[ArchEc] vault and plind opened
Thomas Krichel
krichel at openlib.org
Fri Sep 25 06:33:38 UTC 2020
I have just opened the vault and the plind for your
rsyncing pleasure
krichel at trabbi/tmp$ mkdir plind
krichel at trabbi/tmp$ rsync -av rsync://archec.repec.org/plind/ plind/
The plind is the payload index. It says where in the vault
file PDF data for papers can be found. The start of the
payload is at 'b', the length at 'f'. The 'o' field has
the PDF status.
'm' according to mime type
'a' it has something "%PDF" inside first 100 bytes
'p' it has "PDF" in the futli, important for ftp
'f' it has an URL starting with "ftp://"
'r' is from a WARC resource record that contains a payload,
i.e. not preceeded by a WARC metadata record, or not
concurrent to another record.
At this time a tiny fraction of the plind data is available. I'm
still running a full set.
The vault contains the actual warcs. I recommend to exclude the
cdx files
krichel at trabbi/tmp$ mkdir vault
krichel at trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/
At this time, there is no way to actually limit this to files
that are actually mentioned in the plind. This is important
since we hold PDF only for a minority of papers. Suggestions
welcome.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the ArchEc-run
mailing list