[ArchEc] vault and plind opened

Thomas Krichel krichel at openlib.org
Fri Sep 25 06:33:38 UTC 2020


  I have just opened the vault and the plind for your
  rsyncing pleasure

krichel at trabbi/tmp$ mkdir plind
krichel at trabbi/tmp$ rsync -av  rsync://archec.repec.org/plind/ plind/

  The plind is the payload index. It says where in the vault
  file PDF data for papers can be found. The start of the
  payload is at 'b', the length at 'f'. The 'o' field has
  the PDF status.
    'm' according to mime type
    'a' it has something "%PDF" inside first 100 bytes
    'p' it has "PDF" in the futli, important for ftp
    'f' it has an URL starting with "ftp://"
    'r' is from a WARC resource record that contains a payload,
        i.e. not preceeded by a WARC metadata record, or not
        concurrent to another record.

  At this time a tiny fraction of the plind data is available. I'm
  still running a full set.

  The vault contains the actual warcs. I recommend to exclude the
  cdx files

krichel at trabbi/tmp$ mkdir vault
krichel at trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/

  At this time, there is no way to actually limit this to files
  that are actually mentioned in the plind. This is important
  since we hold PDF only for a minority of papers. Suggestions
  welcome.
  

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel



More information about the ArchEc-run mailing list