[ArchEc] RePEc:bru problem

Thomas Krichel krichel at openlib.org
Sat May 29 04:33:50 UTC 2021


  When dealing with this, from the 2003-01-13 dump

rchec at svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp00.rdf 
-rw-r--r-- 1 archec archec 35394 Jul 13  2004 bruppp00.rdf

archec at svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp/bruppp03.rdf 
-rw-r--r-- 1 archec archec 35394 Jul 13  2004 bruppp/bruppp03.rdf

  Not only just the same size but the same file. 
  Only the later is legal as the protocol goes. 

  Now they seem to have the same date as well but I'm sure
  there will be cases when one or the other is older.

  What to do?

  I think, first sort a RePEc archive by file modification times.

  When a duplicate contents is found, create a revisit record for warc,
  noting the modification time of the newer resource, and say
  it's a copy of the older resource. Within a single dump,
  the file name have to be different anyway.

  If, when examining the same archive from a different tarbal,
  and the file name is the same, don't create a revisit record.
  Just make sure to process the tarbals in order, and assume
  the dates on the tarballs are correct. Warn if that assumption
  is violated, i.e. the mtime on a file in a later tarball
  is older than the mtime in a later tarball, warn about this.

  The problem with that approach is that the preserved copy
  is not necessarily the one that is protocol compliant. 

  The reason I chose to work on this archive at all is that
  it contains files ending in ~, from emacs I guess. So I
  first wrote a function looking for rdf~ files, and store
  these as if they had no ~. But think sorting by time
  would not impact this order since if the ~ is a genuine
  backup, it will be older. 
  

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel



More information about the ArchEc-run mailing list