[ArchEc] RePEc:bru problem
Thomas Krichel
krichel at openlib.org
Sat May 29 04:33:50 UTC 2021
When dealing with this, from the 2003-01-13 dump
rchec at svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp00.rdf
-rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp00.rdf
archec at svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp/bruppp03.rdf
-rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp/bruppp03.rdf
Not only just the same size but the same file.
Only the later is legal as the protocol goes.
Now they seem to have the same date as well but I'm sure
there will be cases when one or the other is older.
What to do?
I think, first sort a RePEc archive by file modification times.
When a duplicate contents is found, create a revisit record for warc,
noting the modification time of the newer resource, and say
it's a copy of the older resource. Within a single dump,
the file name have to be different anyway.
If, when examining the same archive from a different tarbal,
and the file name is the same, don't create a revisit record.
Just make sure to process the tarbals in order, and assume
the dates on the tarballs are correct. Warn if that assumption
is violated, i.e. the mtime on a file in a later tarball
is older than the mtime in a later tarball, warn about this.
The problem with that approach is that the preserved copy
is not necessarily the one that is protocol compliant.
The reason I chose to work on this archive at all is that
it contains files ending in ~, from emacs I guess. So I
first wrote a function looking for rdf~ files, and store
these as if they had no ~. But think sorting by time
would not impact this order since if the ~ is a genuine
backup, it will be older.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the ArchEc-run
mailing list