[ArchEc] current work
Thomas Krichel
krichel at openlib.org
Sat Oct 9 12:43:53 UTC 2021
Hi gang,
I'm still working on the tarballs of ArchEc. Here is a
submission to a meeting explains where I am heading to
in non-technical terms. The technical details are still
progress, even after two months. It's just that there is
a lot of junk to fight with in the RePEc tarballs.
-----------------------------------------------------------------
In digital archiving, it is quite common to start with records in
files. If each file contains just one record, then preserving the
files is essentially the same as preserving the records. But what if
the files contain several records? And what if the files change over
time? For example, if the files are harvested from a bunch of
sources. These sources maybe creating poorly-formed records. Each
source offers files that contain records. We can harvest the current
versions of these files and they land up on our disk ... how to
preserve them?
One approach is to preserve the files. In that case we loose no
information. We can just look at what files are changing, and copy the
changed ones into an archival location. I see two problems. (1) if
there are files in which records are accumulating, we are wasting disk
space because we store the same record over and over again. (2) what
about consumers of our archival material? They presumably do not care
about the files. They want the records. If we just tell them, “hey,
here are the files, go figure”, we are not likely to get much buy-in.
Another approach is to split the files into records, and preserve the
records. Then we solve the two problems with the file-based
approach. But we get more serious problems. We loose any information
that was tied to the fact that records where in the same file. If the
software used to split the records was buggy, or if we change our
opinion about record borders, or the nature of records, we have no way
to get back. We could circumvent that we some form of clever
metadata. It has to be clever because the data may be broken in all
sort of ways.
The prior art that lead me to this problem is the ArchEc
http://archec.repec.org project. It is a humble effort to preserve
RePEc http://repec.org data. A first stage received €3000 funding from
the Fondation Banque de France pour la recherche économique. In this
stage, I worked on preserving full-text instances pointed to in the
RePEc data. In the current stage, funded with €2000 by the same
funder, the work ultimately aims to preserve the actual RePEc
records. But even in the funding application, available at
http://governance.repec.org/applications/lebach.docx, I only pledged
to work on preserving files, because I was unsure about how to
preserve records.
Well, on 2 August 2021, I had an ingenious idea how to preserve files
and records. I have implemented it. Thus, I have theory, software and
a wealth of experience that I will share during the talk. While the
work is obviously made for RePEc, the conceptual framework and the
methods used apply to any time-varying collections of records that are
in files. And the data moves into WARCs. I guess that the audience
will be familiar with that format.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the ArchEc-run
mailing list