[ArchEc] TR: Repec: archive pdf
Thomas Krichel
krichel at openlib.org
Wed Dec 4 07:32:27 UTC 2019
1430-FONDATION-UT at banque-france.fr writes
> Can you tell me where you are in relation to the development of your project?
I keep a diary of all the work, by the minute
http://archec.repec.org/losheim
This shows 34 hours of work done. However, the diary is not quite
complete. It does not account for the work that I did with Jose Manuel
Barrueco Cruz, when we met in October. I had to work with him to get
the access to the data and understand how it is being organized. I
just could not hang around and count by the minute as a usually do,
even making a new entry when I go to the toilet. In general, there
are many more hours of work on the project because the billable work
excludes any sysadmin work, documentation and report writing. The
billable work is only for actual software writing. It’s only for raw
time on the computer typing the actual software. I call it parch for
something like “paper archiver”. It is fairly generic, and it has to
be because we have several sources of data to build the actual
archive. We have data that we bring in now, and we have the historic
data in the CitEc PDF storage.
The main reason that there are not more hours done is as follows. When
I made the application, I assumed that the project that I get paid to
work on, called Cirtec---which is very vaguely related to CitEc, but
must not be confused with it---would close in June. This assumption
turned out to be wrong. The Cirtec project is supposed to close at the
end of the year, in fact my boss had to write the final report by
November 25th. When I got to know that Cirtec would not close until
the end of the year happened, I hoped to be able to squeeze work in on
Losheim over the time until the end of the year. Unfortunately the
work on Cirtec intensified and I lost track of parch. Once I loose
track of the project it takes a few days to get back into into it. I
got partly back to it when I met Jose Manuel Barrueco Cruz in
October. At that time I officially took an annual vacation from
Cirtec. While on a meeting with him we worked on the inclusion of his
CitEc archive into the RePEc parch storage. Parch is conceived to make
it easy to handle data from various source. But collection the data
from the CitEc source is not easy. We knew that, and that’s why the
Losheim says we will attempt. In the latest run, yesterday, I only
managed to find RePEc handles for 170000 of the about 800000 PDF
documents that are in the storage. Clearly this needs more work. You
may wonder why this is not trivial. Well the answer is that we don’t
have a historic set of RePEc data available. That is, we have PDF
files where we don’t have a clear indication of what RePEc paper the
PDF instantiates. Building a system that would preserve the RePEc
metadata is not part of Losheim. It is supposed to be part in a
second application to the foundation. Clearly if the current work is
not completed by January, RePEc will not apply for funding for that
second phase in 2020, but only in 2021.
Looking on the bright side, I have been extremely frugal in the way I
count the hours. I am confident that all the required work can be done
in the 100 hours that I counted. In fact the bulk of the work on parch
is done. We store PDFs that we can directly find in the NEP
data---which handles the current data going forward. This may sound
primitive. But the system has been conceived to handle a wide variety
of sources of data. This it can be configured to handle various
XML-based metadata formats. It can point to various components having
various functions. And it is robust to different payloads. That means
it can store various instances of payloads available a URL in
metadata, within the same storage file. The code is in python. That’s
not a language I have used before, but we can’t use Perl, the lingua
franca of RePEc because it has a steadily declining community of users
and no library to handle warcs. There are 486573 warc files in
storage at the last count.
As for Cirtec (the project that blocks my time), I’m waiting for the
November wage payment. Then I will limit my work to strictly two hours
a day. With that I can finish the 100 hours by mid January. But it is
not completely sure that that will do all we have set in Losheim
(basically parch plus import of CitEc archive into parch data)
done. Plus the work has to be documented and published. The 100 hours
don’t pay for that, I’m supposed to do it for free. Equally Losheim
does not provide for the data actually be used. The plan is to use it
in NEP. That will only start next year. Frankly, I’d rather be paid
when all the work is done. But I understand they have to close the
books this year. In that case it seems reasonable to pay now, knowing
there will be no further subsidy in case I can’t demonstrate during
next year that I’m worthy of a 2021 subsidy.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the ArchEc-run
mailing list