[ArchEc] TR: Repec: archive pdf

Wed Dec 4 07:32:27 UTC 2019

  1430-FONDATION-UT at banque-france.fr writes

> Can you tell me where you are in relation to the development of your project?

  I keep a diary of all the work, by the minute

http://archec.repec.org/losheim

  This shows 34 hours of work done. However, the diary is not quite
  complete. It does not account for the work that I did with Jose Manuel
  Barrueco Cruz, when we met in October. I had to work with him to get
  the access to the data and understand how it is being organized.  I
  just could not hang around and count by the minute as a usually do,
  even making a new entry when I go to the toilet.  In general, there
  are many more hours of work on the project because the billable work
  excludes any sysadmin work, documentation and report writing. The
  billable work is only for actual software writing. It’s only for raw
  time on the computer typing the actual software.  I call it parch for
  something like “paper archiver”. It is fairly generic, and it has to
  be because we have several sources of data to build the actual
  archive. We have data that we bring in now, and we have the historic
  data in the CitEc PDF storage.

  The main reason that there are not more hours done is as follows. When
  I made the application, I assumed that the project that I get paid to
  work on, called Cirtec---which is very vaguely related to CitEc, but
  must not be confused with it---would close in June. This assumption
  turned out to be wrong. The Cirtec project is supposed to close at the
  end of the year, in fact my boss had to write the final report by
  November 25th.  When I got to know that Cirtec would not close until
  the end of the year happened, I hoped to be able to squeeze work in on
  Losheim over the time until the end of the year. Unfortunately the
  work on Cirtec intensified and I lost track of parch. Once I loose
  track of the project it takes a few days to get back into into it. I
  got partly back to it when I met Jose Manuel Barrueco Cruz in
  October. At that time I officially took an annual vacation from
  Cirtec. While on a meeting with him we worked on the inclusion of his
  CitEc archive into the RePEc parch storage. Parch is conceived to make
  it easy to handle data from various source. But collection the data
  from the CitEc source is not easy. We knew that, and that’s why the
  Losheim says we will attempt. In the latest run, yesterday, I only
  managed to find RePEc handles for 170000 of the about 800000 PDF
  documents that are in the storage. Clearly this needs more work. You
  may wonder why this is not trivial. Well the answer is that we don’t
  have a historic set of RePEc data available. That is, we have PDF
  files where we don’t have a clear indication of what RePEc paper the
  PDF instantiates. Building a system that would preserve the RePEc
  metadata is not part of Losheim. It is supposed to be part in a
  second application to the foundation. Clearly if the current work is
  not completed by January, RePEc will not apply for funding for that
  second phase in 2020, but only in 2021.

  Looking on the bright side, I have been extremely frugal in the way I
  count the hours. I am confident that all the required work can be done
  in the 100 hours that I counted. In fact the bulk of the work on parch
  is done.  We store PDFs that we can directly find in the NEP
  data---which handles the current data going forward. This may sound
  primitive. But the system has been conceived to handle a wide variety
  of sources of data. This it can be configured to handle various
  XML-based metadata formats. It can point to various components having
  various functions. And it is robust to different payloads. That means
  it can store various instances of payloads available a URL in
  metadata, within the same storage file. The code is in python. That’s
  not a language I have used before, but we can’t use Perl, the lingua
  franca of RePEc because it has a steadily declining community of users
  and no library to handle warcs.  There are 486573 warc files in
  storage at the last count.

  As for Cirtec (the project that blocks my time), I’m waiting for the
  November wage payment. Then I will limit my work to strictly two hours
  a day. With that I can finish the 100 hours by mid January. But it is
  not completely sure that that will do all we have set in Losheim
  (basically parch plus import of CitEc archive into parch data)
  done. Plus the work has to be documented and published. The 100 hours
  don’t pay for that, I’m supposed to do it for free. Equally Losheim
  does not provide for the data actually be used. The plan is to use it
  in NEP. That will only start next year.  Frankly, I’d rather be paid
  when all the work is done. But I understand they have to close the
  books this year. In that case it seems reasonable to pay now, knowing
  there will be no further subsidy in case I can’t demonstrate during
  next year that I’m worthy of a 2021 subsidy.

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel