[ArchEc] end of tarballs reached

Thomas Krichel krichel at openlib.org
Mon Nov 22 12:35:16 UTC 2021


  This is long, and very important.
  
  I have finished working on the tarballs. Here is the historic listing

repecsnapshot at helos:~$ ls -l  archive/ 
total 1653215736
-rw-r--r-- 1 repecsnapshot repecsnapshot   484660039 Feb 14  2020 RePEc_2005-01-13.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot   904066637 Feb  1  2007 RePEc_2007-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot   941231076 Mar  1  2007 RePEc_2007-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot   963658331 Apr  1  2007 RePEc_2007-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot   985673832 May  1  2007 RePEc_2007-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  1456266164 Feb  1  2008 RePEc_2008-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  1473425648 Mar  1  2008 RePEc_2008-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  1862245883 Apr  5  2009 RePEc_2009-04-05.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6194425874 May  1  2009 RePEc_2009-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6449767907 Jun  1  2009 RePEc_2009-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6494054786 Jul  1  2009 RePEc_2009-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6454682521 Aug  1  2009 RePEc_2009-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6488278395 Sep  1  2009 RePEc_2009-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6537097218 Oct  1  2009 RePEc_2009-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6602806339 Nov  1  2009 RePEc_2009-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6633233978 Dec  1  2009 RePEc_2009-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6674276232 Jan  1  2010 RePEc_2010-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3056671753 Feb  1  2010 RePEc_2010-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2271868077 Mar  1  2010 RePEc_2010-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2299346881 Apr  1  2010 RePEc_2010-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2351089160 May  1  2010 RePEc_2010-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot   133496832 Jun  1  2010 RePEc_2010-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2370259005 Jul  1  2010 RePEc_2010-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2139186858 Aug  1  2010 RePEc_2010-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2210199963 Sep  1  2010 RePEc_2010-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2257736575 Oct  1  2010 RePEc_2010-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2363757937 Nov  1  2010 RePEc_2010-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2466406462 Dec  1  2010 RePEc_2010-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2602037604 Jan  1  2011 RePEc_2011-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2639670405 Feb  1  2011 RePEc_2011-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2706271220 Mar  1  2011 RePEc_2011-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2774583851 Apr  1  2011 RePEc_2011-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2827230082 May  1  2011 RePEc_2011-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  2954008530 Jun  1  2011 RePEc_2011-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3029854979 Jul  1  2011 RePEc_2011-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3081255473 Aug  1  2011 RePEc_2011-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3213419623 Sep  1  2011 RePEc_2011-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3322078071 Oct  1  2011 RePEc_2011-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3419023452 Nov  1  2011 RePEc_2011-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3553889294 Dec  1  2011 RePEc_2011-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3752766704 Jan  1  2012 RePEc_2012-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3838959870 Feb  1  2012 RePEc_2012-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3942625941 Mar  1  2012 RePEc_2012-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  3974520415 Apr  1  2012 RePEc_2012-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  4034926857 May  1  2012 RePEc_2012-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  4071460961 Jun  1  2012 RePEc_2012-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  4284563770 Jul  1  2012 RePEc_2012-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  4261127463 Aug  1  2012 RePEc_2012-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  4302407553 Sep  1  2012 RePEc_2012-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  4348557297 Oct  1  2012 RePEc_2012-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  4688488319 Nov  1  2012 RePEc_2012-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  5641946931 Jul 11  2016 RePEc_2013-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  5711299146 Feb 25  2013 RePEc_2013-02-25.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6120639264 Mar  1  2013 RePEc_2013-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6197183316 Apr  1  2013 RePEc_2013-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6294581775 May  1  2013 RePEc_2013-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6536699176 Jul  1  2013 RePEc_2013-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6515265030 Aug  1  2013 RePEc_2013-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  6726546517 Sep  1  2013 RePEc_2013-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  7040050302 Oct  1  2013 RePEc_2013-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  7176390568 Nov  1  2013 RePEc_2013-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  7705018567 Jan  1  2014 RePEc_2014-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  7819037451 Feb  1  2014 RePEc_2014-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  8055801010 Mar  1  2014 RePEc_2014-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot  8154382673 Apr  1  2014 RePEc_2014-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54249581051 Sep 18  2015 RePEc_2015-09-18.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54292904727 Oct  1  2015 RePEc_2015-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54477600620 Nov  1  2015 RePEc_2015-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 17861613243 Dec  1  2015 RePEc_2015-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 18152473788 Jan  1  2016 RePEc_2016-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 18643207732 Feb  1  2016 RePEc_2016-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19244499570 Mar  1  2016 RePEc_2016-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 12390121472 Apr  1  2016 RePEc_2016-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19202734174 May  1  2016 RePEc_2016-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19729705031 Jun  1  2016 RePEc_2016-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19906775204 Jul  1  2016 RePEc_2016-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 12988989440 Aug  1  2016 RePEc_2016-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 20380834585 Sep  1  2016 RePEc_2016-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 20604648388 Oct  1  2016 RePEc_2016-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 21685010228 Nov  1  2016 RePEc_2016-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 22583953287 Dec  1  2016 RePEc_2016-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23748121441 Jan  1  2017 RePEc_2017-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23988179882 Feb  1  2017 RePEc_2017-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23892989701 Mar  1  2017 RePEc_2017-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 24571501618 Apr  1  2017 RePEc_2017-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 25381949487 May  1  2017 RePEc_2017-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 32375688674 Jan  7  2020 RePEc_2020-01-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 32752245757 Feb  7  2020 RePEc_2020-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33228125218 Mar  7  2020 RePEc_2020-03-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33425579195 Apr  7  2020 RePEc_2020-04-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33724114112 May  7  2020 RePEc_2020-05-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 36387840691 Jun  7  2020 RePEc_2020-06-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 36978171038 Jul  7  2020 RePEc_2020-07-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37175093933 Aug  7  2020 RePEc_2020-08-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37606509948 Sep  7  2020 RePEc_2020-09-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37806459795 Oct  7  2020 RePEc_2020-10-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 38279176241 Nov  7  2020 RePEc_2020-11-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 38520442306 Dec  7  2020 RePEc_2020-12-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39072868567 Jan  7  2021 RePEc_2021-01-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39444298486 Feb  7  2021 RePEc_2021-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39742557323 Mar  7  2021 RePEc_2021-03-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39915211339 Apr  7  2021 RePEc_2021-04-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 50509905007 May  7  2021 RePEc_2021-05-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 50706129419 Jun  7 23:34 RePEc_2021-06-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 51138502107 Jul  7 23:20 RePEc_2021-07-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52134005876 Aug  7 23:50 RePEc_2021-08-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52359572651 Sep  7 23:51 RePEc_2021-09-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52793698280 Oct  8 00:00 RePEc_2021-10-07.tar.gz

repecsnapshot at helos:~$ du -sb archive/
1692892034551   archive/
9180530114      vault

  So it's roughly 1.7 Terabytes. Some of them appear to contain
  Holywood movies. 

  Much of the ArchEc work of extracting templates is automated in the
  software I wrote. The template data is in the vault

archec at svega:~$ du -sb vault
9180530114      vault

  So the vault is 0.54% of the tarbals.

  All the non-template data in the archive had to be manually sorted
  into material that we want to keep.  It is is the cellar

archec at svega:~$ du -sb cellar/
61256595997     cellar/

  Thus the vault is only 0.15% of the cellar. But that comparison
  is somewhat missleading, because the vault is compressed tarbals,
  one per RePEc archive, whereas the cellar is by tarball date

archec at svega:~/cellar$ du -sb * | less
634287180       2005-01-13
536861971       2007-02-01
44834272        2007-03-01
25349991        2007-04-01
21045749        2007-05-01
498172493       2008-02-01
17374225        2008-03-01
464236588       2009-04-05
94967981        2009-05-01
27444649        2009-06-01
45358045        2009-07-01
36796993        2009-08-01
25261355        2009-09-01
52208902        2009-10-01
71659459        2009-11-01
27825748        2009-12-01
51560339        2010-01-01
57371378        2010-02-01
24226893        2010-03-01
25456810        2010-04-01
49648187        2010-05-01
5224014         2010-06-01
43502725        2010-07-01
38417643        2010-08-01
55986330        2010-09-01
51172577        2010-10-01
129154130       2010-11-01
81382474        2010-12-01
151121065       2011-01-01
33716019        2011-02-01
81951320        2011-03-01
78195257        2011-04-01
42583732        2011-05-01
99387926        2011-06-01
88551512        2011-07-01
63851460        2011-08-01
168786636       2011-09-01
89232615        2011-10-01
105594132       2011-11-01
210477319       2011-12-01
191575866       2012-01-01
114081797       2012-02-01
164878279       2012-03-01
60185290        2012-04-01
66670237        2012-05-01
53926838        2012-06-01
219640488       2012-07-01
109846291       2012-08-01
67363920        2012-09-01
55089588        2012-10-01
357429531       2012-11-01
779190379       2013-02-07
169804007       2013-02-25
293097014       2013-03-01
132990958       2013-04-01
126529750       2013-05-01
248474898       2013-07-01
91941447        2013-08-01
244424689       2013-09-01
355380515       2013-10-01
180926842       2013-11-01
491438262       2014-01-01
199507171       2014-02-01
304653827       2014-03-01
125102163       2014-04-01
11951277650     2015-09-18
54133492        2015-10-01
179301723       2015-11-01
216168974       2015-12-01
296785856       2016-01-01
602075445       2016-02-01
550662685       2016-03-01
203808483       2016-04-01
285364623       2016-05-01
392463192       2016-06-01
343421203       2016-07-01
569052659       2016-08-01
259768124       2016-09-01
340470924       2016-10-01
1353382811      2016-11-01
244735724       2016-12-01
221196926       2017-01-01
314834229       2017-02-01
203896839       2017-03-01
432300255       2017-04-01
113302552       2017-05-01
9293990047      2020-01-07
697589478       2020-02-07
230345565       2020-03-07
235876226       2020-04-07
344132561       2020-05-07
2754910067      2020-06-07
575383272       2020-07-07
457447081       2020-08-07
538063845       2020-09-07
505377160       2020-10-07
554300026       2020-11-07
298479749       2020-12-07
541246056       2021-01-07
732556684       2021-02-07
244025703       2021-03-07
361546881       2021-04-07
11993508900     2021-05-07
279982020       2021-06-07
599068731       2021-07-07
459875968       2021-08-07
246665705       2021-09-07
529427666       2021-10-07

  and the data is not compressed.
  
  Within each cellar date, I have taken care to create a symlink to an
  earlier version if that is identical. But if I don't operate by
  date, well then I can't have two version of the same file, unless I
  would build a warc archive similar to what I have in the
  vault. Ideally, I would merge the cellar into the vault.  At this
  time, this is not done. Getting to the stage has already been a
  Herculean task that has occupied my since July, and for which I am
  paid only 2000 Euros.

  So this is the cellar, and then there is the trash

archec at svega:~$ du -sb trash
150200223       trash

  of the trash, we only really have to note the checksums, but for the
  vast majority, I actually have the full files, in a file name that
  is the basename of the file prefixed by the SHA1 of the contents

archec at svega:~/trash$ ls -lrt | tail -5
-rw-r--r-- 1 archec archec      245 Oct  7 15:03 2KKDANE66S53P32ILOCPH5DYWCKHRFUD_borra_031.rdf
-rw-r--r-- 1 archec archec      245 Oct  7 15:03 2FFBOX4DCJNZSQOVT3F63L444KIXH3C4_2017a-10-195-255.rdf
-rw-r--r-- 1 archec archec      245 Oct  7 15:03 2DQ243MZNAAFZD7V4RAQPY46TZ5CZODI_borra_051.rdf
-rw-r--r-- 1 archec archec      245 Oct  7 15:03 2DH2QTRCGI6G5LZO4QKSTC67A5UWGJIP_2008-08.rdf
-rw-r--r-- 1 archec archec      245 Oct  7 15:03 24V7M67PQAPD45UA5SL4A7SQUK6N64QA_2014-05-505-526.rdf

  The last file as an example:
  
archec at svega:~/trash$ cat 24V7M67PQAPD45UA5SL4A7SQUK6N64QA_2014-05-505-526.rdf
<html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br><br>Your support ID is: <17685956588297156091><br><br><a href='javascript:history.back();'>[Go Back]</body></html>archec at svega:~/trash$ 

  The is from RePEc:bdr. An archive that recently expanded and has only
  junk. Clearly this example shows an important weakness of rarch.
  It currently has no support for reading potential and determining
  it by contents. It has not been an important issue, but RePEc:bdr
  makes it important. 

  There are a few rare occasions in the trash where the data contains
  no Template-Types, but otherwise seems correct ReDIF data. I have
  not corrected them manually. The only corrections I did was for
  garbled UTF-16-LE data, where I took empty bytes out. And I think
  once I manually changed en-dashes into a normal dashes to be able to
  incorporate a file. Clearly starting to fix more would result in me
  not finishing by the end of the year. The only thing more that
  could be done is to fix the few files without template-type.

  To summarize, we can use the shuftis. They contain summaries
  of the records only. Example

archec at svega:~/shufti$ zcat cit.json.gz 
{
 "RePEc:cit": {
  "2004-12-08T20:27:25Z": [
   1264,
   293
  ],
  "2013-03-10T16:49:37Z": [
   2862,
   293
  ]
 }
}

archec at svega:~/rarch/bin$ ./sumshu 
4574547 11268778

  The first number is the number of records, and the second the number
  of instances of those records. So we have 2.46 instances per
  record. Presumably, if I were to run this on a weekly basis, the
  number of instances would increase.

  The immediate step is to remove the tarballs from being backed up by
  aigtu. Aigtu was 98% full recently. Then, I will move archec to
  helos, and delete tarballs there over time, starting with onces that
  are not that useful, like the most recent ones. At the same time, I
  will start live survey of data. Live survey will work in a different
  way from the tarballs but the main infrastructure is there.

  I give myself my heartfelt congradulations for this work. 
  
--


  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel



More information about the ArchEc-run mailing list