[ArchEc] end of tarballs reached
Thomas Krichel
krichel at openlib.org
Mon Nov 22 12:35:16 UTC 2021
This is long, and very important.
I have finished working on the tarballs. Here is the historic listing
repecsnapshot at helos:~$ ls -l archive/
total 1653215736
-rw-r--r-- 1 repecsnapshot repecsnapshot 484660039 Feb 14 2020 RePEc_2005-01-13.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 904066637 Feb 1 2007 RePEc_2007-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 941231076 Mar 1 2007 RePEc_2007-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 963658331 Apr 1 2007 RePEc_2007-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 985673832 May 1 2007 RePEc_2007-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 1456266164 Feb 1 2008 RePEc_2008-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 1473425648 Mar 1 2008 RePEc_2008-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 1862245883 Apr 5 2009 RePEc_2009-04-05.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6194425874 May 1 2009 RePEc_2009-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6449767907 Jun 1 2009 RePEc_2009-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6494054786 Jul 1 2009 RePEc_2009-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6454682521 Aug 1 2009 RePEc_2009-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6488278395 Sep 1 2009 RePEc_2009-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6537097218 Oct 1 2009 RePEc_2009-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6602806339 Nov 1 2009 RePEc_2009-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6633233978 Dec 1 2009 RePEc_2009-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6674276232 Jan 1 2010 RePEc_2010-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3056671753 Feb 1 2010 RePEc_2010-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2271868077 Mar 1 2010 RePEc_2010-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2299346881 Apr 1 2010 RePEc_2010-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2351089160 May 1 2010 RePEc_2010-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 133496832 Jun 1 2010 RePEc_2010-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2370259005 Jul 1 2010 RePEc_2010-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2139186858 Aug 1 2010 RePEc_2010-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2210199963 Sep 1 2010 RePEc_2010-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2257736575 Oct 1 2010 RePEc_2010-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2363757937 Nov 1 2010 RePEc_2010-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2466406462 Dec 1 2010 RePEc_2010-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2602037604 Jan 1 2011 RePEc_2011-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2639670405 Feb 1 2011 RePEc_2011-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2706271220 Mar 1 2011 RePEc_2011-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2774583851 Apr 1 2011 RePEc_2011-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2827230082 May 1 2011 RePEc_2011-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2954008530 Jun 1 2011 RePEc_2011-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3029854979 Jul 1 2011 RePEc_2011-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3081255473 Aug 1 2011 RePEc_2011-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3213419623 Sep 1 2011 RePEc_2011-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3322078071 Oct 1 2011 RePEc_2011-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3419023452 Nov 1 2011 RePEc_2011-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3553889294 Dec 1 2011 RePEc_2011-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3752766704 Jan 1 2012 RePEc_2012-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3838959870 Feb 1 2012 RePEc_2012-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3942625941 Mar 1 2012 RePEc_2012-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3974520415 Apr 1 2012 RePEc_2012-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4034926857 May 1 2012 RePEc_2012-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4071460961 Jun 1 2012 RePEc_2012-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4284563770 Jul 1 2012 RePEc_2012-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4261127463 Aug 1 2012 RePEc_2012-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4302407553 Sep 1 2012 RePEc_2012-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4348557297 Oct 1 2012 RePEc_2012-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4688488319 Nov 1 2012 RePEc_2012-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 5641946931 Jul 11 2016 RePEc_2013-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 5711299146 Feb 25 2013 RePEc_2013-02-25.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6120639264 Mar 1 2013 RePEc_2013-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6197183316 Apr 1 2013 RePEc_2013-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6294581775 May 1 2013 RePEc_2013-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6536699176 Jul 1 2013 RePEc_2013-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6515265030 Aug 1 2013 RePEc_2013-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6726546517 Sep 1 2013 RePEc_2013-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7040050302 Oct 1 2013 RePEc_2013-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7176390568 Nov 1 2013 RePEc_2013-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7705018567 Jan 1 2014 RePEc_2014-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7819037451 Feb 1 2014 RePEc_2014-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 8055801010 Mar 1 2014 RePEc_2014-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 8154382673 Apr 1 2014 RePEc_2014-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54249581051 Sep 18 2015 RePEc_2015-09-18.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54292904727 Oct 1 2015 RePEc_2015-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54477600620 Nov 1 2015 RePEc_2015-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 17861613243 Dec 1 2015 RePEc_2015-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 18152473788 Jan 1 2016 RePEc_2016-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 18643207732 Feb 1 2016 RePEc_2016-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19244499570 Mar 1 2016 RePEc_2016-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 12390121472 Apr 1 2016 RePEc_2016-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19202734174 May 1 2016 RePEc_2016-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19729705031 Jun 1 2016 RePEc_2016-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19906775204 Jul 1 2016 RePEc_2016-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 12988989440 Aug 1 2016 RePEc_2016-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 20380834585 Sep 1 2016 RePEc_2016-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 20604648388 Oct 1 2016 RePEc_2016-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 21685010228 Nov 1 2016 RePEc_2016-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 22583953287 Dec 1 2016 RePEc_2016-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23748121441 Jan 1 2017 RePEc_2017-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23988179882 Feb 1 2017 RePEc_2017-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23892989701 Mar 1 2017 RePEc_2017-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 24571501618 Apr 1 2017 RePEc_2017-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 25381949487 May 1 2017 RePEc_2017-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 32375688674 Jan 7 2020 RePEc_2020-01-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 32752245757 Feb 7 2020 RePEc_2020-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33228125218 Mar 7 2020 RePEc_2020-03-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33425579195 Apr 7 2020 RePEc_2020-04-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33724114112 May 7 2020 RePEc_2020-05-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 36387840691 Jun 7 2020 RePEc_2020-06-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 36978171038 Jul 7 2020 RePEc_2020-07-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37175093933 Aug 7 2020 RePEc_2020-08-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37606509948 Sep 7 2020 RePEc_2020-09-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37806459795 Oct 7 2020 RePEc_2020-10-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 38279176241 Nov 7 2020 RePEc_2020-11-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 38520442306 Dec 7 2020 RePEc_2020-12-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39072868567 Jan 7 2021 RePEc_2021-01-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39444298486 Feb 7 2021 RePEc_2021-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39742557323 Mar 7 2021 RePEc_2021-03-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39915211339 Apr 7 2021 RePEc_2021-04-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 50509905007 May 7 2021 RePEc_2021-05-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 50706129419 Jun 7 23:34 RePEc_2021-06-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 51138502107 Jul 7 23:20 RePEc_2021-07-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52134005876 Aug 7 23:50 RePEc_2021-08-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52359572651 Sep 7 23:51 RePEc_2021-09-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52793698280 Oct 8 00:00 RePEc_2021-10-07.tar.gz
repecsnapshot at helos:~$ du -sb archive/
1692892034551 archive/
9180530114 vault
So it's roughly 1.7 Terabytes. Some of them appear to contain
Holywood movies.
Much of the ArchEc work of extracting templates is automated in the
software I wrote. The template data is in the vault
archec at svega:~$ du -sb vault
9180530114 vault
So the vault is 0.54% of the tarbals.
All the non-template data in the archive had to be manually sorted
into material that we want to keep. It is is the cellar
archec at svega:~$ du -sb cellar/
61256595997 cellar/
Thus the vault is only 0.15% of the cellar. But that comparison
is somewhat missleading, because the vault is compressed tarbals,
one per RePEc archive, whereas the cellar is by tarball date
archec at svega:~/cellar$ du -sb * | less
634287180 2005-01-13
536861971 2007-02-01
44834272 2007-03-01
25349991 2007-04-01
21045749 2007-05-01
498172493 2008-02-01
17374225 2008-03-01
464236588 2009-04-05
94967981 2009-05-01
27444649 2009-06-01
45358045 2009-07-01
36796993 2009-08-01
25261355 2009-09-01
52208902 2009-10-01
71659459 2009-11-01
27825748 2009-12-01
51560339 2010-01-01
57371378 2010-02-01
24226893 2010-03-01
25456810 2010-04-01
49648187 2010-05-01
5224014 2010-06-01
43502725 2010-07-01
38417643 2010-08-01
55986330 2010-09-01
51172577 2010-10-01
129154130 2010-11-01
81382474 2010-12-01
151121065 2011-01-01
33716019 2011-02-01
81951320 2011-03-01
78195257 2011-04-01
42583732 2011-05-01
99387926 2011-06-01
88551512 2011-07-01
63851460 2011-08-01
168786636 2011-09-01
89232615 2011-10-01
105594132 2011-11-01
210477319 2011-12-01
191575866 2012-01-01
114081797 2012-02-01
164878279 2012-03-01
60185290 2012-04-01
66670237 2012-05-01
53926838 2012-06-01
219640488 2012-07-01
109846291 2012-08-01
67363920 2012-09-01
55089588 2012-10-01
357429531 2012-11-01
779190379 2013-02-07
169804007 2013-02-25
293097014 2013-03-01
132990958 2013-04-01
126529750 2013-05-01
248474898 2013-07-01
91941447 2013-08-01
244424689 2013-09-01
355380515 2013-10-01
180926842 2013-11-01
491438262 2014-01-01
199507171 2014-02-01
304653827 2014-03-01
125102163 2014-04-01
11951277650 2015-09-18
54133492 2015-10-01
179301723 2015-11-01
216168974 2015-12-01
296785856 2016-01-01
602075445 2016-02-01
550662685 2016-03-01
203808483 2016-04-01
285364623 2016-05-01
392463192 2016-06-01
343421203 2016-07-01
569052659 2016-08-01
259768124 2016-09-01
340470924 2016-10-01
1353382811 2016-11-01
244735724 2016-12-01
221196926 2017-01-01
314834229 2017-02-01
203896839 2017-03-01
432300255 2017-04-01
113302552 2017-05-01
9293990047 2020-01-07
697589478 2020-02-07
230345565 2020-03-07
235876226 2020-04-07
344132561 2020-05-07
2754910067 2020-06-07
575383272 2020-07-07
457447081 2020-08-07
538063845 2020-09-07
505377160 2020-10-07
554300026 2020-11-07
298479749 2020-12-07
541246056 2021-01-07
732556684 2021-02-07
244025703 2021-03-07
361546881 2021-04-07
11993508900 2021-05-07
279982020 2021-06-07
599068731 2021-07-07
459875968 2021-08-07
246665705 2021-09-07
529427666 2021-10-07
and the data is not compressed.
Within each cellar date, I have taken care to create a symlink to an
earlier version if that is identical. But if I don't operate by
date, well then I can't have two version of the same file, unless I
would build a warc archive similar to what I have in the
vault. Ideally, I would merge the cellar into the vault. At this
time, this is not done. Getting to the stage has already been a
Herculean task that has occupied my since July, and for which I am
paid only 2000 Euros.
So this is the cellar, and then there is the trash
archec at svega:~$ du -sb trash
150200223 trash
of the trash, we only really have to note the checksums, but for the
vast majority, I actually have the full files, in a file name that
is the basename of the file prefixed by the SHA1 of the contents
archec at svega:~/trash$ ls -lrt | tail -5
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2KKDANE66S53P32ILOCPH5DYWCKHRFUD_borra_031.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2FFBOX4DCJNZSQOVT3F63L444KIXH3C4_2017a-10-195-255.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2DQ243MZNAAFZD7V4RAQPY46TZ5CZODI_borra_051.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2DH2QTRCGI6G5LZO4QKSTC67A5UWGJIP_2008-08.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 24V7M67PQAPD45UA5SL4A7SQUK6N64QA_2014-05-505-526.rdf
The last file as an example:
archec at svega:~/trash$ cat 24V7M67PQAPD45UA5SL4A7SQUK6N64QA_2014-05-505-526.rdf
<html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br><br>Your support ID is: <17685956588297156091><br><br><a href='javascript:history.back();'>[Go Back]</body></html>archec at svega:~/trash$
The is from RePEc:bdr. An archive that recently expanded and has only
junk. Clearly this example shows an important weakness of rarch.
It currently has no support for reading potential and determining
it by contents. It has not been an important issue, but RePEc:bdr
makes it important.
There are a few rare occasions in the trash where the data contains
no Template-Types, but otherwise seems correct ReDIF data. I have
not corrected them manually. The only corrections I did was for
garbled UTF-16-LE data, where I took empty bytes out. And I think
once I manually changed en-dashes into a normal dashes to be able to
incorporate a file. Clearly starting to fix more would result in me
not finishing by the end of the year. The only thing more that
could be done is to fix the few files without template-type.
To summarize, we can use the shuftis. They contain summaries
of the records only. Example
archec at svega:~/shufti$ zcat cit.json.gz
{
"RePEc:cit": {
"2004-12-08T20:27:25Z": [
1264,
293
],
"2013-03-10T16:49:37Z": [
2862,
293
]
}
}
archec at svega:~/rarch/bin$ ./sumshu
4574547 11268778
The first number is the number of records, and the second the number
of instances of those records. So we have 2.46 instances per
record. Presumably, if I were to run this on a weekly basis, the
number of instances would increase.
The immediate step is to remove the tarballs from being backed up by
aigtu. Aigtu was 98% full recently. Then, I will move archec to
helos, and delete tarballs there over time, starting with onces that
are not that useful, like the most recent ones. At the same time, I
will start live survey of data. Live survey will work in a different
way from the tarballs but the main infrastructure is there.
I give myself my heartfelt congradulations for this work.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the ArchEc-run
mailing list