[CollEc] RePEc Visual
Düben, Christian
Christian.Dueben at uni-hamburg.de
Fri Jun 19 18:50:49 UTC 2020
A quick update on the new CollEc:
Setting up a database within a Docker container turned out to be a good idea. It kept crashing for days. And with Docker I could simply spin it up again without having to ask you to fix the main installation every time.
The source of the crashes was the mismatch between MariaDB default settings and the size of the inserted data. After adjusting various settings, the containerized database is now stable.
An issue that remains is the (LOAD DATA LOCAL INFILE) insert speed. Calculating four distance matrices of more than 2.2 billion cells each and writing them to disk does not take a lot of time. Loading them into the database, however, easily takes multiple days. MariaDB's column restriction requires the data to be inserted in long format, i.e. more than 8.8 billion rows.
I am working on reducing that insert time to not more than a few hours. Using the distance matrices' symmetry, the zeros along the main diagonal and the unconnectedness of authors across subgraphs I cut that table length by more than half. Instead of N^{2} rows I now insert sum_{i} N_{i} (N_{i} - 1)/2 where N_{i} is the number of authors in graph i. This reduces the more than 2.2 billion rows to around 1.02 billion rows for each of the four transition functions. As this still takes a long time, I am testing further modifications to the database. There are various MariaDB system variables apart from the already modified ones (net_read_timeout, net_write_timeout, wait_timeout, innodb-fatal-semaphore-wait-threshold, max_allowed_packet, innodb-buffer-pool-size) for which I am yet to figure out the appropriate levels.
Betweenness calculations now run in a manageable amount of time, but are only computed with three out of the four transition functions. The currently implemented exponential transition function generates edges weights small enough to crash the system when used in betweenness computations. The app does, therefore, not cover this combination.
Addressing the database issues takes longer than I expected. You can test the app after I dealt with the performance bottlenecks. The app and the code generating the data once a day are ready to be deployed.
Have a nice day.
Kind regards,
Christian
Christian Düben
Research Associate
Chair of Macroeconomics
Hamburg University
Von-Melle-Park 5, Room 3102
20146 Hamburg
Germany
+49 40 42838 1898
christian.dueben at uni-hamburg.de
http://www.christian-dueben.com
-----Original Message-----
From: CollEc-run <collec-run-bounces at lists.openlib.org> On Behalf Of Düben, Christian
Sent: Mittwoch, 10. Juni 2020 11:24
To: Thomas Krichel <krichel at openlib.org>
Cc: CollEc Run <collec-run at lists.openlib.org>
Subject: Re: [CollEc] RePEc Visual
I did indeed consider using the main installation. The container just turned out to be the easier solution because it automatically links the database to the other containers via the bridge network.
Christian Düben
Research Associate
Chair of Macroeconomics
Hamburg University
Von-Melle-Park 5, Room 3102
20146 Hamburg
Germany
+49 40 42838 1898
christian.dueben at uni-hamburg.de
http://www.christian-dueben.com
-----Original Message-----
From: Thomas Krichel <krichel at openlib.org>
Sent: Mittwoch, 10. Juni 2020 11:14
To: Düben, Christian <Christian.Dueben at uni-hamburg.de>
Cc: CollEc Run <collec-run at lists.openlib.org>
Subject: Re: [CollEc] RePEc Visual
Düben, Christian writes
> Thanks. And sorry for breaking it in the first place. It should not happen again.
>
> I now use a containerized MariaDB which the other containers can directly access through the bridge network.
>
This issue should not have prevented you from using the main
installation.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
_______________________________________________
CollEc-run mailing list
CollEc-run at lists.openlib.org
http://lists.openlib.org/cgi-bin/mailman/listinfo/collec-run
More information about the CollEc-run
mailing list