[CollEc] Website fail

Christian Düben cdueben.ml at proton.me
Sun Sep 15 15:26:16 UTC 2024


You told me that you did not want to use a data base. You said you wanted it written to text files. Text files are not a data base.

Honestly, precomputing all shortest paths is a terrible idea. It is unnecessarily inefficient. Centrality measures need to be computed beforehand, but paths should be derived during user sessions. All paths taken together occupy hundreds of GB on disk.

The best way would to store the data in Neo4j and update the data base based on messages to an API. But there is no API. CollEc's input is an xml file, which does not even come with a change log, just as the full data set.

I can reduce the number of threads, i.e. the number of workers running in parallel, if the load is too heavy. RAM utilization is already minimal. The new code is the most performant program any version of CollEc has ever seen.

I have sacrificed multiple days to craft this piece of software exactly to your demands. You now have the binary paths for individual authors. You have distance values and you have closeness centrality results. Everything is stored in the requested antique output formats.

All I get in return is insults. First, I am accused of not writing the code myself. Then, you complain about system design despite it meeting exactly the requirements.

If you had told me before that you do not want me to implement this, it would have saved me a lot of work. Just do it yourself. Write it in perl, cobol, or whatever. I am out. This was my last contribution to CollEc.

On Monday, September 16th, 2024 at 00:39, Thomas Krichel <krichel at openlib.org> wrote:

> 

> 

> Christian Düben writes
> 

> > For performance reasons, threads write to their own files.
> 

> 

> I am not sure what threads are and why we need them here. All I
> need is to have the paths from one author to all others in a
> file. These can all be run in parallel. In your run, you seem to try
> to do all authors at the same time. This poses a great strain on the
> machine. I suggest to calculate one author at a time, using
> parallel proccessig in a database on when author data has been
> changed.
> 

> > This way, I can use parallelism without locks. If you prefer all
> > paths, distances, and closeness centrality values to respectively be
> > in single files instead of thread-specific files, I can change
> > that. However, that probably slows down the program's execution.
> 

> 

> This massive parallel way of handling the job makes no sense to
> me.
> 

> > All shortest paths within an author pair are not necessarily stored
> > consecutively. A paths file might contain the first shortest path
> > from author 1 to author 2, followed by the first shortest path from
> > author 1 to author 4, followed by the second shortest path from
> > author 1 to author 2. I can order them, if needed - again at a
> > performance penalty.
> 

> 

> This makes no sense to me. This is not how I built the old
> CollEc. I ran a system that took nodes and updated them. Then
> I could run updates around the clock, and I can ran as many
> processess as I have machine capacity for. That is a
> completely different approach than what you try, which is
> to make a complete calculation every now and then.
> 

> Now the machine is so slow that I can hardly use it.
> 

> It would be better to solve the task at hand, which is
> to create a fast program to do binary paths for an
> individual author. I can then take this up and try
> to rescuciate the old site.
> 

> 

> --
> Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: publickey - cdueben.ml at proton.me - 0xFA476413.asc
Type: application/pgp-keys
Size: 649 bytes
Desc: not available
URL: <http://lists.openlib.org/pipermail/collec-run/attachments/20240915/cdd6fa1f/attachment.asc>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openlib.org/pipermail/collec-run/attachments/20240915/cdd6fa1f/attachment.sig>


More information about the CollEc-run mailing list