[CollEc] RePEc Visual
Düben, Christian
Christian.Dueben at uni-hamburg.de
Wed Jul 8 19:36:51 UTC 2020
I changed the documentation. Take a look.
After weeks of optimizing the data base to efficiently handle billions of pre-calculated distance values, I decided the implement the alternative solution. The app now computes distances directly from the graph rather than reading pre-calculated distance values from the SQL database.
The database optimizations I implemented did improve the daily data insertions. However, the performance did not get anywhere near the intended level. And the more I optimized the data base the more server resources it consumed. As a result, other processes on the server became slower, including the app. Another issue of reading the distance values from the SQL database is the extraction time. Even with indexed searching variables, queries on tables with over a billion rows are not instant.
I tried to omit any network analysis computations from the app itself. But given how the database approach played out, in-app calculations appear to be the preferable choice. Calculating distances when a user interacts with the app rather than in the daily data generation process drastically cuts the time and resources consumed by that daily data generation. And reading the main graph from an optimized file and deriving the distance values turns out to be even faster than querying those values from a long MariaDB table. Reading the graph into memory and calculating N distances from a specific author to all other authors in that graph takes between around 0.18 and 0.21 seconds. My attempt to push these 0.2 seconds closer to zero and to read a few KB of distances values rather than a 10 MB graph motivated my efforts to insert distances into the database in the first place. And now the supposedly less efficient solution turns out to be the more efficient one. Despite the choice of the alternative solution I am glad to have tested the database implementation. I gained insights into the technical features of MariaDB (and MySQL) and the InnoDB engine which I may use at a later point.
Much of the other data used by the app is still read from MariaDB tables. And that works fine.
Now the app is almost ready to be publicly released. I just have to fix a bug in the app's distances tab and produce the introductory video. And we have to agree on the open content questions and have to connect the app's port with one of the RePEc URLs. As I mentioned in the previous e-mail, everything related to the new implementation is up for discussion.
May I install Nginx on the server? Then I can link ShinyProxy's port 8080 to the default HTTP port 80.
You asked for the documentation's source code. Here it is:
h1("Documentation"),
h3("Network"),
p("CollEc constructs and examines the co-authorship network using methods from the field of network analysis. Assuming no computer science background on the
side of many CollEc users, this documentation begins with a short introduction on the basics of networks."),
p("Graphs, i.e. networks, exist in many different applications. Those include websites on the internet, geo-spatial data, social media connections, co-authorship
among economists etc. A graph consists of vertices, also called nodes, that are connected via edges. A vertex is e.g. a location in geo-spatial data, a
registered social media user, a website or a researcher. Edges between vertices are e.g. roads between locations, links to other websites, co-authorship between
economists etc. Both vertices and edges have attributes. A typical vertex attribute is a name. That might be the name of a location, or as in the case of
CollEc the name of an economist who published co-authored research. A common edge attribute is weights. Weights express the transition costs between vertices.
In geo-spatial applications the weight might express the distance between locations. In CollEc the weight represents the degree of collaboration, i.e. the
number of papers two authors wrote together. Edge weights are determined by transition functions. In CollEc you can interactively choose between different
function forms that model the transition cost between co-authors as a non-linear function of the number of joint papers. The transition costs are symmetric
as CollEc uses undirected graphs. Moving from author A to author B is as costly as moving from B to A."),
p("The following plot illustrates the concept of graphs. Authors A to K, the white vertices, are connected by joint research, the blue edges."),
img(src = "Example_Graph.png", width = "300px", style="display: block; margin-left: auto; margin-right: auto;"),
p("CollEc's network currently consists of more than 47,000 authors registered through the ", a(href = "https://authors.repec.org/", "RePEc Author
Service", target = "_blank"), ", a RePEc service maintained by ", a(href = "https://ideas.repec.org/zimm/", "Christian Zimmermann", target = "_blank"), ".
Each of them has at least one co-authored paper listed on RePEc. Not all of them are connected to the same graph. In fact there are over 900 unconnected
sub-graphs. The largest one of them contains with around 44,900 people the vast majority of vertices. The remaining graphs are small and consist of e.g.
two otherwise unconnected people who published a joint paper. You can find an author's graph order, the graph or network size, below distance, closeness
and betweenness plots."),
h3("Variable Definition"),
h4("Distance Measures"),
p("The computed distance is the length of the shortest cost path between the two selected authors. Edge weights measure the distance between adjacent authors.
The shortest cost path is, thus, the connection between two authors that minimizes the sum of edge weights, the path's length. Since the input is an undirected
graph with exclusively positive weights the shortest paths are derived through ", a(href = "https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm",
"Dijkstra's algorithm", target = "_blank"), ". The generated distance values are comparable within but not between transition functions."),
p("If you set ", tags$b("Weighted Edges"), " to ", tags$b("No"), ", edges are not weighted by the number of joint papers. Instead they all receive a weight
of one. So they are in fact weighted. But with all weights set to the same value the weights do not play a role."),
p("If you set ", tags$b("Weighted Edges"), " to ", tags$b("Yes"), " and ", tags$b("Transition Function"), " to ", tags$b("Inverse"), ", edges are weighted
by the inverse of the quanitity of joint papers:"),
uiOutput("eq_distance_i"),
p("If you set ", tags$b("Weighted Edges"), " to ", tags$b("Yes"), " and ", tags$b("Transition Function"), " to ", tags$b("Gravity"), ", edges are weighted
by the inverse of the squared quanitity of joint papers:"),
uiOutput("eq_distance_g"),
p("If you set ", tags$b("Weighted Edges"), " to ", tags$b("Yes"), " and ", tags$b("Transition Function"), " to ", tags$b("Exponential"), ", edges are weighted
by an exponential function based on the quanitity of joint papers:"),
uiOutput("eq_distance_e"),
p("The following figure illustrates how the different transition functions translate number of joint papers into edge weights. If two authors only collaborated
on one paper, it does not matter which transition function is selected. They all attribute a value of one to this connection. Beyond the first paper, the
effect varies between functions.", tags$b("Inverse"), ", ", tags$b("Gravity"), " and ", tags$b("Exponential"), " all model diminishing returns to co-authorship
with the same person. However, with ", tags$b("Gravity"), "the edge weight, i.e. the transition cost or the distance, drops
more drastically over the first few papers than it does with ", tags$b("Inverse"), " and ", tags$b("Exponential"), "."),
img(src = "Transition_Functions.png", width = "500px", style="display: block; margin-left: auto; margin-right: auto;"),
p("CollEc users access bilaterial distances through a plot of the following type."),
img(src = "Distances_Example.png", width = "500px", style="display: block; margin-left: auto; margin-right: auto;"),
p("In this example, the selected authors are Christian Düben and Thomas Krichel and distances are based on the inverse transition function. The two distributions
represent the kernel density estimates of distances from the two authors. The curves are similar in shape but shifted along the horizontal axis. Thomas Krichel is
closer to many authors in the graph than Christian Düben is. Junior researchers like Christian Düben tend to be less closely connected than people like Thomas
Krichel who have been in the field for decades. The red line denotes the bilateral distance between the two selected authors. It is slightly above the blue
distribution's mode and somewhere in the green distribution's upper tail. From Thomas Krichel's perspective, Christian Düben is a fairly distant author. While
from Christian Düben's perspective, Thomas Krichel is just as far as many other authors are. A short text underneath the plot states the graph size and bilateral
distance. The two authors in this example are part of the main graph with around 45,000 people and are located at a distance of around 2.043."),
h4("Closeness"),
p(a(href = "https://en.wikipedia.org/wiki/Closeness_centrality", "Closeness", target = "_blank"), ", or closeness centrality, is the reciprocal of the sum of
the length of the shortest paths between a vertex and all other vertices. High closeness values imply short paths to other vertices and thus a central position."),
uiOutput("eq_closeness"),
p("\\(d(v,i)\\) is the length of the shortest cost path between vertex \\(v\\) and vertex \\(i\\). Given the constant inflow of new authors into the network,
closeness values are not comparable over time."),
p("A closeness plot with Nobel prize laureate Esther Duflo as selected author and distances based on the exponential transition function looks as follows."),
img(src = "Closeness_Example.png", width = "500px", style="display: block; margin-left: auto; margin-right: auto;"),
p("The blue distribution illustrates kernel density estimates of all authors' closeness values in that graph. Esther Duflo is part of the main graph which contains
around 45,000 authors and with it around 45,000 closeness values. She is with a closeness of around 2.24e-05 one of the most centrally located economists.
The red line representing that value is near the upper end of the closeness distribution. Graph size and closeness value are stated in a short text underneath
the plot."),
h4("Betweenness"),
p(a(href = "https://en.wikipedia.org/wiki/Betweenness_centrality", "Betweenness", target = "_blank"), ", or betweenness centrality, measures the number of
shortest paths passing through a vertex. High betweenness values imply high centrality."),
uiOutput("eq_betweenness"),
p("\\(\\sigma_{ij}\\) represents the number of shortest paths from vertex \\(i\\) to vertex \\(j\\). And \\(\\sigma_{ij}(v)\\) is the number of those paths
passing through vertex \\(v\\). Given the constant inflow of new authors into the network, betweenness values are not comparable over time."),
p("A betweenness plot with Nobel prize laureate Abhijit Banerjee as selected author and distances based on the gravity transition function looks as follows."),
img(src = "Betweenness_Example.png", width = "500px", style="display: block; margin-left: auto; margin-right: auto;"),
p("The blue distribution illustrates kernel density estimates of all authors' log betweenness values in that graph. It displays the logarithm of betweenness
because the distribution of actual betweenness is so wide that its shape can merely be guessed from a plot of this size. Abhijit Banerjee is part of the main
graph which contains around 45,000 authors and with it around 45,000 betweenness values. He is with a betweenness of around \\(\\log(6,188,136) \\approx 15.638 \\)
one of the most centrally located economists. The red line representing that value is near the upper end of the betweenness distribution. Graph size and
betweenness value are stated in a short text underneath the plot."),
h3("Technical Implementation"),
h4("Data Generation"),
p("CollEc retrieves information on co-authorship from the ", a(href = "https://authors.repec.org/", "RePEc Author Service", target = "_blank"), ". In particular,
it extracts a vector of authors from each co-authored paper. These are then merged into a graph with edges weighted according to one of the four transition
functions. Any result available in this web application is derived from one of these graphs. The respective code is written in R with the graph construction and
analysis executed through the ", a(href = "https://igraph.org/r/", "igraph package", target = "_blank"), ". igraph is a wrapper for functions written in C and
C++ which makes it very efficient. Calulating more than 2.2 billion shortest cost path lengths and writting them to disk only takes a few minutes in an eight
CPU process."),
p("Much of that data is then inserted into a SQL database. As that process takes considerably more time than just writing the data to disk, the code does not
insert the full distance matrix. Instead of the \\(N \\times N\\) cells, which is currently more than 2.2 billion distance values, it uses around 1.02 billion
cells. Three properties allow the data to be cut by more than 50 percent without losing information. First, an undirected graph's distance matrix is symmetric.
Author A is as far from author B as author B is from author A. Second, all values along the main diagonal, the distance from an author to him- or herself,
are zero. Third, only authors that are part of the same graph are located at a finite bilateral distance. The number of stored distance values is thus
\\( \\sum_{i} N_{i} (N_{i} - 1)/2 \\) where \\( N_{i} \\) is the number of authors in graph \\( i \\)."),
p("Betweenness calculations using the exponential transition function include values small enough to crash the machine and are, thus, not implemented at this
point."),
p("CollEc retrieves RePEc Author Service data and computes the respective results once a day. Check the footer for the current update status. The respective
processes, i.e. the database and the R script generating the data, are executed from within ", a(href = "https://www.docker.com/", "Docker", target = "_blank"),
" containers."),
h4("Web Application"),
p("The web application is written in ", a(href = "https://shiny.rstudio.com/", "R Shiny", target = "_blank"), ", which merges server-side R with client-side
HTML, CSS and Javascript. It reads data from the above mentioned SQL database and displays it. The process generating the data and updating the database
once a day runs independently of the web application."),
p("CollEc uses ", a(href = "https://www.shinyproxy.io/", "ShinyProxy", target = "_blank"), " to deploy the app. When a user visits the website, the itself
containerized ShinyProxy spins up the app through another Docker container."),
h3("Privacy"),
p("CollEc does not set any cookies apart from the ones necessary to navigate a ", a(href = "https://shiny.rstudio.com/", "Shiny", target = "_blank"), " web
application. Users are not tracked anywhere outside this website and are not analyzed. There are no personalized ads and no data is shared with third parties.
Dropdown menu selections, including author names, transition functions etc., are only stored as long as a session is active. They are, therefore, deleted
within minutes of a user's inactivity."),
p("ShinyProxy logs access times, container crashes etc. but does not track what happens within the app. The underlying R session's output is usually not
printed to a file, except during testing and debugging by the maintainer."),
p("CollEc's decision not to track and analyze users and web application usage is motivated by compliance with strict European regulation. Instead of directly
observing who uses the app and how the app is used, CollEc relies on users to explicitly report their experience. If you encounter any errors, long loading
times or other issues, report them ",
a(href = "https://docs.google.com/forms/d/e/1FAIpQLSc6n-6FlzZx6YBorjlsSWpGm8PHbHAVxC9b9akcRyGVujLfQg/viewform?usp=sf_link", "anonymously", target = "_blank"),
" or contact the ", a(href = "http://www.christian-dueben.com", "maintainer", target = "_blank") , "."),
p("The introduction tab's video is a special case. It is a Youtube video embedded with the 'nocookies' option. Youtube only places cookies in the user's
browser once he or she clicks on the play button. Those cookies are subject to Youtube's cookie policy."),
p("This privacy statement, including the extent to which data is stored and to which cookies are used, may change with future updates to the web application."),
h3("Data Access"),
p("At this point, CollEc data is only available through this application's graphical output. A functionality to download the tabular data behind it will be
added in one of the next updates. In the meantime you can use other ", a(href = "https://ideas.repec.org/getdata.html", "RePEc data", target = "_blank"), "."),
h3("History"),
p("The first version of CollEc dates back to the year <YEAR>. It was developed by ", a(href = "http://openlib.org/home/krichel/", "Thomas Krichel", target = "_blank"), "
who also founded the RePEc Author Service. He wrote a software computing closeness and betweenness centrality using Perl and displayed the results on
a static website. An ", a(href = "http://collec.repec.org/", "image", target = "_blank"), " <UPDATE LINK> of that version is still available."),
p("In 2020, after decades of maintaining this project, Thomas Krichel transferred it to ", a(href = "http://www.christian-dueben.com", "Christian Düben", target = "_blank"),
". With the new maintainer came a new implementation. CollEc was re-written from scratch. Migrating the network analysis from Perl to modern C and C++
code wrapped in R functions boosted efficiency and facilitated extensions to the analysis. The primary extensions are bilateral distances and weighted
edges. The interface through which users view the data changed in various regards. Web applications' larger complexity compared to static websites gave
the new maintainer the flexibility to fundamentally redefine how the data is presented. The new CollEc puts results into perspective using graphical
output. When a user inquires the distance between two authors, CollEc generates a plot comparing that bilateral distance to the distances to all other
authors in the network. A short text states further information on network size etc. The new CollEc evolves around the same approach as ",
a(href = "http://graphec.repec.org/", "GraphEc", target = "_blank"), ", another recently developed RePEc service, does. It is a highly interactive tool
presenting easily interpretable results and comparisons."),
h3("Contributions"),
p("If you would like to contribute to CollEc, register with the ", a(href = "https://authors.repec.org/", "RePEc Author Service", target = "_blank"), " and promote it among
your colleagues. RePEc handles are unique identifiers that are assigned to everything listed in RePEc, from authors to papers, journals and working paper
series. Especially in the case of authors they are a major advantage over bibliographic databases that only match by name. Duplicated names are very common
in a field as large as economics. And creating a network based on names would be heavily distored. CollEc therefore constructs the co-authorship network
using RePEc handles. And for an author to be assigned a RePEc handle he or she must register with the RePEc Author Service. Each additional registered
economist with at least one co-authored paper fills in a missing link in CollEc."),
p("Other types of contribution are also welcome. Feel free to contact CollEc's current maintainer ", a(href = "http://www.christian-dueben.com", "Christian
Düben", target = "_blank"), "with your suggestions. Errors in the web application can be reported ",
a(href = "https://docs.google.com/forms/d/e/1FAIpQLSc6n-6FlzZx6YBorjlsSWpGm8PHbHAVxC9b9akcRyGVujLfQg/viewform?usp=sf_link", "anonymously", target = "_blank"),
" or via e-mail to the maintainer."),
h3("Citing CollEc"),
p(a(href = "http://www.christian-dueben.com", "Christian Düben", target = "_blank"), " is currently working on a CollEc-based paper which will be mentioned
here at some point."),
tags$footer(tags$small("CollEc was founded by", a(href = "http://openlib.org/home/krichel/", "Thomas Krichel"), "and is currently maintained by",
a(href = "http://www.christian-dueben.com", "Christian Düben", target = "_blank"), ". The server is sponsored by",
a(href = "https://www.symplectic.co.uk/", "Symplectic", target = "_blank"), ". Report errors using this ",
a(href = "https://docs.google.com/forms/d/e/1FAIpQLSc6n-6FlzZx6YBorjlsSWpGm8PHbHAVxC9b9akcRyGVujLfQg/viewform?usp=sf_link", "form", target = "_blank"),
". Latest data update: ", textOutput("current_date_text_dc", inline = T)), ".",
img(src = "symplectic_logo.png", align = "right"), style = "position: static; left: 1; bottom: 0; width: 98%; text-align: left; margin:10px 0px")
Christian Düben
Research Associate
Chair of Macroeconomics
Hamburg University
Von-Melle-Park 5, Room 3102
20146 Hamburg
Germany
+49 40 42838 1898
christian.dueben at uni-hamburg.de
http://www.christian-dueben.com
-----Original Message-----
From: Thomas Krichel <krichel at openlib.org>
Sent: Freitag, 3. Juli 2020 19:06
To: Düben, Christian <Christian.Dueben at uni-hamburg.de>
Cc: CollEc Run <collec-run at lists.openlib.org>
Subject: Re: [CollEc] RePEc Visual
Thomas Krichel writes
> If that was not running in a container, I could help you with some
> Redirect statements that will give us nice meaningful URLs.
In the meantime, you could send me the doc page source code,
and I can work on that.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the CollEc-run
mailing list