[CollEc] Helos offline

Sun Jul 25 11:47:09 UTC 2021

I guessed that it was a number of different machines because unless a machine deletes the cookies and cuts the web socket connection, it should only be counted once.

I do not know why R invoked the oom killer. The app should not use a lot of memory. When something spawns 30,000 instances of the app, including 30,000 Docker containers, within a short time frame, that might cross a line though. After updating, I will install a script that logs memory use.

We could block petalsearch for a few days. I do not mind that app store/ search engine browsing CollEc. However, the bot should send requests within a few sessions, not distribute them over 30,000 separate instances.

If you upgrade to a newer Ubuntu version, I recommend the latest LTS version: 20.04.2. The intermediate releases, like 21.04, are only supported for a few months. If you want to go with Debian testing instead, I do not know which version to recommend.

I do not know whether it also works for the type of machine that CollEc runs on, but Hetzner's (shared) cloud servers can be rebuilt through the cloud console on the Hetzner website. If you want a clean install that wipes the disk, that might be a preferred option. I upgraded a machine from Ubuntu 18.04 to 20.04 through the command line before. It was okay, but somehow it did not adequately connect the OS to repositories hosted by Hetzner itself.

(ShinyProxy) shiny apps are probably not as robust as full-fledged NodeJS apps. But they are what I can write at this point and grant access to the graph theoretical methods that CollEc's data relies on. And looking at my current work load, I am fairly certain that I will not rewrite the app in another language this year.

Christian Düben
Doctoral Candidate
Chair of Macroeconomics
Hamburg University
Germany
christian.dueben at uni-hamburg.de
http://www.christian-dueben.com

-----Original Message-----
From: Thomas Krichel <krichel at openlib.org> 
Sent: Sonntag, 25. Juli 2021 12:28
To: Düben, Christian <Christian.Dueben at uni-hamburg.de>
Cc: CollEc Run <collec-run at lists.openlib.org>; Cezar Lica <cezar at symplectic.co.uk>
Subject: Re: [CollEc] Helos offline

  Düben, Christian writes

> At the beginning of June, I installed a script that records the times 
> CollEc was accessed - no other variable, just the access time. When 
> plotting the results aggregated by day, you can see that the number of 
> daily app visits tends to fluctuate around 1,000 (see Subset.pdf). 
> However, yesterday it surged to almost 30,000 (see Full_Period.pdf). 
> Monit just notified me at 9:30 am today that the app was offline. So, 
> I do not know whether that is related to the server issue. But tons of 
> machines firing requests at port 80 on one day and the server becoming 
> inaccessible on the next appears to be an odd coincidence.

  Well, if you just log the times, how can you claim it's "ton of
  machines"? I did go through the apache log, and the surge appears
  to come from indeed, a bunch of servers from Huawei's petalsearch.
  The requests look legit. I'm sure they use reasonable defaults. It
  just that the shinyapp is slow. 

  Apache keeps saying 503 but keeps logging so it was still up. The
  odd thing is that we could not get through on the ssh. Since we
  only have that route to the server we are stuck, and have to
  ask for Cezar.

  There is a change for 502 to 503

114.119.158.156 - - [24/Jul/2021:09:04:30 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pel60%22 HTTP/1.1" 502 646 "-" "Mozilla/5.0 (Linux; Androi d 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.136.243 - - [24/Jul/2021:09:04:39 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22ppa963%22 HTTP/1.1" 502 646 "-" "Mozilla/5.0 (Linux; Andro id 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.134.212 - - [24/Jul/2021:09:04:43 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pkr268%22 HTTP/1.1" 503 575 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.146.29 - - [24/Jul/2021:09:04:43 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pbe625%22 HTTP/1.1" 503 575 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

  at 9:04 so that's pretty consistent with what you note.
  The non-accessibilty presumably has to do with helos
  running out of memory, but why did the oom killer not work?
  Well it run, but was not enough. We have in syslog

root at helos /var/log # grep 'R invoked oom-killer' syslog.1 Jul 24 08:59:09 helos kernel: [14922235.506685] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 09:30:07 helos kernel: [14924093.497980] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 10:26:46 helos kernel: [14927492.848174] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 10:58:08 helos kernel: [14929347.932058] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 12:08:50 helos kernel: [14933616.461377] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 12:58:00 helos kernel: [14936548.248476] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 13:10:19 helos kernel: [14937294.624810] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 13:23:38 helos kernel: [14938104.947025] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 14:08:06 helos kernel: [14940762.579273] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 14:24:13 helos kernel: [14941739.313980] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 16:32:52 helos kernel: [14949437.368614] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Jul 24 17:50:50 helos kernel: [14954122.341626] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0

  But seemingly these oom kill are not enough to keep ssh up.

  I suspect what could be done is a script that checks whether
  the uptime is greater than a day. In that case, grep for
  'R invoked oom-killer' in syslog, if found, reboot. Run
  that every hour. I've never written / run anything like that.

  The easier thing is to disable petal via hosts.txt. 

  Your thoughts?

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel