[CollEc] Helos offline

Sun Jul 25 10:27:49 UTC 2021

  Düben, Christian writes

> At the beginning of June, I installed a script that records the
> times CollEc was accessed - no other variable, just the access
> time. When plotting the results aggregated by day, you can see that
> the number of daily app visits tends to fluctuate around 1,000 (see
> Subset.pdf). However, yesterday it surged to almost 30,000 (see
> Full_Period.pdf). Monit just notified me at 9:30 am today that the
> app was offline. So, I do not know whether that is related to the
> server issue. But tons of machines firing requests at port 80 on one
> day and the server becoming inaccessible on the next appears to be
> an odd coincidence.

  Well, if you just log the times, how can you claim it's "ton of
  machines"? I did go through the apache log, and the surge appears
  to come from indeed, a bunch of servers from Huawei's petalsearch.
  The requests look legit. I'm sure they use reasonable defaults. It
  just that the shinyapp is slow. 

  Apache keeps saying 503 but keeps logging so it was still up. The
  odd thing is that we could not get through on the ssh. Since we
  only have that route to the server we are stuck, and have to
  ask for Cezar.

  There is a change for 502 to 503

114.119.158.156 - - [24/Jul/2021:09:04:30 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pel60%22 HTTP/1.1" 502 646 "-" "Mozilla/5.0 (Linux; Androi
d 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.136.243 - - [24/Jul/2021:09:04:39 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22ppa963%22 HTTP/1.1" 502 646 "-" "Mozilla/5.0 (Linux; Andro
id 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.134.212 - - [24/Jul/2021:09:04:43 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pkr268%22 HTTP/1.1" 503 575 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.146.29 - - [24/Jul/2021:09:04:43 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pbe625%22 HTTP/1.1" 503 575 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

  at 9:04 so that's pretty consistent with what you note.
  The non-accessibilty presumably has to do with helos
  running out of memory, but why did the oom killer not work?
  Well it run, but was not enough. We have in syslog

root at helos /var/log # grep 'R invoked oom-killer' syslog.1 
Jul 24 08:59:09 helos kernel: [14922235.506685] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 09:30:07 helos kernel: [14924093.497980] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 10:26:46 helos kernel: [14927492.848174] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 10:58:08 helos kernel: [14929347.932058] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 12:08:50 helos kernel: [14933616.461377] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 12:58:00 helos kernel: [14936548.248476] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 13:10:19 helos kernel: [14937294.624810] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 13:23:38 helos kernel: [14938104.947025] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 14:08:06 helos kernel: [14940762.579273] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 14:24:13 helos kernel: [14941739.313980] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 16:32:52 helos kernel: [14949437.368614] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 17:50:50 helos kernel: [14954122.341626] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0

  But seemingly these oom kill are not enough to keep ssh up.

  I suspect what could be done is a script that checks whether
  the uptime is greater than a day. In that case, grep for
  'R invoked oom-killer' in syslog, if found, reboot. Run
  that every hour. I've never written / run anything like that.

  The easier thing is to disable petal via hosts.txt. 

  Your thoughts?

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel