[CollEc] Helos offline
Thomas Krichel
krichel at openlib.org
Sun Jul 25 10:27:49 UTC 2021
Düben, Christian writes
> At the beginning of June, I installed a script that records the
> times CollEc was accessed - no other variable, just the access
> time. When plotting the results aggregated by day, you can see that
> the number of daily app visits tends to fluctuate around 1,000 (see
> Subset.pdf). However, yesterday it surged to almost 30,000 (see
> Full_Period.pdf). Monit just notified me at 9:30 am today that the
> app was offline. So, I do not know whether that is related to the
> server issue. But tons of machines firing requests at port 80 on one
> day and the server becoming inaccessible on the next appears to be
> an odd coincidence.
Well, if you just log the times, how can you claim it's "ton of
machines"? I did go through the apache log, and the surge appears
to come from indeed, a bunch of servers from Huawei's petalsearch.
The requests look legit. I'm sure they use reasonable defaults. It
just that the shinyapp is slow.
Apache keeps saying 503 but keeps logging so it was still up. The
odd thing is that we could not get through on the ssh. Since we
only have that route to the server we are stuck, and have to
ask for Cezar.
There is a change for 502 to 503
114.119.158.156 - - [24/Jul/2021:09:04:30 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pel60%22 HTTP/1.1" 502 646 "-" "Mozilla/5.0 (Linux; Androi
d 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.136.243 - - [24/Jul/2021:09:04:39 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22ppa963%22 HTTP/1.1" 502 646 "-" "Mozilla/5.0 (Linux; Andro
id 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.134.212 - - [24/Jul/2021:09:04:43 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pkr268%22 HTTP/1.1" 503 575 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.146.29 - - [24/Jul/2021:09:04:43 +0200] "GET /app_direct/collec_app/?_inputs_&navbars=%22tab_Coauthors%22&_values_&g_author=%22pbe625%22 HTTP/1.1" 503 575 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
at 9:04 so that's pretty consistent with what you note.
The non-accessibilty presumably has to do with helos
running out of memory, but why did the oom killer not work?
Well it run, but was not enough. We have in syslog
root at helos /var/log # grep 'R invoked oom-killer' syslog.1
Jul 24 08:59:09 helos kernel: [14922235.506685] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 09:30:07 helos kernel: [14924093.497980] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 10:26:46 helos kernel: [14927492.848174] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 10:58:08 helos kernel: [14929347.932058] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 12:08:50 helos kernel: [14933616.461377] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 12:58:00 helos kernel: [14936548.248476] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 13:10:19 helos kernel: [14937294.624810] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 13:23:38 helos kernel: [14938104.947025] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 14:08:06 helos kernel: [14940762.579273] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 14:24:13 helos kernel: [14941739.313980] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 16:32:52 helos kernel: [14949437.368614] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jul 24 17:50:50 helos kernel: [14954122.341626] R invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
But seemingly these oom kill are not enough to keep ssh up.
I suspect what could be done is a script that checks whether
the uptime is greater than a day. In that case, grep for
'R invoked oom-killer' in syslog, if found, reboot. Run
that every hour. I've never written / run anything like that.
The easier thing is to disable petal via hosts.txt.
Your thoughts?
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the CollEc-run
mailing list