So I ran into this problem at work today with an runit based service breaching open files limit.
My first thought was to increase the system ulimit for nofile in /etc/security/limits.conf. I changed this
from 30k to about 60k. But strangely, the service still keep dying.
Since the service uses chpst, my first suspect was that chpst
was trying to be overbearing and changing the system limits before starting up the process. The service I was
running was Python based, and so I added this small snippet in the beginning of the program to see what limits
was it seeing.
import resource
logger.info("rlimit for nofile: %s", resource.getrlimit(resource.RLIMIT_NOFILE))
The logs showed:
[INFO] rlimit for nofile: (1024, 4096)
That was shocking. None of those values are anywhere configured in the system.
lsof of the process had earlier showed me that the process had about a
thousand open files before it crashed, so that explained why it was crashing -
it was breaching the soft limit.
I looked at the documentation for chpst and found a option -o which changes
the open files limit. So I set that in the chpst invocation (exec chpst -o 60000 ...), and I got:
[INFO] rlimit for nofile: (4096, 4096)
It seems that the -o only affects the soft limit. I took the win and the
service recovered, but after the crisis passed, I kept digging. I was curious
where all these limits were coming from.
The chpst source
didn’t reveal any limits being imposed. I couldn’t figure out any other call to
setrlimit in the rest of the runit sources either.
On a hunch, I tried to print the limits beforechpst is called using this
run file.
#!/bin/bash
exec 2>&1echo "Soft limits"ulimit -S -a
echo "Hard limits"ulimit -H -a
exec chpst ...
It got me:
open files (-n) 1024 # soft limit
open files (-n) 4096 # hard limit
The actual ulimit values for the root user were 60000 (soft),60000 (hard).
That showed me that this limits modification was not happening because of
chpst, but probably because of runsv or some other part of the runit
system. I could be wrong though because, like I said before, I couldn’t find a
call to setrlimit in the sources.
Curiously, printing the ulimit values in the run file showed me another odd
change from the system limits - the max procs limit (ulimit -u). It seems
that when the run file is executing, the soft limit for this setting is set to
the hard limit.
On the RHEL6 machine I was running, the root user showed these limits in the
root shell: (1024,514975). In the run file however, the equivalents were
(514975,514975). This was so weird, I double checked.
Why is the runit service being conservative with resource limits with regard to
the max open files limit, but more generous with the max procs limit?
I have no idea.
In the end, yes, you can workaround the default limits of runit by using ulimit
in the shell script before invoking chpst, but when you don’t, I hope this
helps in reminding ourselves the limits it places on the services that it
manages.