-
Notifications
You must be signed in to change notification settings - Fork 58
Closed
Labels
Sled AgentRelated to the Per-Sled Configuration and ManagementRelated to the Per-Sled Configuration and ManagementbugSomething that isn't working.Something that isn't working.
Milestone
Description
I came across a gimlet in a state this morning where I was unable to log in because the SSH server could not fork.
We're seeing a lot of fork failures from sled-agent
too.
BRM42220014 # dtrace -n 'forksys:return{@[execname,arg1==-1,errno]=count()}'
dtrace: description 'forksys:return' matched 2 probes
^C
sled-agent 1 11 32
devfsadm 0 0 36
ksh93 0 0 36
tfportd 0 0
and we're seeing the misc fork failure counter increasing:
> zone0::print ! grep zone_ff
zone_ffcap = 0
zone_ffnoproc = 0
zone_ffnomem = 0
zone_ffmisc = 0x4976
> zone0::print ! grep zone_ff
zone_ffcap = 0
zone_ffnoproc = 0
zone_ffnomem = 0
zone_ffmisc = 0x4979
After a bit of tracing, we find that the failing function is contract_process_fork()
:
BRM42220014 # dtrace -n 'contract_process_fork:return/execname=="sled-agent"/{trace(arg1)}'
dtrace: description 'contract_process_fork:return' matched 1 probe
CPU ID FUNCTION:NAME
0 38886 contract_process_fork:return 0
52 38886 contract_process_fork:return 0
52 38886 contract_process_fork:return 0
52 38886 contract_process_fork:return 0
50 38886 contract_process_fork:return 0
111 38886 contract_process_fork:return 0
How many contracts does sled agent have?
BRM42220014 # ptree `pgrep sled-agent`
652 ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /
654 /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.
BRM42220014 # ctstat | awk '$5 == 652 || $5 == 654 { print $5 }' | sort | uniq -c
1 652
9964 654
That 9964 is suspiciously close to 10,000. What's the contract limit for sled-agent?
BRM42220014 # prctl -i process -n project.max-contracts `pgrep sled-agent`
process: 654: /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
project.max-contracts
privileged 10.0K - deny -
Picking one of them:
BRM42220014 # ctstat -av -i 10275
CTID ZONEID TYPE STATE HOLDER EVENTS QTIME NTIME
10275 0 process owned 654 0 - -
cookie: 0
informative event set: none
critical event set: none
fatal event set: hwerr
parameter set: pgrponly regent
member processes: none
inherited contracts: none
service fmri: svc:/oxide/sled-agent:default
service fmri ctid: 60
creator: sled-agent
aux:
The problem here seems to be that sled-agent is creating a new contract for running a command inside a zone, but it is allowing the contract to remain around once the child process has completed.
Metadata
Metadata
Assignees
Labels
Sled AgentRelated to the Per-Sled Configuration and ManagementRelated to the Per-Sled Configuration and ManagementbugSomething that isn't working.Something that isn't working.