Skip to content

sled-agent leaks contracts when executing commands in an NGZ #3753

@citrus-it

Description

@citrus-it

I came across a gimlet in a state this morning where I was unable to log in because the SSH server could not fork.

We're seeing a lot of fork failures from sled-agent too.

BRM42220014 # dtrace -n 'forksys:return{@[execname,arg1==-1,errno]=count()}'
dtrace: description 'forksys:return' matched 2 probes
^C

  sled-agent                                                1       11               32
  devfsadm                                                  0        0               36
  ksh93                                                     0        0               36
  tfportd                                                   0        0

and we're seeing the misc fork failure counter increasing:

> zone0::print ! grep zone_ff
    zone_ffcap = 0
    zone_ffnoproc = 0
    zone_ffnomem = 0
    zone_ffmisc = 0x4976
> zone0::print ! grep zone_ff
    zone_ffcap = 0
    zone_ffnoproc = 0
    zone_ffnomem = 0
    zone_ffmisc = 0x4979

After a bit of tracing, we find that the failing function is contract_process_fork():

BRM42220014 # dtrace -n 'contract_process_fork:return/execname=="sled-agent"/{trace(arg1)}'
dtrace: description 'contract_process_fork:return' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  38886     contract_process_fork:return                 0
 52  38886     contract_process_fork:return                 0
 52  38886     contract_process_fork:return                 0
 52  38886     contract_process_fork:return                 0
 50  38886     contract_process_fork:return                 0
111  38886     contract_process_fork:return                 0

How many contracts does sled agent have?

BRM42220014 # ptree `pgrep sled-agent`
652    ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /
  654    /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.

BRM42220014 # ctstat | awk '$5 == 652 || $5 == 654 { print $5 }' | sort | uniq -c
   1 652
9964 654

That 9964 is suspiciously close to 10,000. What's the contract limit for sled-agent?

BRM42220014 # prctl -i process -n project.max-contracts `pgrep sled-agent`
process: 654: /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
project.max-contracts
        privileged      10.0K       -   deny                                 -

Picking one of them:

BRM42220014 # ctstat -av -i 10275
CTID    ZONEID  TYPE    STATE   HOLDER  EVENTS  QTIME   NTIME
10275   0       process owned   654     0       -       -
	cookie:                0
	informative event set: none
	critical event set:    none
	fatal event set:       hwerr
	parameter set:         pgrponly regent
	member processes:      none
	inherited contracts:   none
	service fmri:          svc:/oxide/sled-agent:default
	service fmri ctid:     60
	creator:               sled-agent
	aux:

The problem here seems to be that sled-agent is creating a new contract for running a command inside a zone, but it is allowing the contract to remain around once the child process has completed.

Metadata

Metadata

Assignees

Labels

Sled AgentRelated to the Per-Sled Configuration and ManagementbugSomething that isn't working.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions