I'm trying to understand what's happening here:
I have a supervisor that is cyclically restarting one client without triggering the MaxR, MaxT
mechanism. The client just crashes slowly enough never to trigger the rate limitation.
There would have been another mechanism that uses supervisor:which_children/1
and delete_child/2, start_child/2
to adapt the set of children to reality (its开发者_C百科 scanning for USB devices trying to have one supervisor child per device found).
This would normally behave like a safety net to the rate limitation, but strangely it looks like the mechanism that deletes and starts children is not called at all.
To find out what's going on I called supervisor:which_children/1
from the shell and it looks like the call just blocks and never returns.
Can it be that calls to the supervisor are blocked while it is busy trying to restart a child?
Addendum:
it looks like the crash happens during child start:
=SUPERVISOR REPORT==== 29-Mar-2011::21:36:20 ===
Supervisor: {local,gateway_sup}
Context: start_error
Reason: {'EXIT',{timeout,{gen_server,call,[<0.155.0>,late_init]}}}
Offender: [{pid,<0.76.0>},
{name,gw_3_5},
{mfa,{channel,start_link,
[[{gateways,[{left,108},{right,103}]}],
{3,5}]}},
{restart_type,transient},
{shutdown,10000},
{child_type,worker}]
The answer to the question besides the discussion is:
When restarting a child that fails during startup the supervisor loops inside its process (it is a gen_server internally) not handling any API calls to it.
So it is especially bad if the rate limitation of the supervisor is configured that it will not trigger on startup errors of the children. I have a slow startup (especially on error) in my example.
So if the supervisor loops forever trying to restart a child it is not reachable for any calls to it ... which is usually bad.
精彩评论