2002-09-05 11:46:10 +00:00
|
|
|
<title>Scheduler RPC timing and retry policies</title>
|
2002-08-20 23:54:17 +00:00
|
|
|
<body bgcolor=ffffff>
|
2002-09-05 11:46:10 +00:00
|
|
|
<h2>Scheduler RPC timing and retry policies</h2>
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
Each scheduler RPC reports results, gets work, or both.
|
|
|
|
The client's <b>scheduler RPC policy</b> has several components:
|
|
|
|
when to make a scheduler RPC, which project to contact,
|
|
|
|
which scheduling server for
|
2002-07-29 19:01:38 +00:00
|
|
|
that project, how much work to ask for, and what to do if the RPC fails.
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
The scheduler RPC policy has the following goals:
|
2002-07-15 05:34:32 +00:00
|
|
|
<ul>
|
2002-08-19 18:43:10 +00:00
|
|
|
<li> Make as few scheduler RPCs as possible.
|
2002-08-20 23:54:17 +00:00
|
|
|
<li> Use random exponential backoff if a project's scheduling servers are down.
|
2002-08-19 18:43:10 +00:00
|
|
|
This avoids an RPC storm when the servers come back up.
|
|
|
|
<li> Eventually re-read a project's master URL file in case its set
|
2002-07-29 19:01:38 +00:00
|
|
|
of schedulers changes.
|
2002-08-19 18:43:10 +00:00
|
|
|
<li> Report results before or soon after their deadlines.
|
2002-07-15 05:34:32 +00:00
|
|
|
</ul>
|
2002-07-29 19:01:38 +00:00
|
|
|
<h3>Resource debt</h3>
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
The client maintains an exponentially-averaged sum of the CPU time
|
|
|
|
it has devoted to each project.
|
|
|
|
The constant EXP_DECAY_RATE determines
|
2002-07-29 19:01:38 +00:00
|
|
|
the decay rate (currently a factor of e every week).
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
Each project is assigned a <b>resource debt</b>, computed as
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
resource_debt = resource_share / exp_avg_cpu
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
Resource debt is a measure of how much work the client owes the
|
2002-07-29 19:01:38 +00:00
|
|
|
project, and in general the project with the greatest resource debt is
|
|
|
|
the one from which work should be requested.
|
2002-08-19 18:43:10 +00:00
|
|
|
|
2002-07-29 19:01:38 +00:00
|
|
|
<h3>Minimum RPC time</h3>
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
The client maintains a <b>minimum RPC time</b> for each project.
|
2002-07-29 19:01:38 +00:00
|
|
|
This is the earliest time at which a scheduling RPC should be done to
|
2002-08-19 18:43:10 +00:00
|
|
|
that project (if zero, an RPC can be done immediately).
|
|
|
|
The minimum RPC time can be set for various reasons:
|
2002-07-15 05:34:32 +00:00
|
|
|
<ul>
|
2002-08-19 18:43:10 +00:00
|
|
|
<li> Because of a request from the project, i.e. a
|
2002-07-29 19:01:38 +00:00
|
|
|
<request_delay> element in a scheduler reply message.
|
2002-08-19 18:43:10 +00:00
|
|
|
<li> Because RPCs to all of the project's scheduler has failed.
|
|
|
|
An exponential backoff policy is used.
|
|
|
|
<li> Because one of the project's computations has failed (the
|
|
|
|
application crashed, or a file upload or download failed).
|
|
|
|
An exponential backoff policy is used to prevent a cycle of rapid failures.
|
2002-07-15 05:34:32 +00:00
|
|
|
</ul>
|
2002-08-19 18:43:10 +00:00
|
|
|
|
2002-07-29 19:01:38 +00:00
|
|
|
<h3>Scheduler RPC sessions</h3>
|
2002-07-15 05:34:32 +00:00
|
|
|
<p>
|
2002-08-19 18:43:10 +00:00
|
|
|
Communication with schedulers is organized into <b>sessions</b>,
|
|
|
|
each of which may involve many RPCs.
|
|
|
|
There are two types of sessions:
|
2002-07-29 19:01:38 +00:00
|
|
|
</p>
|
2002-07-15 05:34:32 +00:00
|
|
|
<ul>
|
2002-08-19 18:43:10 +00:00
|
|
|
<li> <b>Get-work</b> sessions, whose goal is to get a certain amount of work.
|
|
|
|
Results may be reported as a side-effect.
|
|
|
|
<li>
|
|
|
|
<b>Report-result</b> sessions, whose goal is to report results.
|
2002-07-15 05:34:32 +00:00
|
|
|
Work may be fetched as a side-effect.
|
|
|
|
</ul>
|
2002-07-29 19:01:38 +00:00
|
|
|
The internal logic of scheduler sessions is encapsulated in the class
|
2002-08-19 18:43:10 +00:00
|
|
|
SCHEDULER_OP.
|
|
|
|
This is implemented as a state machine, but its logic
|
|
|
|
expressed as a process might look like:
|
|
|
|
<pre>
|
2002-07-15 05:34:32 +00:00
|
|
|
get_work_session() {
|
2002-07-29 19:01:38 +00:00
|
|
|
while estimated work < high water mark
|
|
|
|
P = project with greatest debt and min_rpc_time < now
|
2002-07-15 05:34:32 +00:00
|
|
|
for each scheduler URL of P
|
|
|
|
attempt an RPC to that URL
|
|
|
|
if no error break
|
|
|
|
if some RPC succeeded
|
|
|
|
P.nrpc_failures = 0
|
|
|
|
else
|
|
|
|
P.nrpc_failures++
|
|
|
|
P.min_rpc_time = exponential_backoff(P.min_rpc_failures)
|
|
|
|
if P.nrpc_failures mod MASTER_FETCH_PERIOD = 0
|
|
|
|
P.fetch_master_flag = true
|
|
|
|
for each project P with P.fetch_master_flag set
|
|
|
|
read and parse master file
|
|
|
|
if error
|
|
|
|
P.nrpc_failures++
|
|
|
|
P.min_rpc_time = exponential_backoff(P.min_rpc_failures)
|
|
|
|
if got any new scheduler urls
|
|
|
|
P.nrpc_failures = 0
|
|
|
|
P.min_rpc_time = 0
|
|
|
|
}
|
|
|
|
|
|
|
|
report_result_session(project P) {
|
|
|
|
for each scheduler URL of project
|
|
|
|
attempt an RPC to that URL
|
|
|
|
if no error break
|
|
|
|
if some RPC succeeded
|
|
|
|
P.nrpc_failures = 0
|
|
|
|
else
|
|
|
|
P.nrpc_failures++;
|
|
|
|
P.min_rpc_time = exponential_backoff(P.min_rpc_failures)
|
|
|
|
}
|
2002-08-19 18:43:10 +00:00
|
|
|
</pre>
|
|
|
|
The logic for initiating scheduler sessions is expressed in the
|
|
|
|
following poll function:
|
|
|
|
<pre>
|
2002-07-15 05:34:32 +00:00
|
|
|
if a scheduler RPC session is not active
|
|
|
|
if estimated work is less than low-water mark
|
|
|
|
start a get-work session
|
|
|
|
else if some project P has overdue results
|
|
|
|
start a report-result session for P;
|
|
|
|
is P is the project with greatest resource debt,
|
|
|
|
the RPC request should ask for enough work to bring us up
|
|
|
|
to the high-water mark
|
2002-07-29 19:01:38 +00:00
|
|
|
</pre>
|