A while back I changed the job sched and work fetch policies to use
REC-based project priority.
The work fetch logic sorts the project list (in CLIENT_STATE::projects)
by descending priority.
This causes two problems:
- If you have a lot of projects, it's hard to find a particular one
in the event log, e.g. in work_fetch_debug output.
- In the manager's Statistics tab, the selected project can change
unexpectedly since we identify it by array index,
and the array order may change.
Solution: sort CLIENT_STATE::projects alphabetically (case insensitive).
In WORK_FETCH, copy this array to a separate array,
that is then sorted by decreasing priority.
In work fetch setup, we were computing rsc_project_reason
before doing the round-robin simulation.
It needs to be done after, because it uses the # of idle devices,
which is computed by the simulation.
- Remove code that tries to keep track of available GPU RAM
and defer jobs that don't fit.
This never worked, it relied on project estimates of RAM usage,
and it's been replaced by having the app do temporary exit
if alloc fails.
- Move logic for checking for deferred jobs from CPU
to work fetch.
- Rename rsc_defer_sched to has_deferred_job,
and move it from PROJECT to RSC_PROJECT_WORK_FETCH
- tweak work_fetch_debug output
The logic for deciding whether to fetch work for a project
or a (project, resource type) pair
was scattered among several functions, with confusing names.
Consolidate this logic, and use consistent names.
We weren't copying the request fields from RSC_WORK_FETCH to COPROC.
Do this, and clean up the code a bit.
Note: the arrays that parallel the COPROCS::coprocs array
are a bit of a kludge; that stuff logically belongs in COPROC.
But it's specific to the client, so I can't put it there.
Maybe I could do something fancy with derived classes, not sure.
The "static estimate" is wu.rsc_fpops_est/app_version.flops.
The problem is: what if the elapsed time exceeds this.
In this case we were returning elapsed time,
resulting in a "time remaining" of zero, which is bad.
Instead, use the same exponential model that we use to
estimate fraction done when it's not reported.
This has the advantages that:
- time remaining monotonically decreases
(though potentially at a very slow rate)
- the combo of fraction done, elapsed time, and time remaining
is consistent for apps that don't report fraction done
Scheduling: if a resource has exclusions, put all jobs in the run list;
otherwise we might fail to have a job for a GPU instance, and starve it.
Work fetch: allow work fetch from zero-share projects if the resource
has instances that are idle because of GPU exclusion
My last commit did this using a new API call.
But this would require rebuilding apps any time you want to change it;
too much work.
So instead make it an attribute of apps,
which you can set via the admin web interface.
Corresponding changes to client.
Currently the duration estimate for a task is a combination of
- a static estimate, based on wu.rsc_fpops_est and the estimated FLOPS
- a dynamic estimate, based on fraction done (FD) and elapsed time
The weighting of the dynamic estimate is FD^2;
the assumption is that fraction done is imprecise and improves
toward the end of a task.
This isn't ideal for apps that can supply accurate FD.
Solution: add a new API function
boinc_fraction_done_exact().
This notifies the client that the FD is accurate,
and that it should use only the dynamic estimate.
(New clients will do this; old clients will use the FD as the currently do).
My commit of Feb 7 caused work fetch to project P
to be deferred for up to 5 min if an upload to P is active,
even if some instances are idle.
This was to deal with a case where the idleness was caused
by a jobs-in-progress limit by P,
and work requests lead to long backoff.
However, this can cause instances to be idle unnecessarily.
I changed things so that, if instances are idle,
a work fetch can happen even during upload.
But only one such fetch will be done.
- gpu_active_frac is the fraction of time GPU use is allowed
while the client is running.
Previously the client reported it but we weren't storing it in the DB.
We may need it in the future for batch scheduling logic.
- fix a crashing bug in scheduler
- client: minor message tweak
We were waiting until there was no task for the project
before asking for another task.
We should have been waiting until there was no in-progress task.
A while back we added a mechanism intended to defer work-request RPCs
while file uploads are happening,
with the goal of reporting completed tasks sooner
and reducing the number of RPCs.
There were 2 bugs in this mechanism.
First, the decision of whether an upload is active was flawed;
if several uploads were active and 1 finished,
it would act like all had finished.
Second, when WORK_FETCH::choose_project.cpp() picks a project,
it sets p->sched_rpc_pending to RPC_REASON_NEED_WORK.
If we then decide not to request work because an upload
is active, we need to clear this field.
Otherwise scheduler_rpc_poll() will do an RPC to it,
piggybacking a work request and bypassing the upload check.
It's reported that the client can repeatedly make work request RPCs
that don't request work for any resource.
I'm not sure why this happens, but prevent it.
This gives you a way to simulate the effects of app_config.xml
- client: piggyback requests for resources even if we're backed off from them
- client: change resource backoff logic
Old: if we requested work and didn't get any,
back off from resources for which we requested work
New: for each resource type T:
if we requested work for T and didn't get any, back off from T
Also, don't back off if we're already backed off
(i.e. if this is a piggyback request)
Also, only back off if the RPC was due to an automatic
and potentially rapid source
(namely: work fetch, result report, trickle up)
- client: fix small work fetch bug
by Jacob Klein.
The new policy is roughly as follows:
- find the highest-priority project P that is allowed
to fetch work for a resource below buf_min
- Ask P for work for all resources R below buf_max
for which it's allowed to fetch work,
unless there's a higher-priority project allowed
to request work for R.
If we're going to do an RPC to P for reasons other than work fetch,
the policy is:
- for each resource R for which P is the highest-priority project
allowed to fetch work, and R is below buf_max,
request work for R.
to use project's share of instances.
- client emulator: if client_state.xml doesn't have <no_rsc_apps>
for a project, and the project doesn't have apps for that resource,
the project can be asked for work for that resource.