We were waiting until there was no task for the project
before asking for another task.
We should have been waiting until there was no in-progress task.
A while back we added a mechanism intended to defer work-request RPCs
while file uploads are happening,
with the goal of reporting completed tasks sooner
and reducing the number of RPCs.
There were 2 bugs in this mechanism.
First, the decision of whether an upload is active was flawed;
if several uploads were active and 1 finished,
it would act like all had finished.
Second, when WORK_FETCH::choose_project.cpp() picks a project,
it sets p->sched_rpc_pending to RPC_REASON_NEED_WORK.
If we then decide not to request work because an upload
is active, we need to clear this field.
Otherwise scheduler_rpc_poll() will do an RPC to it,
piggybacking a work request and bypassing the upload check.
It's reported that the client can repeatedly make work request RPCs
that don't request work for any resource.
I'm not sure why this happens, but prevent it.
This gives you a way to simulate the effects of app_config.xml
- client: piggyback requests for resources even if we're backed off from them
- client: change resource backoff logic
Old: if we requested work and didn't get any,
back off from resources for which we requested work
New: for each resource type T:
if we requested work for T and didn't get any, back off from T
Also, don't back off if we're already backed off
(i.e. if this is a piggyback request)
Also, only back off if the RPC was due to an automatic
and potentially rapid source
(namely: work fetch, result report, trickle up)
- client: fix small work fetch bug
by Jacob Klein.
The new policy is roughly as follows:
- find the highest-priority project P that is allowed
to fetch work for a resource below buf_min
- Ask P for work for all resources R below buf_max
for which it's allowed to fetch work,
unless there's a higher-priority project allowed
to request work for R.
If we're going to do an RPC to P for reasons other than work fetch,
the policy is:
- for each resource R for which P is the highest-priority project
allowed to fetch work, and R is below buf_max,
request work for R.
to use project's share of instances.
- client emulator: if client_state.xml doesn't have <no_rsc_apps>
for a project, and the project doesn't have apps for that resource,
the project can be asked for work for that resource.
- scale amount of work request by
(# non-excluded instances)/#instances
- change policy:
old: don't fetch work if #jobs > #non-excluded instances
new: don't fetch work if # of instance-seconds used in RR sim
> work_buf_min * (#non-exluded instances)/#instances
Note: this fixes a major problem (starvation)
with project-level GPU exclusion.
However, project-level GPU exclusion interferes with most of
the client's scheduling policies.
E.g., round-robin simulation doesn't take GPU exclusion into account,
and the resulting completion estimates and device shortfalls
can be wrong by an order of magnitude.
The only way I can see to fix this would be to model each
GPU instance as a separate resource,
and to associate each job with a particular GPU instance.
This would be a sweeping change in both client and server.
initial work request to a project
- client: put some casts to double in NVIDIA detect code.
Shouldn't make any difference.
- volunteer storage: truncate file to right size after retrieval
svn path=/trunk/boinc/; revision=26051
allow it to fetch work of that type if the # of runnable
jobs it <= the # of non-excluded instances (rather than 0).
svn path=/trunk/boinc/; revision=26045
for a reason other than work fetch,
and we're deciding whether to piggyback a work request,
skip the checks for hysteresis (buffer < min)
and for per-resource backoff time.
These checks are there only to limit the rate of RPCs,
which is not relevant since we're doing one any.
This fixes a bug where a project w/ sporadic jobs specifies
a next_rpc_delay to ensure regular polling from clients.
When these polls occur they should request work regardless of backoff.
svn path=/trunk/boinc/; revision=26002
and change types of mem-size fields from int to double.
These fields are size_t in NVIDIA's version of this;
however, cuDeviceGetAttribute() returns them as int,
so I don't see where this makes any difference.
- client: fix bug in handling of <no_rsc_apps> element.
- scheduler: message tweaks.
Note: [foo] means that the message is enabled by <debug_foo>.
svn path=/trunk/boinc/; revision=25849
and there's a simple reason
(e.g. the project is suspended, no-new-tasks, downloads stalled, etc.)
show it in the event lot.
If the reason is more complex, don't try to explain.
svn path=/trunk/boinc/; revision=25827
1) a network connection is available and
2) network communication is allowed and
3) CPU computation is allowed
- If an app version is marked as needs_network,
use the above fraction in estimating its rate of progress
- replace "core client" with "client" in comments.
- scheduler: message tweaks
svn path=/trunk/boinc/; revision=25803
on each request.
- client: when showing how much work a scheduler request returned,
scale by availability (as is done to show the amount of the request)
- client in account manager request, <not_started_dur> and
<in_progress_dur> are in wall time, not run time
(i.e. scale them by availability)
Note: there's some confusion in the code between runtime and wall time,
where in general wall time = runtime / availability.
New convention: let's use "runtime" for the former,
and "duration" for the latter.
svn path=/trunk/boinc/; revision=25597