The basic problem: the way we assign GPU instances when creating
the "run list" is slightly different from the way we assign them
when we actually run the jobs;
the latter assigns a running job to the instance it's using,
but the former doesn't.
Solution (kludge): when building the run list,
don't reserve instances for currently running jobs.
This will result in more jobs in the run list, and avoid starvation.
For efficiency, do this only if there are exclusions for this type.
Comment: this is yet another complexity that would be eliminated
if GPU instances were modeled separately.
I wish I had time to do that.
- client emulator: change default latency bound from 1 day to 10 days
This gives you a way to simulate the effects of app_config.xml
- client: piggyback requests for resources even if we're backed off from them
- client: change resource backoff logic
Old: if we requested work and didn't get any,
back off from resources for which we requested work
New: for each resource type T:
if we requested work for T and didn't get any, back off from T
Also, don't back off if we're already backed off
(i.e. if this is a piggyback request)
Also, only back off if the RPC was due to an automatic
and potentially rapid source
(namely: work fetch, result report, trickle up)
- client: fix small work fetch bug
by Jacob Klein.
The new policy is roughly as follows:
- find the highest-priority project P that is allowed
to fetch work for a resource below buf_min
- Ask P for work for all resources R below buf_max
for which it's allowed to fetch work,
unless there's a higher-priority project allowed
to request work for R.
If we're going to do an RPC to P for reasons other than work fetch,
the policy is:
- for each resource R for which P is the highest-priority project
allowed to fetch work, and R is below buf_max,
request work for R.
- scale amount of work request by
(# non-excluded instances)/#instances
- change policy:
old: don't fetch work if #jobs > #non-excluded instances
new: don't fetch work if # of instance-seconds used in RR sim
> work_buf_min * (#non-exluded instances)/#instances
Note: this fixes a major problem (starvation)
with project-level GPU exclusion.
However, project-level GPU exclusion interferes with most of
the client's scheduling policies.
E.g., round-robin simulation doesn't take GPU exclusion into account,
and the resulting completion estimates and device shortfalls
can be wrong by an order of magnitude.
The only way I can see to fix this would be to model each
GPU instance as a separate resource,
and to associate each job with a particular GPU instance.
This would be a sweeping change in both client and server.
initial work request to a project
- client: put some casts to double in NVIDIA detect code.
Shouldn't make any difference.
- volunteer storage: truncate file to right size after retrieval
svn path=/trunk/boinc/; revision=26051
allow it to fetch work of that type if the # of runnable
jobs it <= the # of non-excluded instances (rather than 0).
svn path=/trunk/boinc/; revision=26045
for a reason other than work fetch,
and we're deciding whether to piggyback a work request,
skip the checks for hysteresis (buffer < min)
and for per-resource backoff time.
These checks are there only to limit the rate of RPCs,
which is not relevant since we're doing one any.
This fixes a bug where a project w/ sporadic jobs specifies
a next_rpc_delay to ensure regular polling from clients.
When these polls occur they should request work regardless of backoff.
svn path=/trunk/boinc/; revision=26002
and there's a simple reason
(e.g. the project is suspended, no-new-tasks, downloads stalled, etc.)
show it in the event lot.
If the reason is more complex, don't try to explain.
svn path=/trunk/boinc/; revision=25827
if we're making a scheduler RPC to a project for reasons
other than work fetch,
and we're deciding whether to ask for work, ignore hysteresis;
i.e. ask for work even if we're above the min buffer
(idea from John McLeod).
svn path=/trunk/boinc/; revision=25291
work fetch (e.g. to report completed jobs)
only request work if it's the project we would have chosen
if we were fetching work.
- client: the way in which project priorities were adjusted
in work fetch to reflected currently queued work was wrong.
- client: fix bug in the way project priorities are adjusted
in RR simulator
- client emulator: if there are results in the state file
with states DOWNLOADING or UPLOADING,
change them to DOWNLOADED or UPLOADED.
Otherwise they're stuck.
svn path=/trunk/boinc/; revision=24737
reduce its runtime from O(N^2) to O(N),
where N is the number of runnable jobs
(which can be in the thousands).
This will make the client emulator run a lot faster,
and will reduce the client CPU overhead a bit.
- API: change boinc_get_opencl_ids() so that it returns
a BOINC error code (< -100) if the app_init.xml is
missing or bad (i.e. we're running standalone),
and an OpenCL error code (> -100) if an OpenCL call failed.
svn path=/trunk/boinc/; revision=24469
so that they do what they're supposed to
(i.e. enforce resource shares)
- client: change log flag <debt_debug> to <priority_debug>
- client simulator: update REC even with large delta-t.
- client simulator: handle "no new work" apps correctly
svn path=/trunk/boinc/; revision=24429
by simulating time-slicing explicitly.
Also simulate changes in project REC
and hence in scheduling priority.
- client: add a log flag "rrsim_detail" that prints
time-slice-level info.
svn path=/trunk/boinc/; revision=24161
- client: cc_config.xml: if <devnum> is omitted from a <exclude_gpu>,
it means exclude all instances of that GPU type
- client: if all instances of a GPU type are excluded for a project,
don't ask the project for jobs of that type
svn path=/trunk/boinc/; revision=23898
adjust project REC by the amount of work queued, to increase variety
NOTE: at some point I think I had a reason to not do this,
but I can't remember what it is.
- client, job scheduling policy: fix how project REC is adjusted
svn path=/trunk/boinc/; revision=23838
as non-high-priority
- client: don't print spurious "domino prevention"
and "thrashing prevention" msgs
- manager: show project descriptions in same size font
as the rest of the dialog
svn path=/trunk/boinc/; revision=23831
- new GPU types can be added easily
- users can specify GPUs in cc_config.xml,
referred to by app_info.xml,
and they will be scheduled by BOINC
and passed --device N options
Note: the parsing of cc_config.xml is not done yet.
- RPC protocols (account manager and scheduler)
can now specify GPU types in separate elements
rather than embedding them in tag names
e.g. <no_rsc>NVIDIA</no_rsc> rather than <no_cuda/>
- client: in account manager replies, parse elements of the form
<no_rsc>NAME</no_rsc>
indicating the GPUs of type NAME should not be used.
This allows account managers to control GPU types
not hardwired into the client.
Note: <no_cuda/> and <no_ati/> will continue to be supported.
- scheduler RPC reply: add
<no_rsc_apps>NAME</no_rsc_apps>
(NAME = GPU name)
to indicate that the project has no jobs for the indicated GPU type.
<no_cuda_apps> etc. are still supported
- client/lib: remove set_debts() GUI RPC
- client/scheduler RPC
remove <cuda_backoff> etc. (superceded by no_app)
Exception: <ip_result> elements in sched request
still have <ncudas> and <natis>.
Fix this later.
Implementation notes:
- client/lib: change "CUDA" to "NVIDIA" in type/variable names, and in XML
Continue to recognize "CUDA" for compatibility
- host_info.coprocs no longer used within the client;
use a global var (COPROCS coprocs) instead.
COPROCS now has an array of COPROCs;
GPUs types are identified by the array index.
Index zero means CPU.
- a bunch of other resource-specific structs (like RSC_WORK_FETCH)
are now stored in arrays, with same indices as COPROCS
(i.e. index 0 is CPU)
- COPROCS still has COPROC_NVIDIA and COPROC_ATI structs to hold vendor-specific info
- APP_VERSION now has a struct GPU_USAGE to describe its GPU usage
svn path=/trunk/boinc/; revision=23253
for RR sim's pending-job lists.
Erasing head of vector is slow.
- lib: allow GPU peak FLOPS to be specified in XML (for simulator)
- simulator work
- client: old work fetch policy: projects may need enough jobs
for all device instances, not just resource_share*ninst.
E.g. a project that has only CPU jobs in a CPU/GPU client
- client: with REC scheduling, don't ask for work for
secondary resources if project has negative priority.
- client: in RR sim, make sure we saturate devices if possible.
Otherwise we may report a shortfall incorrectly
svn path=/trunk/boinc/; revision=22894
since we now do round-robin for GPUs as well as CPU.
NOTE: this bug was found using the client simulator!
- client simulator: generate REC graph
svn path=/trunk/boinc/; revision=22746
recent estimated credit (REC) instead of debt.
These changes are enabled by
#define USE_REC
in work_fetch.h.
If this is commented out (the default) the client uses
debt-based scheduling, same as before.
TODO: work-fetch policy changes
- client simulator: various fixes:
- compute idle and wasted fraction based on all processing resources,
not just CPU
- compute job completion times based on FLOPS, not CPU seconds
- compute and use project->no_X_apps
etc.
svn path=/trunk/boinc/; revision=22741
- scheduler: improve the deadline check mechanism slightly.
When updating "estimated delay" (a rough measure of how long
a resource is saturated with high-priority work)
take into account the # of instances used by the job,
and the # of total instances
svn path=/trunk/boinc/; revision=22612
cmdline arg.
Suppresses the fetch of project list and of current client version #.
Use when running on grid nodes.
- debugging on client simulator. Not done yet.
svn path=/trunk/boinc/; revision=22414