Round-robin simulation, among other things, creates a bitmap
"sim_excluded_instances" of instances that are idle because of CPU exclusions.
There was a problem in how this was computed;
in the situation where there are fewer jobs than GPU instances
it could fail to set any bits, so no work fetch would happen.
My solution is a bit of a kludge, but should work in most cases.
The long-term solution is to treat GPU instances separately,
eliminating the need for GPU exclusions.
- scale amount of work request by
(# non-excluded instances)/#instances
- change policy:
old: don't fetch work if #jobs > #non-excluded instances
new: don't fetch work if # of instance-seconds used in RR sim
> work_buf_min * (#non-exluded instances)/#instances
(especially per-app exclusions) was incomplete and buggy.
Changes:
- make bitmaps of included instances per (app, resource type)
- in round-robin simulation, we keep track of used instances
(so that we know if there are instances that are idle
because of exclusions).
Do this based on app-level exclusions
(previously it was done based on project-wide exclusions,
which didn't include app-level exclusions).
- compute RSC_PROJECT_WORK_FETCH::non_excluded_instances
as the logical OR of the per-app masks.
I.e. if you exclude an instance for all apps separately,
it's the same as excluding it for the project as a whole.
(Note: this bitmap is used for only 1 purpose:
if we have idle instances, don't request work from a project
for which those instances are excluded.)
- define RSC_PROJECT_WORK_FETCH::ncoprocs_excluded as the # of
instances excluded for *any* app, not the # excluded for all apps.
This quantity is used in work fetch to make sure we don't
unboundedly fetch jobs that turn out not to have a GPU to run on
due to exclusions.
the binding of the get_state() RPC
- client: move client_start_time and previous_uptime
from CLIENT_STATE to TIME_STATS,
so that these are also visible in GUI RPC
- scheduler RPC: move uptime and previous_uptime
into <time_stats>
- client: condition an RR simulation message on <rrsim_detail>
- boinccmd: show TIME_STATS info in --get_state
Note: this fixes a major problem (starvation)
with project-level GPU exclusion.
However, project-level GPU exclusion interferes with most of
the client's scheduling policies.
E.g., round-robin simulation doesn't take GPU exclusion into account,
and the resulting completion estimates and device shortfalls
can be wrong by an order of magnitude.
The only way I can see to fix this would be to model each
GPU instance as a separate resource,
and to associate each job with a particular GPU instance.
This would be a sweeping change in both client and server.
allow it to fetch work of that type if the # of runnable
jobs it <= the # of non-excluded instances (rather than 0).
svn path=/trunk/boinc/; revision=26045
1) a network connection is available and
2) network communication is allowed and
3) CPU computation is allowed
- If an app version is marked as needs_network,
use the above fraction in estimating its rate of progress
- replace "core client" with "client" in comments.
- scheduler: message tweaks
svn path=/trunk/boinc/; revision=25803
on each request.
- client: when showing how much work a scheduler request returned,
scale by availability (as is done to show the amount of the request)
- client in account manager request, <not_started_dur> and
<in_progress_dur> are in wall time, not run time
(i.e. scale them by availability)
Note: there's some confusion in the code between runtime and wall time,
where in general wall time = runtime / availability.
New convention: let's use "runtime" for the former,
and "duration" for the latter.
svn path=/trunk/boinc/; revision=25597
work fetch (e.g. to report completed jobs)
only request work if it's the project we would have chosen
if we were fetching work.
- client: the way in which project priorities were adjusted
in work fetch to reflected currently queued work was wrong.
- client: fix bug in the way project priorities are adjusted
in RR simulator
- client emulator: if there are results in the state file
with states DOWNLOADING or UPLOADING,
change them to DOWNLOADED or UPLOADED.
Otherwise they're stuck.
svn path=/trunk/boinc/; revision=24737
reduce its runtime from O(N^2) to O(N),
where N is the number of runnable jobs
(which can be in the thousands).
This will make the client emulator run a lot faster,
and will reduce the client CPU overhead a bit.
- API: change boinc_get_opencl_ids() so that it returns
a BOINC error code (< -100) if the app_init.xml is
missing or bad (i.e. we're running standalone),
and an OpenCL error code (> -100) if an OpenCL call failed.
svn path=/trunk/boinc/; revision=24469
so that they do what they're supposed to
(i.e. enforce resource shares)
- client: change log flag <debt_debug> to <priority_debug>
- client simulator: update REC even with large delta-t.
- client simulator: handle "no new work" apps correctly
svn path=/trunk/boinc/; revision=24429
- client: if an app version can't be used because the GPUs it needs
are all excluded, mark it and all its results as "coproc missing"
so that they won't be looked at in scheduling logic.
svn path=/trunk/boinc/; revision=24317
where work fetch didn't work right in the presence of
multiple GPUs and <exclude_gpu> config options.
For example: suppose:
- you have 2 GPUs and 2 projects
- Project A is excluded from GPU 1
- you have lots of jobs for project A
Then the client won't try to fetch jobs from project B.
The problem had 2 parts:
a) round-robin simulation wasn't taking GPU exclusions into account.
In the above example, it would think that both GPUs had jobs.
I fixed this by computing the # of GPUs from each project
is excluded, and using this in the RR simulation.
b) Once this was done, I needed to make the client
request GPU jobs from project B rather than project A.
I did this with following policy:
If a project has excluded GPUs of a given type,
and has a runnable job of that type,
don't ask it for more work of that type.
Notes:
- the policy in b) is crude, and it means that work-buffer
preferences are ignored in some cases.
- neither a) nor b) takes into account app-level exclusions.
I could fix both of these with a lot of work,
but I'd rather move to a model in which dissimilar GPUs
are modeled as different resources,
which would remove the need for the <exclude_gpu> mechanism
in the first place.
- web: remove extraneous ) at end of button tooltips
svn path=/trunk/boinc/; revision=24312
- Fix build problems on Mac OS X using autotools
- Consistently use #if HAVE_X for platform checks,
rather than #ifdef HAVE_X or #if defined(HAVE_X)
- In Unix build, make lots of compiler checks standard
- Fix some compile warnings
From Matt Arsenault.
Note: there are now lots of compile warnings in clientgui/ on Unix,
mostly in WxWidgets code
svn path=/trunk/boinc/; revision=24303
by simulating time-slicing explicitly.
Also simulate changes in project REC
and hence in scheduling priority.
- client: add a log flag "rrsim_detail" that prints
time-slice-level info.
svn path=/trunk/boinc/; revision=24161
- new GPU types can be added easily
- users can specify GPUs in cc_config.xml,
referred to by app_info.xml,
and they will be scheduled by BOINC
and passed --device N options
Note: the parsing of cc_config.xml is not done yet.
- RPC protocols (account manager and scheduler)
can now specify GPU types in separate elements
rather than embedding them in tag names
e.g. <no_rsc>NVIDIA</no_rsc> rather than <no_cuda/>
- client: in account manager replies, parse elements of the form
<no_rsc>NAME</no_rsc>
indicating the GPUs of type NAME should not be used.
This allows account managers to control GPU types
not hardwired into the client.
Note: <no_cuda/> and <no_ati/> will continue to be supported.
- scheduler RPC reply: add
<no_rsc_apps>NAME</no_rsc_apps>
(NAME = GPU name)
to indicate that the project has no jobs for the indicated GPU type.
<no_cuda_apps> etc. are still supported
- client/lib: remove set_debts() GUI RPC
- client/scheduler RPC
remove <cuda_backoff> etc. (superceded by no_app)
Exception: <ip_result> elements in sched request
still have <ncudas> and <natis>.
Fix this later.
Implementation notes:
- client/lib: change "CUDA" to "NVIDIA" in type/variable names, and in XML
Continue to recognize "CUDA" for compatibility
- host_info.coprocs no longer used within the client;
use a global var (COPROCS coprocs) instead.
COPROCS now has an array of COPROCs;
GPUs types are identified by the array index.
Index zero means CPU.
- a bunch of other resource-specific structs (like RSC_WORK_FETCH)
are now stored in arrays, with same indices as COPROCS
(i.e. index 0 is CPU)
- COPROCS still has COPROC_NVIDIA and COPROC_ATI structs to hold vendor-specific info
- APP_VERSION now has a struct GPU_USAGE to describe its GPU usage
svn path=/trunk/boinc/; revision=23253
The problem arises when there are jobs of projects
with widely differing resource shares,
and results in an overestimation of saturated time.
Old: at the start of simulation, call WORK_FETCH::compute_shares()
to get resources of runnable projects.
Use these throughout the simulation.
Problem: suppose you have 2 runnable projects;
P1 has large RS, P2 has small RS.
P1's jobs finish quickly.
P2's jobs then are running alone,
but their FLOPS is scaled (incorrectly) by P2's small RS.
Solution: recompute relative CPU resource share within the
simulation loop,
and compute it over the projects that have actives jobs
in the simulation.
svn path=/trunk/boinc/; revision=23162
Currently we do a reschedule any time a job checkpoints,
in case there's a job that has finished a time slice
but hasn't checkpointed yet.
Instead: flag such jobs, and trigger a reschedule
on checkpoint only for flagged jobs.
- client: fix instability in job scheduling that happens
if a job's estimated completion time in RR sim is close to its deadline.
It can alternate between making and missing deadline,
causing the scheduler to alternate rapidly between jobs.
Solution: if RR sim has marked a job as deadline miss
any time in the last (CPU scheduling period),
treat it as a deadline miss.
svn path=/trunk/boinc/; revision=22928
for RR sim's pending-job lists.
Erasing head of vector is slow.
- lib: allow GPU peak FLOPS to be specified in XML (for simulator)
- simulator work
- client: old work fetch policy: projects may need enough jobs
for all device instances, not just resource_share*ninst.
E.g. a project that has only CPU jobs in a CPU/GPU client
- client: with REC scheduling, don't ask for work for
secondary resources if project has negative priority.
- client: in RR sim, make sure we saturate devices if possible.
Otherwise we may report a shortfall incorrectly
svn path=/trunk/boinc/; revision=22894
The round-robin simulation wasn't handling multithread jobs correctly.
For example, given two 3-CPU jobs,
it would model running them together on a 4-CPU host.
This doesn't correspond with the CPU scheduler,
which runs only 1 at a time.
So the simulator would say that there are no idle CPUs
when in fact there are, and no new CPU jobs would be fetched.
svn path=/trunk/boinc/; revision=22801
Additions to request message:
<not_started_dur>X</not_started_dur>
<in_progress_dur>X</in_progress_dur>
The estimated remaining duration of unstarted
and in-progress tasks
Additions to reply message, within <project>, optional:
<suspend>0|1</suspend>
suspend or resume project (overrides local state)
<abort_not_started>0|1</abort_not_started>
if set, abort unstarted jobs
svn path=/trunk/boinc/; revision=22698
pointers to dynamically allocated COPROC-derived objects,
just have the objects themselves.
Dynamic allocation should be avoided at all costs.
svn path=/trunk/boinc/; revision=21564
skip those jobs in RR sim.
Otherwise we add stuff to uninitialized data structures,
and a crash can result.
- client: initialize the above data structures anyway
svn path=/trunk/boinc/; revision=20753
treat it as a "backup project":
fetch work from it only if there is an idle instance
and no other projects have work.
svn path=/trunk/boinc/; revision=20286
if job A is unstarted and EDF,
and there's a job B that is later in the list,
is started, has the same app version,
and has the same arrival time,
move A after B.
- client: remove the "temp_dcf" mechanism,
which had the same goal but didn't work.
- client: in computing overall debt for a project,
subtract a term that reflects pending work.
This should reduce repeated fetches from the same project.
- client simulator: tweaks
svn path=/trunk/boinc/; revision=20223
- a project overestimates job FLOP counts
- the client starts jobs in EDF mode
- as job progresses and fraction done increases,
its completion time estimate decreases until
it's no longer a deadline miss.
- job gets preempted by other job from that project;
you end up with lots of partly completed jobs.
Solution (I hope): if an app version has running jobs,
compute a "temp DCF" for the app version,
which is the min of dynamic/static estimates for its jobs.
Apply this scaling factor to completion time estimates
for unstarted jobs in RR simulation
- client: the estimation of remaining time of running jobs was wrong
(how did this bug survive so long?)
svn path=/trunk/boinc/; revision=20077
will have enough jobs to use its share of resource instances.
This avoids situations where e.g. on a 2-CPU system
a project has 75% resource share and 1 CPU job,
and its STD increases without bound.
Did a general cleanup of the logic for computing
work request sizes (seconds and instances).
svn path=/trunk/boinc/; revision=20036
Old: if a project has RR sim deadline misses,
select jobs to run high-priority on the basis of:
1) deadline (earliest first)
2) estimated time to completion (least first)
This ignores whether jobs missed their deadline in RR sim,
so it may choose to run a job that's actually in no
danger of missing its deadline over one that is.
New: choose only jobs that miss their deadline in RR sim
svn path=/trunk/boinc/; revision=19826
start only enough jobs to fill CPUs per project,
not all the CPU jobs at once.
I'm not sure how much difference this makes,
but this is how it's supposed to work.
- client: if app_info.xml doesn't specify flops,
use an estimate that takes GPUs into account.
- client: if it's been more than 2 weeks since time stats update,
don't decay on_frac at all.
svn path=/trunk/boinc/; revision=19035