Old: it's based entirely on CPU time.
So a GPU project, whose app uses only a fraction
of a CPU, accrues positive debt.
This is OK if the project has only GPU apps,
since STD is not (currently) used for GPU scheduling.
But some projects have both CPU and GPU apps.
New: STD is based on total processing.
It has terms for each resource type.
The notion of "runnable resource share" is specific to a type.
Note: the notion of "resource share fraction" appears in
a couple of other places:
- it's passed to apps in app_init_data.xml
- it's passed in scheduler requests.
It should be broken down by resource type in these cases too.
Note to self: do this later.
svn path=/trunk/boinc/; revision=19762
don't accumulate debt for that resource.
Otherwise we'll accumulate debt forever,
pushing other projects into overworked state.
svn path=/trunk/boinc/; revision=19547
(estimated throughput of all GPUs)/(estimated throughput of all CPUs)
rather than the ratio of 1 GPU to 1 CPU.
This change will hopefully cause ratios of granted credit
to more closely match resource shares.
svn path=/trunk/boinc/; revision=19311
Make them both peak FLOPS,
according to the formula supplied by the manufacturer.
The impact on the client is minor:
- the startup message describing the GPU
- the weight of the resource type in computing long-term debt
On the server, I changed the example app_plan() function
to assume that app FLOPS is 20% of peak FLOPS
(that's about what it is for SETI@home)
svn path=/trunk/boinc/; revision=19310
set a "coproc_missing" flag rather than aborting the job.
If use removes a GPU board while there's a large queue of GPU jobs,
they'll stay queued (until their deadline passes).
Note: this doesn't fix the situation where user connects via
Remote Desktop while GPU jobs are running or queued.
We should check for Remote Desktop every minute or so, and stop GPU jobs.
svn path=/trunk/boinc/; revision=19287
prefs and update, the change wouldn't take effect until client restart.
- client: fix bug in enforcement of "no CPU/NVIDIA/ATI" prefs
svn path=/trunk/boinc/; revision=19236
to accept CPU, NVIDIA and ATI jobs.
These prefs are shown only where relevant:
e.g., only for processor types for which the project has app versions,
and if it has versions for only one type, no pref is shown.
These prefs affect both client and scheduler.
The client won't ask for work for a device blocked by prefs,
and the scheduler won't send it.
This replaces earlier optional project-specific prefs for
"no CPU jobs" and "no GPU jobs".
(However, these prefs continue to be honored on the server side).
- client: if NVIDIA driver is unknown, say that rather than 0
svn path=/trunk/boinc/; revision=19194
and <ati_backoff> elements to scheduler reply.
These specify backoffs for the resource types,
overriding the existing backoff mechanism.
Projects can supply these if they don't have apps of a particular type
and don't want to get periodic requests for them.
svn path=/trunk/boinc/; revision=19059
start only enough jobs to fill CPUs per project,
not all the CPU jobs at once.
I'm not sure how much difference this makes,
but this is how it's supposed to work.
- client: if app_info.xml doesn't specify flops,
use an estimate that takes GPUs into account.
- client: if it's been more than 2 weeks since time stats update,
don't decay on_frac at all.
svn path=/trunk/boinc/; revision=19035
If you have 2 CPUs and a 1-day job in EDF mode,
the busy time should be zero, not .5 days.
Add a class BUSY_TIME_ESTIMATOR that makes a somewhat better
(though still fairly crude) estimate.
svn path=/trunk/boinc/; revision=19003
with a GPU request if project is anonymous platform
AND it has an app for that GPU type
- client: report overall work request as well as per-resource-type requests
svn path=/trunk/boinc/; revision=18994
to the max of the requests for different resource types.
Otherwise projects with old schedulers won't send us work.
svn path=/trunk/boinc/; revision=18945
2 * max(ncpus, ngpus);
show this in the state displayed by <work_fetch_debug>
- manager: show project-wide backoff in transfers tab
svn path=/trunk/boinc/; revision=18662
We need to estimate 2 different delays for each resource type:
1) "saturated time": the time the resource will be fully utilized
(new name for the old "estimated delay").
This is used to compute work requests.
2) "busy time": the time a new job would have to wait
to start using this resource.
This is passed to the scheduler and used for a crude deadline check.
Note: this is ill-defined; a single number doesn't suffice.
But as a very rough estimate, I'll use the sum of
(J.duration * J.ninstances)/ninstances
over all jobs that miss their deadline under RR sim.
svn path=/trunk/boinc/; revision=18629
(passed to server for crude deadline check) is computed.
Old: estimated delay is the interval for which the resource
is fully used (i.e., all instances busy).
Problem: this may cause unnecessary project starvation.
example: 1 CPU machine, has a month-long CPDN job
with a 1-year deadline (it's not in deadline trouble).
Then the CPU estimated delay will be 1 month,
and the client won't get any work from projects
with deadlines shorter than 1 month.
New: estimated delay is the latest time at which the
resource is fully used and is being used by at least 1 job
that is projected to miss its deadline under RR.
Note: this isn't precise, but I don't think we can improve it
much without getting a lot more complex.
svn path=/trunk/boinc/; revision=18607
- client: show times correctly in rr_sim debug msgs
- client: in "requesting new tasks" msg,
say what resources we're requesting (if there's more than CPU)
- client: estimated delay was possibly being calculated incorrectly
because of roundoff error
svn path=/trunk/boinc/; revision=18269
- first schedule jobs projected to miss deadline in EDF order
- then schedule remaining jobs in FIFO order
This is intended to reduce the number of preemptions of coproc jobs,
and hence (since they are always preempted by quit)
to reduce the wasted time due to checkpoint gaps.
- client: the CPU scheduling policy made use of the number
of deadline misses in various places.
This should include only the deadline misses of CPU jobs.
So move "deadlines_missed" from RR_SIM_STATUS and PROJECT
to RSC_PROJECT_WORK_FETCH so that we have separate counts
for CPU and coproc jobs, and use the count for CPU jobs.
- GUI RPC: removed the rr_sim_deadlines_missed field
from project descriptor.
This is no longer meaningful, and it didn't seem to be used anywhere.
svn path=/trunk/boinc/; revision=17785
i.e., list both win/x86 and win/x86 + NVIDIA.
This will allow the manager to show which projects can
use the hosts's coprocessors,
and also graying out projects that require an absent coproc.
- fix compile warnings
svn path=/trunk/boinc/; revision=17735
set the job params to reasonable values (see below),
and make it easy to change these values in the script
- create_work (function and script): change default job params:
FLOPs est: 1e9 => 3600e9
FLOPs bound: 1e10 => 86400e9
mem bound 100MB => 500MB,
disk bound 100MB => 1GB
delay bound: 100000s => 1 week
svn path=/trunk/boinc/; revision=17703
if resource is saturated for < work_buf_min()
(rather than saturated for 0).
So now the only significance of "overworked" is that we
won't ask overworked projects for work if resource is saturated
more than work_buf_min() but less than work_buf_total()
svn path=/trunk/boinc/; revision=17620
other than work fetch (e.g., user request, project request)
temporarily clear resource backoffs while deciding
whether to request work.
The backoffs are there only to delay RPCs,
and we're going an RPC anyway.
svn path=/trunk/boinc/; revision=17416
to ask for work inappropriately,
and tell user that it wasn't asking for work.
Here's what was going on:
There are two different structures with work request fields
(req_secs, req_instances, estimated_delay):
COPROC_CUDA *coproc_cuda
and
RSC_WORK_FETCH cuda_work_fetch.
WORK_FETCH::choose_project() copied from cuda_work_fetch to coproc_cuda,
but only if a project was selected.
WORK_FETCH::clear_request() clears cuda_work_fetch but not coproc_cuda.
Scenario:
- a scheduler op is made to project A requesting X>0 secs of CUDA
- later, a scheduler op is made to project B for reason
other than work fetch (e.g., user request)
- choose_project() doesn't choose anything,
so the value of coproc_cuda->req_secs remains X
- clear_request() is called but that doesn't change *coproc_cuda
Solution: work-fetch code no longer knows about internals of
COPROC_CUDA and is not responsible for settings its request fields.
The copying of request fields from RSC_WORK_FETCH to COPROC
is done at a higher level,
in CLIENT_STATE::make_scheduler_request()
Additional bug fix: estimated_delay wasn't being cleared in some cases.
svn path=/trunk/boinc/; revision=17411
project, it most have no runnable jobs for ANY resource.
- client: work-fetch bug fix: when setting requests in the
shortfall case, don't request anything if project is backed off
or overworked for the resource.
svn path=/trunk/boinc/; revision=17338
There are situations where multiple projects can legitimately
have large negative LTD on a uniprocessor.
Instead...
- client: add <zero_debts> option to cc_config.xml
svn path=/trunk/boinc/; revision=17328
1) if an instance is idle, get work from highest-debt project,
even if it's overworked.
2) if resource has a shortfall, get work from highest-debt
non-overworked project
3) if there's a fetchable non-overworked project with no runnable jobs,
get from from the highest-debt one.
(each step is done first for GPU, then CPU)
Clause 3) is new.
It will cause the client to get jobs for as many projects as possible,
even if there is no shortfall.
This is necessary to make the notion of "overworked" meaningful
(otherwise, any project with long jobs can become overworked).
It also maintains as much variety as possible (like pre-6.6 clients).
Also (small bug fix) if a project is overworked for resource R,
request work for R only in case 1).
svn path=/trunk/boinc/; revision=17327
stop accumulating debt if it's at or around zero.
This prevents other projects from being driven unboundedly negative.
- client: if the number of overworked projects exceeds the number
of device instances, clear debts; this indicates that an earlier
client was buggy and produced bad debt values.
svn path=/trunk/boinc/; revision=17325
This fixes a bug that can cause debts to NEVER get updated.
- client: added "abort_jobs_on_exit" feature
(available by --abort_jobs_on_exit cmdline
or <abort_jobs_on_exit> in cc_config.xml).
If set, when the client is exited by user request
(this includes signals on Unix)
it marks all pending jobs as aborted,
and does a scheduler RPC to all projects with jobs.
When these are completed the client exits.
This is useful when BOINC is being used on grids
where it is wiped clean after each run.
svn path=/trunk/boinc/; revision=17300
so that largest debt among eligible projects tends towards zero
- client: change definition of "overworked"; debt must be < 1 day
svn path=/trunk/boinc/; revision=17206
- client: if a project-requested RPC doesn't return work,
don't do resource backoff.
- client: if a user-requested scheduler RPC errors out, clear the request
svn path=/trunk/boinc/; revision=17191
ignore intervals longer than 10 secs;
that could only happen if the client or host was suspended/hibernated.
- client: in adjust_debts(), ignore intervals longer than
2*work fetch period, not 2*CPU sched period.
adjust_debts() is called from work fetch.
svn path=/trunk/boinc/; revision=17154
worked in the presence of coprocessors.
The simulator maintained per-project queues of pending jobs.
When a job finished (in the simulation) it would get
one or more jobs from that project's pending queue.
The problem: this could cause "holes" in the scheduling of GPUs,
and produce an erroneous nonzero shortfall for GPUs,
leading to infinite work fetch.
The solution: maintain a separate (per-resource, not per--project)
queue of pending coprocessor jobs.
When a coprocessor job finishes,
start pending jobs from the queue for that resource.
Another change: the simulator did strict reservation of coprocessors.
If there are 2 instances of CUDA,
and a 1-instance job is running in the simulation,
it wouldn't start an additional 2-instance job.
This also can cause erroneous nonzero shortfalls.
So instead, schedule coprocessors like CPUs, i.e. saturate them.
This can cause distorted completion time estimates,
but it's better than infinite work fetch.
svn path=/trunk/boinc/; revision=17093
There are two mechanisms to prevent the scheduler from
sending jobs that won't finish by their deadline.
Simple mechanism:
The client sends the interval x for which CPUs are projected
to be saturated.
Given a job with estimated duration y,
the scheduler doesn't send it if x + y exceeds the delay bound.
If it does send it, x is incremented by y.
Complex mechanism:
Client sends workload description.
Scheduler does EDF simulation, sees if deadlines are missed.
The only project using this AFAIK is BOINC alpha test.
Neither of these mechanisms takes coprocessors into account,
and as a result jobs could be sent that are doomed to
miss their deadline.
This checkin adds coprocessor awareness to the Simple mechanism.
Changes:
Client:
compute estimated delay (i.e. time until non-saturation)
for coprocessors as well as CPU.
Send them in scheduler request as part of coproc descriptor.
Scheduler:
Keep track of estimated delays separately for different resources
- client: fixed bug that computed CPU estimated delay incorrectly
- client: the work request (req_secs) for a resource is the min
of the project's share and the shortfall.
svn path=/trunk/boinc/; revision=17086
- client: restore notion of overworked;
if a project is overworked for a resource R,
don't fetch work for R unless there are idle instances
svn path=/trunk/boinc/; revision=17057
1) net adjustment for eligible projects is zero;
2) max LTD is zero
- scheduler: fix msgs so disk size is shown in GB
svn path=/trunk/boinc/; revision=17031
- client: respect work-fetch backoff for non-CPU-intensive projects
- client: for non-CPU-intensive project, fetch new job
if no currently running jobs
- client: skip non-CPU-intensive projects in debt calculations
- manager: show resource backoff times correctly
svn path=/trunk/boinc/; revision=16998
(otherwise it doesn't work for coproc or multi-proc apps)
- client: in estimate of job completion time,
weight the estimate based on fraction done more heavily
(quadratic rather than linear)
svn path=/trunk/boinc/; revision=16603
as the basis for estimating job completion times.
This should improve estimates for GPU apps,
and prevent the DCF from getting messed up.
svn path=/trunk/boinc/; revision=16598
are non-CPU-intensive or that use < 1 CPU (e.g., CUDA)
- client: get rid of spurious "internal error,
expected process to be executing" msg
- diag: don't check heap on every alloc
- fix a few compile warnings
svn path=/trunk/boinc/; revision=16323
Here's are the new semantics: a scheduler reply can include
<next_rpc_delay>
Make another RPC ASAP after this amount of time elapses.
This is specified by the <next_rpc_delay> element in config.xml.
<request_delay>
Don't make another RPC until this amount of time elapses.
This is sent automatically (and sometimes with large delays)
by various parts of the scheduler.
next_rpc_delay now "overrides" request_delay in the sense that
request_delay is ignored if it's greater than next_rpc_delay.
In addition: the client maintains a min_rpc_time which is set based
on request_delay and also by various exponential backoff schemes.
new_rpc_delay now overrides this as well, in the same sense.
svn path=/trunk/boinc/; revision=16206