and an upload started in the last 5 min, don't fetch work from it.
The goal is to merge the 2 scheduler RPCs
(fetch work, report completed taskS) into a single RPC.
Note: this may result in idleness in some cases.
- scheduler: if client doesn't handle plan class (pre-5.10),
check plan-class app versions anyway,
but only use if it's a single-CPU app.
This allows single-CPU app versions with specific requirements
(like SSE) to be issued to old clients.
From Bernd Machenschalk
svn path=/trunk/boinc/; revision=22841
since we now do round-robin for GPUs as well as CPU.
NOTE: this bug was found using the client simulator!
- client simulator: generate REC graph
svn path=/trunk/boinc/; revision=22746
recent estimated credit (REC) instead of debt.
These changes are enabled by
#define USE_REC
in work_fetch.h.
If this is commented out (the default) the client uses
debt-based scheduling, same as before.
TODO: work-fetch policy changes
- client simulator: various fixes:
- compute idle and wasted fraction based on all processing resources,
not just CPU
- compute job completion times based on FLOPS, not CPU seconds
- compute and use project->no_X_apps
etc.
svn path=/trunk/boinc/; revision=22741
even when the project doesn't have app versions that use the resource.
TODO: there are 2 functions,
compute_may_have_work() and dont_fetch(),
that do the same thing and both have misleading names.
Clean this up.
Rom: please back-port to 6.10
svn path=/trunk/boinc/; revision=22733
Additions to request message:
<not_started_dur>X</not_started_dur>
<in_progress_dur>X</in_progress_dur>
The estimated remaining duration of unstarted
and in-progress tasks
Additions to reply message, within <project>, optional:
<suspend>0|1</suspend>
suspend or resume project (overrides local state)
<abort_not_started>0|1</abort_not_started>
if set, abort unstarted jobs
svn path=/trunk/boinc/; revision=22698
- If the scheduler doesn't have any app versions for resource type X,
it includes an element <no_X_apps>1</no_X_apps> in the reply msg
(e.g., <no_cpu_apps>1</no_cpu_apps>)
- The client parses and stores these flags,
and doesn't ask a project for work for a resource
if the project doesn't have app versions for it.
Apparently I started this change in [19375] (October 2009)
and forgot to finish it.
svn path=/trunk/boinc/; revision=22661
as the major criterion in choosing non-EDF GPU jobs.
GPU scheduling now respects resource share,
and as a result STD should no longer diverge.
- client simulator: various improvements, most notably
that we now generate gnuplot graphs of all debt types
NOTE: the client problem was found and fixed using the simulator!
svn path=/trunk/boinc/; revision=22536
cmdline arg.
Suppresses the fetch of project list and of current client version #.
Use when running on grid nodes.
- debugging on client simulator. Not done yet.
svn path=/trunk/boinc/; revision=22414
Old: when a job finished, we cleared the backoffs for the
resources it used. The idea was to get more jobs
immediately in the case where the client was at
a jobs-in-progress limit.
Problem: this resulted in an RPC immediately,
typically before the output files were uploaded.
So the client is still at the limit, and doesn't get jobs.
New: clear the backoffs at the point when output files
have been uploaded and the job is ready to report.
- client: change range in resource backoff from (0,x) to (.5, 1.5*x)
svn path=/trunk/boinc/; revision=22411
Insteady of using its own XML input files,
the simulator now takes a client_state.xml file as input.
The simulator generates a synthetic workload based on the
projects, apps, app versions, WUs, and result it finds there.
This means that a user seeing aberrant behavior
can just send their client_state.xml file
and (hopefully) we can use the simulator to repro.
The simulator now can model GPUs.
As of this checkin, the simulator compiles but doesn't work.
There should be no change in the actual client.
svn path=/trunk/boinc/; revision=22409
1) individual file transfers
2) project-level file transfer backoff
3) scheduler operations
Old: scale by e.
Use random backoff in the range min..x
New: scale by 2.
Use random backoff in the rand x/2..x
- client: for file transfers, use backoff range of 10 min .. 12 hrs
rather than 1 min .. 4 hrs
svn path=/trunk/boinc/; revision=21887
if one of its downloads is stalled.
This fixes a situation that can cause processor or GPU
idleness when download servers are down for a while
svn path=/trunk/boinc/; revision=21877
If set, then:
if there are any active jobs at startup, don't fetch more work
otherwise make exactly 1 scheduler RPC requesting work,
and request only enough jobs to fill all devices.
- client: --exit_when_idle: make it available in config file
and change semantics to:
If set: exit if
1) there are no tasks, and
2) either there was an active task on startup,
or we made a scheduler RPC requesting work
Note: if there are not active tasks on startup,
and the client makes a work request which doesn't return work,
it will exit.
svn path=/trunk/boinc/; revision=21680
pointers to dynamically allocated COPROC-derived objects,
just have the objects themselves.
Dynamic allocation should be avoided at all costs.
svn path=/trunk/boinc/; revision=21564
show an alert.
- manager: in transfers tab, show it if transfers are suspended
because network is suspended
- manager: in tasks tab, if a task is downloading or uploading
and network is suspended, show it
svn path=/trunk/boinc/; revision=21346
skip those jobs in RR sim.
Otherwise we add stuff to uninitialized data structures,
and a crash can result.
- client: initialize the above data structures anyway
svn path=/trunk/boinc/; revision=20753
no longer means simulate zero CPUs.
There are several places that divide by ncpus.
Zero CPUs doesn't make any sense anyway.
svn path=/trunk/boinc/; revision=20474
treat it as a "backup project":
fetch work from it only if there is an idle instance
and no other projects have work.
svn path=/trunk/boinc/; revision=20286
if job A is unstarted and EDF,
and there's a job B that is later in the list,
is started, has the same app version,
and has the same arrival time,
move A after B.
- client: remove the "temp_dcf" mechanism,
which had the same goal but didn't work.
- client: in computing overall debt for a project,
subtract a term that reflects pending work.
This should reduce repeated fetches from the same project.
- client simulator: tweaks
svn path=/trunk/boinc/; revision=20223
"There is a bug in tra() that causes problems if one of the arguments
contains a replacement marker itself. For example, if the first
argument contains an encoded URL, which contains '%2', the second
argument may appear in the middle of the URL."
- client simulator: further fiddling around. Not done.
svn path=/trunk/boinc/; revision=20201
if project has crazy DCF, don't automatically request 1 sec;
only request work if there's a shortfall.
- intermediate checkin for notices stuff
svn path=/trunk/boinc/; revision=20145
- a project overestimates job FLOP counts
- the client starts jobs in EDF mode
- as job progresses and fraction done increases,
its completion time estimate decreases until
it's no longer a deadline miss.
- job gets preempted by other job from that project;
you end up with lots of partly completed jobs.
Solution (I hope): if an app version has running jobs,
compute a "temp DCF" for the app version,
which is the min of dynamic/static estimates for its jobs.
Apply this scaling factor to completion time estimates
for unstarted jobs in RR simulation
- client: the estimation of remaining time of running jobs was wrong
(how did this bug survive so long?)
svn path=/trunk/boinc/; revision=20077
will have enough jobs to use its share of resource instances.
This avoids situations where e.g. on a 2-CPU system
a project has 75% resource share and 1 CPU job,
and its STD increases without bound.
Did a general cleanup of the logic for computing
work request sizes (seconds and instances).
svn path=/trunk/boinc/; revision=20036
RAM to run job, but when we actually run the job
not enough GPU RAM is free, so the application fails.
This can cause a large number of jobs to fail.
Solution:
- app_plan() can specify the GPU RAM requirements of an app version.
This is passed to the client in a new field
<gpu_ram> of the <app_version> element.
- prior to starting or restarting a GPU app, the client
checks the amount of free RAM on the particular GPU.
If it's not enough for the app version,
the client doesn't start it,
and arranges for the scheduler to ignore it for 5 minutes
(by which point there might be more free GPU RAM)
Notes:
1) this change will have effect only when
both client and scheduler are updated.
2) the check is done in enforce_schedule(),
rather than schedule_cpus(),
because only at that point
have we assigned a specific GPU to the job.
3) there's another case to deal with:
a GPU app's malloc of GPU RAM fails in the middle of the job.
Currently the job fails.
I plan to add an API call boinc_temporary_exit(x) so
that the job can exit and potentially restart in x seconds.
(In principle this mechanism is sufficient for all cases,
but it could lead to a lot of starting/exiting,
so the current change is worthwhile).
svn path=/trunk/boinc/; revision=19864
can increase or decrease at N times real time.
My checkin of 7 Dec reflects this by changing
the STD limits to +- N*MAX_STD.
This looks like a bug to users.
Instead, scale that rate of STD change by 1/N,
and keep the old limits of +- MAX_STD
svn path=/trunk/boinc/; revision=19851
Old: if a project has RR sim deadline misses,
select jobs to run high-priority on the basis of:
1) deadline (earliest first)
2) estimated time to completion (least first)
This ignores whether jobs missed their deadline in RR sim,
so it may choose to run a job that's actually in no
danger of missing its deadline over one that is.
New: choose only jobs that miss their deadline in RR sim
svn path=/trunk/boinc/; revision=19826
Let them float around with other projects.
Fixes problem where, when a project finishes its last job
and has a negative STD, it gets an unfair increment
by being set to zero.
svn path=/trunk/boinc/; revision=19804
It computed an "overall STD" as the sum of CPU and coprocs,
weighted by the coproc's speed, as we do for LTD.
This was the wrong idea; in the presence of GPUs,
STDs quickly get pushed to +- 1 day and are truncated there.
New scheme: STD is maintained per (resource type, project).
This fixes the above problem,
and it opens to door to round-robin scheduling of GPUs.
- client: the calculation of "anticipated debt" was scaling
by relative resource share.
This wasn't correct, seems to me.
- client: rename "debt" to "long_term_debt" in a few places
(but not in the client state file, for compatibility)
svn path=/trunk/boinc/; revision=19777
only if the offset is positive.
- client: some cmdline args set members of config.
However, config was being cleared after cmdline args were parsed,
so these args had no effect.
Instead, clear config before parsing cmdline
svn path=/trunk/boinc/; revision=19776
Old: it's based entirely on CPU time.
So a GPU project, whose app uses only a fraction
of a CPU, accrues positive debt.
This is OK if the project has only GPU apps,
since STD is not (currently) used for GPU scheduling.
But some projects have both CPU and GPU apps.
New: STD is based on total processing.
It has terms for each resource type.
The notion of "runnable resource share" is specific to a type.
Note: the notion of "resource share fraction" appears in
a couple of other places:
- it's passed to apps in app_init_data.xml
- it's passed in scheduler requests.
It should be broken down by resource type in these cases too.
Note to self: do this later.
svn path=/trunk/boinc/; revision=19762
don't accumulate debt for that resource.
Otherwise we'll accumulate debt forever,
pushing other projects into overworked state.
svn path=/trunk/boinc/; revision=19547
(estimated throughput of all GPUs)/(estimated throughput of all CPUs)
rather than the ratio of 1 GPU to 1 CPU.
This change will hopefully cause ratios of granted credit
to more closely match resource shares.
svn path=/trunk/boinc/; revision=19311
Make them both peak FLOPS,
according to the formula supplied by the manufacturer.
The impact on the client is minor:
- the startup message describing the GPU
- the weight of the resource type in computing long-term debt
On the server, I changed the example app_plan() function
to assume that app FLOPS is 20% of peak FLOPS
(that's about what it is for SETI@home)
svn path=/trunk/boinc/; revision=19310
set a "coproc_missing" flag rather than aborting the job.
If use removes a GPU board while there's a large queue of GPU jobs,
they'll stay queued (until their deadline passes).
Note: this doesn't fix the situation where user connects via
Remote Desktop while GPU jobs are running or queued.
We should check for Remote Desktop every minute or so, and stop GPU jobs.
svn path=/trunk/boinc/; revision=19287
prefs and update, the change wouldn't take effect until client restart.
- client: fix bug in enforcement of "no CPU/NVIDIA/ATI" prefs
svn path=/trunk/boinc/; revision=19236
to accept CPU, NVIDIA and ATI jobs.
These prefs are shown only where relevant:
e.g., only for processor types for which the project has app versions,
and if it has versions for only one type, no pref is shown.
These prefs affect both client and scheduler.
The client won't ask for work for a device blocked by prefs,
and the scheduler won't send it.
This replaces earlier optional project-specific prefs for
"no CPU jobs" and "no GPU jobs".
(However, these prefs continue to be honored on the server side).
- client: if NVIDIA driver is unknown, say that rather than 0
svn path=/trunk/boinc/; revision=19194
and <ati_backoff> elements to scheduler reply.
These specify backoffs for the resource types,
overriding the existing backoff mechanism.
Projects can supply these if they don't have apps of a particular type
and don't want to get periodic requests for them.
svn path=/trunk/boinc/; revision=19059
start only enough jobs to fill CPUs per project,
not all the CPU jobs at once.
I'm not sure how much difference this makes,
but this is how it's supposed to work.
- client: if app_info.xml doesn't specify flops,
use an estimate that takes GPUs into account.
- client: if it's been more than 2 weeks since time stats update,
don't decay on_frac at all.
svn path=/trunk/boinc/; revision=19035
If you have 2 CPUs and a 1-day job in EDF mode,
the busy time should be zero, not .5 days.
Add a class BUSY_TIME_ESTIMATOR that makes a somewhat better
(though still fairly crude) estimate.
svn path=/trunk/boinc/; revision=19003
with a GPU request if project is anonymous platform
AND it has an app for that GPU type
- client: report overall work request as well as per-resource-type requests
svn path=/trunk/boinc/; revision=18994
to the max of the requests for different resource types.
Otherwise projects with old schedulers won't send us work.
svn path=/trunk/boinc/; revision=18945
2 * max(ncpus, ngpus);
show this in the state displayed by <work_fetch_debug>
- manager: show project-wide backoff in transfers tab
svn path=/trunk/boinc/; revision=18662
We need to estimate 2 different delays for each resource type:
1) "saturated time": the time the resource will be fully utilized
(new name for the old "estimated delay").
This is used to compute work requests.
2) "busy time": the time a new job would have to wait
to start using this resource.
This is passed to the scheduler and used for a crude deadline check.
Note: this is ill-defined; a single number doesn't suffice.
But as a very rough estimate, I'll use the sum of
(J.duration * J.ninstances)/ninstances
over all jobs that miss their deadline under RR sim.
svn path=/trunk/boinc/; revision=18629
(passed to server for crude deadline check) is computed.
Old: estimated delay is the interval for which the resource
is fully used (i.e., all instances busy).
Problem: this may cause unnecessary project starvation.
example: 1 CPU machine, has a month-long CPDN job
with a 1-year deadline (it's not in deadline trouble).
Then the CPU estimated delay will be 1 month,
and the client won't get any work from projects
with deadlines shorter than 1 month.
New: estimated delay is the latest time at which the
resource is fully used and is being used by at least 1 job
that is projected to miss its deadline under RR.
Note: this isn't precise, but I don't think we can improve it
much without getting a lot more complex.
svn path=/trunk/boinc/; revision=18607
- client: show times correctly in rr_sim debug msgs
- client: in "requesting new tasks" msg,
say what resources we're requesting (if there's more than CPU)
- client: estimated delay was possibly being calculated incorrectly
because of roundoff error
svn path=/trunk/boinc/; revision=18269
- first schedule jobs projected to miss deadline in EDF order
- then schedule remaining jobs in FIFO order
This is intended to reduce the number of preemptions of coproc jobs,
and hence (since they are always preempted by quit)
to reduce the wasted time due to checkpoint gaps.
- client: the CPU scheduling policy made use of the number
of deadline misses in various places.
This should include only the deadline misses of CPU jobs.
So move "deadlines_missed" from RR_SIM_STATUS and PROJECT
to RSC_PROJECT_WORK_FETCH so that we have separate counts
for CPU and coproc jobs, and use the count for CPU jobs.
- GUI RPC: removed the rr_sim_deadlines_missed field
from project descriptor.
This is no longer meaningful, and it didn't seem to be used anywhere.
svn path=/trunk/boinc/; revision=17785