Currently we do a reschedule any time a job checkpoints,
in case there's a job that has finished a time slice
but hasn't checkpointed yet.
Instead: flag such jobs, and trigger a reschedule
on checkpoint only for flagged jobs.
- client: fix instability in job scheduling that happens
if a job's estimated completion time in RR sim is close to its deadline.
It can alternate between making and missing deadline,
causing the scheduler to alternate rapidly between jobs.
Solution: if RR sim has marked a job as deadline miss
any time in the last (CPU scheduling period),
treat it as a deadline miss.
svn path=/trunk/boinc/; revision=22928
for RR sim's pending-job lists.
Erasing head of vector is slow.
- lib: allow GPU peak FLOPS to be specified in XML (for simulator)
- simulator work
- client: old work fetch policy: projects may need enough jobs
for all device instances, not just resource_share*ninst.
E.g. a project that has only CPU jobs in a CPU/GPU client
- client: with REC scheduling, don't ask for work for
secondary resources if project has negative priority.
- client: in RR sim, make sure we saturate devices if possible.
Otherwise we may report a shortfall incorrectly
svn path=/trunk/boinc/; revision=22894
The round-robin simulation wasn't handling multithread jobs correctly.
For example, given two 3-CPU jobs,
it would model running them together on a 4-CPU host.
This doesn't correspond with the CPU scheduler,
which runs only 1 at a time.
So the simulator would say that there are no idle CPUs
when in fact there are, and no new CPU jobs would be fetched.
svn path=/trunk/boinc/; revision=22801
Additions to request message:
<not_started_dur>X</not_started_dur>
<in_progress_dur>X</in_progress_dur>
The estimated remaining duration of unstarted
and in-progress tasks
Additions to reply message, within <project>, optional:
<suspend>0|1</suspend>
suspend or resume project (overrides local state)
<abort_not_started>0|1</abort_not_started>
if set, abort unstarted jobs
svn path=/trunk/boinc/; revision=22698
pointers to dynamically allocated COPROC-derived objects,
just have the objects themselves.
Dynamic allocation should be avoided at all costs.
svn path=/trunk/boinc/; revision=21564
skip those jobs in RR sim.
Otherwise we add stuff to uninitialized data structures,
and a crash can result.
- client: initialize the above data structures anyway
svn path=/trunk/boinc/; revision=20753
treat it as a "backup project":
fetch work from it only if there is an idle instance
and no other projects have work.
svn path=/trunk/boinc/; revision=20286
if job A is unstarted and EDF,
and there's a job B that is later in the list,
is started, has the same app version,
and has the same arrival time,
move A after B.
- client: remove the "temp_dcf" mechanism,
which had the same goal but didn't work.
- client: in computing overall debt for a project,
subtract a term that reflects pending work.
This should reduce repeated fetches from the same project.
- client simulator: tweaks
svn path=/trunk/boinc/; revision=20223
- a project overestimates job FLOP counts
- the client starts jobs in EDF mode
- as job progresses and fraction done increases,
its completion time estimate decreases until
it's no longer a deadline miss.
- job gets preempted by other job from that project;
you end up with lots of partly completed jobs.
Solution (I hope): if an app version has running jobs,
compute a "temp DCF" for the app version,
which is the min of dynamic/static estimates for its jobs.
Apply this scaling factor to completion time estimates
for unstarted jobs in RR simulation
- client: the estimation of remaining time of running jobs was wrong
(how did this bug survive so long?)
svn path=/trunk/boinc/; revision=20077
will have enough jobs to use its share of resource instances.
This avoids situations where e.g. on a 2-CPU system
a project has 75% resource share and 1 CPU job,
and its STD increases without bound.
Did a general cleanup of the logic for computing
work request sizes (seconds and instances).
svn path=/trunk/boinc/; revision=20036
Old: if a project has RR sim deadline misses,
select jobs to run high-priority on the basis of:
1) deadline (earliest first)
2) estimated time to completion (least first)
This ignores whether jobs missed their deadline in RR sim,
so it may choose to run a job that's actually in no
danger of missing its deadline over one that is.
New: choose only jobs that miss their deadline in RR sim
svn path=/trunk/boinc/; revision=19826
start only enough jobs to fill CPUs per project,
not all the CPU jobs at once.
I'm not sure how much difference this makes,
but this is how it's supposed to work.
- client: if app_info.xml doesn't specify flops,
use an estimate that takes GPUs into account.
- client: if it's been more than 2 weeks since time stats update,
don't decay on_frac at all.
svn path=/trunk/boinc/; revision=19035
runs GPU jobs in a seemingly random order,
or preempts GPU jobs needlessly.
The change has two parts:
1) sort the "results" vector by received_time,
so that the RR simulation processes GPU jobs FIFO.
2) in the CPU scheduler (earliest_deadline_result())
instead of choosing the earliest-deadline GPU job that
misses its deadline,
pick the earliest_deadline GPU from a project that
has a deadline miss for that GPU type
(this is what's done in the CPU case)
- client: fix bug where if you have an exclusive app,
then remove it from cc_config.xml and do "update config",
it doesn't go away.
Need to clear the list before parsing.
svn path=/trunk/boinc/; revision=18842
We need to estimate 2 different delays for each resource type:
1) "saturated time": the time the resource will be fully utilized
(new name for the old "estimated delay").
This is used to compute work requests.
2) "busy time": the time a new job would have to wait
to start using this resource.
This is passed to the scheduler and used for a crude deadline check.
Note: this is ill-defined; a single number doesn't suffice.
But as a very rough estimate, I'll use the sum of
(J.duration * J.ninstances)/ninstances
over all jobs that miss their deadline under RR sim.
svn path=/trunk/boinc/; revision=18629
(passed to server for crude deadline check) is computed.
Old: estimated delay is the interval for which the resource
is fully used (i.e., all instances busy).
Problem: this may cause unnecessary project starvation.
example: 1 CPU machine, has a month-long CPDN job
with a 1-year deadline (it's not in deadline trouble).
Then the CPU estimated delay will be 1 month,
and the client won't get any work from projects
with deadlines shorter than 1 month.
New: estimated delay is the latest time at which the
resource is fully used and is being used by at least 1 job
that is projected to miss its deadline under RR.
Note: this isn't precise, but I don't think we can improve it
much without getting a lot more complex.
svn path=/trunk/boinc/; revision=18607
- client: show times correctly in rr_sim debug msgs
- client: in "requesting new tasks" msg,
say what resources we're requesting (if there's more than CPU)
- client: estimated delay was possibly being calculated incorrectly
because of roundoff error
svn path=/trunk/boinc/; revision=18269
- first schedule jobs projected to miss deadline in EDF order
- then schedule remaining jobs in FIFO order
This is intended to reduce the number of preemptions of coproc jobs,
and hence (since they are always preempted by quit)
to reduce the wasted time due to checkpoint gaps.
- client: the CPU scheduling policy made use of the number
of deadline misses in various places.
This should include only the deadline misses of CPU jobs.
So move "deadlines_missed" from RR_SIM_STATUS and PROJECT
to RSC_PROJECT_WORK_FETCH so that we have separate counts
for CPU and coproc jobs, and use the count for CPU jobs.
- GUI RPC: removed the rr_sim_deadlines_missed field
from project descriptor.
This is no longer meaningful, and it didn't seem to be used anywhere.
svn path=/trunk/boinc/; revision=17785
instead of host_info.fpops as the FLOPS estimate for non-GPU apps.
I don't see why this would make any difference
(these two are equal for non-GPU apps)
but people have reported that this change improves estimates.
svn path=/trunk/boinc/; revision=17624
project, it most have no runnable jobs for ANY resource.
- client: work-fetch bug fix: when setting requests in the
shortfall case, don't request anything if project is backed off
or overworked for the resource.
svn path=/trunk/boinc/; revision=17338
worked in the presence of coprocessors.
The simulator maintained per-project queues of pending jobs.
When a job finished (in the simulation) it would get
one or more jobs from that project's pending queue.
The problem: this could cause "holes" in the scheduling of GPUs,
and produce an erroneous nonzero shortfall for GPUs,
leading to infinite work fetch.
The solution: maintain a separate (per-resource, not per--project)
queue of pending coprocessor jobs.
When a coprocessor job finishes,
start pending jobs from the queue for that resource.
Another change: the simulator did strict reservation of coprocessors.
If there are 2 instances of CUDA,
and a 1-instance job is running in the simulation,
it wouldn't start an additional 2-instance job.
This also can cause erroneous nonzero shortfalls.
So instead, schedule coprocessors like CPUs, i.e. saturate them.
This can cause distorted completion time estimates,
but it's better than infinite work fetch.
svn path=/trunk/boinc/; revision=17093
There are two mechanisms to prevent the scheduler from
sending jobs that won't finish by their deadline.
Simple mechanism:
The client sends the interval x for which CPUs are projected
to be saturated.
Given a job with estimated duration y,
the scheduler doesn't send it if x + y exceeds the delay bound.
If it does send it, x is incremented by y.
Complex mechanism:
Client sends workload description.
Scheduler does EDF simulation, sees if deadlines are missed.
The only project using this AFAIK is BOINC alpha test.
Neither of these mechanisms takes coprocessors into account,
and as a result jobs could be sent that are doomed to
miss their deadline.
This checkin adds coprocessor awareness to the Simple mechanism.
Changes:
Client:
compute estimated delay (i.e. time until non-saturation)
for coprocessors as well as CPU.
Send them in scheduler request as part of coproc descriptor.
Scheduler:
Keep track of estimated delays separately for different resources
- client: fixed bug that computed CPU estimated delay incorrectly
- client: the work request (req_secs) for a resource is the min
of the project's share and the shortfall.
svn path=/trunk/boinc/; revision=17086
for projects with no active results.
This is now wrong because there coproc apps might have pending results.
Also remove nidle_cpus > 0 conditional that increments CPU shortfall;
I think this is vestigial code.
svn path=/trunk/boinc/; revision=16646