Round-robin simulation, among other things, creates a bitmap
"sim_excluded_instances" of instances that are idle because of CPU exclusions.
There was a problem in how this was computed;
in the situation where there are fewer jobs than GPU instances
it could fail to set any bits, so no work fetch would happen.
My solution is a bit of a kludge, but should work in most cases.
The long-term solution is to treat GPU instances separately,
eliminating the need for GPU exclusions.
The basic problem: the way we assign GPU instances when creating
the "run list" is slightly different from the way we assign them
when we actually run the jobs;
the latter assigns a running job to the instance it's using,
but the former doesn't.
Solution (kludge): when building the run list,
don't reserve instances for currently running jobs.
This will result in more jobs in the run list, and avoid starvation.
For efficiency, do this only if there are exclusions for this type.
Comment: this is yet another complexity that would be eliminated
if GPU instances were modeled separately.
I wish I had time to do that.
- client emulator: change default latency bound from 1 day to 10 days
This gives you a way to simulate the effects of app_config.xml
- client: piggyback requests for resources even if we're backed off from them
- client: change resource backoff logic
Old: if we requested work and didn't get any,
back off from resources for which we requested work
New: for each resource type T:
if we requested work for T and didn't get any, back off from T
Also, don't back off if we're already backed off
(i.e. if this is a piggyback request)
Also, only back off if the RPC was due to an automatic
and potentially rapid source
(namely: work fetch, result report, trickle up)
- client: fix small work fetch bug
by Jacob Klein.
The new policy is roughly as follows:
- find the highest-priority project P that is allowed
to fetch work for a resource below buf_min
- Ask P for work for all resources R below buf_max
for which it's allowed to fetch work,
unless there's a higher-priority project allowed
to request work for R.
If we're going to do an RPC to P for reasons other than work fetch,
the policy is:
- for each resource R for which P is the highest-priority project
allowed to fetch work, and R is below buf_max,
request work for R.
to use project's share of instances.
- client emulator: if client_state.xml doesn't have <no_rsc_apps>
for a project, and the project doesn't have apps for that resource,
the project can be asked for work for that resource.
- remote job submission:
- prefix error messages with "BOINC server:"
so higher levels can tell where the error is coming from
- "get templates" RPC can take job name instead of app name
- Condor interface
- add BOINC_SELECT_PROJECT function
- BOINC_SUBMIT no longer has info about output files
- Change BOINC_FETCH_OUTPUT semantics
- client: add <async_file_debug> log flag
- client: do decompress (both sync and async) to a temp file,
then rename
- client: if a file's status is VERIFY_PENDING on startup,
set it to NOT_PRESENT; that will trigger a verify
- client: do async copy only if size is above threshold
svn path=/trunk/boinc/; revision=25222
work fetch (e.g. to report completed jobs)
only request work if it's the project we would have chosen
if we were fetching work.
- client: the way in which project priorities were adjusted
in work fetch to reflected currently queued work was wrong.
- client: fix bug in the way project priorities are adjusted
in RR simulator
- client emulator: if there are results in the state file
with states DOWNLOADING or UPLOADING,
change them to DOWNLOADED or UPLOADED.
Otherwise they're stuck.
svn path=/trunk/boinc/; revision=24737
where the client crashes after giving up (90 day timeout) on an upload.
I'm guessing this was caused by [24391],
which changed the order in the poll loop from
garbage_collect
file_xfers->poll
pers_file_xfers->poll
handle_pers_file_xfers
to
garbage_collect
handle_pers_file_xfers
file_xfers->poll
pers_file_xfers->poll
I don't understand why this would have caused a crash,
but so be it.
I restored the original order, but with handle_pers_file_xfers
not inside the if (!network_suspended).
- client renamed handle_pers_file_xfers() to
create_and_delete_pers_file_xfers()
- client simulator: show simulator CPU time
svn path=/trunk/boinc/; revision=24531
If we're contacting a project to report results,
only piggyback work requests for resources for which
that project is the highest priority that may have work.
- client: compute result.not_started more efficiently
TODO: continue efficiency work. There's still some quadratic stuff
svn path=/trunk/boinc/; revision=24523
reduce its runtime from O(N^2) to O(N),
where N is the number of runnable jobs
(which can be in the thousands).
This will make the client emulator run a lot faster,
and will reduce the client CPU overhead a bit.
- API: change boinc_get_opencl_ids() so that it returns
a BOINC error code (< -100) if the app_init.xml is
missing or bad (i.e. we're running standalone),
and an OpenCL error code (> -100) if an OpenCL call failed.
svn path=/trunk/boinc/; revision=24469
so that they do what they're supposed to
(i.e. enforce resource shares)
- client: change log flag <debt_debug> to <priority_debug>
- client simulator: update REC even with large delta-t.
- client simulator: handle "no new work" apps correctly
svn path=/trunk/boinc/; revision=24429
so that if you use <http_debug> and filter by project
you don't see other projects' HTTP stuff
- client simulator: cc_config.xml is part of the scenario;
log flags are part of the simulation
svn path=/trunk/boinc/; revision=24410
This will show pending uploads in the Transfers tab.
- file_upload_handler: fix message to client when can't acquire lock
- client: parse <alt_platform> in state file correctly
svn path=/trunk/boinc/; revision=24391
- client: if an app version can't be used because the GPUs it needs
are all excluded, mark it and all its results as "coproc missing"
so that they won't be looked at in scheduling logic.
svn path=/trunk/boinc/; revision=24317
where work fetch didn't work right in the presence of
multiple GPUs and <exclude_gpu> config options.
For example: suppose:
- you have 2 GPUs and 2 projects
- Project A is excluded from GPU 1
- you have lots of jobs for project A
Then the client won't try to fetch jobs from project B.
The problem had 2 parts:
a) round-robin simulation wasn't taking GPU exclusions into account.
In the above example, it would think that both GPUs had jobs.
I fixed this by computing the # of GPUs from each project
is excluded, and using this in the RR simulation.
b) Once this was done, I needed to make the client
request GPU jobs from project B rather than project A.
I did this with following policy:
If a project has excluded GPUs of a given type,
and has a runnable job of that type,
don't ask it for more work of that type.
Notes:
- the policy in b) is crude, and it means that work-buffer
preferences are ignored in some cases.
- neither a) nor b) takes into account app-level exclusions.
I could fix both of these with a lot of work,
but I'd rather move to a model in which dissimilar GPUs
are modeled as different resources,
which would remove the need for the <exclude_gpu> mechanism
in the first place.
- web: remove extraneous ) at end of button tooltips
svn path=/trunk/boinc/; revision=24312
This simulates just the jobs in the state file,
rather than simulating an infinite stream of jobs
modeled after the ones in the state file.
svn path=/trunk/boinc/; revision=24293
of the simulation, not the scenario.
If you want to run a simulation w/ different log flags,
you shouldn't have to create a new scenario.
- client emulator: add --config_prefix cmdline arg
- validator: prevent infinite loop when app_version.pfc_avg
is wonky (like 1e-300).
Next step: figure out how it got that way.
svn path=/trunk/boinc/; revision=23828