Synopsis: max concurrent was being enforced in the last stage of CPU sched,
but not in earlier stages, or in work fetch.
This caused starvation in some cases.
Fix this by modeling max concurrent in RR sim and make_run_list().
- CPU sched: model and enforce max concurrent limits in building run list
for CPU jobs; otherwise the list has jobs we can't actually run
- RR simulation: model and enforce max concurrent limits
- RR sim: fix bug in calculation of # idle instances
- RR sim: model unavailability of GPUs
e.g. if we can't run GPU jobs we can potentially run more CPU jobs
- work fetch: if a project is at a max concurrent limit,
don't fetch work from it.
The jobs we get (possibly) wouldn't be runnable.
NOTE: we currently provide max concurrent limits
at both project and app level.
The problem with app level is that apps can have versions that
use different resources.
It would be better to have limits at the resource level instead.
- In many cases (e.g. job completion) CPU sched and work fetch are both done
back to back. Each of them does RR simulation.
Only need to do this once (efficiency).
- Show max concurrent settings in startup messages
- Make max runnable jobs (1000) into a #define
- Fix removal of "can't fetch work" notices
- Make "can't fetch work" notices resource-specific;
the reasons may differ between resources
- Get rid of WF_DEBUG macro;
just print everything if log_flags.work_fetch_debug is set.
- Change project- and resource-level work-fetch reason codes
(DONT_FETCH_PREFS etc.) from #defines to enums,
and give them prefixes RSC_REASON and PROJECT_REASON
- Fix bug where the return of compute_project_reason() wasn't
actually being stored in project.work_fetch.
- Add work-fetch reason MAX_CONCURRENT (project is at max concurrent limit)
- There was a scenario (#164 in fact) where CPUs were starved
because CPU weren't being added to the run list.
The basic problem was the the max_concurrent stuff was being
called in make_run_list().
It doesn't belong there - only in enforce_run_list().
- add the ability to handle app_config.xml files in the client emulator.
- fix a performance bug that caused extremely long run lists;
in make_run_list(), check for exclusion at the project level, not global.
- do max_concurrent logic only if a max_concurrent rule was given.
- fix bug where the emulator would assign the wrong
version number to results, then fail to find their app version.
Old: each scheduler process holds a semaphore
while scanning the shared-mem job array.
On machines with many CPUs
there seems to be contention for this semaphore,
causing slow scheduler response and possibly connection failures.
New: Don't hold the semaphore while scanning array.
Instead, if find a job that passes quick_check(),
acquire the semaphore and recheck that the job is present in array
and passes quick_check().
- client: show messages if app_config.xml has unrecognized tags
http://boinc.berkeley.edu/trac/wiki/ClientAppConfig
This lets users do the following:
1) limit the number of concurrent jobs of a given app
(e.g. for WCG apps that are I/O-intensive)
2) Specify the CPU and GPU usage parameters of GPU versions
of a given app.
Implementation notes:
- max app concurrency is enforced in 2 places:
1) when building the initial job run list
2) when enforcing the final job run list
Both are needed to avoid possible starvation.
- however, we don't enforce it during RR simulation.
Doing so could cause erroneous shortfall and work fetch.
This means, however, that work buffering will not work
as expected if you're using max concurrency.