The basic problem: the way we assign GPU instances when creating
the "run list" is slightly different from the way we assign them
when we actually run the jobs;
the latter assigns a running job to the instance it's using,
but the former doesn't.
Solution (kludge): when building the run list,
don't reserve instances for currently running jobs.
This will result in more jobs in the run list, and avoid starvation.
For efficiency, do this only if there are exclusions for this type.
Comment: this is yet another complexity that would be eliminated
if GPU instances were modeled separately.
I wish I had time to do that.
- client emulator: change default latency bound from 1 day to 10 days
has an invalid URL, type, or app
- server, create_work() function: if a <file_info> in input template
lists URLs, they're directories; append filename to each one
(especially per-app exclusions) was incomplete and buggy.
Changes:
- make bitmaps of included instances per (app, resource type)
- in round-robin simulation, we keep track of used instances
(so that we know if there are instances that are idle
because of exclusions).
Do this based on app-level exclusions
(previously it was done based on project-wide exclusions,
which didn't include app-level exclusions).
- compute RSC_PROJECT_WORK_FETCH::non_excluded_instances
as the logical OR of the per-app masks.
I.e. if you exclude an instance for all apps separately,
it's the same as excluding it for the project as a whole.
(Note: this bitmap is used for only 1 purpose:
if we have idle instances, don't request work from a project
for which those instances are excluded.)
- define RSC_PROJECT_WORK_FETCH::ncoprocs_excluded as the # of
instances excluded for *any* app, not the # excluded for all apps.
This quantity is used in work fetch to make sure we don't
unboundedly fetch jobs that turn out not to have a GPU to run on
due to exclusions.
Note: this fixes a major problem (starvation)
with project-level GPU exclusion.
However, project-level GPU exclusion interferes with most of
the client's scheduling policies.
E.g., round-robin simulation doesn't take GPU exclusion into account,
and the resulting completion estimates and device shortfalls
can be wrong by an order of magnitude.
The only way I can see to fix this would be to model each
GPU instance as a separate resource,
and to associate each job with a particular GPU instance.
This would be a sweeping change in both client and server.
so that we can fetch work immediately
- client: in PERS_FILE_XFER::create_xfer(),
check for already-existing file before seeing we're allowed to start a new xfer
- client: in PERS_FILE_XFER::create_xfer(),
if an async verify is in progress, mark PERS_FILE_XFER as done.
svn path=/trunk/boinc/; revision=25243
parsing cc_config.xml
- client: if an <exclude_cpu> element in cc_config.xml
specifies a nonexistent app, show an error msg with
a list of existing app names
- web: increase the default mem limit from 64MB to 256MB
TODO: change user_hosts.php to show N at a time
svn path=/trunk/boinc/; revision=24593
(need to do this after reading the state file,
since GPU exclusions refer to projects).
- client: fix bug that added garbage <coproc> element
to <app_version> in state file when using GPU exclusions
svn path=/trunk/boinc/; revision=24522
so that they do what they're supposed to
(i.e. enforce resource shares)
- client: change log flag <debt_debug> to <priority_debug>
- client simulator: update REC even with large delta-t.
- client simulator: handle "no new work" apps correctly
svn path=/trunk/boinc/; revision=24429
- client: if an app version can't be used because the GPUs it needs
are all excluded, mark it and all its results as "coproc missing"
so that they won't be looked at in scheduling logic.
svn path=/trunk/boinc/; revision=24317
in the presence of GPU exclusions.
The problem was in the job-selection phase,
which picks enough jobs to use all devices.
It was ignoring GPU exclusions, so for example on
a 2 GPU system it could pick 2 jobs from a project
for which 1 GPU is excluded,
and as a result 1 GPU would be idle.
Solution: during job selection,
keep track of GPU usage on a per-instance basis.
Select a job only if it can run on a non-excluded GPU.
- client: in computing ncprocs_excluded (which is used in
work fetch policy) don't count exclusions of non-existent devices
svn path=/trunk/boinc/; revision=24316
where work fetch didn't work right in the presence of
multiple GPUs and <exclude_gpu> config options.
For example: suppose:
- you have 2 GPUs and 2 projects
- Project A is excluded from GPU 1
- you have lots of jobs for project A
Then the client won't try to fetch jobs from project B.
The problem had 2 parts:
a) round-robin simulation wasn't taking GPU exclusions into account.
In the above example, it would think that both GPUs had jobs.
I fixed this by computing the # of GPUs from each project
is excluded, and using this in the RR simulation.
b) Once this was done, I needed to make the client
request GPU jobs from project B rather than project A.
I did this with following policy:
If a project has excluded GPUs of a given type,
and has a runnable job of that type,
don't ask it for more work of that type.
Notes:
- the policy in b) is crude, and it means that work-buffer
preferences are ignored in some cases.
- neither a) nor b) takes into account app-level exclusions.
I could fix both of these with a lot of work,
but I'd rather move to a model in which dissimilar GPUs
are modeled as different resources,
which would remove the need for the <exclude_gpu> mechanism
in the first place.
- web: remove extraneous ) at end of button tooltips
svn path=/trunk/boinc/; revision=24312
types for which we have no app versions
- client: if too many <coproc> elements in cc_config.xml,
detect it and inform user
svn path=/trunk/boinc/; revision=24144
Add parsed_tag and is_tag to the class,
so that parsing functions don't need to declare them
and pass them around.
- Complete the task of using XML_PARSER as the argument
to all parsing functions.
(Internally, many of these functions still use the old XML parser;
that's the next step.)
svn path=/trunk/boinc/; revision=23978
- add <heartbeat_debug> log flag
- show trickle-up and int file upload msgs if <app_msg_receive> set
- if scheduler RPC reason is trickle-up, say so
- manager:
- restore "non CPU intensive" to task description
- project properties: show if RPC in progress or trickle-up pending.
(show these low-probability things only if present)
- manager: fix Unix build
(from Ian Hay)
svn path=/trunk/boinc/; revision=23365