This feature lets you run the BOINC client as a job on grid systems
that handle only 1-CPU jobs;
it disables various mechanisms that prevent multiple clients per host
(which is normally a bad thing).
Old:
- Run the client with a --allow_multiple_clients flag.
This tells it not to use a mutex that prevents
multiple clients per host.
- Run the project with the <multiple_clients_per_host> config flag.
This suppresses two mechanisms:
- (avoid duplicate host records)
on a scheduler request with no host ID,
looks for a host with same domain name, OS type,
and mem size, and assumes the request is from that host
- (job retry)
If we get a request that doesn't have a host ID
but does have a host CPID,
mark its in-progress results as over
NOTE: I CAN'T REMEMBER WHY WE SUPPRESS THIS;
MARK S, DO YOU REMEMBER?
Problem:
if the grid clients attach to a project that
doesn't use <multiple_clients_per_host>, bad things happen.
E.g., if there are several requests at about the same time,
most of them will fail with
"another RPC already in progress" errors.
If a project does include this flag,
it loses protection from duplicate host records.
New:
- If the client is run with --allow_multiple_clients flag,
it passes a <allow_multiple_clients> element
in scheduler requests.
- The scheduler skips the duplicate-host check on
requests that include this flag.
- There is no more <multiple_clients_per_host> scheduler option.
Note: if a project using the old mechanism upgrades to this change,
it will need to use new clients for its grid deployment.
svn path=/trunk/boinc/; revision=21839
You can now specify limits for specific apps,
and/or for the project as a whole.
Within each of these, you can specify limits on
CPU jobs, GPU jobs, or total jobs.
In the case of CPU and GPU limits, you can specify
whether the limit should be scaled by the number of devices.
Note: the enforcement of this is done in get_app_version(),
since per-resource-type limits may dictate what app versions
we can use for a particular job.
svn path=/trunk/boinc/; revision=21674
pointers to dynamically allocated COPROC-derived objects,
just have the objects themselves.
Dynamic allocation should be avoided at all costs.
svn path=/trunk/boinc/; revision=21564
We store pointers to BEST_APP_VERSION in both APP_VERSION and RESULT.
We can't then fiddle with the vector that these point into.
Switch back to using a vector of pointers.
This restores the memory leak, which I'll deal with later.
svn path=/trunk/boinc/; revision=21494
- daily quota mechanism
- reliable mechanism (accelerated retries)
- "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
host_scale_check set.
- validator: do scale probation on invalid results
(need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options
Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
a host will have separate quotas for each one.
That means it could error out on 100 jobs for cuda_fermi,
and when its quota goes to zero,
error out on 100 jobs for cuda23, etc.
This is intentional; there may be cases where one version
works but not the others.
- host.error_rate and host.max_results_day are deprecated
TODO:
- the values in the app table for limits on jobs in progress etc.
should override rather than config.xml.
Implementation notes:
scheduler:
process_request():
read all host_app_versions for host at start;
Compute "reliable" and "trusted" for each one.
write modified records at end
get_app_version():
add "reliable_only" arg; if set, use only reliable versions
skip over-quota versions
Multi-pass scheduling: if have at least one reliable version,
do a pass for jobs that need reliable,
and use only reliable versions.
Then clear best_app_versions cache.
Score-based scheduling: for need-reliable jobs,
it will pick the fastest version,
then give a score bonus if that version happens to be reliable.
When get back a successful result from client:
increase daily quota
When get back an error result from client:
impose scale probation
decrease daily quota if not aborted
Validator:
when handling a WU, create a vector of HOST_APP_VERSION
parallel to vector of RESULT.
Pass it to assign_credit_set().
Make copies of originals so we can update only modified ones
update HOST_APP_VERSION error rates
Transitioner:
decrease quota on timeout
svn path=/trunk/boinc/; revision=21181
1) peak FLOPS (based on benchmarks or GPU attributes).
This does not change over time.
It's not adjusted on the basis of statistics.
It's not affected by wu.rsc_fpops_est.
It can be compared across projects.
versus
2) projected FLOPS: the scheduler's best guess as to what will satisfy
X * elapsed_time = wu.rsc_fpops_est;
this is used to make server-side runtime estimates,
and it's sent to the client and used for its runtime estimates.
It may be based on the (host, app version) elapsed time average.
My checkin [21153] mistakently confounded these two.
Notes:
1) app_plan() now must return both peak and projected FLOPS.
2) result.flops_estimate stores peak FLOPS
3) the <flops> field in app_info.xml files should be
projected FLOPS. But its accuracy is not important;
it's not used once the server has statistics
for the (host, app version)
svn path=/trunk/boinc/; revision=21164
an app version will run on a particular host.
- scheduler: fix memory leak: BEST_APP_VERSIONs weren't being freed
svn path=/trunk/boinc/; revision=21148
10000*major + 100*minor + release,
rather than 100*major + minor.
Sometimes you need release-level resolution.
This affects:
- app_version.min_core_version
- config: min_core_client_version_announced
- config: min_core_client_version
Projects using these must multiply them by 100.
svn path=/trunk/boinc/; revision=20149
if project has crazy DCF, don't automatically request 1 sec;
only request work if there's a shortfall.
- intermediate checkin for notices stuff
svn path=/trunk/boinc/; revision=20145
RAM to run job, but when we actually run the job
not enough GPU RAM is free, so the application fails.
This can cause a large number of jobs to fail.
Solution:
- app_plan() can specify the GPU RAM requirements of an app version.
This is passed to the client in a new field
<gpu_ram> of the <app_version> element.
- prior to starting or restarting a GPU app, the client
checks the amount of free RAM on the particular GPU.
If it's not enough for the app version,
the client doesn't start it,
and arranges for the scheduler to ignore it for 5 minutes
(by which point there might be more free GPU RAM)
Notes:
1) this change will have effect only when
both client and scheduler are updated.
2) the check is done in enforce_schedule(),
rather than schedule_cpus(),
because only at that point
have we assigned a specific GPU to the job.
3) there's another case to deal with:
a GPU app's malloc of GPU RAM fails in the middle of the job.
Currently the job fails.
I plan to add an API call boinc_temporary_exit(x) so
that the job can exit and potentially restart in x seconds.
(In principle this mechanism is sufficient for all cases,
but it could lead to a lot of starting/exiting,
so the current change is worthwhile).
svn path=/trunk/boinc/; revision=19864
to accept CPU, NVIDIA and ATI jobs.
These prefs are shown only where relevant:
e.g., only for processor types for which the project has app versions,
and if it has versions for only one type, no pref is shown.
These prefs affect both client and scheduler.
The client won't ask for work for a device blocked by prefs,
and the scheduler won't send it.
This replaces earlier optional project-specific prefs for
"no CPU jobs" and "no GPU jobs".
(However, these prefs continue to be honored on the server side).
- client: if NVIDIA driver is unknown, say that rather than 0
svn path=/trunk/boinc/; revision=19194
e.g. the Milkyway@home ATI app, of which we can typically run
2 or 3 instances at once on a GPU.
Changes include:
- In APP_VERSION, don't use a COPROCS to represent the GPU
requirements; just use doubles ncudas and natis.
- sufficient_coprocs() etc. are no longer members of COPROCS
- in HOST_USAGE, ncudas and natis are doubles
- in scheduler request, req_instances is now a double
This checkin doesn't include the job scheduling logic,
i.e. assigning jobs to GPUs. That will follow.
svn path=/trunk/boinc/; revision=18868
don't modify user preferences or CPID.
- client: fix bug that shows ATI version incorrectly
- database: host.posts has been repurposed as a salt (or seqno)
for a new type of weak authenticator that won't depend on password
- web code:
modify forum_preferences.posts instead of host.posts.
(actually, the former isn't used either, we just do a select count(*);
should fix this at some point).
svn path=/trunk/boinc/; revision=18865