- whether host is "reliable" for an app version
- whether host is eligible for single replication for an app version
- whether to use host scaling
In each case, the answer is yes if the number of
consecutive valid results is above a threshold.
This replaces existing "error rate" and "scale probation" mechanisms.
TODO: the # of consecutive valid results should also determine
a limit on jobs in progress for an app version.
Namely, if N is the threshold for host scaling, the limit should be
ndevices*(max(1, consecutive_valid - N))
The client currently doesn't supply enough
app version info to do this.
It could be approximated; that would give some protection
against cherry-picking.
- credit: more conservative formulas for combining claimed credit
among replicas.
If there are normal replicas, we use a "low average"
that weights each sample by the sum of the other samples.
Otherwise we use the min (not the average) of the approximate samples.
NOTE: a DB update is required
svn path=/trunk/boinc/; revision=21230
of other jobs of that type.
They're waiting for GPU RAM, which may now be available.
- client: bug fix in GPU RAM availability
- client: fix testing setup for GPU RAM availability
svn path=/trunk/boinc/; revision=21206
old: assign GPUs, then check available RAM
Problem: may cause starvation on multi-GPU systems.
new: use available RAM info in the assignment process.
Prevents starvation, also reduces the number of driver calls.
svn path=/trunk/boinc/; revision=21205
size to the dialog creation routines instead of setting the size
after dialog initialization. Avoids artifacts and things having
to be tweaked later.
clientgui/
DlgEventLog.cpp, .h
svn path=/trunk/boinc/; revision=21194
- daily quota mechanism
- reliable mechanism (accelerated retries)
- "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
host_scale_check set.
- validator: do scale probation on invalid results
(need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options
Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
a host will have separate quotas for each one.
That means it could error out on 100 jobs for cuda_fermi,
and when its quota goes to zero,
error out on 100 jobs for cuda23, etc.
This is intentional; there may be cases where one version
works but not the others.
- host.error_rate and host.max_results_day are deprecated
TODO:
- the values in the app table for limits on jobs in progress etc.
should override rather than config.xml.
Implementation notes:
scheduler:
process_request():
read all host_app_versions for host at start;
Compute "reliable" and "trusted" for each one.
write modified records at end
get_app_version():
add "reliable_only" arg; if set, use only reliable versions
skip over-quota versions
Multi-pass scheduling: if have at least one reliable version,
do a pass for jobs that need reliable,
and use only reliable versions.
Then clear best_app_versions cache.
Score-based scheduling: for need-reliable jobs,
it will pick the fastest version,
then give a score bonus if that version happens to be reliable.
When get back a successful result from client:
increase daily quota
When get back an error result from client:
impose scale probation
decrease daily quota if not aborted
Validator:
when handling a WU, create a vector of HOST_APP_VERSION
parallel to vector of RESULT.
Pass it to assign_credit_set().
Make copies of originals so we can update only modified ones
update HOST_APP_VERSION error rates
Transitioner:
decrease quota on timeout
svn path=/trunk/boinc/; revision=21181
- scheduler: fix bug that broke anon platform
Note: Bruce Allen once advised me to take a few days and just
observe BOINC in action.
I should really do this more often; it always turns up bugs
and/or design flaws.
svn path=/trunk/boinc/; revision=21165
1) peak FLOPS (based on benchmarks or GPU attributes).
This does not change over time.
It's not adjusted on the basis of statistics.
It's not affected by wu.rsc_fpops_est.
It can be compared across projects.
versus
2) projected FLOPS: the scheduler's best guess as to what will satisfy
X * elapsed_time = wu.rsc_fpops_est;
this is used to make server-side runtime estimates,
and it's sent to the client and used for its runtime estimates.
It may be based on the (host, app version) elapsed time average.
My checkin [21153] mistakently confounded these two.
Notes:
1) app_plan() now must return both peak and projected FLOPS.
2) result.flops_estimate stores peak FLOPS
3) the <flops> field in app_info.xml files should be
projected FLOPS. But its accuracy is not important;
it's not used once the server has statistics
for the (host, app version)
svn path=/trunk/boinc/; revision=21164
to multiple jobs at the same time.
I fixed one error (reference arg to assign_coprocs())
but I can't see why this would explain the problem.
I added a lot of extra <coproc_debug> log messages.
- user web: give scientists moderator privileges
svn path=/trunk/boinc/; revision=21158
an app version will run on a particular host.
- scheduler: fix memory leak: BEST_APP_VERSIONs weren't being freed
svn path=/trunk/boinc/; revision=21148
installer for the auto created project_init.xml file.
win_build/installerv2/redist/Windows/src/boinccas/
boinccas.rc
CACreateProjectInitFile.cpp
win_build/installerv2/redist/Windows/Win32/
boinccas.dll
boinccas95.dll
win_build/installerv2/redist/Windows/x64/
boinccas.dll
boinccas95.dll
svn path=/trunk/boinc/; revision=21147