- whether host is "reliable" for an app version
- whether host is eligible for single replication for an app version
- whether to use host scaling
In each case, the answer is yes if the number of
consecutive valid results is above a threshold.
This replaces existing "error rate" and "scale probation" mechanisms.
TODO: the # of consecutive valid results should also determine
a limit on jobs in progress for an app version.
Namely, if N is the threshold for host scaling, the limit should be
ndevices*(max(1, consecutive_valid - N))
The client currently doesn't supply enough
app version info to do this.
It could be approximated; that would give some protection
against cherry-picking.
- credit: more conservative formulas for combining claimed credit
among replicas.
If there are normal replicas, we use a "low average"
that weights each sample by the sum of the other samples.
Otherwise we use the min (not the average) of the approximate samples.
NOTE: a DB update is required
svn path=/trunk/boinc/; revision=21230
- daily quota mechanism
- reliable mechanism (accelerated retries)
- "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
host_scale_check set.
- validator: do scale probation on invalid results
(need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options
Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
a host will have separate quotas for each one.
That means it could error out on 100 jobs for cuda_fermi,
and when its quota goes to zero,
error out on 100 jobs for cuda23, etc.
This is intentional; there may be cases where one version
works but not the others.
- host.error_rate and host.max_results_day are deprecated
TODO:
- the values in the app table for limits on jobs in progress etc.
should override rather than config.xml.
Implementation notes:
scheduler:
process_request():
read all host_app_versions for host at start;
Compute "reliable" and "trusted" for each one.
write modified records at end
get_app_version():
add "reliable_only" arg; if set, use only reliable versions
skip over-quota versions
Multi-pass scheduling: if have at least one reliable version,
do a pass for jobs that need reliable,
and use only reliable versions.
Then clear best_app_versions cache.
Score-based scheduling: for need-reliable jobs,
it will pick the fastest version,
then give a score bonus if that version happens to be reliable.
When get back a successful result from client:
increase daily quota
When get back an error result from client:
impose scale probation
decrease daily quota if not aborted
Validator:
when handling a WU, create a vector of HOST_APP_VERSION
parallel to vector of RESULT.
Pass it to assign_credit_set().
Make copies of originals so we can update only modified ones
update HOST_APP_VERSION error rates
Transitioner:
decrease quota on timeout
svn path=/trunk/boinc/; revision=21181
see http://boinc.berkeley.edu/trac/wiki/CreditNew
Projects will need to update DB and recompile all back-end programs.
Summary:
- new way of computing credit
- "reliable host" mechanism is per app version
- "host punishment" mechanism is per app version
- adjustment of wu.rsc_fpops_est provides the
equivalent of per app version DCF
- max jobs in progress is now per app
- max jobs per RPC is now per app
TODO:
- reliable mechanism:
- populate and use host_app_version.error_rate
- populate host_app_version.turnaround
- host punishment:
- populate host_app_version.max_jobs_per_day
- populate host_app_version.n_jobs_today
- use app.max_jobs_per_day_init
- job limits:
- use app.max_jobs_in_progress, max_gpu_jobs_in_progress
- use app.max_jobs_per_rpc
- adjust wu.rsc_fpops_est
- remove old credit stuff
fpops_cumulative, credit_multiplier
credit computation in scheduler
- AVERAGE class: use the Knuth algorithm (Wikipedia)
svn path=/trunk/boinc/; revision=21021
Triggering the work generator is now done via the DB
instead of flat files.
Since only E@h uses locality scheduling,
I kept the DB changes in a separate file (db/schema_locality.sql).
There's a new field in the workunit table,
and that's a required update (in db_update.php)
- manager: compile fix
svn path=/trunk/boinc/; revision=20807
elapsed_time: the elapsed time (runtime) as reported by client
flops_estimate: the app's estimated FLOPS as reported by app_plan()
app_version_id: the DB ID of the app_version used
(or -1 if anonymous platform)
TODO: show these in the web interfaces,
and use them where appropriate
svn path=/trunk/boinc/; revision=19002
Old:
1) check deadline based on wu.delay_bound
2) in add_result_to_reply(), potentially modify wu.delay_bound,
e.g. because of retry acceleration
problem: reducing delay bound may cause deadline miss
New:
1) new function get_delay_bound_range()
(called from wu_is_infeasible_fast())
returns optimistic and pessimistic delay bounds.
Retry acceleration logic is here.
2) check deadline based on optimistic bound;
if that fails, check based on pessimistic bound.
Set wu.delay_bound to the one that worked.
Notes:
- get_delay_bound_range() needs result priority and report deadline,
and it's called before we read the full result.
So add these items to WORK_ITEM and WU_RESULT.
- get_delay_bound_range() could be customized for
project-specific deadline policy.
- add_result_to_reply() was becoming a toxic waste dump.
Deadline-related stuff should have been factored out in any case.
svn path=/trunk/boinc/; revision=18946