Say that a job has a "long-term failure" if it fails in a way
(as evidenced by its exit code and/or stderr)
suggesting that other jobs for that (host, app version) will fail too.
In this case we want to avoid sending more jobs to that (host, app version).
This implements this feature.
To use it, have your validator's init_result() return
VAL_RESULT_LONG_TERM_FAIL if it finds a long-term failure,
and run your validator with the --check_punitive option.
("Punitive" because we're "punishing" the host for its failure).
The validator punishes the (host, app version) by
setting host_app_version.max_jobs_per_day to 1.
One job per day can still be sent.
That way if the underlying problem is fixed
(e.g. the user enables VM acceleration in the BIOS)
we'll eventually go back to normal.
Also: normally HAV.max_jobs_per_day is scaled by the numbers
of CPUs and GPUs.
Disable this scaling in the case where it's 1.
Add --post_assigned_credit option to validator.
If set, it gets claimed credit from result.claimed_credit
(put there by project's init_result() function).
The claimed credit of the canonical result is the job's granted credit.
Also changed --credit_from_runtime so that it averages
claimed credit across instances,
instead of just using the canonical instance.
A validator now has the possibility to mark a single result as "suspicious" by making init_result() return VAL_RESULT_SUSPICIOUS. If this is the single quorum result of an adaptive replication, this will trigger another task to be generated for validation.
If you link your functions (init_result(), compare_results(),
cleanup_result()) with validate_test.cpp,
you'll get a program that you can run as
validate_test file1 file2
and it will compare the two files
(this works only for validators that expect 1 file per result).
I added a makefile, sched/makefile_validator_test,
that you can use for this.
- server: shuffle code so that the above doesn't need to
link MySQL libraries
- client: if we fetch a master file and it contains no scheduler URLs,
show a message of class INTERNAL_ERROR
- client/scheduler: make CUDA_DEVICE_PROP.totalGlobalMem a double,
and remove dtotalGlobalMem.
Although NVIDIA reports RAM size as a size_t,
there's no reason to store it as an integer after that.
svn path=/trunk/boinc/; revision=25542
see http://boinc.berkeley.edu/trac/wiki/CreditNew
Projects will need to update DB and recompile all back-end programs.
Summary:
- new way of computing credit
- "reliable host" mechanism is per app version
- "host punishment" mechanism is per app version
- adjustment of wu.rsc_fpops_est provides the
equivalent of per app version DCF
- max jobs in progress is now per app
- max jobs per RPC is now per app
TODO:
- reliable mechanism:
- populate and use host_app_version.error_rate
- populate host_app_version.turnaround
- host punishment:
- populate host_app_version.max_jobs_per_day
- populate host_app_version.n_jobs_today
- use app.max_jobs_per_day_init
- job limits:
- use app.max_jobs_in_progress, max_gpu_jobs_in_progress
- use app.max_jobs_per_rpc
- adjust wu.rsc_fpops_est
- remove old credit stuff
fpops_cumulative, credit_multiplier
credit computation in scheduler
- AVERAGE class: use the Knuth algorithm (Wikipedia)
svn path=/trunk/boinc/; revision=21021
- scheduler: fix bug in adaptive replication:
if send an unreplicated job to untrusted host,
set both wu.target_nresults and wu.min_quorum to app.target_nresults.
svn path=/trunk/boinc/; revision=15762
(attempt to send big jobs to fast hosts, small jobs to slow hosts).
- have "census" compute mean/stdev of host speeds,
write it to a file perf_info.txt
- have feeder compute mean/stdev of sizes of jobs in shmem
- have feeder read perf_info.txt into shmem
- scheduler: add some debugging messages for app version selection
- Add LGPL license to a few files
- upgrade/setup scripts: copy census to bin/
svn path=/trunk/boinc/; revision=15136
to allow validator to assign different credit
to different instances of a job
- Scheduler: if can't open DB, return <project_is_down/>
(fixes#578)
- clean up logic of modify_claimed_credit
- feeder: for -priority_order_create_time, use workunitid
rather than create time (faster for the DB)
from Kevin Reed
svn path=/trunk/boinc/; revision=14908