A 'transient error' is one that will go away in a while,
e.g. an fopen failure because of a broken NSF mount.
In general, the BOINC back-end code (validation, assimilation)
handles transient errors sensibly:
if there's a bad NSF mount, it retries validation for a few hours
rather than marking thousands of jobs as failed.
Extend this to script-based validation.
If a script (either init_result or compare_results) exits with 3,
treat that as a transient error.
Treat other nonzero exits (or lack of an exit code) as a permanent error.
More generally (for all validators) add a return value
VAL_RESULT_TRANSIENT_ERROR for init_result() and compute_results().
This means any transient error.
Previously we checked only for ERR_OPENDIR.
And for compare_results() we treated all nonzero returns as permanent.
Change these to the Github wiki.
Web: change a couple of links from Trac to Github wiki.
text_transform.inc: the [github]wiki:xxx[/github] tag linked
to a non-existent boinc-dev-doc repo.
Link to the Github wiki instead.
A validator now has the possibility to mark a single result as "suspicious" by making init_result() return VAL_RESULT_SUSPICIOUS. If this is the single quorum result of an adaptive replication, this will trigger another task to be generated for validation.
The SETI@home result table is about to run out of 32-bit IDs,
so we need to move to 64-bit result IDs.
This will happen to the workunit table at some point too.
I changed the server C++ code to use the "long" type for all DB IDs
(and to use appropriate conversion codes like %lu).
"long" is 64 bit on 64-bit machines.
For uniformity I did this for all tables,
even ones (like app) that will never get big.
I chose NOT to change the DB schema for now.
The new code will work with 32-bit ID fields in the DB.
As projects approach the 32-bit limit on a table they can change
its ID field, and fields that reference this table, to BIGINT.
This is likely to happen only on the result and workunit tables.
I put functions in html/ops/db_update.php
to change the IDs of these tables.
check_set() wasn't returning "retry" properly in the case where
one of the calls to init_result() return ERR_OPEN_DIR
(treated as a transient failure, since it can be caused by a failed NFS mount)
This assigns credit proportional to runtime*p_fpops.
To prevent cheating, p_fpops is capped at the 95th percentile value
among active hosts,
and runtime is capped at a specified limit.
This option supports apps, like LHC's CERNvm app,
that run for a certain amount of time and then exit.
The CreditNew system doesn't work for such apps.
- trickle_credit:
To prevent cheating,
cap p_fpops at the 95th percentile value among active hosts,
and require a limit on runtime.
- require that trickle handlers supply an initialization function
svn path=/trunk/boinc/; revision=24182
see http://boinc.berkeley.edu/trac/wiki/CreditNew
Projects will need to update DB and recompile all back-end programs.
Summary:
- new way of computing credit
- "reliable host" mechanism is per app version
- "host punishment" mechanism is per app version
- adjustment of wu.rsc_fpops_est provides the
equivalent of per app version DCF
- max jobs in progress is now per app
- max jobs per RPC is now per app
TODO:
- reliable mechanism:
- populate and use host_app_version.error_rate
- populate host_app_version.turnaround
- host punishment:
- populate host_app_version.max_jobs_per_day
- populate host_app_version.n_jobs_today
- use app.max_jobs_per_day_init
- job limits:
- use app.max_jobs_in_progress, max_gpu_jobs_in_progress
- use app.max_jobs_per_rpc
- adjust wu.rsc_fpops_est
- remove old credit stuff
fpops_cumulative, credit_multiplier
credit computation in scheduler
- AVERAGE class: use the Knuth algorithm (Wikipedia)
svn path=/trunk/boinc/; revision=21021