check_set() wasn't returning "retry" properly in the case where
one of the calls to init_result() return ERR_OPEN_DIR
(treated as a transient failure, since it can be caused by a failed NFS mount)
- It was possible if all results for a workunit were PFC_MODE_INVALID
that NaN pfc would be used causing database update errors. Solved
by using wu_estimated_pfc() as pfc in that case.
- Sanity check was comparing raw_pfc directly to rsc_fpops_bound. That
was causing problems GPUs with high performance estimates. Fixed by
including the app_version scale factor in the check. I thought I had
already committed this...
- Removed a few lines of commented out experimental code accidentally
comitted earlier.
- Committed to git repository on 8/24
svn path=/trunk/boinc/; revision=26144
- When a normal and an approx result are compared the normal result now gets
double weight in a pegged_avg() with any approx results. "Normal mode" GPU
results are frequently resulting in bad credit values for yet undetermined
reasons. Since GPUs are so fast, there can be a lot of bad values in a
short time. Including the prior average and another result even seems to
prevent this in many case.
- Replaced many of the if { msg_log.printf } with msg_log.cond_printf()
- Accidentally changed some of the formatting when trying a new editor that
apparently autoformats. Sorry for the extra diff lines.
- There's a problem with pfc calculation for hosts whose credit calculation
fails the sanity check. This has been a problem for a long time. Because the
result fails the sanity check, the host_app_version pfc is never updated.
Because hav.pfc is never updated, the credit calculation continues to be
wrong. To quote Quinn, it's like one of those viscious things. I hope to
fix that soon.
svn path=/trunk/boinc/; revision=25999
proximity to the average estimate. This reduces the number of results that
are granted extremely low credit when a new app_version is released and (I
hope) improves work/duration estimates by speeding up the convergence of app
versions stats. I may try using this in lieu of low_average for normal
result, but haven't yet.
svn path=/trunk/boinc/; revision=25953
- validator: add some sanity-checking for credit,
to prevent granting 1e38 credit.
max_granted_credit now defaults to the equivalent of 1 TeraFLOP-year.
Instances that exceed this are not counted in the credit
calculation, and a critical-mode log message is written
- wrapper: remove wall_cpu_time; not used anymore
svn path=/trunk/boinc/; revision=25825
the first few jobs of a new application
(in wu_estimated_pfc(), only multiply by app.min_avg_pfc
if it's nonzero).
svn path=/trunk/boinc/; revision=25484
scale their PFC by 0.1 in credit calculations.
This reflects the fact that GPU apps are typically less efficient
(relative to device peak FLOPS) than are CPU apps.
The actual values from SETI@home and Milkyway are 0.05 and 0.08.
svn path=/trunk/boinc/; revision=24842
Tells multicore apps how many cores to use.
The --nthreads command line arg to the app is now deprecated
though we'll keep it around for the time being.
svn path=/trunk/boinc/; revision=24708
is a "runtime outlier", i.e. its runtime does
not correspond to the job's rsc_fpops_est.
Runtime outliers are not counted in the statistics for
elapsed time, turnaround time, and peak FLOPs count.
The is intended for applications like SETI@home,
some of whose jobs finish more or less instantly
(this happens if the data contains a lot of interference).
If a host happens to get a bunch of these short jobs,
its statistics will get skewed: in essence, the server
will think that the host is extremely fast,
and will send it too many jobs.
svn path=/trunk/boinc/; revision=24225
reporting incremental runtime exery x seconds of runtime.
- client: more XML parsing cleanup
- credit trickle handler: do sanity checks on CPU speed
svn path=/trunk/boinc/; revision=24017
was doing memset(this, 0, sizeof(RESULT)),
i.e. it wasn't zeroing out the whole structure.
The elapsed_time field (which isn't reported by old clients),
is near the end of the struct,
and it was getting garbage, e.g. 1e-304, in some cases,
which led to zero credit (and maybe other problems)
- validator: treat 1e-304 like zero in case of other problems
like the above.
- remote job submission: tweaks
svn path=/trunk/boinc/; revision=23947
adjust project REC by the amount of work queued, to increase variety
NOTE: at some point I think I had a reason to not do this,
but I can't remember what it is.
- client, job scheduling policy: fix how project REC is adjusted
svn path=/trunk/boinc/; revision=23838
PFC values should be around 1.
If they differ from 1 by a factor of > 1e4, ignore them,
and put an error message into the validator log
- validator: if get_pfc() fails because an app version is
missing from the DB (i.e. the project deleted it)
keep going so we don't reprocess the WU forever
svn path=/trunk/boinc/; revision=23837
of the simulation, not the scenario.
If you want to run a simulation w/ different log flags,
you shouldn't have to create a new scenario.
- client emulator: add --config_prefix cmdline arg
- validator: prevent infinite loop when app_version.pfc_avg
is wonky (like 1e-300).
Next step: figure out how it got that way.
svn path=/trunk/boinc/; revision=23828
trickle_credit: grants credit based on CPU time reported in msg
trickle_echo: echoes trickle-up as a trickle-down
svn path=/trunk/boinc/; revision=23118
allow <no_cpu>, <no_cuda> and <no_ati> bools
within <account> in reply message.
They suppress work fetch for that resource type from that project.
- scheduler:
check max_granted_credit after wu.rsc_fpops_bound,
so that max_granted_credit will be enforced
even if wu.rsc_fpops_bound is absurdly high
Fixes#1034. From Diggory Hardy.
svn path=/trunk/boinc/; revision=22793
That produced a messed-up query that assigned garbage values to:
host_app_version.turnaround_var
host_app_version.turnaround_q
host_app_version.max_jobs_per_day
host_app_version.consecutive_valid
To repair these:
- set turnaround_var and turnaround_q to zero
- if max_jobs_per_day is outside of
(0..config.daily_result_quota)
set it to config.daily_result_quota
- if consecutive_valid is outside (0..1000), set it to zero
I added a script, html/ops/repair_21812.php, that does this;
if you ran server code between [21181] and [21812], run this script.
- scheduler/transitioner: add <debug_quota> log flag
- changed the build system to always use -Wall
(if we'd done this before, this bug wouldn't have happened)
- fixed a bunch of other compile warnings
svn path=/trunk/boinc/; revision=21812
scale wu.rsc_fpops_est by app.min_avg_pfc.
- validator: assume that app.min_avg_pfc is nonzero;
it will be, since the DB default is now 1.
svn path=/trunk/boinc/; revision=21804
- whether host is "reliable" for an app version
- whether host is eligible for single replication for an app version
- whether to use host scaling
In each case, the answer is yes if the number of
consecutive valid results is above a threshold.
This replaces existing "error rate" and "scale probation" mechanisms.
TODO: the # of consecutive valid results should also determine
a limit on jobs in progress for an app version.
Namely, if N is the threshold for host scaling, the limit should be
ndevices*(max(1, consecutive_valid - N))
The client currently doesn't supply enough
app version info to do this.
It could be approximated; that would give some protection
against cherry-picking.
- credit: more conservative formulas for combining claimed credit
among replicas.
If there are normal replicas, we use a "low average"
that weights each sample by the sum of the other samples.
Otherwise we use the min (not the average) of the approximate samples.
NOTE: a DB update is required
svn path=/trunk/boinc/; revision=21230
- daily quota mechanism
- reliable mechanism (accelerated retries)
- "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
host_scale_check set.
- validator: do scale probation on invalid results
(need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options
Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
a host will have separate quotas for each one.
That means it could error out on 100 jobs for cuda_fermi,
and when its quota goes to zero,
error out on 100 jobs for cuda23, etc.
This is intentional; there may be cases where one version
works but not the others.
- host.error_rate and host.max_results_day are deprecated
TODO:
- the values in the app table for limits on jobs in progress etc.
should override rather than config.xml.
Implementation notes:
scheduler:
process_request():
read all host_app_versions for host at start;
Compute "reliable" and "trusted" for each one.
write modified records at end
get_app_version():
add "reliable_only" arg; if set, use only reliable versions
skip over-quota versions
Multi-pass scheduling: if have at least one reliable version,
do a pass for jobs that need reliable,
and use only reliable versions.
Then clear best_app_versions cache.
Score-based scheduling: for need-reliable jobs,
it will pick the fastest version,
then give a score bonus if that version happens to be reliable.
When get back a successful result from client:
increase daily quota
When get back an error result from client:
impose scale probation
decrease daily quota if not aborted
Validator:
when handling a WU, create a vector of HOST_APP_VERSION
parallel to vector of RESULT.
Pass it to assign_credit_set().
Make copies of originals so we can update only modified ones
update HOST_APP_VERSION error rates
Transitioner:
decrease quota on timeout
svn path=/trunk/boinc/; revision=21181
1) peak FLOPS (based on benchmarks or GPU attributes).
This does not change over time.
It's not adjusted on the basis of statistics.
It's not affected by wu.rsc_fpops_est.
It can be compared across projects.
versus
2) projected FLOPS: the scheduler's best guess as to what will satisfy
X * elapsed_time = wu.rsc_fpops_est;
this is used to make server-side runtime estimates,
and it's sent to the client and used for its runtime estimates.
It may be based on the (host, app version) elapsed time average.
My checkin [21153] mistakently confounded these two.
Notes:
1) app_plan() now must return both peak and projected FLOPS.
2) result.flops_estimate stores peak FLOPS
3) the <flops> field in app_info.xml files should be
projected FLOPS. But its accuracy is not important;
it's not used once the server has statistics
for the (host, app version)
svn path=/trunk/boinc/; revision=21164
TODO: remove related code
- validator: update wu.canonical_credit correctly.
However, this field should be deprecated.
- validator: check for error return from assign_credit_set().
svn path=/trunk/boinc/; revision=21096
(todo: do this for all daemons)
- validator: change cmdline args from -foo to --foo
(todo: do this for all daemons)
- validator: pass max_granted_credit to assign_credit_set()
svn path=/trunk/boinc/; revision=21093