Commit Graph

58 Commits

Author SHA1 Message Date
David Anderson 9049737d1f validator: retry if transient failure
check_set() wasn't returning "retry" properly in the case where
one of the calls to init_result() return ERR_OPEN_DIR
(treated as a transient failure, since it can be caused by a failed NFS mount)
2013-05-20 13:01:10 -07:00
David Anderson c9c9f2bae0 - scheduler: code shuffle; new file sched_check.cpp contains functions
that decide whether a job can be sent to a host
2013-04-09 12:19:00 -07:00
David Anderson 3017ed943f - scheduler: debug the above 2013-02-26 16:44:26 +01:00
BOINC Admin c8b8c6155f need to add files? 2013-02-26 16:37:26 +01:00
Eric J. Korpela 33962b77e1 - sched: 2 bug fixes in credit.cpp
- It was possible if all results for a workunit were PFC_MODE_INVALID
          that NaN pfc would be used causing database update errors.  Solved
          by using wu_estimated_pfc() as pfc in that case.
        - Sanity check was comparing raw_pfc directly to rsc_fpops_bound.  That
          was causing problems GPUs with high performance estimates.  Fixed by
          including the app_version scale factor in the check.  I thought I had
          already committed this...
        - Removed a few lines of commented out experimental code accidentally
          comitted earlier.
        - Committed to git repository on 8/24


svn path=/trunk/boinc/; revision=26144
2012-10-02 15:20:13 +00:00
David Anderson 25c2f6b49c - client: treat all 4xx HTTP errors as permanent
- code cleanup
- API: increase a buffer in timer_handler() from 256 to 512.


svn path=/trunk/boinc/; revision=26012
2012-08-13 18:23:20 +00:00
Eric J. Korpela 9b9ec18d69 - Fixed typo
svn path=/trunk/boinc/; revision=26000
2012-08-09 16:23:38 +00:00
Eric J. Korpela 5c24fc50eb - Credit is more stable when pegged_avg() is used.
- When a normal and an approx result are compared the normal result now gets
  double weight in a pegged_avg() with any approx results. "Normal mode" GPU 
  results are frequently resulting in bad credit values for yet undetermined 
  reasons.  Since GPUs are so fast, there can be a lot of bad values in a 
  short time.  Including the prior average and another result even seems to
  prevent this in many case.
- Replaced many of the if { msg_log.printf } with msg_log.cond_printf()
- Accidentally changed some of the formatting when trying a new editor that
  apparently autoformats. Sorry for the extra diff lines.
- There's a problem with pfc calculation for hosts whose credit calculation
  fails the sanity check.  This has been a problem for a long time.  Because the
  result fails the sanity check, the host_app_version pfc is never updated.
  Because hav.pfc is never updated, the credit calculation continues to be
  wrong. To quote Quinn, it's like one of those viscious things. I hope to 
  fix that soon.

svn path=/trunk/boinc/; revision=25999
2012-08-09 03:02:51 +00:00
Eric J. Korpela 29d5781a34 - Modified credit granting for "appox credit" result to weight results by
proximity to the average estimate.  This reduces the number of results that
  are granted extremely low credit when a new app_version is released and (I
  hope) improves work/duration estimates by speeding up the convergence of app
  versions stats.  I may try using this in lieu of low_average for normal
  result, but haven't yet.


svn path=/trunk/boinc/; revision=25953
2012-08-02 15:45:13 +00:00
David Anderson da7e40f142 - use <cmath> instead of <math.h>. Seems to be needed on Debian.
svn path=/trunk/boinc/; revision=25938
2012-08-01 21:21:38 +00:00
David Anderson fc2af21221 - client: add missing end tag for <pci_info>. Doh!
- validator: add some sanity-checking for credit,
    to prevent granting 1e38 credit.
    max_granted_credit now defaults to the equivalent of 1 TeraFLOP-year.
    Instances that exceed this are not counted in the credit
    calculation, and a critical-mode log message is written
- wrapper: remove wall_cpu_time; not used anymore


svn path=/trunk/boinc/; revision=25825
2012-06-29 22:24:07 +00:00
David Anderson ec0ca2615d - scheduler: fix bug that could cause zero credit for
the first few jobs of a new application
    (in wu_estimated_pfc(), only multiply by app.min_avg_pfc
    if it's nonzero).


svn path=/trunk/boinc/; revision=25484
2012-03-23 21:47:06 +00:00
David Anderson fe90776614 - scheduler: if an app has only GPU versions,
scale their PFC by 0.1 in credit calculations.
    This reflects the fact that GPU apps are typically less efficient
    (relative to device peak FLOPS) than are CPU apps.
    The actual values from SETI@home and Milkyway are 0.05 and 0.08.


svn path=/trunk/boinc/; revision=24842
2011-12-21 03:21:52 +00:00
David Anderson dd93780787 - API and client: add "ncpus" field to APP_INIT_DATA.
Tells multicore apps how many cores to use.
    The --nthreads command line arg to the app is now deprecated
    though we'll keep it around for the time being.


svn path=/trunk/boinc/; revision=24708
2011-12-01 18:44:19 +00:00
David Anderson d53b89fe6f - feeder: fix logic error in the way app_version.pfc_scale is updated
(from Kevin Reed)


svn path=/trunk/boinc/; revision=24514
2011-11-03 07:08:52 +00:00
David Anderson 61dc940872 - validator: add runtime_outlier message
svn path=/trunk/boinc/; revision=24229
2011-09-16 21:30:21 +00:00
David Anderson e49f945908 - Validator: allow project-specific code to mark a result
is a "runtime outlier", i.e. its runtime does
    not correspond to the job's rsc_fpops_est.
    Runtime outliers are not counted in the statistics for
    elapsed time, turnaround time, and peak FLOPs count.

    The is intended for applications like SETI@home,
    some of whose jobs finish more or less instantly
    (this happens if the data contains a lot of interference).
    If a host happens to get a bunch of these short jobs,
    its statistics will get skewed: in essence, the server
    will think that the host is extremely fast,
    and will send it too many jobs.


svn path=/trunk/boinc/; revision=24225
2011-09-16 16:43:15 +00:00
David Anderson 826cd355e5 - validator: old scheduler bugs may cause result.flops_estimate
to be negative in some cases.
    Detect this, and use 1e10 instead


svn path=/trunk/boinc/; revision=24146
2011-09-08 19:36:14 +00:00
David Anderson 8fda6c0497 - Vbox wrapper: add --trickle x option; sends a trickle-up message
reporting incremental runtime exery x seconds of runtime.
- client: more XML parsing cleanup
- credit trickle handler: do sanity checks on CPU speed


svn path=/trunk/boinc/; revision=24017
2011-08-21 11:18:08 +00:00
David Anderson 578d5f924f - scheduler: fix nasty bug where SCHED_DB_RESULT::parse()
was doing memset(this, 0, sizeof(RESULT)),
    i.e. it wasn't zeroing out the whole structure.
    The elapsed_time field (which isn't reported by old clients),
    is near the end of the struct,
    and it was getting garbage, e.g. 1e-304, in some cases,
    which led to zero credit (and maybe other problems)
- validator: treat 1e-304 like zero in case of other problems
    like the above.
- remote job submission: tweaks


svn path=/trunk/boinc/; revision=23947
2011-08-08 04:37:53 +00:00
David Anderson d1be15b9fb - validator: remove spurious messages
svn path=/trunk/boinc/; revision=23855
2011-07-19 18:48:03 +00:00
David Anderson 8ca24cbbab - client, work fetch policy:
adjust project REC by the amount of work queued, to increase variety
    NOTE: at some point I think I had a reason to not do this,
    but I can't remember what it is.
- client, job scheduling policy: fix how project REC is adjusted


svn path=/trunk/boinc/; revision=23838
2011-07-13 19:46:03 +00:00
David Anderson f44c9910e7 - validator: if job FLOPs estimates are accurate,
PFC values should be around 1.
    If they differ from 1 by a factor of > 1e4, ignore them,
    and put an error message into the validator log
- validator: if get_pfc() fails because an app version is
    missing from the DB (i.e. the project deleted it)
    keep going so we don't reprocess the WU forever


svn path=/trunk/boinc/; revision=23837
2011-07-12 20:44:28 +00:00
David Anderson 62074bd4fa - client emulator web interface: make cc_config.xml an attribute
of the simulation, not the scenario.
    If you want to run a simulation w/ different log flags,
    you shouldn't have to create a new scenario.
- client emulator: add --config_prefix cmdline arg
- validator: prevent infinite loop when app_version.pfc_avg
    is wonky (like 1e-300).
    Next step: figure out how it got that way.


svn path=/trunk/boinc/; revision=23828
2011-07-10 07:05:07 +00:00
David Anderson 732866b8aa - back end: add two example trickle handlers:
trickle_credit: grants credit based on CPU time reported in msg
    trickle_echo: echoes trickle-up as a trickle-down

svn path=/trunk/boinc/; revision=23118
2011-02-27 00:10:14 +00:00
David Anderson 58dadd91a8 - client, acct manager protocol:
allow <no_cpu>, <no_cuda> and <no_ati> bools
    within <account> in reply message.
    They suppress work fetch for that resource type from that project.
- scheduler:
    check max_granted_credit after wu.rsc_fpops_bound,
    so that max_granted_credit will be enforced
    even if wu.rsc_fpops_bound is absurdly high
    Fixes #1034.  From Diggory Hardy.


svn path=/trunk/boinc/; revision=22793
2010-12-02 04:53:12 +00:00
David Anderson b169e5ab0f - server programs: print error message instead of numeric retval
in log messages

svn path=/trunk/boinc/; revision=22647
2010-11-08 17:51:57 +00:00
David Anderson 8aa29bec33 - validator: fix another bug with --credit_from_wu
- make_project, update scripts: don't quit it user_profiles
    already exists


svn path=/trunk/boinc/; revision=22630
2010-11-05 17:15:27 +00:00
David Anderson 4edfe2ec28 - client: small initial checkin for new scheduling system.
Keep track of per-project recent estimated credit

svn path=/trunk/boinc/; revision=22608
2010-10-29 23:41:34 +00:00
David Anderson 5ef4dead7d - validator: need parens in boolean expression
svn path=/trunk/boinc/; revision=21814
2010-06-25 19:23:16 +00:00
David Anderson 7c51512cbf - transitioner: the format string for a DB query had %.15d instead of %.15e.
That produced a messed-up query that assigned garbage values to:
        host_app_version.turnaround_var
        host_app_version.turnaround_q
        host_app_version.max_jobs_per_day
        host_app_version.consecutive_valid
    To repair these:
        - set turnaround_var and turnaround_q to zero
        - if max_jobs_per_day is outside of
            (0..config.daily_result_quota)
            set it to config.daily_result_quota
        - if consecutive_valid is outside (0..1000), set it to zero
    I added a script, html/ops/repair_21812.php, that does this;
    if you ran server code between [21181] and [21812], run this script.
- scheduler/transitioner: add <debug_quota> log flag
- changed the build system to always use -Wall
    (if we'd done this before, this bug wouldn't have happened)
- fixed a bunch of other compile warnings


svn path=/trunk/boinc/; revision=21812
2010-06-25 18:54:37 +00:00
David Anderson 25f9b05bdb - validator: there were a couple of places where we needed to
scale wu.rsc_fpops_est by app.min_avg_pfc.
- validator: assume that app.min_avg_pfc is nonzero;
    it will be, since the DB default is now 1.

svn path=/trunk/boinc/; revision=21804
2010-06-24 22:17:33 +00:00
David Anderson 1250f41313 - validator: fix a divide by zero (happens w/ old clients
that don't report elapsed time)

svn path=/trunk/boinc/; revision=21788
2010-06-22 18:09:55 +00:00
David Anderson 9262cc8c1a - validator: fix possible divide-by-zero
- validator: when claimed credit is too high,
    assign standard credit rather than exiting.

svn path=/trunk/boinc/; revision=21783
2010-06-21 17:56:12 +00:00
David Anderson f2e8d4601b - validator: because of the above problem,
some results have flops_estimate == 0, which causes divide by zero.
    Check for this and use 1e10.

svn path=/trunk/boinc/; revision=21776
2010-06-18 22:27:09 +00:00
David Anderson 4147249de2 - server: delete old credit stuff
- user web: show host link in user result list.  Fixes #999


svn path=/trunk/boinc/; revision=21735
2010-06-12 22:08:15 +00:00
David Anderson ef0019d8c3 - validator: bug fixes: bad formula for low_average();
failure to reread app_versions because of 1e6/1e-6 typo


svn path=/trunk/boinc/; revision=21302
2010-04-26 23:12:40 +00:00
David Anderson 5035007b90 - back end: new way of deciding:
- whether host is "reliable" for an app version
    - whether host is eligible for single replication for an app version
    - whether to use host scaling
    In each case, the answer is yes if the number of
    consecutive valid results is above a threshold.
    This replaces existing "error rate" and "scale probation" mechanisms.

    TODO: the # of consecutive valid results should also determine
        a limit on jobs in progress for an app version.
        Namely, if N is the threshold for host scaling, the limit should be
            ndevices*(max(1, consecutive_valid - N))
        The client currently doesn't supply enough
        app version info to do this.
        It could be approximated; that would give some protection
        against cherry-picking.
- credit: more conservative formulas for combining claimed credit
    among replicas.
    If there are normal replicas, we use a "low average"
    that weights each sample by the sum of the other samples.
    Otherwise we use the min (not the average) of the approximate samples.

NOTE: a DB update is required


svn path=/trunk/boinc/; revision=21230
2010-04-21 19:33:20 +00:00
David Anderson 6893691ae2 - validator: message tweak
svn path=/trunk/boinc/; revision=21212
2010-04-19 22:57:49 +00:00
David Anderson 61195cb59d - validator: fix bug where host.total_credit not incremented
svn path=/trunk/boinc/; revision=21211
2010-04-19 21:46:45 +00:00
David Anderson b71d3e6cf4 - back end: typo and tweaks
svn path=/trunk/boinc/; revision=21196
2010-04-16 21:16:18 +00:00
David Anderson 021edb02c2 - back end programs: improve log msgs
svn path=/trunk/boinc/; revision=21193
2010-04-16 18:07:08 +00:00
David Anderson 02717af2f3 - bug fixes
svn path=/trunk/boinc/; revision=21187
2010-04-15 21:58:44 +00:00
David Anderson b2451544e1 - server: change the following from per-host to per-(host, app version):
- daily quota mechanism
    - reliable mechanism (accelerated retries)
    - "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
    host_scale_check set.
- validator: do scale probation on invalid results
    (need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options

Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
    a host will have separate quotas for each one.
    That means it could error out on 100 jobs for cuda_fermi,
    and when its quota goes to zero,
    error out on 100 jobs for cuda23, etc.
    This is intentional; there may be cases where one version
    works but not the others.
- host.error_rate and host.max_results_day are deprecated

TODO:
    - the values in the app table for limits on jobs in progress etc.
        should override rather than config.xml.

Implementation notes:
scheduler:
    process_request():
        read all host_app_versions for host at start;
        Compute "reliable" and "trusted" for each one.
        write modified records at end
    get_app_version():
        add "reliable_only" arg; if set, use only reliable versions
        skip over-quota versions
    Multi-pass scheduling: if have at least one reliable version,
        do a pass for jobs that need reliable,
        and use only reliable versions.
        Then clear best_app_versions cache.
    Score-based scheduling: for need-reliable jobs,
        it will pick the fastest version,
        then give a score bonus if that version happens to be reliable.
    When get back a successful result from client:
        increase daily quota
    When get back an error result from client:
        impose scale probation
        decrease daily quota if not aborted
Validator:
    when handling a WU, create a vector of HOST_APP_VERSION
        parallel to vector of RESULT.
        Pass it to assign_credit_set().
        Make copies of originals so we can update only modified ones
    update HOST_APP_VERSION error rates
Transitioner:
    decrease quota on timeout


svn path=/trunk/boinc/; revision=21181
2010-04-15 03:13:56 +00:00
David Anderson e05a479f42 - scheduler and validator: distinguish between
1) peak FLOPS (based on benchmarks or GPU attributes).
        This does not change over time.
        It's not adjusted on the basis of statistics.
        It's not affected by wu.rsc_fpops_est.
        It can be compared across projects.
    versus
    2) projected FLOPS: the scheduler's best guess as to what will satisfy
        X * elapsed_time = wu.rsc_fpops_est;
        this is used to make server-side runtime estimates,
        and it's sent to the client and used for its runtime estimates.
        It may be based on the (host, app version) elapsed time average.
    My checkin [21153] mistakently confounded these two.

    Notes:
    1) app_plan() now must return both peak and projected FLOPS.
    2) result.flops_estimate stores peak FLOPS
    3) the <flops> field in app_info.xml files should be
        projected FLOPS.  But its accuracy is not important;
        it's not used once the server has statistics
        for the (host, app version)

svn path=/trunk/boinc/; revision=21164
2010-04-10 05:49:51 +00:00
David Anderson 1d765245ed - scheduler: sweeping changes to the way job runtimes are estimated:
see http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation


svn path=/trunk/boinc/; revision=21153
2010-04-08 23:14:47 +00:00
David Anderson 212fb765e9 - validator: detect jobs that used GPU app but fell back to CPU
(SETI@home does this if GPU initialization fails).
    Treat these like CPU apps for credit purposes.

svn path=/trunk/boinc/; revision=21130
2010-04-06 23:48:35 +00:00
David Anderson e276aa5ed6 - server: make the -d 4 feature work with FCGI
svn path=/trunk/boinc/; revision=21109
2010-04-05 23:12:02 +00:00
David Anderson 2536797068 - validator: remove update_credit_per_cpu_sec(). Irrelevant.
TODO: remove related code
- validator: update wu.canonical_credit correctly.
    However, this field should be deprecated.
- validator: check for error return from assign_credit_set().

svn path=/trunk/boinc/; revision=21096
2010-04-05 20:03:54 +00:00
David Anderson a2a661993b - validator: -d 4 means -d 3 plus print all DB queries
(todo: do this for all daemons)
- validator: change cmdline args from -foo to --foo
    (todo: do this for all daemons)
- validator: pass max_granted_credit to assign_credit_set()

svn path=/trunk/boinc/; revision=21093
2010-04-05 18:59:16 +00:00