Commit Graph

36 Commits

Author SHA1 Message Date
David Anderson 94c8e53204 Server: add "punitive validation" mechanism
Say that a job has a "long-term failure" if it fails in a way
(as evidenced by its exit code and/or stderr)
suggesting that other jobs for that (host, app version) will fail too.
In this case we want to avoid sending more jobs to that (host, app version).

This implements this feature.
To use it, have your validator's init_result() return
VAL_RESULT_LONG_TERM_FAIL if it finds a long-term failure,
and run your validator with the --check_punitive option.
("Punitive" because we're "punishing" the host for its failure).

The validator punishes the (host, app version) by
setting host_app_version.max_jobs_per_day to 1.
One job per day can still be sent.
That way if the underlying problem is fixed
(e.g. the user enables VM acceleration in the BIOS)
we'll eventually go back to normal.

Also: normally HAV.max_jobs_per_day is scaled by the numbers
of CPUs and GPUs.
Disable this scaling in the case where it's 1.
2019-02-18 21:29:04 -08:00
David Anderson 7878d24f95 Add header comments to sched/*.cpp 2016-06-24 15:42:11 -07:00
David Anderson a87d039f49 server: more 64-bit ID fixes
negative values are stored in app_version_id fields to represent
anonymous platform versions.
So need to use %ld rather than %lu for these fields.

Also there were a couple of more changes of int do DB_ID_TYPE
2015-07-29 17:32:57 -07:00
David Anderson 8cd8c8e7ee server software: handle 64-bit database IDs
The SETI@home result table is about to run out of 32-bit IDs,
so we need to move to 64-bit result IDs.
This will happen to the workunit table at some point too.

I changed the server C++ code to use the "long" type for all DB IDs
(and to use appropriate conversion codes like %lu).
"long" is 64 bit on 64-bit machines.
For uniformity I did this for all tables,
even ones (like app) that will never get big.

I chose NOT to change the DB schema for now.
The new code will work with 32-bit ID fields in the DB.
As projects approach the 32-bit limit on a table they can change
its ID field, and fields that reference this table, to BIGINT.
This is likely to happen only on the result and workunit tables.
I put functions in html/ops/db_update.php
to change the IDs of these tables.
2015-07-23 10:11:08 -07:00
David Anderson 60a3582151 scheduler: if client reports zero elapsed time, set ET to CPU time 2014-09-03 10:35:52 -07:00
David Anderson 7a65508923 scheduler: don't penalize hosts for jobs aborted by server
If a job was aborted by server (exit status is EXIT_ABORTED_BY_PROJECT)
don't decrement daily quota or reset the consecutive valid count.
2014-05-23 12:25:41 -07:00
David Anderson bb4f4194d0 scheduler: cap CPU time of reported results at elapsed time * ncpus
This affects only result display,
since CPU time is no longer used for anything.
2014-04-10 23:52:13 -07:00
David Anderson fc7c75b200 server: parse peak memory/disk info from client, store in DB, display in web
The latest client reports the peak working set size, swap size,
and disk usage for completed jobs.
Add fields to the results table to store these.
Parse them in scheduler request messages, and write to the DB.
Display them in the result web page.

This data can be used to improve (or even automate)
the job estimates for memory and disk usage.
2014-04-02 19:35:59 -07:00
Eric J Korpela 244ba5bc85 SCHED: modified scheduled log output to use unsigned format for WU and RESULT
ids.  This allows IDs greater than 2^31 to be printed.
2013-06-19 10:15:08 -07:00
David Anderson 73eebc69fc - scheduler: we were using CPU time for elapsed time
when the latter wasn't reported.
    Do this BEFORE sanity checks on elapsed time
    to prevent cheating.


svn path=/trunk/boinc/; revision=24906
2011-12-26 14:31:14 +00:00
David Anderson 4111c5696c - scheduler: fix crashing bug (don't memset SCHED_REQUEST).
svn path=/trunk/boinc/; revision=24660
2011-11-29 04:47:10 +00:00
David Anderson 8837274aae - scheduler: check for negative elapsed time in results
reported by client, set to zero


svn path=/trunk/boinc/; revision=23909
2011-07-31 17:31:12 +00:00
David Anderson 732866b8aa - back end: add two example trickle handlers:
trickle_credit: grants credit based on CPU time reported in msg
    trickle_echo: echoes trickle-up as a trickle-down

svn path=/trunk/boinc/; revision=23118
2011-02-27 00:10:14 +00:00
David Anderson b169e5ab0f - server programs: print error message instead of numeric retval
in log messages

svn path=/trunk/boinc/; revision=22647
2010-11-08 17:51:57 +00:00
David Anderson 5f5d64c978 - scheduler: check whether client is reporting the same result twice
in a given request message.
    Eliminate duplicates; they mess things up.
- scheduler: fix RESULT#0 problem in message log
- user web: keep credit totals when merging hosts by name

svn path=/trunk/boinc/; revision=22432
2010-09-30 21:40:44 +00:00
David Anderson 3dffe0a8bc - API: remove deprected stuff related to:
1) old-style apps with graphics in main program.
        No one should be using these anymore.
    2) writing init_data.xml in boinc_finish().
        This was used by deprecated "compound app" scheme
- scheduler: if request reports results that were previously reported,
    that's evidence that the previous reply was not received by client.
    It may have contained results.
    So set a "resend lost results" flag.

svn path=/trunk/boinc/; revision=22203
2010-08-11 22:02:41 +00:00
David Anderson 570c5cbfa4 - file deleter: if we're configured to generate cached MD5 checksums,
delete those files too.
- scheduler: add log messages (conditioned by debug_credit)
    if result.fpops_cumulative or result.fpops_per_cpu_sec is present

svn path=/trunk/boinc/; revision=22083
2010-07-31 04:08:14 +00:00
David Anderson 7c51512cbf - transitioner: the format string for a DB query had %.15d instead of %.15e.
That produced a messed-up query that assigned garbage values to:
        host_app_version.turnaround_var
        host_app_version.turnaround_q
        host_app_version.max_jobs_per_day
        host_app_version.consecutive_valid
    To repair these:
        - set turnaround_var and turnaround_q to zero
        - if max_jobs_per_day is outside of
            (0..config.daily_result_quota)
            set it to config.daily_result_quota
        - if consecutive_valid is outside (0..1000), set it to zero
    I added a script, html/ops/repair_21812.php, that does this;
    if you ran server code between [21181] and [21812], run this script.
- scheduler/transitioner: add <debug_quota> log flag
- changed the build system to always use -Wall
    (if we'd done this before, this bug wouldn't have happened)
- fixed a bunch of other compile warnings


svn path=/trunk/boinc/; revision=21812
2010-06-25 18:54:37 +00:00
David Anderson 4147249de2 - server: delete old credit stuff
- user web: show host link in user result list.  Fixes #999


svn path=/trunk/boinc/; revision=21735
2010-06-12 22:08:15 +00:00
David Anderson e80e54fd4d - user web: add "Application info" link in host page,
linking to new page showing host_app_versions for this host
- scheduler: message tweaks

svn path=/trunk/boinc/; revision=21690
2010-06-03 20:26:02 +00:00
David Anderson 89fab4ece5 - back end: change "daily result quota" mechanism.
Old: config.xml specifies an initial daily quota (say, 100).
        Each host_app_version starts out with this quota.
        On the return of a SUCCESS result,
        the quota is doubled, up to the initial value.
        On the return of an error result, or a timeout,
        the quota is decremented down to 1.
    Problem:
        Doesn't accommodate hosts that can do more than 100 jobs/day.
    New: similar, but
        - on validation of a job, daily quota is incremented.
        - on invalidation of a job, daily quota is decremented.
        - on return of an error result, or a timeout,
            daily quota is min'd with initial quota, then decremented.
    Notes:
        - This allows a host to have an unboundedly large quota
            as long as it continues to return more valid
            than invalid results.
        - Even with this change, hosts that return SUCCESS but
            invalid results will continue to get the initial daily quota.
            It would be desirable to reduce their quota to 1.

svn path=/trunk/boinc/; revision=21675
2010-06-02 00:11:01 +00:00
David Anderson 64def3d588 - scheduler: fix bug that caused resent jobs with anonymous platform
to have zero FPOPS est and bound

svn path=/trunk/boinc/; revision=21671
2010-06-01 19:56:54 +00:00
David Anderson 5035007b90 - back end: new way of deciding:
- whether host is "reliable" for an app version
    - whether host is eligible for single replication for an app version
    - whether to use host scaling
    In each case, the answer is yes if the number of
    consecutive valid results is above a threshold.
    This replaces existing "error rate" and "scale probation" mechanisms.

    TODO: the # of consecutive valid results should also determine
        a limit on jobs in progress for an app version.
        Namely, if N is the threshold for host scaling, the limit should be
            ndevices*(max(1, consecutive_valid - N))
        The client currently doesn't supply enough
        app version info to do this.
        It could be approximated; that would give some protection
        against cherry-picking.
- credit: more conservative formulas for combining claimed credit
    among replicas.
    If there are normal replicas, we use a "low average"
    that weights each sample by the sum of the other samples.
    Otherwise we use the min (not the average) of the approximate samples.

NOTE: a DB update is required


svn path=/trunk/boinc/; revision=21230
2010-04-21 19:33:20 +00:00
David Anderson b2451544e1 - server: change the following from per-host to per-(host, app version):
- daily quota mechanism
    - reliable mechanism (accelerated retries)
    - "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
    host_scale_check set.
- validator: do scale probation on invalid results
    (need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options

Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
    a host will have separate quotas for each one.
    That means it could error out on 100 jobs for cuda_fermi,
    and when its quota goes to zero,
    error out on 100 jobs for cuda23, etc.
    This is intentional; there may be cases where one version
    works but not the others.
- host.error_rate and host.max_results_day are deprecated

TODO:
    - the values in the app table for limits on jobs in progress etc.
        should override rather than config.xml.

Implementation notes:
scheduler:
    process_request():
        read all host_app_versions for host at start;
        Compute "reliable" and "trusted" for each one.
        write modified records at end
    get_app_version():
        add "reliable_only" arg; if set, use only reliable versions
        skip over-quota versions
    Multi-pass scheduling: if have at least one reliable version,
        do a pass for jobs that need reliable,
        and use only reliable versions.
        Then clear best_app_versions cache.
    Score-based scheduling: for need-reliable jobs,
        it will pick the fastest version,
        then give a score bonus if that version happens to be reliable.
    When get back a successful result from client:
        increase daily quota
    When get back an error result from client:
        impose scale probation
        decrease daily quota if not aborted
Validator:
    when handling a WU, create a vector of HOST_APP_VERSION
        parallel to vector of RESULT.
        Pass it to assign_credit_set().
        Make copies of originals so we can update only modified ones
    update HOST_APP_VERSION error rates
Transitioner:
    decrease quota on timeout


svn path=/trunk/boinc/; revision=21181
2010-04-15 03:13:56 +00:00
David Anderson da7e82fe15 - scheduler and back end: add new fields to result table:
elapsed_time: the elapsed time (runtime) as reported by client
    flops_estimate: the app's estimated FLOPS as reported by app_plan()
    app_version_id: the DB ID of the app_version used
        (or -1 if anonymous platform)
    TODO: show these in the web interfaces,
    and use them where appropriate

svn path=/trunk/boinc/; revision=19002
2009-09-03 20:26:31 +00:00
David Anderson 9e9f2a9878 - scheduler: code cleanup
svn path=/trunk/boinc/; revision=18896
2009-08-21 19:14:15 +00:00
David Anderson b300519444 svn path=/trunk/boinc/; revision=18825 2009-08-10 04:49:02 +00:00
Rytis Slatkevičius f239587bdb Sched: config option not to store stderr_out if exit_status==0 (to save on DB size). With help from Nicolas Alvarez.
svn path=/trunk/boinc/; revision=18528
2009-06-30 18:00:58 +00:00
David Anderson 10f9e11ee6 - lib: created a new file for declaring "replacements"
for functions like strlcpy() etc.
    config.h is included here rather than in str_util.h


svn path=/trunk/boinc/; revision=18437
2009-06-16 20:54:44 +00:00
David Anderson 8a765c0f4a - file deleter: detect cases where the upload/download dir doesn't exist,
and treat it as a recoverable error (i.e., retry).
    The file deleter may run on a host that NSF-mounts
    the upload/download dirs, and NSF mounts can file.
- scheduler: include WU#ID in log msgs for handled results

svn path=/trunk/boinc/; revision=18351
2009-06-10 17:42:18 +00:00
David Anderson c2fda4db09 - scheduler: add <report_max> config parameter;
limits the # of completed results handled per scheduler RPC.
    This may be needed to avoid crashes due to memory allocation
    failure (each reported result uses about 128KB memory).
- web: In showing result lists,
    include "Validate error" results in the "Invalid" category.
    (Previously they didn't appear in any category)

svn path=/trunk/boinc/; revision=18104
2009-05-14 19:01:40 +00:00
David Anderson dcc3bbe36f - scheduler: slight code cleanup
svn path=/trunk/boinc/; revision=17395
2009-02-26 03:03:35 +00:00
David Anderson 85a8e6a772 - scheduler: remove the config flag <have_cuda_apps>,
and add <cuda_multiplier>.
    The latter is used in calculating max jobs/day for a host;
    namely, it's host.max_results_day * (NCPUS + NCUDA*cuda_multiplier).
    Set it to 10 or so if you have CUDA apps.
- scheduler: don't overload effective_ncpus();
    instead, add two new functions,
    max_results_day_multiplier() and max_wus_in_progress_multiplier()
- scheduler: don't reduce max_results_day if we get an aborted job
    (it might have been aborted by the project;
    not appopriate to punish host in this case)

svn path=/trunk/boinc/; revision=16959
2009-01-20 00:54:16 +00:00
David Anderson 91e120b3f4 - scheduler: improve message formatting; add <debug_locality> flag
for locality scheduling messages

svn path=/trunk/boinc/; revision=16921
2009-01-15 20:23:20 +00:00
David Anderson 312ffba708 - API: remove BOINC_OPTIONS::worker_thread_stack_size
- web: check whether to show profile in separate function
    from displaying profile; eliminate double headers
- scheduler: finish purge of redundant arguments

svn path=/trunk/boinc/; revision=16726
2008-12-19 18:14:02 +00:00
David Anderson 98cfb8d3b0 - rename .C files to .cpp so that Doxygen will work
svn path=/trunk/boinc/; revision=16069
2008-09-26 18:20:24 +00:00