Commit Graph

21 Commits

Author SHA1 Message Date
David Anderson 5035007b90 - back end: new way of deciding:
- whether host is "reliable" for an app version
    - whether host is eligible for single replication for an app version
    - whether to use host scaling
    In each case, the answer is yes if the number of
    consecutive valid results is above a threshold.
    This replaces existing "error rate" and "scale probation" mechanisms.

    TODO: the # of consecutive valid results should also determine
        a limit on jobs in progress for an app version.
        Namely, if N is the threshold for host scaling, the limit should be
            ndevices*(max(1, consecutive_valid - N))
        The client currently doesn't supply enough
        app version info to do this.
        It could be approximated; that would give some protection
        against cherry-picking.
- credit: more conservative formulas for combining claimed credit
    among replicas.
    If there are normal replicas, we use a "low average"
    that weights each sample by the sum of the other samples.
    Otherwise we use the min (not the average) of the approximate samples.

NOTE: a DB update is required


svn path=/trunk/boinc/; revision=21230
2010-04-21 19:33:20 +00:00
David Anderson 6893691ae2 - validator: message tweak
svn path=/trunk/boinc/; revision=21212
2010-04-19 22:57:49 +00:00
David Anderson 61195cb59d - validator: fix bug where host.total_credit not incremented
svn path=/trunk/boinc/; revision=21211
2010-04-19 21:46:45 +00:00
David Anderson b71d3e6cf4 - back end: typo and tweaks
svn path=/trunk/boinc/; revision=21196
2010-04-16 21:16:18 +00:00
David Anderson 021edb02c2 - back end programs: improve log msgs
svn path=/trunk/boinc/; revision=21193
2010-04-16 18:07:08 +00:00
David Anderson 02717af2f3 - bug fixes
svn path=/trunk/boinc/; revision=21187
2010-04-15 21:58:44 +00:00
David Anderson b2451544e1 - server: change the following from per-host to per-(host, app version):
- daily quota mechanism
    - reliable mechanism (accelerated retries)
    - "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
    host_scale_check set.
- validator: do scale probation on invalid results
    (need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options

Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
    a host will have separate quotas for each one.
    That means it could error out on 100 jobs for cuda_fermi,
    and when its quota goes to zero,
    error out on 100 jobs for cuda23, etc.
    This is intentional; there may be cases where one version
    works but not the others.
- host.error_rate and host.max_results_day are deprecated

TODO:
    - the values in the app table for limits on jobs in progress etc.
        should override rather than config.xml.

Implementation notes:
scheduler:
    process_request():
        read all host_app_versions for host at start;
        Compute "reliable" and "trusted" for each one.
        write modified records at end
    get_app_version():
        add "reliable_only" arg; if set, use only reliable versions
        skip over-quota versions
    Multi-pass scheduling: if have at least one reliable version,
        do a pass for jobs that need reliable,
        and use only reliable versions.
        Then clear best_app_versions cache.
    Score-based scheduling: for need-reliable jobs,
        it will pick the fastest version,
        then give a score bonus if that version happens to be reliable.
    When get back a successful result from client:
        increase daily quota
    When get back an error result from client:
        impose scale probation
        decrease daily quota if not aborted
Validator:
    when handling a WU, create a vector of HOST_APP_VERSION
        parallel to vector of RESULT.
        Pass it to assign_credit_set().
        Make copies of originals so we can update only modified ones
    update HOST_APP_VERSION error rates
Transitioner:
    decrease quota on timeout


svn path=/trunk/boinc/; revision=21181
2010-04-15 03:13:56 +00:00
David Anderson e05a479f42 - scheduler and validator: distinguish between
1) peak FLOPS (based on benchmarks or GPU attributes).
        This does not change over time.
        It's not adjusted on the basis of statistics.
        It's not affected by wu.rsc_fpops_est.
        It can be compared across projects.
    versus
    2) projected FLOPS: the scheduler's best guess as to what will satisfy
        X * elapsed_time = wu.rsc_fpops_est;
        this is used to make server-side runtime estimates,
        and it's sent to the client and used for its runtime estimates.
        It may be based on the (host, app version) elapsed time average.
    My checkin [21153] mistakently confounded these two.

    Notes:
    1) app_plan() now must return both peak and projected FLOPS.
    2) result.flops_estimate stores peak FLOPS
    3) the <flops> field in app_info.xml files should be
        projected FLOPS.  But its accuracy is not important;
        it's not used once the server has statistics
        for the (host, app version)

svn path=/trunk/boinc/; revision=21164
2010-04-10 05:49:51 +00:00
David Anderson 1d765245ed - scheduler: sweeping changes to the way job runtimes are estimated:
see http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation


svn path=/trunk/boinc/; revision=21153
2010-04-08 23:14:47 +00:00
David Anderson 212fb765e9 - validator: detect jobs that used GPU app but fell back to CPU
(SETI@home does this if GPU initialization fails).
    Treat these like CPU apps for credit purposes.

svn path=/trunk/boinc/; revision=21130
2010-04-06 23:48:35 +00:00
David Anderson e276aa5ed6 - server: make the -d 4 feature work with FCGI
svn path=/trunk/boinc/; revision=21109
2010-04-05 23:12:02 +00:00
David Anderson 2536797068 - validator: remove update_credit_per_cpu_sec(). Irrelevant.
TODO: remove related code
- validator: update wu.canonical_credit correctly.
    However, this field should be deprecated.
- validator: check for error return from assign_credit_set().

svn path=/trunk/boinc/; revision=21096
2010-04-05 20:03:54 +00:00
David Anderson a2a661993b - validator: -d 4 means -d 3 plus print all DB queries
(todo: do this for all daemons)
- validator: change cmdline args from -foo to --foo
    (todo: do this for all daemons)
- validator: pass max_granted_credit to assign_credit_set()

svn path=/trunk/boinc/; revision=21093
2010-04-05 18:59:16 +00:00
David Anderson 54dce55e98 - backend: fix scaling problem that was producing xe15 size credits.
This had messed up the beta DB, which I had to clean up.
    Added a cap (1e5) to prevent this in the future.

svn path=/trunk/boinc/; revision=21064
2010-04-02 23:18:47 +00:00
David Anderson 78d11a263b - backend: improved messages for app version credit updates
svn path=/trunk/boinc/; revision=21063
2010-04-02 21:45:43 +00:00
David Anderson 19f7d66b53 - backend programs: change the way PFC and elapsed-time statistics
are written to the DB.
    The incremental approach was bogus.
    New approach:
    host_app_version: write directly; R/W interval is tiny
    app_version: maintain an explicit list of update samples
        for both PFC and credit.
        When the validator flushes its app_version cache,
        do careful updates.
    Note: when using double fields in careful updates,
    you can't test for equality.  Use abs(new-old) < 1e-N

svn path=/trunk/boinc/; revision=21057
2010-04-02 19:10:37 +00:00
David Anderson 38bd1c8def - validator: improved log messages
- fix some compiler warnings


svn path=/trunk/boinc/; revision=21053
2010-04-01 22:51:19 +00:00
David Anderson fb851311e0 - server: various changes;
see http://boinc.berkeley.edu/trac/wiki/CreditNew

    Projects will need to update DB and recompile all back-end programs.

    Summary:
    - new way of computing credit
    - "reliable host" mechanism is per app version
    - "host punishment" mechanism is per app version
    - adjustment of wu.rsc_fpops_est provides the
        equivalent of per app version DCF
    - max jobs in progress is now per app
    - max jobs per RPC is now per app

    TODO:
    - reliable mechanism:
        - populate and use host_app_version.error_rate
        - populate host_app_version.turnaround
    - host punishment:
        - populate host_app_version.max_jobs_per_day
        - populate host_app_version.n_jobs_today
        - use app.max_jobs_per_day_init
    - job limits:
        - use app.max_jobs_in_progress, max_gpu_jobs_in_progress
        - use app.max_jobs_per_rpc
    - adjust wu.rsc_fpops_est
    - remove old credit stuff
        fpops_cumulative, credit_multiplier
        credit computation in scheduler

- AVERAGE class: use the Knuth algorithm (Wikipedia)


svn path=/trunk/boinc/; revision=21021
2010-03-29 22:28:20 +00:00
David Anderson 3fb7c8f13f - server code: moved everything related to credit-granting to credit.cpp,
where it can be used by trickle handlers as well as by validators.

svn path=/trunk/boinc/; revision=18831
2009-08-12 16:26:43 +00:00
David Anderson 3eeefc0048 - server code cleanup
svn path=/trunk/boinc/; revision=18830
2009-08-12 16:01:46 +00:00
David Anderson f6d3e8a477 svn path=/trunk/boinc/; revision=18829 2009-08-11 15:17:37 +00:00