Commit Graph

89 Commits

Author SHA1 Message Date
David Anderson 01402bb45a - client: improve GPU scheduling
old: assign GPUs, then check available RAM
        Problem: may cause starvation on multi-GPU systems.
    new: use available RAM info in the assignment process.
        Prevents starvation, also reduces the number of driver calls.

svn path=/trunk/boinc/; revision=21205
2010-04-18 03:00:33 +00:00
David Anderson b2451544e1 - server: change the following from per-host to per-(host, app version):
- daily quota mechanism
    - reliable mechanism (accelerated retries)
    - "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
    host_scale_check set.
- validator: do scale probation on invalid results
    (need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options

Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
    a host will have separate quotas for each one.
    That means it could error out on 100 jobs for cuda_fermi,
    and when its quota goes to zero,
    error out on 100 jobs for cuda23, etc.
    This is intentional; there may be cases where one version
    works but not the others.
- host.error_rate and host.max_results_day are deprecated

TODO:
    - the values in the app table for limits on jobs in progress etc.
        should override rather than config.xml.

Implementation notes:
scheduler:
    process_request():
        read all host_app_versions for host at start;
        Compute "reliable" and "trusted" for each one.
        write modified records at end
    get_app_version():
        add "reliable_only" arg; if set, use only reliable versions
        skip over-quota versions
    Multi-pass scheduling: if have at least one reliable version,
        do a pass for jobs that need reliable,
        and use only reliable versions.
        Then clear best_app_versions cache.
    Score-based scheduling: for need-reliable jobs,
        it will pick the fastest version,
        then give a score bonus if that version happens to be reliable.
    When get back a successful result from client:
        increase daily quota
    When get back an error result from client:
        impose scale probation
        decrease daily quota if not aborted
Validator:
    when handling a WU, create a vector of HOST_APP_VERSION
        parallel to vector of RESULT.
        Pass it to assign_credit_set().
        Make copies of originals so we can update only modified ones
    update HOST_APP_VERSION error rates
Transitioner:
    decrease quota on timeout


svn path=/trunk/boinc/; revision=21181
2010-04-15 03:13:56 +00:00
David Anderson e05a479f42 - scheduler and validator: distinguish between
1) peak FLOPS (based on benchmarks or GPU attributes).
        This does not change over time.
        It's not adjusted on the basis of statistics.
        It's not affected by wu.rsc_fpops_est.
        It can be compared across projects.
    versus
    2) projected FLOPS: the scheduler's best guess as to what will satisfy
        X * elapsed_time = wu.rsc_fpops_est;
        this is used to make server-side runtime estimates,
        and it's sent to the client and used for its runtime estimates.
        It may be based on the (host, app version) elapsed time average.
    My checkin [21153] mistakently confounded these two.

    Notes:
    1) app_plan() now must return both peak and projected FLOPS.
    2) result.flops_estimate stores peak FLOPS
    3) the <flops> field in app_info.xml files should be
        projected FLOPS.  But its accuracy is not important;
        it's not used once the server has statistics
        for the (host, app version)

svn path=/trunk/boinc/; revision=21164
2010-04-10 05:49:51 +00:00
David Anderson 132a35c38a typo
svn path=/trunk/boinc/; revision=21154
2010-04-09 03:45:25 +00:00
David Anderson 85e06afe4b - scheduler: app_plan() no longer has to guess how efficiently
an app version will run on a particular host.
- scheduler: fix memory leak: BEST_APP_VERSIONs weren't being freed


svn path=/trunk/boinc/; revision=21148
2010-04-08 18:27:27 +00:00
David Anderson 4462fe534b - client: don't do RSS fetch if network suspended
svn path=/trunk/boinc/; revision=21123
2010-04-06 20:32:02 +00:00
David Anderson ad3ed99b96 - scheduler: choose cuda_fermi over other cuda plan classes
svn path=/trunk/boinc/; revision=21052
2010-04-01 21:18:16 +00:00
David Anderson a8ed958cd6 - scheduler: cuda_fermi class needs CUDA version 3.0+
- boinccmd: "result" -> "task"

svn path=/trunk/boinc/; revision=20784
2010-03-03 22:36:36 +00:00
David Anderson 9020d0b715 - server: if MySQL version is 5.0.19 <= v < 5.1,
set the reconnect option before real_connect() instead of after.
    From Oliver Bock.

svn path=/trunk/boinc/; revision=20763
2010-03-01 19:12:19 +00:00
David Anderson f82216e203 - scheduler: add plan class "cuda_fermi":
requires CUDA 2.3 and compute capability 2.0+

svn path=/trunk/boinc/; revision=20748
2010-02-26 23:45:12 +00:00
David Anderson 84701e3624 - scheduler: add example code for SETI@home's situation
where app has both GPU and CPU versions,
    but for certain jobs (VLAR WUs in this case)
    the GPU version performs poorly and shouldn't be used.

    The fix is a kludge - it will result in these jobs
    not being sent to the host at all,
    rather than being sent with the CPU app.
    The current architecture makes it difficult to do otherwise.
    One possible fix would be to create a separate app
    for VLAR jobs, with only CPU app versions.


svn path=/trunk/boinc/; revision=20419
2010-02-04 17:34:55 +00:00
David Anderson 78695969e4 - scheduler: don't send CUDA jobs to Macs with client < 6.10.28;
they'll crash.

svn path=/trunk/boinc/; revision=20360
2010-02-02 17:18:39 +00:00
David Anderson 746b75d0d0 - scheduler: use a more accurate way of estimating total FLOPS
and avg_ncpus for GPU apps.
    App versions are now characterized by two parameters
    (we assume that the app uses either the CPU or the GPU
    at a given time, but not both):
    - the fraction of FLOPs performed on the CPU
    - when the app is using the GPU, the fraction of peak FLOPS
        that it gets.
    We then run the numbers to get the total FLOPS and avg_ncpus.


svn path=/trunk/boinc/; revision=19977
2009-12-18 23:28:10 +00:00
David Anderson a151ad6cb3 - client/scheduler: deal with situation where GPU has enough
RAM to run job, but when we actually run the job
    not enough GPU RAM is free, so the application fails.
    This can cause a large number of jobs to fail.
    Solution:
    - app_plan() can specify the GPU RAM requirements of an app version.
        This is passed to the client in a new field
        <gpu_ram> of the <app_version> element.
    - prior to starting or restarting a GPU app, the client
        checks the amount of free RAM on the particular GPU.
        If it's not enough for the app version,
        the client doesn't start it,
        and arranges for the scheduler to ignore it for 5 minutes
        (by which point there might be more free GPU RAM)
    Notes:
    1) this change will have effect only when
        both client and scheduler are updated.
    2) the check is done in enforce_schedule(),
        rather than schedule_cpus(),
        because only at that point
        have we assigned a specific GPU to the job.
    3) there's another case to deal with:
        a GPU app's malloc of GPU RAM fails in the middle of the job.
        Currently the job fails.
        I plan to add an API call boinc_temporary_exit(x) so
        that the job can exit and potentially restart in x seconds.
        (In principle this mechanism is sufficient for all cases,
        but it could lead to a lot of starting/exiting,
        so the current change is worthwhile).

svn path=/trunk/boinc/; revision=19864
2009-12-11 22:45:59 +00:00
David Anderson fe2a18f282 - client/scheduler: standardize the FLOPS estimate between NVIDIA and ATI.
Make them both peak FLOPS,
    according to the formula supplied by the manufacturer.

    The impact on the client is minor:
    - the startup message describing the GPU
    - the weight of the resource type in computing long-term debt

    On the server, I changed the example app_plan() function
    to assume that app FLOPS is 20% of peak FLOPS
    (that's about what it is for SETI@home)

svn path=/trunk/boinc/; revision=19310
2009-10-16 00:13:01 +00:00
David Anderson 62c1c4811b - scheduler: fix app_plan_uses_gpu()
svn path=/trunk/boinc/; revision=19172
2009-09-25 21:06:34 +00:00
David Anderson 67a42e0106 svn path=/trunk/boinc/; revision=19171 2009-09-25 20:59:13 +00:00
Rom Walton 2f61827ea0 - scheduler: setup priorites for the ATI plan classes
sched/
        sched_customize.cpp

svn path=/trunk/boinc/; revision=19169
2009-09-25 18:39:05 +00:00
David Anderson 348f6e6db8 - scheduler: fix app_plan() bug, improve ATI-related msgs
svn path=/trunk/boinc/; revision=19164
2009-09-25 16:35:43 +00:00
Rom Walton ad455ab09d - client: Add support for checking for both amd* prefixed CAL libraries
and ati* prefixed CAL libraries.
    - scheduler: redefine ati class plans again.
        ati: CAL 1.0+, amd* prefixed libraries
        ati13amd: CAL 1.3+, amd* prefixed libraries
        ati13ati: CAL 1.3+, ati* prefixed libraries
        ati14: CAL 1.4+, ati* prefixed libraries

    sched/
        sched_customize.cpp
    lib/
        coproc.cpp, .h

svn path=/trunk/boinc/; revision=19162
2009-09-25 15:40:16 +00:00
David Anderson cfcfeffd21 - client: for ATI enumeration, use only aticalrt.dll
(amdcalrt.dll is old version w/ funky DLL names)
- client: make GPU enumeration warnings more consistent
    (e.g., "NVIDIA" instead of "CUDA").
- scheduler: get rid of ati13 plan class.
    Require 1.4+ driver for plan class ati.

svn path=/trunk/boinc/; revision=19158
2009-09-24 18:33:40 +00:00
Rom Walton 9dd532b54e - scheduler: bug fix.
sched/
        sched_customize.cpp

svn path=/trunk/boinc/; revision=19150
2009-09-23 23:59:19 +00:00
David Anderson 9049f5fa14 - scheduler: change it to:
"ati" means CAL 1.2 or less (Catalyst 9.1 or less)
    "ati13" means CAL 1.3 or greater (Catalyst 9.2+)

svn path=/trunk/boinc/; revision=19149
2009-09-23 22:55:14 +00:00
David Anderson 2282c901d4 - scheduler: add a plan class "ati13186" for apps that require
CAL version 1.3.186 or greater.


svn path=/trunk/boinc/; revision=19148
2009-09-23 21:47:52 +00:00
David Anderson 1c8af5232d - scheduler: add comments in sched_customize.cpp to say that
wu_is_infeasible_custom() can assign the resource usage
    and/or FLOPS estimate for a particular host.

svn path=/trunk/boinc/; revision=19083
2009-09-17 21:34:42 +00:00
David Anderson 3bcaefd1d7 - web: show BOINC version in host displays
svn path=/trunk/boinc/; revision=19038
2009-09-10 20:30:46 +00:00
David Anderson 9d38ecb835 svn path=/trunk/boinc/; revision=19029 2009-09-08 19:30:06 +00:00
David Anderson 8b701fc73f - scheduler: fix messed-up deadline check logic.
Old:
        1) check deadline based on wu.delay_bound
        2) in add_result_to_reply(), potentially modify wu.delay_bound,
            e.g. because of retry acceleration
        problem: reducing delay bound may cause deadline miss
    New:
        1) new function get_delay_bound_range()
            (called from wu_is_infeasible_fast())
            returns optimistic and pessimistic delay bounds.
            Retry acceleration logic is here.
        2) check deadline based on optimistic bound;
            if that fails, check based on pessimistic bound.
            Set wu.delay_bound to the one that worked.
    Notes:
    - get_delay_bound_range() needs result priority and report deadline,
        and it's called before we read the full result.
        So add these items to WORK_ITEM and WU_RESULT.
    - get_delay_bound_range() could be customized for
        project-specific deadline policy.
    - add_result_to_reply() was becoming a toxic waste dump.
        Deadline-related stuff should have been factored out in any case.

svn path=/trunk/boinc/; revision=18946
2009-08-31 19:35:46 +00:00
David Anderson 720ec66f28 - scheduler: fix CUDA RAM warning msg
svn path=/trunk/boinc/; revision=18922
2009-08-26 16:10:46 +00:00
David Anderson eafb410cf8 - scheduler: simplify and fix the way that app_plan() conveys messages
to the user.  app_plan() now generates the messages directly
    rather than returning integer error codes.

svn path=/trunk/boinc/; revision=18899
2009-08-21 20:38:39 +00:00
David Anderson 9e9f2a9878 - scheduler: code cleanup
svn path=/trunk/boinc/; revision=18896
2009-08-21 19:14:15 +00:00
Eric J. Korpela 208956257a - Moved credit.cpp into libboinc_sched where it should have gone in the first
place.
- Added a separate GPU memory requirement for the CUDA23 plan


svn path=/trunk/boinc/; revision=18887
2009-08-21 02:12:31 +00:00
David Anderson 073e6ded2c - client and scheduler: lay the groundwork for "fractional coproc jobs",
e.g. the Milkyway@home ATI app, of which we can typically run
    2 or 3 instances at once on a GPU.
    Changes include:
    - In APP_VERSION, don't use a COPROCS to represent the GPU
        requirements; just use doubles ncudas and natis.
    - sufficient_coprocs() etc. are no longer members of COPROCS
    - in HOST_USAGE, ncudas and natis are doubles
    - in scheduler request, req_instances is now a double

    This checkin doesn't include the job scheduling logic,
    i.e. assigning jobs to GPUs.  That will follow.

svn path=/trunk/boinc/; revision=18868
2009-08-19 18:41:47 +00:00
David Anderson 7278ab1787 - scheduler: add support for ATI GPUs
svn path=/trunk/boinc/; revision=18851
2009-08-17 17:07:38 +00:00
David Anderson ca24bc3cb1 - scheduler: fixes for cuda23 plan class
svn path=/trunk/boinc/; revision=18841
2009-08-14 02:42:52 +00:00
David Anderson b300519444 svn path=/trunk/boinc/; revision=18825 2009-08-10 04:49:02 +00:00
David Anderson f163897d8a - scheduler: add plan class for CUDA 2.3
svn path=/trunk/boinc/; revision=18804
2009-08-03 21:30:19 +00:00
David Anderson fa0c32c20e - scheduler: compile fixes
svn path=/trunk/boinc/; revision=18783
2009-07-30 17:00:43 +00:00
David Anderson e3363c7eb8 - scheduler: on second thought, it would be better to add the above
feature without requiring use of score-based scheduling.
    So add a new customizable function, wu_is_infeasible_custom(),
    where projects can put job-specific checks.

    Also, move customizable functions (of which there are now 4)
    to a new file, sched_customize.cpp.

svn path=/trunk/boinc/; revision=18767
2009-07-29 18:55:50 +00:00