Commit Graph

248 Commits

Author SHA1 Message Date
David Anderson d6da81b862 client: fix bugs with CPU throttling and GPU apps
Various bad things could happen when CPU throttling was used together w/ GPU apps.
Examples:
- on a multi-GPU system, several GPU tasks are assigned to the same GPU
- a suspended GPU task remains in memory (tying up its GPU resources)
while other tasks try to use the GPU.

The problem was that parts of the code assumed that suspended
GPU processes don't exist - i.e. that when a GPU task is suspended
it's always removed from memory.
This isn't true in the presence of CPU throttling.

So I made the following changes:
- When assigning GPUs to tasks, treat suspended tasks like running tasks
  (i.e. reserve their GPUs)
- At the end of the CPU-scheduling logic, if there are any GPU tasks
  that are suspended and not scheduled, remove them from memory,
  and trigger a reschedule so we can reallocate their GPUs.

Also, a cosmetic change: in the resource usage string shown in the GUI,
include "(device X)" even if the task is suspended (i.e. because of throttling).

Also: zero out COPROC::opencl_device_indexes[] so we don't write
a garbage number to init_data.xml for non-OpenCL jobs
2013-11-29 11:44:09 -08:00
David Anderson 3d910a0190 client: message tweak 2013-11-13 21:24:16 -08:00
David Anderson 45dfb684a6 Client: don't allow more than 1000 slot dirs.
There was a report of a situation where the client created unbounded slot dirs.
Not sure why this happened, but may as well impose a limit.
2013-10-23 21:37:24 -07:00
David Anderson 39af029598 client: mostly revert dddf586, which could lead to way overcommitted CPU 2013-07-03 00:56:01 -07:00
David Anderson dddf586532 client: remove code that avoids overcommitting CPUs if MT jobs present.
This can lead to starving the CPUs if there are both GPU and MT jobs.
The basic problem is that a host with GPUs will never have all its CPUs
available for MT jobs.
It should probably advertise fewer CPUs, or something.
2013-06-17 08:48:05 -07:00
David Anderson 4323afee1f client: task schedule tweak to avoid starvation case
In enforce_run_list(), don't count the RAM usage of NCI tasks.
NCI tasks run sporadically, so it doesn't make to count it;
doing so can starve regular jobs in some cases.
2013-05-09 15:24:44 -07:00
David Anderson 6b6c2ac519 - client: fix bug that could cause idle GPUs when exclusions are present.
The basic problem: the way we assign GPU instances when creating
        the "run list" is slightly different from the way we assign them
        when we actually run the jobs;
        the latter assigns a running job to the instance it's using,
        but the former doesn't.
    Solution (kludge): when building the run list,
        don't reserve instances for currently running jobs.
        This will result in more jobs in the run list, and avoid starvation.
        For efficiency, do this only if there are exclusions for this type.
    Comment: this is yet another complexity that would be eliminated
        if GPU instances were modeled separately.
        I wish I had time to do that.
- client emulator: change default latency bound from 1 day to 10 days
2013-04-07 13:00:15 -07:00
David Anderson 1b9ad86694 - client: don't prefix <task> messages with [task] 2013-04-02 12:31:32 -07:00
David Anderson b93e80c6f5 - client: code cleanup. Some variable/function/constant names
contained "debt" when they actually refer to REC.
    Change these names to use "rec".
2013-03-24 11:22:01 -07:00
David Anderson 702798b84b - client: a couple of more clock-change fixes 2013-03-22 10:28:20 +01:00
David Anderson 3c029c7613 - client: job scheduler tweak to avoid CPU idleness in situation
where GPU jobs use different CPU fractions
- single-job submission: default platform is that of server
2013-03-05 15:57:34 +01:00
Rom Walton 2dd82881de - client/server: fix build breaks I introduced last night with a variable
rename.
2013-03-04 15:30:03 +01:00
Charlie Fenton ce87ec9848 OpenCL: First pass at adding support for Intel Ivy Bridge GPUs 2013-03-04 15:23:39 +01:00
David Anderson 952a495fb7 - client: add "client app configuration" feature; see
http://boinc.berkeley.edu/trac/wiki/ClientAppConfig
    This lets users do the following:
    1) limit the number of concurrent jobs of a given app
        (e.g. for WCG apps that are I/O-intensive)
    2) Specify the CPU and GPU usage parameters of GPU versions
        of a given app.
    Implementation notes:
    - max app concurrency is enforced in 2 places:
        1) when building the initial job run list
        2) when enforcing the final job run list
        Both are needed to avoid possible starvation.
    - however, we don't enforce it during RR simulation.
        Doing so could cause erroneous shortfall and work fetch.
        This means, however, that work buffering will not work
        as expected if you're using max concurrency.
2013-03-04 15:20:32 +01:00
Charlie Fenton 687c8e1a5d Mac: fix build break.
svn path=/trunk/boinc/; revision=25842
2012-07-03 07:31:06 +00:00
David Anderson 430f6a0813 - client: in the job scheduler, there's a check to prevent
overcommitting the CPUs if an MT is scheduled.
    Skip this check for GPU jobs.


svn path=/trunk/boinc/; revision=25835
2012-07-02 17:58:33 +00:00
David Anderson 331114d961 - client: minor code shuffle
svn path=/trunk/boinc/; revision=25627
2012-04-30 21:12:35 +00:00
David Anderson bbfbef0fe8 - client: code cleanup. Move RESULT and PROJECT to separate files
svn path=/trunk/boinc/; revision=25621
2012-04-30 21:00:28 +00:00
David Anderson 9d25481174 - scheduler: fix bug that tried to open plan class spec file
on each request.
- client: when showing how much work a scheduler request returned,
    scale by availability (as is done to show the amount of the request)
- client in account manager request, <not_started_dur> and
    <in_progress_dur> are in wall time, not run time
    (i.e. scale them by availability)

Note: there's some confusion in the code between runtime and wall time,
    where in general wall time = runtime / availability.
    New convention: let's use "runtime" for the former,
    and "duration" for the latter.

svn path=/trunk/boinc/; revision=25597
2012-04-25 04:10:29 +00:00
David Anderson b6b02aedf4 - client: fix bug that caused a project's jobs to all be run EDF
if the project has the <dont_use_dcf> flag set.

svn path=/trunk/boinc/; revision=25593
2012-04-24 06:07:36 +00:00
David Anderson f317329321 - client: fix typo that prevented GPU jobs from running
if CPUs were filled with EDF jobs


svn path=/trunk/boinc/; revision=25497
2012-03-27 17:20:47 +00:00
David Anderson 6498b0bba2 - client: set PROJECT::last_upload_start whenever an upload starts,
not just when a result becomes ready to upload.
    Fix bug where a scheduler RPC to report results is done
    even though uploads are active.
- client: cpu_sched_debug enables messages about not scheduling jobs
    because of insufficient RAM
    

svn path=/trunk/boinc/; revision=25493
2012-03-26 22:01:31 +00:00
David Anderson adab6254bc Update Translation
svn path=/trunk/boinc/; revision=25477
2012-03-23 16:25:19 +00:00
David Anderson dfd34e631f - client: job scheduling policy tweak:
if CPUs are fully committed (e.g. with EDF jobs)
    allow GPU jobs but only up to CPU usage of ncpus+1


svn path=/trunk/boinc/; revision=25454
2012-03-19 17:39:26 +00:00
David Anderson 22f6512135 - client: changes to job scheduling policy:
- fix bug that could greatly overcommit CPUs
        if there are several EDF jobs and several non-EDF GPU jobs.
    - don't overcommit CPUs if any job is MT (MT means avg_ncpus > 1).
        For example, on a 4-CPU machine we will run:
            a 0.5-CPU GPU job and 4 1-CPU jobs
            but not
            a 0.5-CPU GPU job and 1 4-CPU job

svn path=/trunk/boinc/; revision=25442
2012-03-18 05:50:47 +00:00
David Anderson 14c5493c69 - client: change the job scheduling policy for MT jobs.
The old policy avoided running an N-CPU job unless N CPUs were free.
    This could result in idle CPUs for long periods; for example:
    on a 4-CPU machine, suppose you have a long 1-CPU job in EDF mode,
    and some 4-CPU jobs.
    3 CPUs will be idle until the 1-CPU job finishes.
    Furthermore, the work fetch mechanism won't try to get
    jobs (possibly non-MT) from other projects,
    because the RR simulation doesn't reflect the scheduling
    policy's exclusion principle.

    The change: schedule jobs until ncpus_used >= ncpus.
    E.g. in the above situation run the 1- and 4-CPU jobs together.
    In extreme cases we might run 3 1-CPU jobs and the 4-CPU job.
    This will degrade the performance of the 4-CPU job,
    but that's probably better than having idle CPUs.


svn path=/trunk/boinc/; revision=25312
2012-02-22 21:11:41 +00:00
David Anderson 015a70e757 - client: define a "arrived-first" order on results
in which the tiebreaker is MD5 of name.
    That way the order is stable
    (it doesn't change from one run of the client to the next)
    and it doesn't grep results with similar names
    (and hence for the same app).
    This ordering is used for
    1) the order of display in the manager
    2) the job scheduler's notion of FIFO


svn path=/trunk/boinc/; revision=25300
2012-02-20 22:31:40 +00:00
David Anderson a4cd8e5cdb - storage stuff
- client: message tweak


svn path=/trunk/boinc/; revision=25244
2012-02-13 08:41:48 +00:00
David Anderson b36779b22a - client: fix job scheduler problem:
old: RR simulation marks some jobs as missing their deadline,
        and the job scheduler runs those jobs as "high priority".
    problem: those generally aren't the ones we should run.
        E.g. if the client has a lot of jobs from a project,
        typically the ones with later deadlines are the ones
        whose deadlines are missed in the simulation.
        But in this case the EDF policy says we should run
        the ones with earliest deadlines.
    new: if a project has N deadline misses,
        run its N earliest-deadline jobs,
        regardless of whether they missed their deadline in the sim.
    Note: this is how it used to be (as designed by John McLeod).
        I attempted to improve it, and got it wrong.


svn path=/trunk/boinc/; revision=25188
2012-02-02 17:05:55 +00:00
Charlie Fenton 65b5930423 client: don't defer scheduling a task based on insufficient GPU RAM
svn path=/trunk/boinc/; revision=25166
2012-01-30 10:09:44 +00:00
David Anderson f4b8f357bb - client: fix divide-by-zero bug in calculation of priority
of projects with zero resource share

svn path=/trunk/boinc/; revision=25127
2012-01-23 07:34:16 +00:00
Charlie Fenton bd55ab5968 client: Add logging message for insufficient GPU RAM details to coproc_debug flag
svn path=/trunk/boinc/; revision=25077
2012-01-17 04:03:19 +00:00
Charlie Fenton ee48c85db6 client: Add logging message for insufficient GPU RAM details to coproc_debug flag
svn path=/trunk/boinc/; revision=25074
2012-01-17 02:42:49 +00:00
David Anderson bba4ce24ce - client: compute projects' disk share (based on resource share).
Report it (along with disk usage) in scheduler request messages.
    This will allow the scheduler to send file-delete commands
    if the project is using more than its share.
- client: add <disk_usage_debug> log flag
- create_work: add --help, show --command_line option


svn path=/trunk/boinc/; revision=24968
2012-01-02 05:53:42 +00:00
David Anderson 69834e0c01 - client: compile fix; remove redundant total_peak_flops()
svn path=/trunk/boinc/; revision=24738
2011-12-06 09:20:30 +00:00
David Anderson bc35060726 - client: when contacting a project for reasons other than
work fetch (e.g. to report completed jobs)
    only request work if it's the project we would have chosen
    if we were fetching work.
- client: the way in which project priorities were adjusted
    in work fetch to reflected currently queued work was wrong.
- client: fix bug in the way project priorities are adjusted
    in RR simulator
- client emulator: if there are results in the state file
    with states DOWNLOADING or UPLOADING,
    change them to DOWNLOADED or UPLOADED.
    Otherwise they're stuck.


svn path=/trunk/boinc/; revision=24737
2011-12-06 04:21:27 +00:00
David Anderson 07e54fc86b - client: fix work fetch bug.
If we're contacting a project to report results,
    only piggyback work requests for resources for which
    that project is the highest priority that may have work.
- client: compute result.not_started more efficiently

TODO: continue efficiency work.  There's still some quadratic stuff


svn path=/trunk/boinc/; revision=24523
2011-11-04 08:15:04 +00:00
David Anderson ad2f3771da - client: fix bugs when writing/parsing cc_config.xml via GUI RPCs
(e.g. when editing it via the Manager).
    Include only the GPUs that were specified in the original cc_config.xml,
    not those detected by the client.
- client: fix bug that failed to require authorization for
    GUI RPCs that are supposed to be authorized
- client: report parse errors in acct_mgr_url.xml and acct_mgr_login.xml
- fix compile warnings
- user web: in sample project_specific_prefs.inc,
    get app names from the DB instead of listing them in the PHP code.


svn path=/trunk/boinc/; revision=24518
2011-11-03 19:19:36 +00:00
David Anderson 6297bdbc77 - web: typo in forum RSS from Daniel L G; fixes #1147
- client: message tweak


svn path=/trunk/boinc/; revision=24483
2011-10-25 17:22:18 +00:00
David Anderson b95ac02c5b - client: change the way project priorities are computed,
so that they do what they're supposed to
    (i.e. enforce resource shares)
- client: change log flag <debt_debug> to <priority_debug>
- client simulator: update REC even with large delta-t.
- client simulator: handle "no new work" apps correctly


svn path=/trunk/boinc/; revision=24429
2011-10-19 06:37:03 +00:00
David Anderson f8e7662e1f - client: improvements to job scheduling and work fetch policies.
- Job scheduling: the baseline policy is to schedule based on "project priority",
            which is how much processing P should receive based on resource share
            minus how much it actually has received recently.
            This policy tends to run jobs from the same project together,
            so we modified it by adding a priority adjustment as jobs are scheduled.
            The idea is that if 2 projects have about the same priority
            they should split the processors.

            The problem: the adjustment was too large on hosts that are on
            only a small fraction of the time,
            thus tending to run 1 job from each project, regardless of priority.

            Solution: make an adjustment that reflects the host's actual throughput.
            See adjust_rec_sched() for details.

        - Work fetch: similar situation.
            We were making an adjustment based on how much work the project currently has queued,
            but the adjustment drowned out the project priority,
            so we'd tend to always get work from the project that has least work queued.
            Solution: make a smaller adjustment (-.3 ... .3)

    - client: in message announcing app start, show the plan class

    - client: don't show "unrecognized XML" messages for account files.
        It's typically project-specific prefs that the client doesn't know about.

svn path=/trunk/boinc/; revision=24403
2011-10-15 20:28:26 +00:00
David Anderson 80a6db29d6 - client: win compile fixes
svn path=/trunk/boinc/; revision=24330
2011-10-04 18:19:57 +00:00
Charlie Fenton 95ecb2acda client: Fix compiler warnings
svn path=/trunk/boinc/; revision=24319
2011-10-03 07:55:33 +00:00
Charlie Fenton ebdb8094f1 client: Fix compiler warnings
svn path=/trunk/boinc/; revision=24318
2011-10-03 07:54:22 +00:00
David Anderson 5c0d5d371e - client: compute project scheduling priority more efficiently
- client: if an app version can't be used because the GPUs it needs
    are all excluded, mark it and all its results as "coproc missing"
    so that they won't be looked at in scheduling logic.


svn path=/trunk/boinc/; revision=24317
2011-10-03 06:18:58 +00:00
David Anderson 090050c0ca - client: fix bug that could cause GPU idleness
in the presence of GPU exclusions.
    The problem was in the job-selection phase,
    which picks enough jobs to use all devices.
    It was ignoring GPU exclusions, so for example on
    a 2 GPU system it could pick 2 jobs from a project
    for which 1 GPU is excluded,
    and as a result 1 GPU would be idle.

    Solution: during job selection,
    keep track of GPU usage on a per-instance basis.
    Select a job only if it can run on a non-excluded GPU.

- client: in computing ncprocs_excluded (which is used in
    work fetch policy) don't count exclusions of non-existent devices


svn path=/trunk/boinc/; revision=24316
2011-10-03 03:29:58 +00:00
Charlie Fenton a0096d3ae1 client: Fix compile break on Mac
svn path=/trunk/boinc/; revision=24299
2011-09-27 11:34:18 +00:00
David Anderson 9667ff52a8 - client simulator: fixes
- client: message tweaks


svn path=/trunk/boinc/; revision=24297
2011-09-26 23:34:40 +00:00
David Anderson 7411dd60aa - client: change in the use of GPU available RAM:
- measure the available RAM of each GPU when BOINC starts up.
        If this fails, set available = physical.
        Show available RAM in startup messages.
    - use available RAM rather than physical RAM in selecting
        the "best" GPU instance
    - report available RAM to the scheduler
TODO: change the scheduler to use available rather than physical
    if it's reported


svn path=/trunk/boinc/; revision=24210
2011-09-14 22:45:26 +00:00
David Anderson 7f2a3c0ce1 - client: get GPU available RAM at startup (only)
- client: fix compile warning


svn path=/trunk/boinc/; revision=24188
2011-09-13 22:58:39 +00:00