Commit Graph

92 Commits

Author SHA1 Message Date
David Anderson 73b990b4b0 client: fix bug that sometimes prevented work fetch when GPU exclusions used 2013-06-16 20:10:17 -07:00
David Anderson 02fcc45ec4 client: fix work fetch bugs that caused incorrect GPU fetches 2013-06-10 10:36:05 -07:00
David Anderson 424b8c4034 client: fix work-fetch bug that can cause idle GPUs when use exclusions
Round-robin simulation, among other things, creates a bitmap
"sim_excluded_instances" of instances that are idle because of CPU exclusions.
There was a problem in how this was computed;
in the situation where there are fewer jobs than GPU instances
it could fail to set any bits, so no work fetch would happen.

My solution is a bit of a kludge, but should work in most cases.
The long-term solution is to treat GPU instances separately,
eliminating the need for GPU exclusions.
2013-06-08 16:25:53 -07:00
David Anderson 2e23bfedaa - client, work fetch policy. Change policy for projects w/ GPU exclusions 2013-03-07 11:28:43 +01:00
David Anderson a63ebbc13e - client: change work fetch policy to work better with GPU exclusions
- scale amount of work request by
        (# non-excluded instances)/#instances
    - change policy:
        old: don't fetch work if #jobs > #non-excluded instances
        new: don't fetch work if # of instance-seconds used in RR sim
            > work_buf_min * (#non-exluded instances)/#instances
2013-03-07 11:28:42 +01:00
David Anderson 3c73f40809 - client: the logic for work fetch in the presence of GPU exclusions
(especially per-app exclusions) was incomplete and buggy.
    Changes:
    - make bitmaps of included instances per (app, resource type)
    - in round-robin simulation, we keep track of used instances
        (so that we know if there are instances that are idle
        because of exclusions).
        Do this based on app-level exclusions
        (previously it was done based on project-wide exclusions,
        which didn't include app-level exclusions).
    - compute RSC_PROJECT_WORK_FETCH::non_excluded_instances
        as the logical OR of the per-app masks.
        I.e. if you exclude an instance for all apps separately,
        it's the same as excluding it for the project as a whole.
        (Note: this bitmap is used for only 1 purpose:
        if we have idle instances, don't request work from a project
        for which those instances are excluded.)
    - define RSC_PROJECT_WORK_FETCH::ncoprocs_excluded as the # of
        instances excluded for *any* app, not the # excluded for all apps.
        This quantity is used in work fetch to make sure we don't
        unboundedly fetch jobs that turn out not to have a GPU to run on
        due to exclusions.
2013-03-05 13:42:00 +01:00
David Anderson 9cf10b400a - GUI RPC: expose TIME_STATS info (e.g. on_frac) in
the binding of the get_state() RPC
- client: move client_start_time and previous_uptime
    from CLIENT_STATE to TIME_STATS,
    so that these are also visible in GUI RPC
- scheduler RPC: move uptime and previous_uptime
    into <time_stats>
- client: condition an RR simulation message on <rrsim_detail>
- boinccmd: show TIME_STATS info in --get_state
2013-03-01 16:08:52 +01:00
David Anderson 777f1f11e8 - client: change work fetch policy to avoid starving GPUs in situations where GPU exclusions are used. - client: fix bug in round-robin simulation when GPU exclusions are used.
Note: this fixes a major problem (starvation)
    with project-level GPU exclusion.
    However, project-level GPU exclusion interferes with most of
    the client's scheduling policies.
    E.g., round-robin simulation doesn't take GPU exclusion into account,
    and the resulting completion estimates and device shortfalls
    can be wrong by an order of magnitude.

    The only way I can see to fix this would be to model each
    GPU instance as a separate resource,
    and to associate each job with a particular GPU instance.
    This would be a sweeping change in both client and server.
2013-03-01 15:31:41 +01:00
David Anderson 4fea52c6f2 - client: if a project has excluded GPUs of a given type,
allow it to fetch work of that type if the # of runnable
    jobs it <= the # of non-excluded instances (rather than 0).


svn path=/trunk/boinc/; revision=26045
2012-08-18 23:26:10 +00:00
David Anderson f8c1665722 - client: keep track of the fraction of time that
1) a network connection is available and
    2) network communication is allowed and
    3) CPU computation is allowed
- If an app version is marked as needs_network,
    use the above fraction in estimating its rate of progress
- replace "core client" with "client" in comments.
- scheduler: message tweaks


svn path=/trunk/boinc/; revision=25803
2012-06-26 20:30:56 +00:00
David Anderson bbfbef0fe8 - client: code cleanup. Move RESULT and PROJECT to separate files
svn path=/trunk/boinc/; revision=25621
2012-04-30 21:00:28 +00:00
David Anderson 9d25481174 - scheduler: fix bug that tried to open plan class spec file
on each request.
- client: when showing how much work a scheduler request returned,
    scale by availability (as is done to show the amount of the request)
- client in account manager request, <not_started_dur> and
    <in_progress_dur> are in wall time, not run time
    (i.e. scale them by availability)

Note: there's some confusion in the code between runtime and wall time,
    where in general wall time = runtime / availability.
    New convention: let's use "runtime" for the former,
    and "duration" for the latter.

svn path=/trunk/boinc/; revision=25597
2012-04-25 04:10:29 +00:00
David Anderson bc35060726 - client: when contacting a project for reasons other than
work fetch (e.g. to report completed jobs)
    only request work if it's the project we would have chosen
    if we were fetching work.
- client: the way in which project priorities were adjusted
    in work fetch to reflected currently queued work was wrong.
- client: fix bug in the way project priorities are adjusted
    in RR simulator
- client emulator: if there are results in the state file
    with states DOWNLOADING or UPLOADING,
    change them to DOWNLOADED or UPLOADED.
    Otherwise they're stuck.


svn path=/trunk/boinc/; revision=24737
2011-12-06 04:21:27 +00:00
David Anderson 312c44415d - client: condition RR sim negative FLOPs message on rr_simulation.
svn path=/trunk/boinc/; revision=24540
2011-11-07 18:53:37 +00:00
David Anderson 98ba6807ab svn path=/trunk/boinc/; revision=24537 2011-11-07 05:12:02 +00:00
David Anderson 7b28215032 - client: reimplement the round-robin simulator to
reduce its runtime from O(N^2) to O(N),
    where N is the number of runnable jobs
    (which can be in the thousands).
    This will make the client emulator run a lot faster,
    and will reduce the client CPU overhead a bit.
- API: change boinc_get_opencl_ids() so that it returns
    a BOINC error code (< -100) if the app_init.xml is
    missing or bad (i.e. we're running standalone),
    and an OpenCL error code (> -100) if an OpenCL call failed.


svn path=/trunk/boinc/; revision=24469
2011-10-24 17:53:09 +00:00
David Anderson b95ac02c5b - client: change the way project priorities are computed,
so that they do what they're supposed to
    (i.e. enforce resource shares)
- client: change log flag <debt_debug> to <priority_debug>
- client simulator: update REC even with large delta-t.
- client simulator: handle "no new work" apps correctly


svn path=/trunk/boinc/; revision=24429
2011-10-19 06:37:03 +00:00
David Anderson 5c0d5d371e - client: compute project scheduling priority more efficiently
- client: if an app version can't be used because the GPUs it needs
    are all excluded, mark it and all its results as "coproc missing"
    so that they won't be looked at in scheduling logic.


svn path=/trunk/boinc/; revision=24317
2011-10-03 06:18:58 +00:00
David Anderson b7f1aa0226 - client: fix a bug reported by Jacob Klein,
where work fetch didn't work right in the presence of
    multiple GPUs and <exclude_gpu> config options.
    For example: suppose:
        - you have 2 GPUs and 2 projects
        - Project A is excluded from GPU 1
        - you have lots of jobs for project A
    Then the client won't try to fetch jobs from project B.

    The problem had 2 parts:
    a) round-robin simulation wasn't taking GPU exclusions into account.
        In the above example, it would think that both GPUs had jobs.
        I fixed this by computing the # of GPUs from each project
        is excluded, and using this in the RR simulation.
    b) Once this was done, I needed to make the client
        request GPU jobs from project B rather than project A.
        I did this with following policy:
        If a project has excluded GPUs of a given type,
        and has a runnable job of that type,
        don't ask it for more work of that type.

    Notes:
    - the policy in b) is crude, and it means that work-buffer
        preferences are ignored in some cases.
    - neither a) nor b) takes into account app-level exclusions.

    I could fix both of these with a lot of work,
    but I'd rather move to a model in which dissimilar GPUs
    are modeled as different resources,
    which would remove the need for the <exclude_gpu> mechanism
    in the first place.

- web: remove extraneous ) at end of button tooltips


svn path=/trunk/boinc/; revision=24312
2011-10-01 16:23:28 +00:00
David Anderson e279b59913 - Updates Linux notifications to use current libnotify.
- Fix build problems on Mac OS X using autotools
- Consistently use #if HAVE_X for platform checks,
    rather than #ifdef HAVE_X or #if defined(HAVE_X)
- In Unix build, make lots of compiler checks standard
- Fix some compile warnings

From Matt Arsenault.

Note: there are now lots of compile warnings in clientgui/ on Unix,
    mostly in WxWidgets code


svn path=/trunk/boinc/; revision=24303
2011-09-27 19:45:27 +00:00
David Anderson e0956b06df - minor code shuffle
svn path=/trunk/boinc/; revision=24222
2011-09-15 17:12:18 +00:00
David Anderson be1d379f6a - client: message tweak
svn path=/trunk/boinc/; revision=24162
2011-09-12 17:22:36 +00:00
David Anderson f81cb82b8e - client: make RR simulation more accurate
by simulating time-slicing explicitly.
    Also simulate changes in project REC
    and hence in scheduling priority.
- client: add a log flag "rrsim_detail" that prints
    time-slice-level info.


svn path=/trunk/boinc/; revision=24161
2011-09-12 17:01:54 +00:00
David Anderson 7b9e20ee78 - client: make round-robin simulator match what the job scheduler now does:
give lowest priority to projects with zero resource share.


svn path=/trunk/boinc/; revision=23963
2011-08-08 19:07:54 +00:00
David Anderson a21abed078 - client: fix typo that caused a lot of spurious
"project has XXXXXX deadline misses" messages
- fix compile warnings


svn path=/trunk/boinc/; revision=23816
2011-07-07 23:58:23 +00:00
David Anderson 94e8c48220 - client: change --detach_phase_two (??) to --detach_console
- eliminate compiler warnings (e.g. shadowed vars)
    in various places, mostly in client


svn path=/trunk/boinc/; revision=23710
2011-06-12 20:58:43 +00:00
David Anderson 3b906a191c - client: generalize the GPU framework so that
- new GPU types can be added easily
		- users can specify GPUs in cc_config.xml,
			referred to by app_info.xml,
			and they will be scheduled by BOINC
			and passed --device N options
			Note: the parsing of cc_config.xml is not done yet.
		- RPC protocols (account manager and scheduler)
			can now specify GPU types in separate elements
			rather than embedding them in tag names
			e.g. <no_rsc>NVIDIA</no_rsc> rather than <no_cuda/>
	- client: in account manager replies, parse elements of the form
		<no_rsc>NAME</no_rsc>
		indicating the GPUs of type NAME should not be used.
		This allows account managers to control GPU types
		not hardwired into the client.
		Note: <no_cuda/> and <no_ati/> will continue to be supported.
	- scheduler RPC reply: add
		<no_rsc_apps>NAME</no_rsc_apps>
		(NAME = GPU name)
		to indicate that the project has no jobs for the indicated GPU type.
		<no_cuda_apps> etc. are still supported 
	- client/lib: remove set_debts() GUI RPC
	- client/scheduler RPC
		remove <cuda_backoff> etc. (superceded by no_app)
		Exception: <ip_result> elements in sched request
		still have <ncudas> and <natis>.
		Fix this later.

	Implementation notes:
	- client/lib: change "CUDA" to "NVIDIA" in type/variable names, and in XML
		Continue to recognize "CUDA" for compatibility
	- host_info.coprocs no longer used within the client;
		use a global var (COPROCS coprocs) instead.
		COPROCS now has an array of COPROCs;
		GPUs types are identified by the array index.
		Index zero means CPU.
	- a bunch of other resource-specific structs (like RSC_WORK_FETCH)
		are now stored in arrays, with same indices as COPROCS
		(i.e. index 0 is CPU)
	- COPROCS still has COPROC_NVIDIA and COPROC_ATI structs to hold vendor-specific info
	- APP_VERSION now has a struct GPU_USAGE to describe its GPU usage

svn path=/trunk/boinc/; revision=23253
2011-03-25 03:44:09 +00:00
David Anderson 0685bd508e - client: fix inaccuracy in RR simulation reported by Bill Barber.
The problem arises when there are jobs of projects
    with widely differing resource shares,
    and results in an overestimation of saturated time.

    Old: at the start of simulation, call WORK_FETCH::compute_shares() 
        to get resources of runnable projects.
        Use these throughout the simulation.

    Problem: suppose you have 2 runnable projects;
        P1 has large RS, P2 has small RS.
        P1's jobs finish quickly.
        P2's jobs then are running alone,
        but their FLOPS is scaled (incorrectly) by P2's small RS.

    Solution: recompute relative CPU resource share within the
        simulation loop,
        and compute it over the projects that have actives jobs
        in the simulation.

svn path=/trunk/boinc/; revision=23162
2011-03-03 20:32:54 +00:00
David Anderson 795e89dbf5 - client: eliminate unnecessary CPU reschedules.
Currently we do a reschedule any time a job checkpoints,
    in case there's a job that has finished a time slice
    but hasn't checkpointed yet.
    Instead: flag such jobs, and trigger a reschedule
    on checkpoint only for flagged jobs.
- client: fix instability in job scheduling that happens
    if a job's estimated completion time in RR sim is close to its deadline.
    It can alternate between making and missing deadline,
    causing the scheduler to alternate rapidly between jobs.
    Solution: if RR sim has marked a job as deadline miss
    any time in the last (CPU scheduling period),
    treat it as a deadline miss.


svn path=/trunk/boinc/; revision=22928
2011-01-19 16:46:55 +00:00
David Anderson 717c45a2db - client: use std::deque instead of std::vector
for RR sim's pending-job lists.
    Erasing head of vector is slow.
- lib: allow GPU peak FLOPS to be specified in XML (for simulator)
- simulator work
- client: old work fetch policy: projects may need enough jobs
    for all device instances, not just resource_share*ninst.
    E.g. a project that has only CPU jobs in a CPU/GPU client
- client: with REC scheduling, don't ask for work for
    secondary resources if project has negative priority.
- client: in RR sim, make sure we saturate devices if possible.
    Otherwise we may report a shortfall incorrectly


svn path=/trunk/boinc/; revision=22894
2011-01-12 00:47:51 +00:00
David Anderson 5c2636b743 - client: fix scheduling bug.
The round-robin simulation wasn't handling multithread jobs correctly.
    For example, given two 3-CPU jobs,
    it would model running them together on a 4-CPU host.
    This doesn't correspond with the CPU scheduler,
    which runs only 1 at a time.
    So the simulator would say that there are no idle CPUs
    when in fact there are, and no new CPU jobs would be fetched.

svn path=/trunk/boinc/; revision=22801
2010-12-02 17:26:03 +00:00
David Anderson 8d9cf013c5 - client: account manager RPC:
Additions to request message:
        <not_started_dur>X</not_started_dur>
        <in_progress_dur>X</in_progress_dur>
        The estimated remaining duration of unstarted
        and in-progress tasks
    Additions to reply message, within <project>, optional:
        <suspend>0|1</suspend>
            suspend or resume project (overrides local state)
        <abort_not_started>0|1</abort_not_started>
            if set, abort unstarted jobs


svn path=/trunk/boinc/; revision=22698
2010-11-17 20:04:58 +00:00
David Anderson de944b928e - admin web: fix bugs in manage_app_versions page
- client: message tweak

svn path=/trunk/boinc/; revision=22633
2010-11-05 23:23:28 +00:00
David Anderson fdf15fb3af - client: maintain "gpu_active_frac" in addition to "active_frac"
(which really means CPU active)

svn path=/trunk/boinc/; revision=22283
2010-08-23 05:00:22 +00:00
David Anderson 2b33429f18 - scheduler: fix bug in single-replication decision (from Rytis)
svn path=/trunk/boinc/; revision=21576
2010-05-18 22:32:05 +00:00
David Anderson 40eebe00af - client/scheduler: in COPROCS, instead of having a vector of
pointers to dynamically allocated COPROC-derived objects,
    just have the objects themselves.
    Dynamic allocation should be avoided at all costs.

svn path=/trunk/boinc/; revision=21564
2010-05-18 19:22:34 +00:00
Rom Walton 9cb3e6ffc7 - client & lib: bring header inclusion up-to-date for the CC to begin
hunting down a memory leak.
        
    client/
        <Various Files>
    lib/
        <Various Files>

svn path=/trunk/boinc/; revision=21457
2010-05-11 19:10:29 +00:00
David Anderson 7db608660f - client: standardize debug messages.
Messages enabled by <foo_debug> are prefixed by "[foo]"


svn path=/trunk/boinc/; revision=21335
2010-04-29 20:32:51 +00:00
David Anderson b7d48765a8 - client: if have coproc jobs but coproc is missing,
skip those jobs in RR sim.
    Otherwise we add stuff to uninitialized data structures,
    and a crash can result.
- client: initialize the above data structures anyway


svn path=/trunk/boinc/; revision=20753
2010-02-28 04:32:10 +00:00
David Anderson f716dcf7ae - client: if a project has zero resource share,
treat it as a "backup project":
    fetch work from it only if there is an idle instance
    and no other projects have work.


svn path=/trunk/boinc/; revision=20286
2010-01-28 05:21:14 +00:00
David Anderson b5124fe729 - client: brute-force attempt at eliminating domino-effect preemption:
if job A is unstarted and EDF,
    and there's a job B that is later in the list,
    is started, has the same app version,
    and has the same arrival time,
    move A after B.
- client: remove the "temp_dcf" mechanism,
    which had the same goal but didn't work.
- client: in computing overall debt for a project,
    subtract a term that reflects pending work.
    This should reduce repeated fetches from the same project.
- client simulator: tweaks

svn path=/trunk/boinc/; revision=20223
2010-01-21 00:14:56 +00:00
David Anderson fe7d8b34f3 - client simulator: done for now
svn path=/trunk/boinc/; revision=20204
2010-01-20 06:35:57 +00:00
David Anderson d6b6f8d5db - client (Mac): append /usr/local/cuda/lib to LD_LIBRARY_PATH
and DYLD_LIBRARY_PATH
- client simulator: compile fixes

svn path=/trunk/boinc/; revision=20117
2010-01-09 16:41:17 +00:00
David Anderson 37aae854f3 - client: scheduling problem:
- a project overestimates job FLOP counts
    - the client starts jobs in EDF mode
    - as job progresses and fraction done increases,
        its completion time estimate decreases until
        it's no longer a deadline miss.
    - job gets preempted by other job from that project;
        you end up with lots of partly completed jobs.
    Solution (I hope): if an app version has running jobs,
        compute a "temp DCF" for the app version,
        which is the min of dynamic/static estimates for its jobs.
        Apply this scaling factor to completion time estimates
        for unstarted jobs in RR simulation
- client: the estimation of remaining time of running jobs was wrong
    (how did this bug survive so long?)

svn path=/trunk/boinc/; revision=20077
2010-01-06 06:01:23 +00:00
David Anderson 876522c6aa - client: add logic to work fetch so that each project
will have enough jobs to use its share of resource instances.
    This avoids situations where e.g. on a 2-CPU system
    a project has 75% resource share and 1 CPU job,
    and its STD increases without bound.
    
    Did a general cleanup of the logic for computing
    work request sizes (seconds and instances).

svn path=/trunk/boinc/; revision=20036
2009-12-24 20:40:27 +00:00
David Anderson e9a4debf9c - client: scheduling tweak.
Old: if a project has RR sim deadline misses,
			select jobs to run high-priority on the basis of:
			1) deadline (earliest first)
			2) estimated time to completion (least first)
			This ignores whether jobs missed their deadline in RR sim,
			so it may choose to run a job that's actually in no
			danger of missing its deadline over one that is.
		New: choose only jobs that miss their deadline in RR sim

svn path=/trunk/boinc/; revision=19826
2009-12-08 20:39:46 +00:00
David Anderson 4d96415576 - client: fix bug introduced in [19035] that causes wrong nidle instances
(and resulting work fetch problems)
- Unix build: don't touch svn_version.sh if it hasn't changed,
    to avoid remake of sched/ (from Gabor Gombas)

svn path=/trunk/boinc/; revision=19096
2009-09-18 19:26:34 +00:00
David Anderson f5a6f862bf - client: fix bug in RR simulation:
start only enough jobs to fill CPUs per project,
    not all the CPU jobs at once.
    I'm not sure how much difference this makes,
    but this is how it's supposed to work.
- client: if app_info.xml doesn't specify flops,
    use an estimate that takes GPUs into account.
- client: if it's been more than 2 weeks since time stats update,
    don't decay on_frac at all.

svn path=/trunk/boinc/; revision=19035
2009-09-09 22:18:02 +00:00
David Anderson c3fe504e1d - client: add ATI support to job scheduling and work fetch
svn path=/trunk/boinc/; revision=18850
2009-08-17 16:50:40 +00:00
David Anderson 0a523d5f3f svn path=/trunk/boinc/; revision=18843 2009-08-14 17:10:52 +00:00