Commit Graph

50 Commits

Author SHA1 Message Date
David Anderson 0b11b3f6e2 scheduler: don't send spurious "no tasks available" msgs w/ score scheduling 2014-06-16 16:53:49 -07:00
David Anderson 56934f8fbe scheduler: clean up job dispatch logging
There are now 3 flags for job dispatch logging:
<debug_send/>: info about work request, jobs sent, other high-level stuff
<debug_send_scan/>: info about scans through job cache
<debug_send_job/>: info about individual jobs (e.g. reason for not sending)
2014-06-12 11:33:11 -07:00
David Anderson 52bd196c4f scheduler: fix bugs in sending non-compute-intensive jobs 2014-05-26 21:07:07 -07:00
David Anderson 3c64bbb837 scheduler: fix bug where first NCI job in shared mem never gets sent
In SCHED_SHMEM::no_work() we "lock" the job by setting its state to our PID.
When checking the job in send_job_for_app() we need to
accept this state was well as STATE_PRESENT.  From Jack Yang.
2014-05-09 00:18:20 -07:00
Eric J Korpela 244ba5bc85 SCHED: modified scheduled log output to use unsigned format for WU and RESULT
ids.  This allows IDs greater than 2^31 to be printed.
2013-06-19 10:15:08 -07:00
David Anderson c9c9f2bae0 - scheduler: code shuffle; new file sched_check.cpp contains functions
that decide whether a job can be sent to a host
2013-04-09 12:19:00 -07:00
David Anderson 12319ca82b - scheduler: add code (commented out for now) for new implementation
of score-based scheduling.
2013-04-09 11:10:50 -07:00
David Anderson 6f962d5b61 - file upload handler: in FCGI version, check for trigger file
each time through loop (from Bernd).
- validator: fix bug that zeroed result.random
2013-03-04 17:24:18 +01:00
David Anderson a64cb793f1 - scheduler: attempted performance enhancement.
Old: each scheduler process holds a semaphore
        while scanning the shared-mem job array.
        On machines with many CPUs
        there seems to be contention for this semaphore,
        causing slow scheduler response and possibly connection failures.
    New: Don't hold the semaphore while scanning array.
        Instead, if find a job that passes quick_check(),
        acquire the semaphore and recheck that the job is present in array
        and passes quick_check().
- client: show messages if app_config.xml has unrecognized tags
2013-03-04 17:16:56 +01:00
David Anderson 01c0a9a4b0 - scheduler: when resend jobs:
- don't use devices for which work is not being requested
    - obey wu_is_infeasible_custom()
        (e.g. don't send SETI@home VLAR jobs to GPUs)
- scheduler: add <debug_array_detail> log flag for slot-level messages
- admin web: show and allow control of app.beta
2013-03-01 16:26:08 +01:00
David Anderson ca652519cf - scheduler: log message tweaks
- Some C++ files in client had execute permissions (??).  Clear them.
2013-03-01 16:26:08 +01:00
David Anderson 9fb0328d9b - scheduler: add separate log flag for locality sched lite 2013-03-01 16:26:08 +01:00
David Anderson cc13f2ee6f - scheduler: fix logic error limited locality scheduling.
In LLS array pass, skip file-on-host check if host
    doesn't have any sticky files.
    TODO: it should actually be "any sticky files for this app".
    But we currently don't have any way to know that.


svn path=/trunk/boinc/; revision=26108
2012-09-13 17:38:55 +00:00
David Anderson 77f44e521c - scheduler: more detailed msgs for NCI job sending
svn path=/trunk/boinc/; revision=26080
2012-09-08 04:05:50 +00:00
David Anderson 6b81e2ffc3 - scheduler: fix sending of NCI jobs.
We were failing to mark the cache entries as free.
- API: initialize GPU device # to -1;
    If client doesn't give us a device number, something is wrong
    and it's better to not start computing.


svn path=/trunk/boinc/; revision=26079
2012-09-06 23:44:03 +00:00
David Anderson 11a6e85632 - scheduler: support for projects with some non-CPU-intensive apps
(but not all) wasn't finished.
    New logic: if the project has an NCI app then:
    - make a list of NCI apps for which the client doesn't have
        a job in progress.
    - try to send one job for each of these apps
    - do this even if no work is being requested.
    - don't send jobs for NCI apps by other mechanisms

NOTE: the client logic isn't quite right for mixed NCI projects.
    If there's no job for a given NCI app,
    the client should do a scheduler RPC.
    This isn't critical so we won't do this now.


svn path=/trunk/boinc/; revision=26068
2012-09-01 04:58:12 +00:00
David Anderson b1d1e21de4 - remote job submission: start writing a general-purpose
cmdline tool for remote job submission (not done)
- remote job submission: support the 4 file modes described
    in the documentation (not done)


svn path=/trunk/boinc/; revision=26067
2012-08-31 06:11:06 +00:00
David Anderson 6b7fb36056 - scheduler: msg tweaks
svn path=/trunk/boinc/; revision=26066
2012-08-29 18:08:15 +00:00
David Anderson 96b6e172f9 - scheduler: improved log messages for limited locality scheduling
svn path=/trunk/boinc/; revision=26065
2012-08-29 03:09:10 +00:00
David Anderson 9ccb8fa38d - scheduler: add support for limited locality scheduling
- API: remove support for PPM files


svn path=/trunk/boinc/; revision=26062
2012-08-27 17:00:43 +00:00
David Anderson 1776a244ae - web: when showing a batch, recompute and update its fraction done
- feeder: don't enumerate results for WUs with nonzero error_mask
- scheduler: in slow_check(), make sure the WU error_mask is still zero


svn path=/trunk/boinc/; revision=25822
2012-06-29 06:53:48 +00:00
David Anderson 023031a497 - scheduler: add a lot more debug messages if <debug_array> is set
svn path=/trunk/boinc/; revision=25694
2012-05-18 18:13:04 +00:00
David Anderson 2ed1cfbbb2 - scheduler and create_work: fix bugs that caused targeted jobs
to be sent to non-targeted hosts.
    The feeder was erroneously putting targeted jobs
    in the shared mem cache.
    Changes:
    - The feeder only enumerates jobs for which
        workunit.transitioner_flags is zero.
        NOTE: this field is nonzero iff the job is assigned.
    - create_work: when creating an assigned jobs,
        set workunit.transitioner_flags appropriately


svn path=/trunk/boinc/; revision=25314
2012-02-22 22:13:08 +00:00
David Anderson 5d76e13277 - scheduler: tweaks to last night's checkin.
In the inner loop of scan_work_array() there are two WORKUNITs:
    - the one that's part of wu_result (in the shared-mem array)
    - a temp copy.
        quick_check() may modify this in host-specific ways
        (e.g., adjusting rsc_fpops_est or delay_bound).
        This is the one we pass to add_result_to_reply().
    When we reread hr_class and app_version_id from the DB,
    update both structs.


svn path=/trunk/boinc/; revision=24493
2011-10-26 16:51:10 +00:00
David Anderson 4b826b52a0 - scheduler: fix bug in the "homogeneous app version" (HAV) feature
(reported by Kevin Reed).
    The problem: cache inconsistency.
    If there are 2 results for the same WU in shared mem,
    and 2 scheduler instances get them around the same time,
    they can send them with different app versions.
    We already fixed this problem for HR by
    1) rereading the relevant WU fields while deciding
        whether to send the result
    2) doing a "careful update" of the WU field using a where clause
        to make sure it wasn't modified in the (short) interval
        since rereading it.
    I fixed the HAV problem in the same way,
    and merged the two mechanisms to combine the DB queries.

    Also:
    - The rereads are done in slow_check() (see below).
    - The careful updates are done in update_wu_on_send(),
        and this is called *before* doing careful updates on result fields.
        That way, if the WU updates fail, we don't have orphaned results.
    - already_sent_to_different_platform_careful() (sic)
        no longer does DB stuff, so it's merged with
        already_send_to_different_hr_class() (better name)

    NOTE: slow_check() is used in array scheduling only.
        Score-based scheduling uses other code,
        in which this bug is not yet fixed.
        Locality scheduling doesn't support HR or HAV at all.
        This should be unified.


svn path=/trunk/boinc/; revision=24484
2011-10-26 07:15:22 +00:00
David Anderson 436415cfe1 - scheduler, back end: add "homogeneous app version" feature.
Lets you specify, on a per-app basis,
    that all instances should be done using the same app version.
    This is for validation in the presence of GPUs.
- scheduler: code cleanup
    - Instead of adding a bunch of non-DB fields to RESULT,
        used a derived class SCHED_DB_RESULT.
    - Instead of storing a pointer to BEST_APP_VERSION in RESULT,
        store the structure itself.
        This simplifies the memory allocation situation.
- client: condition "Got server request to delete file" messages
    on <file_xfer_debug>


svn path=/trunk/boinc/; revision=23636
2011-06-06 03:40:42 +00:00
David Anderson e480cef000 - scheduler: if we're not sending jobs because of user prefs
(no CPU, no GPU, selected apps)
    send a message, not a notice.
    Assume the user knew what they were doing,
    and doesn't want to be nagged.
- scheduler: check for the existence of an app version
    before checking for user selected-app prefs.
    This prevents sending "no jobs available for selected apps"
    message when no app versions exist for non-selected apps
- scheduler: use "tasks" instead of "work" in user messages


svn path=/trunk/boinc/; revision=23168
2011-03-04 19:40:59 +00:00
David Anderson b169e5ab0f - server programs: print error message instead of numeric retval
in log messages

svn path=/trunk/boinc/; revision=22647
2010-11-08 17:51:57 +00:00
David Anderson 81973a9fff - scheduler: fix structural problems with sending user messages.
Old: various redundant and/or misleading messages were sent.
    New:
        - if host w/ no GPU contacts a GPU-only project,
            send high-pri message saying they need a GPU
        - if host w/ GPU has driver too old for all versions,
            send high-pri message saying to update driver
        - if host w/ GPU has driver too old for some versions,
            send low-pri message saying to update driver
        - if host has GPU but too little RAM for any app,
            send low-pri message saying so
- scheduler: revamp GPU plan class functions

svn path=/trunk/boinc/; revision=21760
2010-06-16 22:07:19 +00:00
David Anderson b2451544e1 - server: change the following from per-host to per-(host, app version):
- daily quota mechanism
    - reliable mechanism (accelerated retries)
    - "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
    host_scale_check set.
- validator: do scale probation on invalid results
    (need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options

Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
    a host will have separate quotas for each one.
    That means it could error out on 100 jobs for cuda_fermi,
    and when its quota goes to zero,
    error out on 100 jobs for cuda23, etc.
    This is intentional; there may be cases where one version
    works but not the others.
- host.error_rate and host.max_results_day are deprecated

TODO:
    - the values in the app table for limits on jobs in progress etc.
        should override rather than config.xml.

Implementation notes:
scheduler:
    process_request():
        read all host_app_versions for host at start;
        Compute "reliable" and "trusted" for each one.
        write modified records at end
    get_app_version():
        add "reliable_only" arg; if set, use only reliable versions
        skip over-quota versions
    Multi-pass scheduling: if have at least one reliable version,
        do a pass for jobs that need reliable,
        and use only reliable versions.
        Then clear best_app_versions cache.
    Score-based scheduling: for need-reliable jobs,
        it will pick the fastest version,
        then give a score bonus if that version happens to be reliable.
    When get back a successful result from client:
        increase daily quota
    When get back an error result from client:
        impose scale probation
        decrease daily quota if not aborted
Validator:
    when handling a WU, create a vector of HOST_APP_VERSION
        parallel to vector of RESULT.
        Pass it to assign_credit_set().
        Make copies of originals so we can update only modified ones
    update HOST_APP_VERSION error rates
Transitioner:
    decrease quota on timeout


svn path=/trunk/boinc/; revision=21181
2010-04-15 03:13:56 +00:00
David Anderson 5b7f8b8348 - web: fix bug that caused "send email" and "show hosts"
in project prefs to always select "no"


svn path=/trunk/boinc/; revision=20786
2010-03-04 04:16:00 +00:00
David Anderson 12a85e5ced - scheduler: code cleanup: goto considered harmful
- scheduler: when calculate scheduler runtime,
    don't include the part reading request msg from client.
    That can be misleadingly long

svn path=/trunk/boinc/; revision=20781
2010-03-03 19:29:23 +00:00
David Anderson 56a8296b5b - scheduler: compute no_jobs_available correctly
in the presence of multiple scheduling types
    (e.g., locality and job array)
    From Nils Brause

svn path=/trunk/boinc/; revision=19559
2009-11-12 21:30:33 +00:00
David Anderson 8b701fc73f - scheduler: fix messed-up deadline check logic.
Old:
        1) check deadline based on wu.delay_bound
        2) in add_result_to_reply(), potentially modify wu.delay_bound,
            e.g. because of retry acceleration
        problem: reducing delay bound may cause deadline miss
    New:
        1) new function get_delay_bound_range()
            (called from wu_is_infeasible_fast())
            returns optimistic and pessimistic delay bounds.
            Retry acceleration logic is here.
        2) check deadline based on optimistic bound;
            if that fails, check based on pessimistic bound.
            Set wu.delay_bound to the one that worked.
    Notes:
    - get_delay_bound_range() needs result priority and report deadline,
        and it's called before we read the full result.
        So add these items to WORK_ITEM and WU_RESULT.
    - get_delay_bound_range() could be customized for
        project-specific deadline policy.
    - add_result_to_reply() was becoming a toxic waste dump.
        Deadline-related stuff should have been factored out in any case.

svn path=/trunk/boinc/; revision=18946
2009-08-31 19:35:46 +00:00
David Anderson b300519444 svn path=/trunk/boinc/; revision=18825 2009-08-10 04:49:02 +00:00
David Anderson 2e5d9bd778 - scheduler: add new config option <max_wus_in_progress_gpus>.
The limit on jobs in progress is now
        max_wus_in_progress * NCPUS
        + max_wus_in_progress * NGPUS
    where NCPUS and NGPUS reflect prefs and are capped.
    Furthermore: if the client reports plan class for in-progress jobs
    (see checkin of 31 May 2009)
    then these limits are enforced separately;
    i.e. the # of in-progress CPU jobs is <= max_wus_in_progress*NCPUS,
    and the # of in-progress GPU jobs is <= max_wus_in_progress_gpu*NGPUS
- scheduler config: rename <cuda_multiplier> to <gpu_multiplier>
- scheduler: <max_wus_to_send> is now scaled by
    (NCPUS + gpu_multiplier*NGPUS)
- scheduler: don't keep scanning array if !work_needed()
- scheduler: moved array-scan logic from sched_send.cpp to sched_array.cpp
- scheduler: don't say "no work available" if jobs are available
    but work_needed() is initially false


svn path=/trunk/boinc/; revision=18255
2009-06-01 22:15:14 +00:00
David Anderson 84afd18450 - scheduler: move app-version selection and score-based scheduling
to new files.

svn path=/trunk/boinc/; revision=17630
2009-03-19 16:35:35 +00:00
David Anderson 76da7d8653 - scheduler: msg tweak
svn path=/trunk/boinc/; revision=17584
2009-03-10 21:34:49 +00:00
David Anderson 41ed82f791 - scheduler: fix bugs that caused only 1 job to be sent
svn path=/trunk/boinc/; revision=17555
2009-03-07 01:00:05 +00:00
David Anderson c22b62f25b - scheduler: fix bugs in support for anonymous platform + coprocs
(app versions don't have a <coprocs> around coproc elements,
    may an oversight but let's stick with it).
    Anyway, I think it's working now.
- lib: remove "owner" array from COPROC.
    This was used in client to keep track of assignment of
    coprocessors to tasks, but we got rid of the reserve/free scheme.
    NOTE: this breaks the mechanism for passing --device N to apps;
    I'll have to do this another way.  Stay tuned.

svn path=/trunk/boinc/; revision=17543
2009-03-06 22:21:47 +00:00
David Anderson 33d5a81cf6 - scheduler: add locality_scheduling arg to add_result_to_reply();
eliminate the need to diddle around with config.locality_scheduling.

svn path=/trunk/boinc/; revision=17445
2009-03-03 16:38:54 +00:00
David Anderson 4cd3c530b0 - scheduler: reduce frequency of calls to work_needed()
svn path=/trunk/boinc/; revision=17003
2009-01-23 22:52:35 +00:00
David Anderson 91e120b3f4 - scheduler: improve message formatting; add <debug_locality> flag
for locality scheduling messages

svn path=/trunk/boinc/; revision=16921
2009-01-15 20:23:20 +00:00
David Anderson 74423f23b6 - scheduler: if no jobs available to send, inform the user
svn path=/trunk/boinc/; revision=16730
2008-12-22 00:10:02 +00:00
David Anderson 312ffba708 - API: remove BOINC_OPTIONS::worker_thread_stack_size
- web: check whether to show profile in separate function
    from displaying profile; eliminate double headers
- scheduler: finish purge of redundant arguments

svn path=/trunk/boinc/; revision=16726
2008-12-19 18:14:02 +00:00
David Anderson ef52366c1b - web: fix bug that caused login to fail
- sched: more global vars

svn path=/trunk/boinc/; revision=16695
2008-12-16 16:29:54 +00:00
David Anderson 49a69de194 - scheduler: estimate job durations based on the FLOPS estimate
for the selected APP_VERSION, rather than on the CPU benchmarks.
    Otherwise estimates are wrong for GPU or multi-thread apps.
- scheduler: start switching from having SCHED_REQUEST and
    SCHED_REPLY as globals instead of passing them around as args;
    to be continued.

svn path=/trunk/boinc/; revision=16691
2008-12-15 21:14:32 +00:00
David Anderson 8ea8081626 - scheduler: fix memory leak when reporting time stats logs
- scheduler: fix egregious bug where wu_is_infeasible_fast() result
    is ignored, and we send jobs to hosts that can't handle them.
- scheduler: don't check for disk space in work_needed();
    do it in check_disk(), which generates a message to user.
- scheduler: add -debug_log flag, which sends stderr to
    "debug_log" rather than scheduler_log.txt (for debugging)

svn path=/trunk/boinc/; revision=16578
2008-11-26 21:49:36 +00:00
David Anderson 5039207e2c - scheduler: add <have_cuda_apps> config flag.
If set the "effective NCPUS" (which is used to scale
    daily_result_quota and max_wus_in_progress)
    is max'd with the # of CUDA GPUs.

svn path=/trunk/boinc/; revision=16246
2008-10-21 23:16:07 +00:00
David Anderson 98cfb8d3b0 - rename .C files to .cpp so that Doxygen will work
svn path=/trunk/boinc/; revision=16069
2008-09-26 18:20:24 +00:00