Commit Graph

22 Commits

Author SHA1 Message Date
David Anderson 1267531181 - transitioner: fix bug where retry jobs weren't getting sent
because invalid jobs were counted as successful.
    How could this bug possibly have survived this long?
    From TJM (thanks -- who are you?)
    Fixes #1029

svn path=/trunk/boinc/; revision=22839
2010-12-10 00:33:45 +00:00
David Anderson b169e5ab0f - server programs: print error message instead of numeric retval
in log messages

svn path=/trunk/boinc/; revision=22647
2010-11-08 17:51:57 +00:00
David Anderson d756994bda - scheduler and back end: message tweaks and fixes
svn path=/trunk/boinc/; revision=21835
2010-06-29 03:20:19 +00:00
David Anderson 7c51512cbf - transitioner: the format string for a DB query had %.15d instead of %.15e.
That produced a messed-up query that assigned garbage values to:
        host_app_version.turnaround_var
        host_app_version.turnaround_q
        host_app_version.max_jobs_per_day
        host_app_version.consecutive_valid
    To repair these:
        - set turnaround_var and turnaround_q to zero
        - if max_jobs_per_day is outside of
            (0..config.daily_result_quota)
            set it to config.daily_result_quota
        - if consecutive_valid is outside (0..1000), set it to zero
    I added a script, html/ops/repair_21812.php, that does this;
    if you ran server code between [21181] and [21812], run this script.
- scheduler/transitioner: add <debug_quota> log flag
- changed the build system to always use -Wall
    (if we'd done this before, this bug wouldn't have happened)
- fixed a bunch of other compile warnings


svn path=/trunk/boinc/; revision=21812
2010-06-25 18:54:37 +00:00
David Anderson 89fab4ece5 - back end: change "daily result quota" mechanism.
Old: config.xml specifies an initial daily quota (say, 100).
        Each host_app_version starts out with this quota.
        On the return of a SUCCESS result,
        the quota is doubled, up to the initial value.
        On the return of an error result, or a timeout,
        the quota is decremented down to 1.
    Problem:
        Doesn't accommodate hosts that can do more than 100 jobs/day.
    New: similar, but
        - on validation of a job, daily quota is incremented.
        - on invalidation of a job, daily quota is decremented.
        - on return of an error result, or a timeout,
            daily quota is min'd with initial quota, then decremented.
    Notes:
        - This allows a host to have an unboundedly large quota
            as long as it continues to return more valid
            than invalid results.
        - Even with this change, hosts that return SUCCESS but
            invalid results will continue to get the initial daily quota.
            It would be desirable to reduce their quota to 1.

svn path=/trunk/boinc/; revision=21675
2010-06-02 00:11:01 +00:00
David Anderson 5035007b90 - back end: new way of deciding:
- whether host is "reliable" for an app version
    - whether host is eligible for single replication for an app version
    - whether to use host scaling
    In each case, the answer is yes if the number of
    consecutive valid results is above a threshold.
    This replaces existing "error rate" and "scale probation" mechanisms.

    TODO: the # of consecutive valid results should also determine
        a limit on jobs in progress for an app version.
        Namely, if N is the threshold for host scaling, the limit should be
            ndevices*(max(1, consecutive_valid - N))
        The client currently doesn't supply enough
        app version info to do this.
        It could be approximated; that would give some protection
        against cherry-picking.
- credit: more conservative formulas for combining claimed credit
    among replicas.
    If there are normal replicas, we use a "low average"
    that weights each sample by the sum of the other samples.
    Otherwise we use the min (not the average) of the approximate samples.

NOTE: a DB update is required


svn path=/trunk/boinc/; revision=21230
2010-04-21 19:33:20 +00:00
David Anderson b2451544e1 - server: change the following from per-host to per-(host, app version):
- daily quota mechanism
    - reliable mechanism (accelerated retries)
    - "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
    host_scale_check set.
- validator: do scale probation on invalid results
    (need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options

Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
    a host will have separate quotas for each one.
    That means it could error out on 100 jobs for cuda_fermi,
    and when its quota goes to zero,
    error out on 100 jobs for cuda23, etc.
    This is intentional; there may be cases where one version
    works but not the others.
- host.error_rate and host.max_results_day are deprecated

TODO:
    - the values in the app table for limits on jobs in progress etc.
        should override rather than config.xml.

Implementation notes:
scheduler:
    process_request():
        read all host_app_versions for host at start;
        Compute "reliable" and "trusted" for each one.
        write modified records at end
    get_app_version():
        add "reliable_only" arg; if set, use only reliable versions
        skip over-quota versions
    Multi-pass scheduling: if have at least one reliable version,
        do a pass for jobs that need reliable,
        and use only reliable versions.
        Then clear best_app_versions cache.
    Score-based scheduling: for need-reliable jobs,
        it will pick the fastest version,
        then give a score bonus if that version happens to be reliable.
    When get back a successful result from client:
        increase daily quota
    When get back an error result from client:
        impose scale probation
        decrease daily quota if not aborted
Validator:
    when handling a WU, create a vector of HOST_APP_VERSION
        parallel to vector of RESULT.
        Pass it to assign_credit_set().
        Make copies of originals so we can update only modified ones
    update HOST_APP_VERSION error rates
Transitioner:
    decrease quota on timeout


svn path=/trunk/boinc/; revision=21181
2010-04-15 03:13:56 +00:00
David Anderson 515113d7dd - server: change all backend programs so that -d 4 means
-d 3 plus print DB queries

svn path=/trunk/boinc/; revision=21106
2010-04-05 21:59:33 +00:00
David Anderson fb851311e0 - server: various changes;
see http://boinc.berkeley.edu/trac/wiki/CreditNew

    Projects will need to update DB and recompile all back-end programs.

    Summary:
    - new way of computing credit
    - "reliable host" mechanism is per app version
    - "host punishment" mechanism is per app version
    - adjustment of wu.rsc_fpops_est provides the
        equivalent of per app version DCF
    - max jobs in progress is now per app
    - max jobs per RPC is now per app

    TODO:
    - reliable mechanism:
        - populate and use host_app_version.error_rate
        - populate host_app_version.turnaround
    - host punishment:
        - populate host_app_version.max_jobs_per_day
        - populate host_app_version.n_jobs_today
        - use app.max_jobs_per_day_init
    - job limits:
        - use app.max_jobs_in_progress, max_gpu_jobs_in_progress
        - use app.max_jobs_per_rpc
    - adjust wu.rsc_fpops_est
    - remove old credit stuff
        fpops_cumulative, credit_multiplier
        credit computation in scheduler

- AVERAGE class: use the Knuth algorithm (Wikipedia)


svn path=/trunk/boinc/; revision=21021
2010-03-29 22:28:20 +00:00
David Anderson 67e438bc6b - transitioner: fix bug where WUs with error_mask <> 0 keep
transitioning every 10 days, hence never become eligible for purging.
    The problem: the transitioner has a "safety net" where,
    if the WU doesn't have a canonical result,
    it arranges for another transition in 10 days.
    Skip this if error_mask<>0.

svn path=/trunk/boinc/; revision=20265
2010-01-25 23:35:16 +00:00
David Anderson 6b3ea3d339 - user web: don't show "database error" if result refers
to deleted app version

svn path=/trunk/boinc/; revision=20251
2010-01-23 00:36:12 +00:00
David Anderson 974ea0469c - DB purge: allow fractional min_age_days (from Travis Desell)
svn path=/trunk/boinc/; revision=20250
2010-01-22 23:55:50 +00:00
David Anderson 69f09b9384 - transitioner: fix to 15 Sept checkin
svn path=/trunk/boinc/; revision=19089
2009-09-18 15:59:40 +00:00
David Anderson 75d2a45491 - server programs: add --help and --version cmdline options to all.
From Nils Chr. Brause.

svn path=/trunk/boinc/; revision=19079
2009-09-17 17:56:59 +00:00
David Anderson 814a8e4925 - transitioner: don't create more than max_total_results results
svn path=/trunk/boinc/; revision=19055
2009-09-16 04:35:42 +00:00
David Anderson 390e7b3c3a svn path=/trunk/boinc/; revision=18617 2009-07-17 16:13:51 +00:00
David Anderson 12eb6057e5 - client, Mac: don't do res_init(). It causes a crash.
- client (Unix): if client crashes while benchmark processes are going,
    make sure they detect this and exit.
- back-end programs: remove hardwired assumptions about
    what directory they run in, and hence where config.xml is.
    E.g., daemons look for it in "..", others expect it in current dir.
    New approach: all the programs look for the project dir as follows:
    1) the environment var BOINC_PROJECT_DIR, if defined
    2) the current dir, if config.xml is there.
    3) else ".."
    This means you can run programs in either proj/bin/ or proj/,
    or (using BOINC_PROJECT_DIR) you can keep executables
    outside of the project dir.


svn path=/trunk/boinc/; revision=18042
2009-05-07 13:54:51 +00:00
Eric J. Korpela 8f3abcc835 - Added checks for net/*.h, arpa/*.h, netinet/*.h and code to figure out
which of those files to include
    - Modified MAC address check to work on some non-Linux unixes.
      (mac_address.cpp)
    - Added suggested change to "already attached to project" checking.
      (ProjectInfoPage.cpp)
    - changed includes of standard c header files to their c++ equivalents
      (i.e. replaced <stdio.h> with <cstdio>) for namespace protection.
    - replaced "using namespace std;" with more explicit "using std::function" in
      several files.
    - Fixed bug in checking whether the os is OS/2 and added conditional OS_OS2
      to the build environment. (boinc_platform.m4,configure.ac)
    - Changed build environment to not use -nostandardlibs unless we are using
      G++ and static linkage is specified. (configure.ac)
    - Added makefiles and package building files for solaris CSW package manager.
    - Fixed bug with attempting to find login name using logname. (configure.ac)
    - Added ifdef HAVE_* protection around some include files commonly found in
      sys.
    - Added support for unified binary for x86_64/i686-pc-solaris.
      (cs_platforms.cpp)
    - generate_host_cpid() now uses MAC address on non-linux unix.
      (hostinfo_network.cpp)
    - Macro BOINC_SET_COMPILE_FLAGS now doesn't check gcc only flags on non-gcc
      compilers. (boinc_set_compile_flags.m4)
    - Library compiles no longer depend upon the library extension or require
      the library to be prefixed with lib.
    - More fixes for fcgi builds.
    - Added declaration of "struct ether_addr" and ether_ntoa().  Have not yet
      implemented ether_ntoa() for machines that don't have it, or where it is
      buggy.  (unix_util.h)
    - Added FCGI::perror() which calls FCGI_perror(). (boinc_fcgi.{h,cpp})
    - Fixed library Makefiles so that all required headers get installed.


svn path=/trunk/boinc/; revision=17388
2009-02-26 00:23:23 +00:00
David Anderson f8ed8163d2 - server: sleep intervals are integers
svn path=/trunk/boinc/; revision=16577
2008-11-26 20:37:11 +00:00
David Anderson 07bd768e9d - server: add -sleep_interval args to file_deleter and transitioner
(from Nicolas; fixes #783)


svn path=/trunk/boinc/; revision=16576
2008-11-26 19:09:27 +00:00
David Anderson be4cd9bb79 - scheduler: notify user if we're not sending work
because we don't have any (matchmaker only).
- back end programs: for programs that do enumerations,
    check for error returns and exit
    (otherwise we'll get stuck forever if DB fails)

NOTE: In the course of researching this I came across a bug
in the transitioner: if there's a WU with more than 1000 results,
the enumeration will always return ERR_DB_NOT_FOUND,
and the transitioner won't ever do anything again.
Fixing this is a little tricky, so I'm not going to do it right now.


svn path=/trunk/boinc/; revision=16324
2008-10-27 21:23:07 +00:00
David Anderson 98cfb8d3b0 - rename .C files to .cpp so that Doxygen will work
svn path=/trunk/boinc/; revision=16069
2008-09-26 18:20:24 +00:00