Commit Graph

220 Commits

Author SHA1 Message Date
Bernd Machenschalk 34c823a9ab Merge branch 'EinsteinAtHome' into 'master'
This is meant not to break anything, just add some
(optional) logging and features needed for Einstein@Home.
Please contact me before changing or removing any of this.

Conflicts:
	sched/db_dump.cpp
	sched/file_deleter.cpp
	sched/validator.cpp
2014-05-26 14:42:36 +02:00
Bernd Machenschalk 2f6d140c56 validator:
added options -min_wu_id and -max_wu_id to validator
2014-05-23 12:06:00 +02:00
David Anderson de6540cbc0 scheduler: if a result was aborted by user, don't count it as an error 2014-05-22 23:54:56 -07:00
David Anderson b17455816d db_dump: include badges in XML stats export
I did this by including list of badges in the tables.xml file,
and writing the list of badge assignments to 2 new files,
badge_user.gz (for users) and badge_team.gz (for teams).

I considered including the badges within the <user> and <team> elements.
However, this would require enumerating the badges for a particular user
within the enumeration of users, which doesn't work;
only one enumeration can be active at a time.
Plus it would be less efficient, and db_dump already takes
a half hour on a big project.
2014-05-18 19:19:05 -07:00
David Anderson fec574f4e8 create_work: increase the efficiency of bulk job creation
The job submission RPC handler (PHP) originally ran the
create_work program once per job.
This took about 1.5 minutes to create 1000 jobs.
Recently I changed this so that create_work only is run once;
it does one SQL insert per job.
Disappointingly, this was only slightly faster: 1 min per 1000 jobs.

This commit changes create_work to create multiple jobs per SQL insert
(as many as will fit in a 1 MB query, which is the default limit).
This speeds things up by a factor of 100: 1000 jobs in 0.5 sec.
2014-04-10 23:53:19 -07:00
David Anderson fc7c75b200 server: parse peak memory/disk info from client, store in DB, display in web
The latest client reports the peak working set size, swap size,
and disk usage for completed jobs.
Add fields to the results table to store these.
Parse them in scheduler request messages, and write to the DB.
Display them in the result web page.

This data can be used to improve (or even automate)
the job estimates for memory and disk usage.
2014-04-02 19:35:59 -07:00
David Anderson 0c430ce1fa Add support for multi-size apps
See http://boinc.berkeley.edu/trac/wiki/MultiSize
The components of this include:
- DB changes:
    add size_class to workunit and result
    n_size_classes to app; >1 means multi-size
- size_regulator daemon program: change results states
    from INACTIVE to UNSENT carefully
- size_census program; writes quantile info in flat files
- transitioner: when creating results for multi-size apps,
    set server state to INACTIVE
- sched shmem (feeder): read quantile info from flat files,
    store in shared memory
- scheduler (score-based scheduling): for multi-size apps,
    add component to score function for size class.
- show_shmem: show result size class
- make_work (and other callers of count_unsent_results()):
    count both INACTIVE and UNSENT
- create_work: add --size_class cmdline option

Also:
- if get MySQL errors in upgrade, don't rewrite db_version
2013-04-25 00:27:35 -07:00
David Anderson 2ded3ff67d - fix typo in GUI RPC
- check in some code for multi-user job prioritization
2013-03-04 15:23:39 +01:00
David Anderson 68f9880615 - client: remove "device" entry from CUDA_DEVICE_PROP,
and change types of mem-size fields from int to double.
    These fields are size_t in NVIDIA's version of this;
    however, cuDeviceGetAttribute() returns them as int,
    so I don't see where this makes any difference.
- client: fix bug in handling of <no_rsc_apps> element.
- scheduler: message tweaks.
    Note: [foo] means that the message is enabled by <debug_foo>.



svn path=/trunk/boinc/; revision=25849
2012-07-05 20:24:17 +00:00
David Anderson 759c23ed27 - server: create a harness for testing validator code.
If you link your functions (init_result(), compare_results(),
    cleanup_result()) with validate_test.cpp,
    you'll get a program that you can run as
        validate_test file1 file2
    and it will compare the two files
    (this works only for validators that expect 1 file per result).

    I added a makefile, sched/makefile_validator_test,
    that you can use for this.
- server: shuffle code so that the above doesn't need to
    link MySQL libraries
- client: if we fetch a master file and it contains no scheduler URLs,
    show a message of class INTERNAL_ERROR
- client/scheduler: make CUDA_DEVICE_PROP.totalGlobalMem a double,
    and remove dtotalGlobalMem.
    Although NVIDIA reports RAM size as a size_t,
    there's no reason to store it as an integer after that.


svn path=/trunk/boinc/; revision=25542
2012-04-10 00:32:35 +00:00
David Anderson 86f50ba080 - admin web: when resetting app statistics,
clear elapsed time stats as well as PFC stats


svn path=/trunk/boinc/; revision=25530
2012-04-05 11:01:38 +00:00
David Anderson 4a50b2b2e2 - wrapper: compute final CPU time correctly for multi-process apps
- storage stuff


svn path=/trunk/boinc/; revision=25356
2012-02-29 20:58:45 +00:00
David Anderson 516e5ad798 - storage stuff
svn path=/trunk/boinc/; revision=25354
2012-02-29 01:11:28 +00:00
David Anderson ce52c9cf3e - storage stuff
svn path=/trunk/boinc/; revision=25341
2012-02-24 22:55:11 +00:00
David Anderson a8f883d2fa - server: split out the "antique file deletion" feature of
file_deleter.cpp into a separate program,
    since it blocks normal file deletion while it's running.
    From Bernd.
- storage stuff


svn path=/trunk/boinc/; revision=25321
2012-02-24 03:09:56 +00:00
David Anderson 2ed1cfbbb2 - scheduler and create_work: fix bugs that caused targeted jobs
to be sent to non-targeted hosts.
    The feeder was erroneously putting targeted jobs
    in the shared mem cache.
    Changes:
    - The feeder only enumerates jobs for which
        workunit.transitioner_flags is zero.
        NOTE: this field is nonzero iff the job is assigned.
    - create_work: when creating an assigned jobs,
        set workunit.transitioner_flags appropriately


svn path=/trunk/boinc/; revision=25314
2012-02-22 22:13:08 +00:00
David Anderson c4d1229830 - scheduler: in version selection, when deciding which version is fastest,
we multiple projected FLOPS by a normal random var
    with mean 1 and stddev 0.1.
    Make the stddev configurable; in particular it can be zero.


svn path=/trunk/boinc/; revision=25311
2012-02-22 19:51:09 +00:00
David Anderson 1b8d6b098d - storage stuff (work in progress)
- small code shuffle


svn path=/trunk/boinc/; revision=25274
2012-02-16 23:59:26 +00:00
David Anderson 480e28b54c - web: fix the user search feature
- scheduler: parse d_project_share
- scheduler: if vbox and vbox_mt are both available,
    use vbox for a 1-CPU machine


svn path=/trunk/boinc/; revision=25176
2012-02-01 03:30:14 +00:00
David Anderson 130d6ed4f0 - server: revamp the "assigned job" mechanism.
This now supports two main use cases:
    1) there's a job that you want to run once on all hosts,
        present and future
        (or all hosts belonging to a user, or to a team).
        The job is never transitioned, validated, or assimilated.
    2) There's a normal job for which you want to use only
        hosts belonging to a specific user (e.g. cluster or cloud hosts).
        This restriction can be made either when the job is created,
        or on the fly,
        e.g. as part of a scheme for accelerating batch completion.
        For the latter purposes we now provide a function
            restrict_wu_to_user(DB_WORKUNIT&, int userid);

        The job goes through the standard
        transitioner/validator/assimilator path.

    These cases are enabled by config flags
        <enable_assignment_multi/>
        <enable_assignment/>
    respectively.

    Assignment of type 2) are no longer stored in shared mem,
    so there is no limit on their number.

    There is no longer a rule that assigned job names must contain "asgn".

    NOTE: this requires a database update.


svn path=/trunk/boinc/; revision=25169
2012-01-30 22:39:13 +00:00
David Anderson 10c79a7166 - scheduler: initialize COPROC_ATI::version to zero;
avoid sending spurious "update driver" messages


svn path=/trunk/boinc/; revision=25131
2012-01-23 21:59:12 +00:00
David Anderson c05444ad1e - GUI RPC: switching to the new XML parser
(which won't parse a double as an int)
    revealed a type mismatch in FILE_TRANSFER::next_request_time
    between client and server.


svn path=/trunk/boinc/; revision=25125
2012-01-23 05:03:52 +00:00
David Anderson dd16170fc1 - scheduler: the p_fpops value reported by clients can't be trusted.
Some credit cheats (e.g. with credit_by_runtime) can be done
    by reporting a huge value.
    Fix this by capping the value at 1.1 times the 95th percentile
    of host.p_fpops, taken over active hosts.


svn path=/trunk/boinc/; revision=25017
2012-01-09 17:35:48 +00:00
David Anderson e8657adfd2 - scheduler: change vbox_mt app plan function to use 1, 2 or 3 CPUs
depending on how many the host has,
    and whether CPU VM extensions are present
    (this reflects the requirements of CernVM).


svn path=/trunk/boinc/; revision=25009
2012-01-08 01:28:39 +00:00
David Anderson 95ebb112c2 - client: for VBox apps, check stderr for "ERR_CPU_VM_EXTENSIONS_DISABLED".
If found, set HOST_INFO::p_vm_extensions_disabled,
    and pass this to the scheduler.
- scheduler (VBox app plan function) if a host has p_vm_extensions_disabled
    set, don't sent it multicore VBox jobs.

Note: if you have a host with VM extensions, and they're disabled
    in the BIOS, and you enable them, you can remove the
    <p_vm_extensions_disabled> line from client_state.xml
    and you'll be eligible to get multicore VM jobs again.


svn path=/trunk/boinc/; revision=24944
2011-12-30 09:43:58 +00:00
David Anderson 8877aa5183 - web: in GPU model list page,
look for plan classes containing "nvidia" as well as "cuda".


svn path=/trunk/boinc/; revision=24614
2011-11-16 19:47:40 +00:00
David Anderson e49f945908 - Validator: allow project-specific code to mark a result
is a "runtime outlier", i.e. its runtime does
    not correspond to the job's rsc_fpops_est.
    Runtime outliers are not counted in the statistics for
    elapsed time, turnaround time, and peak FLOPs count.

    The is intended for applications like SETI@home,
    some of whose jobs finish more or less instantly
    (this happens if the data contains a lot of interference).
    If a host happens to get a bunch of these short jobs,
    its statistics will get skewed: in essence, the server
    will think that the host is extremely fast,
    and will send it too many jobs.


svn path=/trunk/boinc/; revision=24225
2011-09-16 16:43:15 +00:00
David Anderson 176b0a4327 - validator: add a --credit_from_runtime option.
This assigns credit proportional to runtime*p_fpops.
    To prevent cheating, p_fpops is capped at the 95th percentile value
    among active hosts,
    and runtime is capped at a specified limit.
    This option supports apps, like LHC's CERNvm app,
    that run for a certain amount of time and then exit.
    The CreditNew system doesn't work for such apps.
- trickle_credit:
    To prevent cheating,
    cap p_fpops at the 95th percentile value among active hosts,
    and require a limit on runtime.
- require that trickle handlers supply an initialization function


svn path=/trunk/boinc/; revision=24182
2011-09-13 21:01:42 +00:00
David Anderson 7c81d72378 - web: fix warnings in forum pages
- scheduler: when using elapsed time stats to predict runtime,
    cap the estimated FLOPS at twice the peak FLOPS;
    otherwise, if a host has received a lot of very short jobs
    recently, it will get a too-high FLOPS estimate and
    will exceed the rsc_fpops_bound limit.


svn path=/trunk/boinc/; revision=24128
2011-09-05 17:29:53 +00:00
David Anderson c5c5975b44 - Improve interface of XML_PARSER.
Add parsed_tag and is_tag to the class,
    so that parsing functions don't need to declare them
    and pass them around.
- Complete the task of using XML_PARSER as the argument
    to all parsing functions.
    (Internally, many of these functions still use the old XML parser;
    that's the next step.)


svn path=/trunk/boinc/; revision=23978
2011-08-10 17:11:08 +00:00
David Anderson 271699ea0a - server: fix typo
svn path=/trunk/boinc/; revision=23904
2011-07-30 22:42:05 +00:00
David Anderson 6e5acbbe60 - web: remote job submission:
- add fields to batch table, extend APIs accordingly
    - require that example web interface run on BOINC server
        (this makes many things easier;
        an actual remote interface would require a bit more work)


svn path=/trunk/boinc/; revision=23881
2011-07-27 06:20:48 +00:00
David Anderson 83965db576 - web: more remote job submission code. Not finished.
svn path=/trunk/boinc/; revision=23871
2011-07-25 21:45:53 +00:00
David Anderson 2177a6bd95 - server: restore fpops/intops_cumulative to RESULT
(structure, not table) for AQUA
- client, Windows: when wake up from hibernation,
    get the time before printing log msg


svn path=/trunk/boinc/; revision=23784
2011-06-29 23:00:39 +00:00
David Anderson 4403df42d8 - client: add <type> element to <exclude_gpu> config option,
in case of multiple GPU types


svn path=/trunk/boinc/; revision=23777
2011-06-25 05:13:56 +00:00
David Anderson 436415cfe1 - scheduler, back end: add "homogeneous app version" feature.
Lets you specify, on a per-app basis,
    that all instances should be done using the same app version.
    This is for validation in the presence of GPUs.
- scheduler: code cleanup
    - Instead of adding a bunch of non-DB fields to RESULT,
        used a derived class SCHED_DB_RESULT.
    - Instead of storing a pointer to BEST_APP_VERSION in RESULT,
        store the structure itself.
        This simplifies the memory allocation situation.
- client: condition "Got server request to delete file" messages
    on <file_xfer_debug>


svn path=/trunk/boinc/; revision=23636
2011-06-06 03:40:42 +00:00
David Anderson b6140088e3 - update_versions: trim XML strings
svn path=/trunk/boinc/; revision=23569
2011-05-21 06:22:15 +00:00
David Anderson 53a7307305 - scheduler: fix nasty bug introduced in [23040]
that caused no jobs to be sent.


svn path=/trunk/boinc/; revision=23096
2011-02-23 21:22:45 +00:00
David Anderson 5421335dbb - transitioner: fix bug that could cause file deletion to not be done
for some WUs
- back end: fix the way "report grace period" is implemented
    old: result.report_deadline (i.e. what's in the DB) and
        the deadline sent to the client are the same.
        Some confusing and incorrect logic in the transitioner
        tries to provide the desired semantics.
    new: result.report_deadline is the deadline sent to the client,
        plus the grace period.
        No logic in the transitioner is needed.


svn path=/trunk/boinc/; revision=23040
2011-02-15 22:07:14 +00:00
David Anderson 3355b66241 - client and scheduler: a client host may have multiple VM systems installed.
TODO: check for VirtualBox on Mac, Linux

svn path=/trunk/boinc/; revision=22704
2010-11-17 23:19:07 +00:00
Rom Walton 1564a49816 - sched: Parse the detected virtual machine software from
the scheduler request so it can be used in plan classes.
        
    db/
        boinc_db.h
    sched/
        sched_types.cpp

svn path=/trunk/boinc/; revision=22703
2010-11-17 20:52:01 +00:00
David Anderson ae7866b251 - scheduler: restore scaling of daily quota by # processors
and/or config.gpu_multiplier
- client: msg tweak

svn path=/trunk/boinc/; revision=21753
2010-06-15 22:21:57 +00:00
David Anderson 4147249de2 - server: delete old credit stuff
- user web: show host link in user result list.  Fixes #999


svn path=/trunk/boinc/; revision=21735
2010-06-12 22:08:15 +00:00
David Anderson 8b836a391b - database: remove unused fields from app table
svn path=/trunk/boinc/; revision=21728
2010-06-11 03:50:47 +00:00
David Anderson c4df1f3104 svn path=/trunk/boinc/; revision=21232 2010-04-21 20:11:41 +00:00
David Anderson 5035007b90 - back end: new way of deciding:
- whether host is "reliable" for an app version
    - whether host is eligible for single replication for an app version
    - whether to use host scaling
    In each case, the answer is yes if the number of
    consecutive valid results is above a threshold.
    This replaces existing "error rate" and "scale probation" mechanisms.

    TODO: the # of consecutive valid results should also determine
        a limit on jobs in progress for an app version.
        Namely, if N is the threshold for host scaling, the limit should be
            ndevices*(max(1, consecutive_valid - N))
        The client currently doesn't supply enough
        app version info to do this.
        It could be approximated; that would give some protection
        against cherry-picking.
- credit: more conservative formulas for combining claimed credit
    among replicas.
    If there are normal replicas, we use a "low average"
    that weights each sample by the sum of the other samples.
    Otherwise we use the min (not the average) of the approximate samples.

NOTE: a DB update is required


svn path=/trunk/boinc/; revision=21230
2010-04-21 19:33:20 +00:00
David Anderson 021edb02c2 - back end programs: improve log msgs
svn path=/trunk/boinc/; revision=21193
2010-04-16 18:07:08 +00:00
David Anderson b2451544e1 - server: change the following from per-host to per-(host, app version):
- daily quota mechanism
    - reliable mechanism (accelerated retries)
    - "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
    host_scale_check set.
- validator: do scale probation on invalid results
    (need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options

Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
    a host will have separate quotas for each one.
    That means it could error out on 100 jobs for cuda_fermi,
    and when its quota goes to zero,
    error out on 100 jobs for cuda23, etc.
    This is intentional; there may be cases where one version
    works but not the others.
- host.error_rate and host.max_results_day are deprecated

TODO:
    - the values in the app table for limits on jobs in progress etc.
        should override rather than config.xml.

Implementation notes:
scheduler:
    process_request():
        read all host_app_versions for host at start;
        Compute "reliable" and "trusted" for each one.
        write modified records at end
    get_app_version():
        add "reliable_only" arg; if set, use only reliable versions
        skip over-quota versions
    Multi-pass scheduling: if have at least one reliable version,
        do a pass for jobs that need reliable,
        and use only reliable versions.
        Then clear best_app_versions cache.
    Score-based scheduling: for need-reliable jobs,
        it will pick the fastest version,
        then give a score bonus if that version happens to be reliable.
    When get back a successful result from client:
        increase daily quota
    When get back an error result from client:
        impose scale probation
        decrease daily quota if not aborted
Validator:
    when handling a WU, create a vector of HOST_APP_VERSION
        parallel to vector of RESULT.
        Pass it to assign_credit_set().
        Make copies of originals so we can update only modified ones
    update HOST_APP_VERSION error rates
Transitioner:
    decrease quota on timeout


svn path=/trunk/boinc/; revision=21181
2010-04-15 03:13:56 +00:00
David Anderson 2536797068 - validator: remove update_credit_per_cpu_sec(). Irrelevant.
TODO: remove related code
- validator: update wu.canonical_credit correctly.
    However, this field should be deprecated.
- validator: check for error return from assign_credit_set().

svn path=/trunk/boinc/; revision=21096
2010-04-05 20:03:54 +00:00
David Anderson 19f7d66b53 - backend programs: change the way PFC and elapsed-time statistics
are written to the DB.
    The incremental approach was bogus.
    New approach:
    host_app_version: write directly; R/W interval is tiny
    app_version: maintain an explicit list of update samples
        for both PFC and credit.
        When the validator flushes its app_version cache,
        do careful updates.
    Note: when using double fields in careful updates,
    you can't test for equality.  Use abs(new-old) < 1e-N

svn path=/trunk/boinc/; revision=21057
2010-04-02 19:10:37 +00:00