- whether host is "reliable" for an app version
- whether host is eligible for single replication for an app version
- whether to use host scaling
In each case, the answer is yes if the number of
consecutive valid results is above a threshold.
This replaces existing "error rate" and "scale probation" mechanisms.
TODO: the # of consecutive valid results should also determine
a limit on jobs in progress for an app version.
Namely, if N is the threshold for host scaling, the limit should be
ndevices*(max(1, consecutive_valid - N))
The client currently doesn't supply enough
app version info to do this.
It could be approximated; that would give some protection
against cherry-picking.
- credit: more conservative formulas for combining claimed credit
among replicas.
If there are normal replicas, we use a "low average"
that weights each sample by the sum of the other samples.
Otherwise we use the min (not the average) of the approximate samples.
NOTE: a DB update is required
svn path=/trunk/boinc/; revision=21230
- daily quota mechanism
- reliable mechanism (accelerated retries)
- "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
host_scale_check set.
- validator: do scale probation on invalid results
(need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options
Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
a host will have separate quotas for each one.
That means it could error out on 100 jobs for cuda_fermi,
and when its quota goes to zero,
error out on 100 jobs for cuda23, etc.
This is intentional; there may be cases where one version
works but not the others.
- host.error_rate and host.max_results_day are deprecated
TODO:
- the values in the app table for limits on jobs in progress etc.
should override rather than config.xml.
Implementation notes:
scheduler:
process_request():
read all host_app_versions for host at start;
Compute "reliable" and "trusted" for each one.
write modified records at end
get_app_version():
add "reliable_only" arg; if set, use only reliable versions
skip over-quota versions
Multi-pass scheduling: if have at least one reliable version,
do a pass for jobs that need reliable,
and use only reliable versions.
Then clear best_app_versions cache.
Score-based scheduling: for need-reliable jobs,
it will pick the fastest version,
then give a score bonus if that version happens to be reliable.
When get back a successful result from client:
increase daily quota
When get back an error result from client:
impose scale probation
decrease daily quota if not aborted
Validator:
when handling a WU, create a vector of HOST_APP_VERSION
parallel to vector of RESULT.
Pass it to assign_credit_set().
Make copies of originals so we can update only modified ones
update HOST_APP_VERSION error rates
Transitioner:
decrease quota on timeout
svn path=/trunk/boinc/; revision=21181
TODO: remove related code
- validator: update wu.canonical_credit correctly.
However, this field should be deprecated.
- validator: check for error return from assign_credit_set().
svn path=/trunk/boinc/; revision=21096
are written to the DB.
The incremental approach was bogus.
New approach:
host_app_version: write directly; R/W interval is tiny
app_version: maintain an explicit list of update samples
for both PFC and credit.
When the validator flushes its app_version cache,
do careful updates.
Note: when using double fields in careful updates,
you can't test for equality. Use abs(new-old) < 1e-N
svn path=/trunk/boinc/; revision=21057
see http://boinc.berkeley.edu/trac/wiki/CreditNew
Projects will need to update DB and recompile all back-end programs.
Summary:
- new way of computing credit
- "reliable host" mechanism is per app version
- "host punishment" mechanism is per app version
- adjustment of wu.rsc_fpops_est provides the
equivalent of per app version DCF
- max jobs in progress is now per app
- max jobs per RPC is now per app
TODO:
- reliable mechanism:
- populate and use host_app_version.error_rate
- populate host_app_version.turnaround
- host punishment:
- populate host_app_version.max_jobs_per_day
- populate host_app_version.n_jobs_today
- use app.max_jobs_per_day_init
- job limits:
- use app.max_jobs_in_progress, max_gpu_jobs_in_progress
- use app.max_jobs_per_rpc
- adjust wu.rsc_fpops_est
- remove old credit stuff
fpops_cumulative, credit_multiplier
credit computation in scheduler
- AVERAGE class: use the Knuth algorithm (Wikipedia)
svn path=/trunk/boinc/; revision=21021
New policy: anon platform and old platform jobs
get average credit, possibly scaled by elapsed time.
We no longer attempt to guess what app version produced them.
svn path=/trunk/boinc/; revision=20816
Triggering the work generator is now done via the DB
instead of flat files.
Since only E@h uses locality scheduling,
I kept the DB changes in a separate file (db/schema_locality.sql).
There's a new field in the workunit table,
and that's a required update (in db_update.php)
- manager: compile fix
svn path=/trunk/boinc/; revision=20807
elapsed_time: the elapsed time (runtime) as reported by client
flops_estimate: the app's estimated FLOPS as reported by app_plan()
app_version_id: the DB ID of the app_version used
(or -1 if anonymous platform)
TODO: show these in the web interfaces,
and use them where appropriate
svn path=/trunk/boinc/; revision=19002
Old:
1) check deadline based on wu.delay_bound
2) in add_result_to_reply(), potentially modify wu.delay_bound,
e.g. because of retry acceleration
problem: reducing delay bound may cause deadline miss
New:
1) new function get_delay_bound_range()
(called from wu_is_infeasible_fast())
returns optimistic and pessimistic delay bounds.
Retry acceleration logic is here.
2) check deadline based on optimistic bound;
if that fails, check based on pessimistic bound.
Set wu.delay_bound to the one that worked.
Notes:
- get_delay_bound_range() needs result priority and report deadline,
and it's called before we read the full result.
So add these items to WORK_ITEM and WU_RESULT.
- get_delay_bound_range() could be customized for
project-specific deadline policy.
- add_result_to_reply() was becoming a toxic waste dump.
Deadline-related stuff should have been factored out in any case.
svn path=/trunk/boinc/; revision=18946
don't modify user preferences or CPID.
- client: fix bug that shows ATI version incorrectly
- database: host.posts has been repurposed as a salt (or seqno)
for a new type of weak authenticator that won't depend on password
- web code:
modify forum_preferences.posts instead of host.posts.
(actually, the former isn't used either, we just do a select count(*);
should fix this at some point).
svn path=/trunk/boinc/; revision=18865
and add <cuda_multiplier>.
The latter is used in calculating max jobs/day for a host;
namely, it's host.max_results_day * (NCPUS + NCUDA*cuda_multiplier).
Set it to 10 or so if you have CUDA apps.
- scheduler: don't overload effective_ncpus();
instead, add two new functions,
max_results_day_multiplier() and max_wus_in_progress_multiplier()
- scheduler: don't reduce max_results_day if we get an aborted job
(it might have been aborted by the project;
not appopriate to punish host in this case)
svn path=/trunk/boinc/; revision=16959
put a textual summary of them in host.serialnum (currently unused)
- web: show coprocs on host detail page
- db_dump: include coproc info in host XML
svn path=/trunk/boinc/; revision=16697
we're the main program (otherwise we didn't lock it in
the first place, and a crash results). From Artyom Sharov.
- scheduler: add support for the GCL simulator,
which uses special versions of backend programs
that use virtual time,
and that wait for signals instead of sleep()ing.
To compile:
make clean
configure CXXFLAGS="-DGCL_SIMULATOR"
make
svn path=/trunk/boinc/; revision=16038
- scheduler: fix bug in adaptive replication:
if send an unreplicated job to untrusted host,
set both wu.target_nresults and wu.min_quorum to app.target_nresults.
svn path=/trunk/boinc/; revision=15762
wish to use it.
- The script calculate_credit_multiplier (expected to be run daily as
a config.xml task) looks at the ratio of granted credit to CPU time
for recent results for each app. Multiplier is calculated to cause
median hosts granted credit per cpu second to equal to equal that
expected from its benchmarks. This is 30-day exponentially averaged
with the previous value of the multplier and stored in the table
credit_multplier.
- When a result is received the server adjusts claimed credit by the
value the multiplier had when the result was sent.
svn path=/trunk/boinc/; revision=15661
and change the correspending structure field from 64KB to 256KB
(could increase this if needed).
This is needed to handle app versions with lots (> 100) of files
- change LARGE_BLOB_SIZE to BLOB_SIZE a bunch of places
- Change COPROCS from vector<COPROC> to vector<COPROC*>.
Otherwise the right virtual functions of COPROCs don't get called
svn path=/trunk/boinc/; revision=14986
in server->client reply messages and in the client itself,
move app-planning info from RESULT to APP_VERSION.
This was necessary to allow anonymous platform info (app_info.xml)
to specify avg_ncpus, etc.
e.g., if someone wants to write a multithread version of SETI@home,
or a GPU/CUDA version,
they can run it using the anonymous platform mechanism
and it will be scheduled correctly.
If a server sends an existing APP_VERSION but with different
app-planning info, the client will accept and use the new info.
svn path=/trunk/boinc/; revision=14978
- update_versions: use __ (not :) as separator for plan class
- client: add plan_class to APP_VERSION;
an app version is now identified by platform/version/plan_class
- client CPU scheduler: don't assume apps use 1 CPU
- client: add avg_ncpus, max_cpus, flops, cmdline to RESULT
- scheduler: implement app planning scheme
Other changes:
- client: if symlink() fails, make a XML soft link instead
(for Unix running off a FAT32 FS)
- client: don't accept nonpositive resource share from AMS
- daemons and DB: check for error returns from enumerations,
and exit if so. Thus, if the MySQL server goes down,
all the daemons will soon exit.
The cron script will restart them every 5 min,
so when the DB server comes back up so will the project.
- web: show empty max CPU % as ---
- API: get rid of all_threads_cpu_time option (always the case now)
svn path=/trunk/boinc/; revision=14966
and apps that use coprocessors.
There now can be several app_versions for the same
(app, platform, version_num) combination.
This changes a number of things.
- Added app_version.plan_class field to DB
- update_versions now looks for a :plan-class in the
file or directory name, and puts it in the app_version's DB record
- Change uniqueness constraint to include plan_class
- Feeder: the feeder was putting non-deprecated app_versions
in shared mem, and leaving it to the scheduler to
find the latest version for a given platform.
This is dumb.
Instead, for each app/platform pair the feeder now
finds the highest version number of a non-deprecated app version,
and enumerates all non-deprecated app_versions with that
app/platform/version
- Scheduler: add a BEST_APP_VERSION data structure that keeps track,
for each app, what the best app_version is for this host.
This saves the work of recomputing it for each job.
svn path=/trunk/boinc/; revision=14906
The design has been changed to constant #threads per app version
Various changes from Kevin Reed/WCG:
- server: add workunit.rsc_bandwidth_bound: if nonzero,
send this WU only to hosts with that much download bandwidth
- assimilators: if a handler returns DEFER_ASSIMILATION,
the WU remains in INIT state and will be handled when the
next instance completes.
Useful if you want the assimilator to see all instances.
- scheduler: when setting result.outcome = DETACHED,
set received_time to now
- scheduler: removed the reliable_time and reliable_min_avg_credit
options
- scheduler/web: add optional <allow_non_preferred_projects>
in project preferences.
If present, user will accept work from non-selected apps
if no work is available for selected apps
- scheduler: improved messages for projects with multiple apps
- scheduler: added config options
<granted_credit_weight> and <granted_credit_ramp_up>.
Used in calculating host.claimed_credit_per_cpu_sec,
but I'm not sure how.
- Added two new credit-granting formulas (validate_util.C):
stddev_credit() and two_credit()
- server DB: add rollback_transaction() and affected_rows() to DB_CONN
NOTE: DB update required
svn path=/trunk/boinc/; revision=14870
Lets you assign a WU to a particular host,
to one or all hosts belonging to a user or team, or to all hosts.
See http://boinc.berkeley.edu/trac/wiki/AssignedWork
Disabled unless you include <enable_assignment> in config.xml
Uses a new DB table.
Tested but only a little.
- Server: code cleanup; moved result-handling to a new file,
and removed the PLATFORM_LIST arg to everything
(put it in SCHEDULER_REQUEST instead)
svn path=/trunk/boinc/; revision=14767
1) a WU is marked as ready for assimilation and has no errors;
2) it has no canonical result
In this case, the assimilate handler gets called anyway,
typically with the canonical result of the previous WU as arg.
Note: this situation doesn't arise normally;
it might happen if some results are deleted accidentally.
The fix:
- identify this situation, and set the WU.error_mask to a new code
(WU_ERROR_NO_CANONICAL_RESULT)
- zero out the "canonical_result" variable passed to the handler,
so even if the handler fails to check wu.error_mask,
at least it won't assimilate the same result twice.
Thanks to Hendrik Verhoek for finding this bug.
- DB schema: team table type is MyISAM, not InnoDB
svn path=/trunk/boinc/; revision=13938
like source code and text files. I skipped to check most files in html/
and mac_*/ though.
- Added svn:executable to tools/watch_tcp because it has a shebang.
svn path=/trunk/boinc/; revision=13819
update_versions to add version 6 apps.
It looks for API_VERSION string in main executable,
adds the API version to the app_version XML,
and sets min_core_version to 6 for version 6+ apps
- API: include API_VERSION string
- convert tabs to spaces here and there
- scheduler: parse unused elements in <net_stats>
- ops/show_log.php: if no URL args, just show form (fixes#415)
- client: parse and store api_version (not used yet)
svn path=/trunk/boinc/; revision=13627
Make a single function that creates teams
and cleanses arguments.
- API: don't include config.h in parse.h.
This file is included from apps
(indirectly, via graphics_api.h)
so it shouldn't assume that config.h exists
svn path=/trunk/boinc/; revision=13212
To do this, set host.max_results_day to -1.
If you do this, scheduler requests from that host
will get an error message, and will otherwise be ignored
(no jobs in or out, no trickles).
- Scheduler: send_message() should be called ONLY if you're
not going to call handle_request();
otherwise we'll write two separate replies.
To fix this, I added a separate function (send_error_message())
that can be called within handle_request()
to deal with error situations.
- Scheduler: moved debug_sched() to main.C
- Scheduler: moved logic to send "delete file" commands
out of handle_request() into a separate function,
send_file_deletes() in sched_locality.C.
Remove #ifdef EINSTEIN_AT_HOMEs; maybe someday another project
will use locality scheduling!
svn path=/trunk/boinc/; revision=13108
on successive calls, scans through ALL the sendable
jobs satisfying the select clause
(it does this by ID order, so there's no order clause)
This is used for HR, so that if a job has been committed
to an HR class, we eventually get it.
With extremely minimal testing, the new HR stuff seems to work.
db/
boinc_db.C,h
sched/
feeder.C
sample_work_generator.C
server_types.C
svn path=/trunk/boinc/; revision=12988
there is for each HR class, and writes it to a file.
This will be used soon for HR support in the feeder.
- split the HR code into hr.C,h (stuff used by both census and scheduler)
and sched_hr.C (stuff used only by the scheduler)
- database: change DB_CREDITED_JOB to treat workunitid
as a double (which it is) rather than a long.
BTW, long == int.
- fixed lots of compile warnings in the server code
db/
boinc_db.C,h
lib/
boinc_cmd.C
miofile.C
util.C
sched/
Makefile.am
census.C (new)
feeder.C
file_deleter.C
file_upload_handler.C
handle_request.C
hr.C,h (new)
main.C
sample_assimilator.C
sample_work_generator.C
sched_array.C
sched_hr.C,h
sched_send.C
server_types.C
transitioner.C
validator.C
svn path=/trunk/boinc/; revision=12970
e.g. distinguish between models of Intel and AMD
- Scheduler: add a quick HR check that doesn't access the DB
- Transitioner: if a workunit has >0 error results and no success results,
set its HR class to zero
From M.F. Somers.
db/
boinc_db.C,h
sched/
sched_array.C
sched_hr.C,h
transitioner.C
svn path=/trunk/boinc/; revision=12773