- whether host is "reliable" for an app version
- whether host is eligible for single replication for an app version
- whether to use host scaling
In each case, the answer is yes if the number of
consecutive valid results is above a threshold.
This replaces existing "error rate" and "scale probation" mechanisms.
TODO: the # of consecutive valid results should also determine
a limit on jobs in progress for an app version.
Namely, if N is the threshold for host scaling, the limit should be
ndevices*(max(1, consecutive_valid - N))
The client currently doesn't supply enough
app version info to do this.
It could be approximated; that would give some protection
against cherry-picking.
- credit: more conservative formulas for combining claimed credit
among replicas.
If there are normal replicas, we use a "low average"
that weights each sample by the sum of the other samples.
Otherwise we use the min (not the average) of the approximate samples.
NOTE: a DB update is required
svn path=/trunk/boinc/; revision=21230
- daily quota mechanism
- reliable mechanism (accelerated retries)
- "trusted" mechanism (adaptive replication)
- scheduler: enforce host scale probation only for apps with
host_scale_check set.
- validator: do scale probation on invalid results
(need this in addition to error and timeout cases)
- feeder: update app version scales every 10 min, not 10 sec
- back-end apps: support --foo as well as -foo for options
Notes:
- If you have, say, cuda, cuda23 and cuda_fermi plan classes,
a host will have separate quotas for each one.
That means it could error out on 100 jobs for cuda_fermi,
and when its quota goes to zero,
error out on 100 jobs for cuda23, etc.
This is intentional; there may be cases where one version
works but not the others.
- host.error_rate and host.max_results_day are deprecated
TODO:
- the values in the app table for limits on jobs in progress etc.
should override rather than config.xml.
Implementation notes:
scheduler:
process_request():
read all host_app_versions for host at start;
Compute "reliable" and "trusted" for each one.
write modified records at end
get_app_version():
add "reliable_only" arg; if set, use only reliable versions
skip over-quota versions
Multi-pass scheduling: if have at least one reliable version,
do a pass for jobs that need reliable,
and use only reliable versions.
Then clear best_app_versions cache.
Score-based scheduling: for need-reliable jobs,
it will pick the fastest version,
then give a score bonus if that version happens to be reliable.
When get back a successful result from client:
increase daily quota
When get back an error result from client:
impose scale probation
decrease daily quota if not aborted
Validator:
when handling a WU, create a vector of HOST_APP_VERSION
parallel to vector of RESULT.
Pass it to assign_credit_set().
Make copies of originals so we can update only modified ones
update HOST_APP_VERSION error rates
Transitioner:
decrease quota on timeout
svn path=/trunk/boinc/; revision=21181
TODO: remove related code
- validator: update wu.canonical_credit correctly.
However, this field should be deprecated.
- validator: check for error return from assign_credit_set().
svn path=/trunk/boinc/; revision=21096
are written to the DB.
The incremental approach was bogus.
New approach:
host_app_version: write directly; R/W interval is tiny
app_version: maintain an explicit list of update samples
for both PFC and credit.
When the validator flushes its app_version cache,
do careful updates.
Note: when using double fields in careful updates,
you can't test for equality. Use abs(new-old) < 1e-N
svn path=/trunk/boinc/; revision=21057
see http://boinc.berkeley.edu/trac/wiki/CreditNew
Projects will need to update DB and recompile all back-end programs.
Summary:
- new way of computing credit
- "reliable host" mechanism is per app version
- "host punishment" mechanism is per app version
- adjustment of wu.rsc_fpops_est provides the
equivalent of per app version DCF
- max jobs in progress is now per app
- max jobs per RPC is now per app
TODO:
- reliable mechanism:
- populate and use host_app_version.error_rate
- populate host_app_version.turnaround
- host punishment:
- populate host_app_version.max_jobs_per_day
- populate host_app_version.n_jobs_today
- use app.max_jobs_per_day_init
- job limits:
- use app.max_jobs_in_progress, max_gpu_jobs_in_progress
- use app.max_jobs_per_rpc
- adjust wu.rsc_fpops_est
- remove old credit stuff
fpops_cumulative, credit_multiplier
credit computation in scheduler
- AVERAGE class: use the Knuth algorithm (Wikipedia)
svn path=/trunk/boinc/; revision=21021
However, MySQL's default is that "affected rows" is
rows actually modified, which is not what we want.
Use the CLIENT_FOUND_ROWS option in mysql_real_connect()
to change the semantics to "rows matched".
From Oliver Bock.
svn path=/trunk/boinc/; revision=20880
New policy: anon platform and old platform jobs
get average credit, possibly scaled by elapsed time.
We no longer attempt to guess what app version produced them.
svn path=/trunk/boinc/; revision=20816
Triggering the work generator is now done via the DB
instead of flat files.
Since only E@h uses locality scheduling,
I kept the DB changes in a separate file (db/schema_locality.sql).
There's a new field in the workunit table,
and that's a required update (in db_update.php)
- manager: compile fix
svn path=/trunk/boinc/; revision=20807
elapsed_time: the elapsed time (runtime) as reported by client
flops_estimate: the app's estimated FLOPS as reported by app_plan()
app_version_id: the DB ID of the app_version used
(or -1 if anonymous platform)
TODO: show these in the web interfaces,
and use them where appropriate
svn path=/trunk/boinc/; revision=19002
Old:
1) check deadline based on wu.delay_bound
2) in add_result_to_reply(), potentially modify wu.delay_bound,
e.g. because of retry acceleration
problem: reducing delay bound may cause deadline miss
New:
1) new function get_delay_bound_range()
(called from wu_is_infeasible_fast())
returns optimistic and pessimistic delay bounds.
Retry acceleration logic is here.
2) check deadline based on optimistic bound;
if that fails, check based on pessimistic bound.
Set wu.delay_bound to the one that worked.
Notes:
- get_delay_bound_range() needs result priority and report deadline,
and it's called before we read the full result.
So add these items to WORK_ITEM and WU_RESULT.
- get_delay_bound_range() could be customized for
project-specific deadline policy.
- add_result_to_reply() was becoming a toxic waste dump.
Deadline-related stuff should have been factored out in any case.
svn path=/trunk/boinc/; revision=18946
don't modify user preferences or CPID.
- client: fix bug that shows ATI version incorrectly
- database: host.posts has been repurposed as a salt (or seqno)
for a new type of weak authenticator that won't depend on password
- web code:
modify forum_preferences.posts instead of host.posts.
(actually, the former isn't used either, we just do a select count(*);
should fix this at some point).
svn path=/trunk/boinc/; revision=18865
after freeing the MYSQL_RES it came from.
(this didn't appear to cause any problems, but not good form).
Fixes#883
svn path=/trunk/boinc/; revision=17904
and add <cuda_multiplier>.
The latter is used in calculating max jobs/day for a host;
namely, it's host.max_results_day * (NCPUS + NCUDA*cuda_multiplier).
Set it to 10 or so if you have CUDA apps.
- scheduler: don't overload effective_ncpus();
instead, add two new functions,
max_results_day_multiplier() and max_wus_in_progress_multiplier()
- scheduler: don't reduce max_results_day if we get an aborted job
(it might have been aborted by the project;
not appopriate to punish host in this case)
svn path=/trunk/boinc/; revision=16959
put a textual summary of them in host.serialnum (currently unused)
- web: show coprocs on host detail page
- db_dump: include coproc info in host XML
svn path=/trunk/boinc/; revision=16697
is still alive before handling a request. If not, try to reconnect.
This will hopefully make things work better if MySQL goes down and up
when using FCGI.
svn path=/trunk/boinc/; revision=16112
we're the main program (otherwise we didn't lock it in
the first place, and a crash results). From Artyom Sharov.
- scheduler: add support for the GCL simulator,
which uses special versions of backend programs
that use virtual time,
and that wait for signals instead of sleep()ing.
To compile:
make clean
configure CXXFLAGS="-DGCL_SIMULATOR"
make
svn path=/trunk/boinc/; revision=16038
- API: in APP_INIT_DATA, enclose project preferences in tags
so that it's legal XML
- scheduler: add <multiple_clients_per_host> option.
Use this if your project runs on Condor or grids
and (to use multicore machines) you're running
multiple clients per host.
This will skip the host lookup based on IP address.
svn path=/trunk/boinc/; revision=15954
added "count" field to DB table to keep track of how many times
we've refreshed.
- show refresh schedule on main courses page
- default for random structure is all units, not 1
svn path=/trunk/boinc/; revision=15846
- scheduler: fix bug in adaptive replication:
if send an unreplicated job to untrusted host,
set both wu.target_nresults and wu.min_quorum to app.target_nresults.
svn path=/trunk/boinc/; revision=15762
wish to use it.
- The script calculate_credit_multiplier (expected to be run daily as
a config.xml task) looks at the ratio of granted credit to CPU time
for recent results for each app. Multiplier is calculated to cause
median hosts granted credit per cpu second to equal to equal that
expected from its benchmarks. This is 30-day exponentially averaged
with the previous value of the multplier and stored in the table
credit_multplier.
- When a result is received the server adjusts claimed credit by the
value the multiplier had when the result was sent.
svn path=/trunk/boinc/; revision=15661
If an app is hard, the scheduler always does the deadline check,
even if the client has no other jobs for this project.
And the estimated wallclock duration is multiplied by 1.3,
to avoid sending jobs to hosts that will barely make the deadline.
Hard apps are marked by setting weight = -1.
This is a total kludge, to avoid adding another field to app.
svn path=/trunk/boinc/; revision=15607
(calculate_credit_multiplier) to determine what factor to multiply
claimed credit by before insertion into the database. Changes to scheduler
to implement have not yet been checked in.
svn path=/trunk/boinc/; revision=15309
added client/scripts to default client build
removed sea from the default clientgui build
added locale/client to the default clientgui build
moved installed headers from $(includedir) to $(pkgincludedir) which
is $(includedir)/boinc by default.
removed redefinitions of $(includedir) from Makefiles.
- configure:
added locale/client/Makefile to AC_CONFIG_FILES
svn path=/trunk/boinc/; revision=15300
and change the correspending structure field from 64KB to 256KB
(could increase this if needed).
This is needed to handle app versions with lots (> 100) of files
- change LARGE_BLOB_SIZE to BLOB_SIZE a bunch of places
- Change COPROCS from vector<COPROC> to vector<COPROC*>.
Otherwise the right virtual functions of COPROCs don't get called
svn path=/trunk/boinc/; revision=14986
in server->client reply messages and in the client itself,
move app-planning info from RESULT to APP_VERSION.
This was necessary to allow anonymous platform info (app_info.xml)
to specify avg_ncpus, etc.
e.g., if someone wants to write a multithread version of SETI@home,
or a GPU/CUDA version,
they can run it using the anonymous platform mechanism
and it will be scheduled correctly.
If a server sends an existing APP_VERSION but with different
app-planning info, the client will accept and use the new info.
svn path=/trunk/boinc/; revision=14978
- update_versions: use __ (not :) as separator for plan class
- client: add plan_class to APP_VERSION;
an app version is now identified by platform/version/plan_class
- client CPU scheduler: don't assume apps use 1 CPU
- client: add avg_ncpus, max_cpus, flops, cmdline to RESULT
- scheduler: implement app planning scheme
Other changes:
- client: if symlink() fails, make a XML soft link instead
(for Unix running off a FAT32 FS)
- client: don't accept nonpositive resource share from AMS
- daemons and DB: check for error returns from enumerations,
and exit if so. Thus, if the MySQL server goes down,
all the daemons will soon exit.
The cron script will restart them every 5 min,
so when the DB server comes back up so will the project.
- web: show empty max CPU % as ---
- API: get rid of all_threads_cpu_time option (always the case now)
svn path=/trunk/boinc/; revision=14966
- DB code: remove "is_high_priority" stuff.
- scheduler: merge find_app_version() into get_app_version().
Have the latter memoize its results (both positive and negative).
Have it call app_plan() for apps with nonempty plan_class.
- scheduler: first steps towards improved selectability of log messages.
It will eventually be like the client,
where you can select among various types of messages.
- feeder: if can't unlink the reread_db trigger file, exit
(else we'd go into an infinite loop)
svn path=/trunk/boinc/; revision=14940
and apps that use coprocessors.
There now can be several app_versions for the same
(app, platform, version_num) combination.
This changes a number of things.
- Added app_version.plan_class field to DB
- update_versions now looks for a :plan-class in the
file or directory name, and puts it in the app_version's DB record
- Change uniqueness constraint to include plan_class
- Feeder: the feeder was putting non-deprecated app_versions
in shared mem, and leaving it to the scheduler to
find the latest version for a given platform.
This is dumb.
Instead, for each app/platform pair the feeder now
finds the highest version number of a non-deprecated app version,
and enumerates all non-deprecated app_versions with that
app/platform/version
- Scheduler: add a BEST_APP_VERSION data structure that keeps track,
for each app, what the best app_version is for this host.
This saves the work of recomputing it for each job.
svn path=/trunk/boinc/; revision=14906
The new field (workunit.rsc_bandwidth_bound)
goes at the END of the record.
Always do it this way!
- make_work: after creating a batch of new WUs,
we were waiting 60 sec for the transitioner to
create the results for them
(so that our next count of unsent results would be correct).
This is bogus; if e.g. the transitioner isn't running,
we'll never get the results, and we'll keep creating WUs forever.
Instead: explicitly wait for there to be results for
the last WU from the batch just created.
- scheduler: parse <allow_non_preferred_apps>, <allow_beta_work> correctly.
svn path=/trunk/boinc/; revision=14875
The design has been changed to constant #threads per app version
Various changes from Kevin Reed/WCG:
- server: add workunit.rsc_bandwidth_bound: if nonzero,
send this WU only to hosts with that much download bandwidth
- assimilators: if a handler returns DEFER_ASSIMILATION,
the WU remains in INIT state and will be handled when the
next instance completes.
Useful if you want the assimilator to see all instances.
- scheduler: when setting result.outcome = DETACHED,
set received_time to now
- scheduler: removed the reliable_time and reliable_min_avg_credit
options
- scheduler/web: add optional <allow_non_preferred_projects>
in project preferences.
If present, user will accept work from non-selected apps
if no work is available for selected apps
- scheduler: improved messages for projects with multiple apps
- scheduler: added config options
<granted_credit_weight> and <granted_credit_ramp_up>.
Used in calculating host.claimed_credit_per_cpu_sec,
but I'm not sure how.
- Added two new credit-granting formulas (validate_util.C):
stddev_credit() and two_credit()
- server DB: add rollback_transaction() and affected_rows() to DB_CONN
NOTE: DB update required
svn path=/trunk/boinc/; revision=14870
But if we do, set their transitioner time to plus infinity
so that we don't see them again.
(otherwise we go into an infinite loop)
- DB code: remove "high_priority" from queries not from scheduler
(should probably remove them from there too)
- file_deleter: print error msg if apache user doesn't exist
svn path=/trunk/boinc/; revision=14835
Lets you assign a WU to a particular host,
to one or all hosts belonging to a user or team, or to all hosts.
See http://boinc.berkeley.edu/trac/wiki/AssignedWork
Disabled unless you include <enable_assignment> in config.xml
Uses a new DB table.
Tested but only a little.
- Server: code cleanup; moved result-handling to a new file,
and removed the PLATFORM_LIST arg to everything
(put it in SCHEDULER_REQUEST instead)
svn path=/trunk/boinc/; revision=14767
which added only confusion.
Implement login directly, using cookies.
- All cookie setting/clearing now goes through two functions,
send_cookie() and clear_cookie().
These deal with path and expiry
(e.g. if you want to have different language or forum settings
on two projects on the same server, that now works).
svn path=/trunk/boinc/; revision=14735
in case someone else changed since we read it.
Hopefully this will fix a race condition
where WU results get sent to different HR classes.
(Alternatively we could use transactions,
or acquire the semaphore during read/update,
but this could impact performance).
svn path=/trunk/boinc/; revision=14710
causing a potential loss of precision.
Change it to double (same as atof())
- When moderator locks a thread, let them specify reason
svn path=/trunk/boinc/; revision=14662