Work generators create jobs (workunits);
the transitioner creates instances (results).
If a work generator tries to maintain a certain number of unsent results
(as the sample work generator does)
it must wait for a bit, after creating jobs,
to let the transitioner create instances of those jobs.
The example work generator waited 5 seconds.
Problem: on a heavily loaded project, the transitioner can fall behind -
minutes or hours behind.
So the above policy can create way too many jobs.
Solution: after creating jobs, the sample work generator
notes the current time X,
then waits until the transitioner catches up to time X
(i.e., until the min workunit.transition_time exceeds X).
This ensures that instances have been created for all the new jobs.
Other work generators the limit the number of unsent jobs
should use the same technique;
use min_transition_time(x) to get the min transition time.
Code cleanup: get_double should be a member of DB_CONN, not DB_BASE.
(reported by Kevin Reed).
The problem: cache inconsistency.
If there are 2 results for the same WU in shared mem,
and 2 scheduler instances get them around the same time,
they can send them with different app versions.
We already fixed this problem for HR by
1) rereading the relevant WU fields while deciding
whether to send the result
2) doing a "careful update" of the WU field using a where clause
to make sure it wasn't modified in the (short) interval
since rereading it.
I fixed the HAV problem in the same way,
and merged the two mechanisms to combine the DB queries.
Also:
- The rereads are done in slow_check() (see below).
- The careful updates are done in update_wu_on_send(),
and this is called *before* doing careful updates on result fields.
That way, if the WU updates fail, we don't have orphaned results.
- already_sent_to_different_platform_careful() (sic)
no longer does DB stuff, so it's merged with
already_send_to_different_hr_class() (better name)
NOTE: slow_check() is used in array scheduling only.
Score-based scheduling uses other code,
in which this bug is not yet fixed.
Locality scheduling doesn't support HR or HAV at all.
This should be unified.
svn path=/trunk/boinc/; revision=24484
see http://boinc.berkeley.edu/trac/wiki/CreditNew
Projects will need to update DB and recompile all back-end programs.
Summary:
- new way of computing credit
- "reliable host" mechanism is per app version
- "host punishment" mechanism is per app version
- adjustment of wu.rsc_fpops_est provides the
equivalent of per app version DCF
- max jobs in progress is now per app
- max jobs per RPC is now per app
TODO:
- reliable mechanism:
- populate and use host_app_version.error_rate
- populate host_app_version.turnaround
- host punishment:
- populate host_app_version.max_jobs_per_day
- populate host_app_version.n_jobs_today
- use app.max_jobs_per_day_init
- job limits:
- use app.max_jobs_in_progress, max_gpu_jobs_in_progress
- use app.max_jobs_per_rpc
- adjust wu.rsc_fpops_est
- remove old credit stuff
fpops_cumulative, credit_multiplier
credit computation in scheduler
- AVERAGE class: use the Knuth algorithm (Wikipedia)
svn path=/trunk/boinc/; revision=21021
is still alive before handling a request. If not, try to reconnect.
This will hopefully make things work better if MySQL goes down and up
when using FCGI.
svn path=/trunk/boinc/; revision=16112
- scheduler: fix bug in adaptive replication:
if send an unreplicated job to untrusted host,
set both wu.target_nresults and wu.min_quorum to app.target_nresults.
svn path=/trunk/boinc/; revision=15762
- DB code: remove "is_high_priority" stuff.
- scheduler: merge find_app_version() into get_app_version().
Have the latter memoize its results (both positive and negative).
Have it call app_plan() for apps with nonempty plan_class.
- scheduler: first steps towards improved selectability of log messages.
It will eventually be like the client,
where you can select among various types of messages.
- feeder: if can't unlink the reread_db trigger file, exit
(else we'd go into an infinite loop)
svn path=/trunk/boinc/; revision=14940
The design has been changed to constant #threads per app version
Various changes from Kevin Reed/WCG:
- server: add workunit.rsc_bandwidth_bound: if nonzero,
send this WU only to hosts with that much download bandwidth
- assimilators: if a handler returns DEFER_ASSIMILATION,
the WU remains in INIT state and will be handled when the
next instance completes.
Useful if you want the assimilator to see all instances.
- scheduler: when setting result.outcome = DETACHED,
set received_time to now
- scheduler: removed the reliable_time and reliable_min_avg_credit
options
- scheduler/web: add optional <allow_non_preferred_projects>
in project preferences.
If present, user will accept work from non-selected apps
if no work is available for selected apps
- scheduler: improved messages for projects with multiple apps
- scheduler: added config options
<granted_credit_weight> and <granted_credit_ramp_up>.
Used in calculating host.claimed_credit_per_cpu_sec,
but I'm not sure how.
- Added two new credit-granting formulas (validate_util.C):
stddev_credit() and two_credit()
- server DB: add rollback_transaction() and affected_rows() to DB_CONN
NOTE: DB update required
svn path=/trunk/boinc/; revision=14870
in case someone else changed since we read it.
Hopefully this will fix a race condition
where WU results get sent to different HR classes.
(Alternatively we could use transactions,
or acquire the semaphore during read/update,
but this could impact performance).
svn path=/trunk/boinc/; revision=14710
causing a potential loss of precision.
Change it to double (same as atof())
- When moderator locks a thread, let them specify reason
svn path=/trunk/boinc/; revision=14662