Work generators create jobs (workunits);
the transitioner creates instances (results).
If a work generator tries to maintain a certain number of unsent results
(as the sample work generator does)
it must wait for a bit, after creating jobs,
to let the transitioner create instances of those jobs.
The example work generator waited 5 seconds.
Problem: on a heavily loaded project, the transitioner can fall behind -
minutes or hours behind.
So the above policy can create way too many jobs.
Solution: after creating jobs, the sample work generator
notes the current time X,
then waits until the transitioner catches up to time X
(i.e., until the min workunit.transition_time exceeds X).
This ensures that instances have been created for all the new jobs.
Other work generators the limit the number of unsent jobs
should use the same technique;
use min_transition_time(x) to get the min transition time.
Code cleanup: get_double should be a member of DB_CONN, not DB_BASE.
- DB: add tables for badges and badge/user and badge/team associations
- add script that defines 3 RAC-based badges and assigns them
- add images for these badges
- add admin page for creating/editing badges
- show badges on user page
not done:
- figure out how to send badges to client
- display badges somewhere in the GUIs
- export badges in db_dump
- enable badges by default for new projects
The OPENCL_CPU_PROP structure was being referred to as both
"opencl_cpu_prop" and "cpu_opencl_prop", roughly 50/50,
in variable names and XML tags.
Let's standardize on "opencl_cpu_prop",
which is what current clients are sending in scheduler requests.
- Batches now have optional "expire time".
If this time passes and the batch is not retired, abort and retire it.
- Add script "expire_batches" which enforces the above.
Run it as a periodic task.
- Add a web RPC for setting the expire time of a batch
(it can be changed multiple times)
- Add a C++ interface for this RPC
- Add a BOINC_SET_LEASE command to the BOINC GAHP
("lease" is Condor term for expire time)
Problem: a workunit could error out with unsent results.
The feeder skips such results, but the size_regulator counts them
and doesn't so doesn't promote any new results.
Solution: the feeder scans for results even with workunit errors.
If marks these results as state OVER, outcome DIDNT_NEED
See http://boinc.berkeley.edu/trac/wiki/MultiSize
The components of this include:
- DB changes:
add size_class to workunit and result
n_size_classes to app; >1 means multi-size
- size_regulator daemon program: change results states
from INACTIVE to UNSENT carefully
- size_census program; writes quantile info in flat files
- transitioner: when creating results for multi-size apps,
set server state to INACTIVE
- sched shmem (feeder): read quantile info from flat files,
store in shared memory
- scheduler (score-based scheduling): for multi-size apps,
add component to score function for size class.
- show_shmem: show result size class
- make_work (and other callers of count_unsent_results()):
count both INACTIVE and UNSENT
- create_work: add --size_class cmdline option
Also:
- if get MySQL errors in upgrade, don't rewrite db_version
(usually in a static variable called "last_time")
of the last time we did something,
and we only do it again when now - last_time exceeds some interval.
Example: sending heartbeat messages to apps.
Problem: if the system clock is decreased by X,
we won't do any of these actions are time X,
making it appear that the client is frozen.
Solution: when we detect that the system clock has decreased,
set a global var "clock_change" for 1 iteration of the polling loop,
and disable these time checks if clock_change is set.
A "viable" result is one that could potentially become the canonical result,
i.e. the outcome is SUCCESS and the validate state is not INVALID.
The existing code treated all results with outcome SUCCESS as viable,
which is wrong.
In particular, this could cause workunit.target_nresults
to be incremented inappropriately.
(but not all) wasn't finished.
New logic: if the project has an NCI app then:
- make a list of NCI apps for which the client doesn't have
a job in progress.
- try to send one job for each of these apps
- do this even if no work is being requested.
- don't send jobs for NCI apps by other mechanisms
NOTE: the client logic isn't quite right for mixed NCI projects.
If there's no job for a given NCI app,
the client should do a scheduler RPC.
This isn't critical so we won't do this now.
svn path=/trunk/boinc/; revision=26068
and non-CPU-intensive applications.
An app can be specified as non-CPU-intensive in project.xml,
and this attribute can be set or cleared using the admin web interface.
Note: support for this was added to the client in 2011,
but we didn't add server-side support at that time.
This change is in 6.12 and later clients.
svn path=/trunk/boinc/; revision=26060
- add a config item vda_host_timeout.
A host that hasn't done a scheduler RPC for this long
is considered dead.
- a host that's not running a version 7+ client is considered dead
- host.cpu_efficiency (an otherwise unused field) is used
as a flag for dead hosts
- the scheduler clears the flag if the client is v7+
- vdad sets the flag for hosts where last RPC is old
- before choosing a host for chunk download,
vdad checks its client version.
svn path=/trunk/boinc/; revision=26059
- Allow projects to report "desired disk usage" (DDU).
If the client learns that a project wants disk space,
it can shrink the allocation to other projects.
- Base share computation on DDU rather than disk usage.
- Introduce the notion of "disk resource share".
This is defined (somewhat arbitrarily) as resource share
plus 1/10 of the largest resource share.
This is intended to ensure that even zero-share projects
get enough disk space to store app versions and data files;
otherwise they wouldn't be able to compute.
- server: use host.d_boinc_max (which wasn't being used)
to start d_project_share reported by client.
- volunteer storage: change the way hosts are allocated to chunks.
Allow hosts to store several chunks of the same file, if needed
svn path=/trunk/boinc/; revision=26052
while writing to them.
It's not clear to me that this locking is beneficial,
and it may be causing filesystem problems at WCG
- volunteer storage stuff
svn path=/trunk/boinc/; revision=26021
and change types of mem-size fields from int to double.
These fields are size_t in NVIDIA's version of this;
however, cuDeviceGetAttribute() returns them as int,
so I don't see where this makes any difference.
- client: fix bug in handling of <no_rsc_apps> element.
- scheduler: message tweaks.
Note: [foo] means that the message is enabled by <debug_foo>.
svn path=/trunk/boinc/; revision=25849
- Fix various #include issues.
CODING STYLE LAW (minimal inclusion principle):
If foo.cpp requires <blah.h>,
#include <blah.h> in foo.cpp, NOT foo.h
svn path=/trunk/boinc/; revision=25837
- feeder: don't enumerate results for WUs with nonzero error_mask
- scheduler: in slow_check(), make sure the WU error_mask is still zero
svn path=/trunk/boinc/; revision=25822
Both are for use by project.
- job submission file sandbox: don't delete physical file
when delete sandbox entry.
We'll have to figure out how to garbage-collect physical files.
- LAMMPS job submission:
use the 50th-percentile host,not 0th
svn path=/trunk/boinc/; revision=25734
This reruns validation for instances that are successful
but marked as invalid or inconclusive.
Use this if you changed your validator to be more permissive,
and you want to grant credit for instances that were
originally marked as invalid.
svn path=/trunk/boinc/; revision=25714