This is meant not to break anything, just add some
(optional) logging and features needed for Einstein@Home.
Please contact me before changing or removing any of this.
Conflicts:
sched/db_dump.cpp
sched/file_deleter.cpp
sched/validator.cpp
I did this by including list of badges in the tables.xml file,
and writing the list of badge assignments to 2 new files,
badge_user.gz (for users) and badge_team.gz (for teams).
I considered including the badges within the <user> and <team> elements.
However, this would require enumerating the badges for a particular user
within the enumeration of users, which doesn't work;
only one enumeration can be active at a time.
Plus it would be less efficient, and db_dump already takes
a half hour on a big project.
My last commit did this using a new API call.
But this would require rebuilding apps any time you want to change it;
too much work.
So instead make it an attribute of apps,
which you can set via the admin web interface.
Corresponding changes to client.
The job submission RPC handler (PHP) originally ran the
create_work program once per job.
This took about 1.5 minutes to create 1000 jobs.
Recently I changed this so that create_work only is run once;
it does one SQL insert per job.
Disappointingly, this was only slightly faster: 1 min per 1000 jobs.
This commit changes create_work to create multiple jobs per SQL insert
(as many as will fit in a 1 MB query, which is the default limit).
This speeds things up by a factor of 100: 1000 jobs in 0.5 sec.
The latest client reports the peak working set size, swap size,
and disk usage for completed jobs.
Add fields to the results table to store these.
Parse them in scheduler request messages, and write to the DB.
Display them in the result web page.
This data can be used to improve (or even automate)
the job estimates for memory and disk usage.
- add --is_gzip option to sample_bitwise_validator.
If set, all files are treated as gzip archives.
Check their 10-byte header to verify that it's a gzip file,
but ignore it when comparing files.
- validator.cpp: don't error out on unparsed cmdline args,
since we're now using them in sample_bitwise_validator
and sample_substr_validator.
- fix build error on Debian
- gpu_active_frac is the fraction of time GPU use is allowed
while the client is running.
Previously the client reported it but we weren't storing it in the DB.
We may need it in the future for batch scheduling logic.
- fix a crashing bug in scheduler
- client: minor message tweak
Work generators create jobs (workunits);
the transitioner creates instances (results).
If a work generator tries to maintain a certain number of unsent results
(as the sample work generator does)
it must wait for a bit, after creating jobs,
to let the transitioner create instances of those jobs.
The example work generator waited 5 seconds.
Problem: on a heavily loaded project, the transitioner can fall behind -
minutes or hours behind.
So the above policy can create way too many jobs.
Solution: after creating jobs, the sample work generator
notes the current time X,
then waits until the transitioner catches up to time X
(i.e., until the min workunit.transition_time exceeds X).
This ensures that instances have been created for all the new jobs.
Other work generators the limit the number of unsent jobs
should use the same technique;
use min_transition_time(x) to get the min transition time.
Code cleanup: get_double should be a member of DB_CONN, not DB_BASE.
- DB: add tables for badges and badge/user and badge/team associations
- add script that defines 3 RAC-based badges and assigns them
- add images for these badges
- add admin page for creating/editing badges
- show badges on user page
not done:
- figure out how to send badges to client
- display badges somewhere in the GUIs
- export badges in db_dump
- enable badges by default for new projects
The OPENCL_CPU_PROP structure was being referred to as both
"opencl_cpu_prop" and "cpu_opencl_prop", roughly 50/50,
in variable names and XML tags.
Let's standardize on "opencl_cpu_prop",
which is what current clients are sending in scheduler requests.
- Batches now have optional "expire time".
If this time passes and the batch is not retired, abort and retire it.
- Add script "expire_batches" which enforces the above.
Run it as a periodic task.
- Add a web RPC for setting the expire time of a batch
(it can be changed multiple times)
- Add a C++ interface for this RPC
- Add a BOINC_SET_LEASE command to the BOINC GAHP
("lease" is Condor term for expire time)
Problem: a workunit could error out with unsent results.
The feeder skips such results, but the size_regulator counts them
and doesn't so doesn't promote any new results.
Solution: the feeder scans for results even with workunit errors.
If marks these results as state OVER, outcome DIDNT_NEED
See http://boinc.berkeley.edu/trac/wiki/MultiSize
The components of this include:
- DB changes:
add size_class to workunit and result
n_size_classes to app; >1 means multi-size
- size_regulator daemon program: change results states
from INACTIVE to UNSENT carefully
- size_census program; writes quantile info in flat files
- transitioner: when creating results for multi-size apps,
set server state to INACTIVE
- sched shmem (feeder): read quantile info from flat files,
store in shared memory
- scheduler (score-based scheduling): for multi-size apps,
add component to score function for size class.
- show_shmem: show result size class
- make_work (and other callers of count_unsent_results()):
count both INACTIVE and UNSENT
- create_work: add --size_class cmdline option
Also:
- if get MySQL errors in upgrade, don't rewrite db_version
(usually in a static variable called "last_time")
of the last time we did something,
and we only do it again when now - last_time exceeds some interval.
Example: sending heartbeat messages to apps.
Problem: if the system clock is decreased by X,
we won't do any of these actions are time X,
making it appear that the client is frozen.
Solution: when we detect that the system clock has decreased,
set a global var "clock_change" for 1 iteration of the polling loop,
and disable these time checks if clock_change is set.
A "viable" result is one that could potentially become the canonical result,
i.e. the outcome is SUCCESS and the validate state is not INVALID.
The existing code treated all results with outcome SUCCESS as viable,
which is wrong.
In particular, this could cause workunit.target_nresults
to be incremented inappropriately.
(but not all) wasn't finished.
New logic: if the project has an NCI app then:
- make a list of NCI apps for which the client doesn't have
a job in progress.
- try to send one job for each of these apps
- do this even if no work is being requested.
- don't send jobs for NCI apps by other mechanisms
NOTE: the client logic isn't quite right for mixed NCI projects.
If there's no job for a given NCI app,
the client should do a scheduler RPC.
This isn't critical so we won't do this now.
svn path=/trunk/boinc/; revision=26068
and non-CPU-intensive applications.
An app can be specified as non-CPU-intensive in project.xml,
and this attribute can be set or cleared using the admin web interface.
Note: support for this was added to the client in 2011,
but we didn't add server-side support at that time.
This change is in 6.12 and later clients.
svn path=/trunk/boinc/; revision=26060
- add a config item vda_host_timeout.
A host that hasn't done a scheduler RPC for this long
is considered dead.
- a host that's not running a version 7+ client is considered dead
- host.cpu_efficiency (an otherwise unused field) is used
as a flag for dead hosts
- the scheduler clears the flag if the client is v7+
- vdad sets the flag for hosts where last RPC is old
- before choosing a host for chunk download,
vdad checks its client version.
svn path=/trunk/boinc/; revision=26059
- Allow projects to report "desired disk usage" (DDU).
If the client learns that a project wants disk space,
it can shrink the allocation to other projects.
- Base share computation on DDU rather than disk usage.
- Introduce the notion of "disk resource share".
This is defined (somewhat arbitrarily) as resource share
plus 1/10 of the largest resource share.
This is intended to ensure that even zero-share projects
get enough disk space to store app versions and data files;
otherwise they wouldn't be able to compute.
- server: use host.d_boinc_max (which wasn't being used)
to start d_project_share reported by client.
- volunteer storage: change the way hosts are allocated to chunks.
Allow hosts to store several chunks of the same file, if needed
svn path=/trunk/boinc/; revision=26052
while writing to them.
It's not clear to me that this locking is beneficial,
and it may be causing filesystem problems at WCG
- volunteer storage stuff
svn path=/trunk/boinc/; revision=26021