Commit Graph

2273 Commits

Author SHA1 Message Date
David Anderson 7a65508923 scheduler: don't penalize hosts for jobs aborted by server
If a job was aborted by server (exit status is EXIT_ABORTED_BY_PROJECT)
don't decrement daily quota or reset the consecutive valid count.
2014-05-23 12:25:41 -07:00
David Anderson de6540cbc0 scheduler: if a result was aborted by user, don't count it as an error 2014-05-22 23:54:56 -07:00
David Anderson c05e74321f Scheduler: only one instance of assigned jobs should be in progress 2014-05-20 10:40:34 -07:00
David Anderson b17455816d db_dump: include badges in XML stats export
I did this by including list of badges in the tables.xml file,
and writing the list of badge assignments to 2 new files,
badge_user.gz (for users) and badge_team.gz (for teams).

I considered including the badges within the <user> and <team> elements.
However, this would require enumerating the badges for a particular user
within the enumeration of users, which doesn't work;
only one enumeration can be active at a time.
Plus it would be less efficient, and db_dump already takes
a half hour on a big project.
2014-05-18 19:19:05 -07:00
David Anderson 641099040e scheduler: add log message for beta-test pref, score-based case 2014-05-12 10:21:28 -07:00
David Anderson af5d5b35f2 scheduler: add stub code for debugging a particular user or host, for Rytis 2014-05-09 01:06:37 -07:00
David Anderson 3c64bbb837 scheduler: fix bug where first NCI job in shared mem never gets sent
In SCHED_SHMEM::no_work() we "lock" the job by setting its state to our PID.
When checking the job in send_job_for_app() we need to
accept this state was well as STATE_PRESENT.  From Jack Yang.
2014-05-09 00:18:20 -07:00
David Anderson c25ce3177c file_deleter: delete gzipped versions of files also 2014-05-06 12:58:13 -07:00
David Anderson e5810f3061 client/server: change implementation of "exact fraction done".
My last commit did this using a new API call.
But this would require rebuilding apps any time you want to change it;
too much work.
So instead make it an attribute of apps,
which you can set via the admin web interface.

Corresponding changes to client.
2014-05-04 00:02:32 -07:00
David Anderson 425f67f4c6 scheduler: don't show error msg if no plan class spec file 2014-05-02 12:03:07 -07:00
David Anderson b0516e635c make_work: fix bug that prevented --max_wus from working 2014-04-30 15:35:04 -07:00
David Anderson bb4f4194d0 scheduler: cap CPU time of reported results at elapsed time * ncpus
This affects only result display,
since CPU time is no longer used for anything.
2014-04-10 23:52:13 -07:00
David Anderson fc7c75b200 server: parse peak memory/disk info from client, store in DB, display in web
The latest client reports the peak working set size, swap size,
and disk usage for completed jobs.
Add fields to the results table to store these.
Parse them in scheduler request messages, and write to the DB.
Display them in the result web page.

This data can be used to improve (or even automate)
the job estimates for memory and disk usage.
2014-04-02 19:35:59 -07:00
David Anderson e91eee67da trickle handler daemon: mark message as handled even if handler returns error.
This is because errors in general are non-recoverable,
and we'll end up retrying infinitely.
If an error actually is recoverable, exit().
2014-03-29 09:25:01 -07:00
David Anderson 6216673eca web: fix missing mysqli change 2014-03-22 09:04:58 -07:00
David Anderson 6f29a50812 validator: fixes and features
- add --is_gzip option to sample_bitwise_validator.
  If set, all files are treated as gzip archives.
  Check their 10-byte header to verify that it's a gzip file,
  but ignore it when comparing files.
- validator.cpp: don't error out on unparsed cmdline args,
  since we're now using them in sample_bitwise_validator
  and sample_substr_validator.
- fix build error on Debian
2014-03-20 12:38:29 -07:00
David Anderson cf0a0817c0 server: fix some compile warnings
Add a derived class DB_APP_VERSION_VAL for use by the validator,
containing the extra fields it uses,
so that we're not doing memset 0 on vectors
2014-03-19 14:55:16 -07:00
David Anderson 8aa10ee5a9 scheduler: check if cpu_time and elapsed_time are infinite, set to zero if so
Some (old? buggy?) clients report these as infinity.
This causes the result update queries to fail.
2014-03-18 20:19:04 -07:00
David Anderson 834ac11661 server: add sample validator that checks for string in stderr 2014-03-18 19:12:13 -07:00
David Anderson c2fd2b33e0 scheduler: fix bug that caused no jobs to be sent 2014-03-12 15:31:12 -07:00
David Anderson 2f91cd6b5e scheduler: add support for jobs targeted at hosts and teams
Also: add code to db_purge to delete assignment records for completed WUs
2014-03-12 00:03:17 -07:00
David Anderson 9889ee8fb6 scheduler: enforce GPU job limits separately for each GPU type
Previously, if a project specified a limit on GPU jobs in progress,
it would be enforced across GPU types.
This could lead to starvation for hosts with multiple GPU types.
E.g. the limit is 10, and a host has 10 NVIDIA jobs and no AMD jobs.

Fix this by enforcing limits separately for each GPU type.
2014-03-08 11:17:16 -08:00
David Anderson 5381def663 server: use gpu_active_frac in scheduling decisions
On some hosts, gpu_active_frac may be much less than active_frac
(i.e., GPUs may be available much less than CPUs).
Use gpu_active_frac in the following places:

- scheduler: in estimating the elapsed time of jobs,
    to decide whether they can meet deadline
- scheduler: in computing the effective speed of a (host, app version),
    when deciding what size class it belongs to
- size_census: in computing effective speed of (host, app versions)

(Previously, we were just using active_frac in all these cases)
2014-03-06 21:23:02 -08:00
David Anderson df1d8e2bde server: store and display gpu_active_frac
- gpu_active_frac is the fraction of time GPU use is allowed
  while the client is running.
  Previously the client reported it but we weren't storing it in the DB.
  We may need it in the future for batch scheduling logic.
- fix a crashing bug in scheduler
- client: minor message tweak
2014-03-06 13:23:52 -08:00
David Anderson 593181e196 scheduler: if gui_urls.xml or project_files.xml don't end with \n, add one
Otherwise the scheduler reply has two tags on one line,
which messes up old clients that don't use the new XML parse
2014-02-26 16:16:51 -08:00
David Anderson 0d8a22e75c Server: add optional size_class parameter to count_unsent_results().
This lets you write work generators that maintain min levels of
unsent jobs for each size class.
2014-02-20 13:44:56 -08:00
David Anderson 4b5a099f81 scheduler: create host_app_version records in NCI case 2014-02-04 15:58:01 -08:00
David Anderson c7db808abd Scheduler: message tweak 2014-02-04 10:07:46 -08:00
David Anderson d861862ca1 server: fix compile warnings and file descriptor leaks
Also, we were using memset() to zero WORK_REQ,
which contains several std::vector's.
This apparently works on Linux, but not in general.
2014-01-08 22:00:13 -08:00
David Anderson cbc419ccab scheduler: fix bug that caused sticky files to always get deleted when file_delete_regexp mechanism used 2013-12-18 16:33:14 -08:00
David Anderson 2e4d561647 sample work generator: wait until transitioner has processed jobs before creating any more
Work generators create jobs (workunits);
the transitioner creates instances (results).
If a work generator tries to maintain a certain number of unsent results
(as the sample work generator does)
it must wait for a bit, after creating jobs,
to let the transitioner create instances of those jobs.
The example work generator waited 5 seconds.

Problem: on a heavily loaded project, the transitioner can fall behind -
minutes or hours behind.
So the above policy can create way too many jobs.

Solution: after creating jobs, the sample work generator
notes the current time X,
then waits until the transitioner catches up to time X
(i.e., until the min workunit.transition_time exceeds X).
This ensures that instances have been created for all the new jobs.

Other work generators the limit the number of unsent jobs
should use the same technique;
use min_transition_time(x) to get the min transition time.

Code cleanup: get_double should be a member of DB_CONN, not DB_BASE.
2013-12-14 16:36:18 -08:00
David Anderson 6d4999767f example app: print "starting" message after boinc_init, so that it appears in stdferr file
Also remove old score-based sched code
2013-12-10 14:00:31 -08:00
David Anderson 7d54e6537e scheduler: add <vm_accel_required> flag to plan class XML spec 2013-12-03 15:54:56 -08:00
David Anderson 99332624f3 scheduler: parse <opencl_cpu_prop> in scheduler requests correctly
The OPENCL_CPU_PROP structure was being referred to as both
"opencl_cpu_prop" and "cpu_opencl_prop", roughly 50/50,
in variable names and XML tags.
Let's standardize on "opencl_cpu_prop",
which is what current clients are sending in scheduler requests.
2013-11-28 14:11:42 -08:00
David Anderson feb2f1971d scheduler: fix bug that prevented Intel GPU work from being sent to anonymous platform clients 2013-11-21 22:31:15 -08:00
Rom Walton bec26d2447 VBOX: Add support for vbox32_hwaccel and vbox64_hwaccel plan classes in the stock server scheduler. 2013-11-18 14:43:44 -05:00
David Anderson 863c9496b0 deadline-extension trickle handler: message tweaks 2013-11-11 13:24:09 -08:00
David Anderson 5192fe2545 scheduler: assigned jobs should respect user app preferences 2013-10-06 21:23:28 -07:00
David Anderson 5b76909f04 scheduler: parse OpenCL/CPU descriptors, and add plan class for OpenCL/CPU/Intel 2013-08-26 23:32:32 -07:00
David Anderson b2e06e0704 Server: various fixes for "make install" 2013-08-24 20:36:49 -07:00
David Anderson f13c3d58ea fix bug in trickle handler framework; from Christian 2013-08-23 13:01:53 -07:00
David Anderson 628ba8f0ef Tweaks to deadline-extension trickle handler, from Christian 2013-08-23 09:45:45 -07:00
David Anderson 95d12b76e7 server: add code for extending deadlines via trickle-ups; from Christian 2013-08-23 00:34:37 -07:00
David Anderson ef82d5d9fb server: fix compile error on systems that don't define MAXPATHLEN 2013-08-22 17:01:45 -07:00
David Anderson 1c31f6feaa Condor: fix bug when 2 input files have same contents; fix error messages 2013-08-09 16:06:36 -07:00
Eric J Korpela 48d995061f Merge branch 'master' of ssh://boinc.berkeley.edu/boinc-v2 2013-08-08 11:31:27 -07:00
Eric J Korpela 60c7814250 SCHED: Removed claimed credit sanity check because GPU machines often have host
scales that cause it to fail. That prevents host_app_version PFCs not to be
updated for perfectly reasonable credit claims.  Since there is a max credit
granted this mechanism is unneccesary, anyway.
2013-08-08 11:23:30 -07:00
David Anderson b156e88208 scheduler: sample code for the SSE3 plan class must check for "pni" rather than "sse3"; clients report "pni" 2013-08-08 11:00:29 -07:00
Eric J Korpela 03e64f720b SCHED: Added "intel_gpu" to app_plan_uses_gpu() 2013-06-25 19:31:23 -07:00
Eric J Korpela 4e338e946e -SCHED: Added plan class spec plan class option "<need_amd_libs>" (similar to
"<need_ati_libs>".  Before this the default was to require AMD libraries unless
    need_ati_libs was set.  Now the default is to require neither.  This is
    necessary for MacOS compatibility (where there is no distiction).
   -SCHED: Changed intel gpu type search to match any string in the gpu_type
    beginning with "intel".  This was done because there have been
    inconsistencies in the code where "intel" vs "intel_gpu" is used.
2013-06-25 19:17:46 -07:00