boinc

Commit Graph

Author	SHA1	Message	Date
David Anderson	9049737d1f	validator: retry if transient failure check_set() wasn't returning "retry" properly in the case where one of the calls to init_result() return ERR_OPEN_DIR (treated as a transient failure, since it can be caused by a failed NFS mount)	2013-05-20 13:01:10 -07:00
David Anderson	c9c9f2bae0	- scheduler: code shuffle; new file sched_check.cpp contains functions that decide whether a job can be sent to a host	2013-04-09 12:19:00 -07:00
David Anderson	3017ed943f	- scheduler: debug the above	2013-02-26 16:44:26 +01:00
BOINC Admin	c8b8c6155f	need to add files?	2013-02-26 16:37:26 +01:00
Eric J. Korpela	33962b77e1	- sched: 2 bug fixes in credit.cpp - It was possible if all results for a workunit were PFC_MODE_INVALID that NaN pfc would be used causing database update errors. Solved by using wu_estimated_pfc() as pfc in that case. - Sanity check was comparing raw_pfc directly to rsc_fpops_bound. That was causing problems GPUs with high performance estimates. Fixed by including the app_version scale factor in the check. I thought I had already committed this... - Removed a few lines of commented out experimental code accidentally comitted earlier. - Committed to git repository on 8/24 svn path=/trunk/boinc/; revision=26144	2012-10-02 15:20:13 +00:00
David Anderson	25c2f6b49c	- client: treat all 4xx HTTP errors as permanent - code cleanup - API: increase a buffer in timer_handler() from 256 to 512. svn path=/trunk/boinc/; revision=26012	2012-08-13 18:23:20 +00:00
Eric J. Korpela	9b9ec18d69	- Fixed typo svn path=/trunk/boinc/; revision=26000	2012-08-09 16:23:38 +00:00
Eric J. Korpela	5c24fc50eb	- Credit is more stable when pegged_avg() is used. - When a normal and an approx result are compared the normal result now gets double weight in a pegged_avg() with any approx results. "Normal mode" GPU results are frequently resulting in bad credit values for yet undetermined reasons. Since GPUs are so fast, there can be a lot of bad values in a short time. Including the prior average and another result even seems to prevent this in many case. - Replaced many of the if { msg_log.printf } with msg_log.cond_printf() - Accidentally changed some of the formatting when trying a new editor that apparently autoformats. Sorry for the extra diff lines. - There's a problem with pfc calculation for hosts whose credit calculation fails the sanity check. This has been a problem for a long time. Because the result fails the sanity check, the host_app_version pfc is never updated. Because hav.pfc is never updated, the credit calculation continues to be wrong. To quote Quinn, it's like one of those viscious things. I hope to fix that soon. svn path=/trunk/boinc/; revision=25999	2012-08-09 03:02:51 +00:00
Eric J. Korpela	29d5781a34	- Modified credit granting for "appox credit" result to weight results by proximity to the average estimate. This reduces the number of results that are granted extremely low credit when a new app_version is released and (I hope) improves work/duration estimates by speeding up the convergence of app versions stats. I may try using this in lieu of low_average for normal result, but haven't yet. svn path=/trunk/boinc/; revision=25953	2012-08-02 15:45:13 +00:00
David Anderson	da7e40f142	- use <cmath> instead of <math.h>. Seems to be needed on Debian. svn path=/trunk/boinc/; revision=25938	2012-08-01 21:21:38 +00:00
David Anderson	fc2af21221	- client: add missing end tag for <pci_info>. Doh! - validator: add some sanity-checking for credit, to prevent granting 1e38 credit. max_granted_credit now defaults to the equivalent of 1 TeraFLOP-year. Instances that exceed this are not counted in the credit calculation, and a critical-mode log message is written - wrapper: remove wall_cpu_time; not used anymore svn path=/trunk/boinc/; revision=25825	2012-06-29 22:24:07 +00:00
David Anderson	ec0ca2615d	- scheduler: fix bug that could cause zero credit for the first few jobs of a new application (in wu_estimated_pfc(), only multiply by app.min_avg_pfc if it's nonzero). svn path=/trunk/boinc/; revision=25484	2012-03-23 21:47:06 +00:00
David Anderson	fe90776614	- scheduler: if an app has only GPU versions, scale their PFC by 0.1 in credit calculations. This reflects the fact that GPU apps are typically less efficient (relative to device peak FLOPS) than are CPU apps. The actual values from SETI@home and Milkyway are 0.05 and 0.08. svn path=/trunk/boinc/; revision=24842	2011-12-21 03:21:52 +00:00
David Anderson	dd93780787	- API and client: add "ncpus" field to APP_INIT_DATA. Tells multicore apps how many cores to use. The --nthreads command line arg to the app is now deprecated though we'll keep it around for the time being. svn path=/trunk/boinc/; revision=24708	2011-12-01 18:44:19 +00:00
David Anderson	d53b89fe6f	- feeder: fix logic error in the way app_version.pfc_scale is updated (from Kevin Reed) svn path=/trunk/boinc/; revision=24514	2011-11-03 07:08:52 +00:00
David Anderson	61dc940872	- validator: add runtime_outlier message svn path=/trunk/boinc/; revision=24229	2011-09-16 21:30:21 +00:00
David Anderson	e49f945908	- Validator: allow project-specific code to mark a result is a "runtime outlier", i.e. its runtime does not correspond to the job's rsc_fpops_est. Runtime outliers are not counted in the statistics for elapsed time, turnaround time, and peak FLOPs count. The is intended for applications like SETI@home, some of whose jobs finish more or less instantly (this happens if the data contains a lot of interference). If a host happens to get a bunch of these short jobs, its statistics will get skewed: in essence, the server will think that the host is extremely fast, and will send it too many jobs. svn path=/trunk/boinc/; revision=24225	2011-09-16 16:43:15 +00:00
David Anderson	826cd355e5	- validator: old scheduler bugs may cause result.flops_estimate to be negative in some cases. Detect this, and use 1e10 instead svn path=/trunk/boinc/; revision=24146	2011-09-08 19:36:14 +00:00
David Anderson	8fda6c0497	- Vbox wrapper: add --trickle x option; sends a trickle-up message reporting incremental runtime exery x seconds of runtime. - client: more XML parsing cleanup - credit trickle handler: do sanity checks on CPU speed svn path=/trunk/boinc/; revision=24017	2011-08-21 11:18:08 +00:00
David Anderson	578d5f924f	- scheduler: fix nasty bug where SCHED_DB_RESULT::parse() was doing memset(this, 0, sizeof(RESULT)), i.e. it wasn't zeroing out the whole structure. The elapsed_time field (which isn't reported by old clients), is near the end of the struct, and it was getting garbage, e.g. 1e-304, in some cases, which led to zero credit (and maybe other problems) - validator: treat 1e-304 like zero in case of other problems like the above. - remote job submission: tweaks svn path=/trunk/boinc/; revision=23947	2011-08-08 04:37:53 +00:00
David Anderson	d1be15b9fb	- validator: remove spurious messages svn path=/trunk/boinc/; revision=23855	2011-07-19 18:48:03 +00:00
David Anderson	8ca24cbbab	- client, work fetch policy: adjust project REC by the amount of work queued, to increase variety NOTE: at some point I think I had a reason to not do this, but I can't remember what it is. - client, job scheduling policy: fix how project REC is adjusted svn path=/trunk/boinc/; revision=23838	2011-07-13 19:46:03 +00:00
David Anderson	f44c9910e7	- validator: if job FLOPs estimates are accurate, PFC values should be around 1. If they differ from 1 by a factor of > 1e4, ignore them, and put an error message into the validator log - validator: if get_pfc() fails because an app version is missing from the DB (i.e. the project deleted it) keep going so we don't reprocess the WU forever svn path=/trunk/boinc/; revision=23837	2011-07-12 20:44:28 +00:00
David Anderson	62074bd4fa	- client emulator web interface: make cc_config.xml an attribute of the simulation, not the scenario. If you want to run a simulation w/ different log flags, you shouldn't have to create a new scenario. - client emulator: add --config_prefix cmdline arg - validator: prevent infinite loop when app_version.pfc_avg is wonky (like 1e-300). Next step: figure out how it got that way. svn path=/trunk/boinc/; revision=23828	2011-07-10 07:05:07 +00:00
David Anderson	732866b8aa	- back end: add two example trickle handlers: trickle_credit: grants credit based on CPU time reported in msg trickle_echo: echoes trickle-up as a trickle-down svn path=/trunk/boinc/; revision=23118	2011-02-27 00:10:14 +00:00
David Anderson	58dadd91a8	- client, acct manager protocol: allow <no_cpu>, <no_cuda> and <no_ati> bools within <account> in reply message. They suppress work fetch for that resource type from that project. - scheduler: check max_granted_credit after wu.rsc_fpops_bound, so that max_granted_credit will be enforced even if wu.rsc_fpops_bound is absurdly high Fixes #1034. From Diggory Hardy. svn path=/trunk/boinc/; revision=22793	2010-12-02 04:53:12 +00:00
David Anderson	b169e5ab0f	- server programs: print error message instead of numeric retval in log messages svn path=/trunk/boinc/; revision=22647	2010-11-08 17:51:57 +00:00
David Anderson	8aa29bec33	- validator: fix another bug with --credit_from_wu - make_project, update scripts: don't quit it user_profiles already exists svn path=/trunk/boinc/; revision=22630	2010-11-05 17:15:27 +00:00
David Anderson	4edfe2ec28	- client: small initial checkin for new scheduling system. Keep track of per-project recent estimated credit svn path=/trunk/boinc/; revision=22608	2010-10-29 23:41:34 +00:00
David Anderson	5ef4dead7d	- validator: need parens in boolean expression svn path=/trunk/boinc/; revision=21814	2010-06-25 19:23:16 +00:00
David Anderson	7c51512cbf	- transitioner: the format string for a DB query had %.15d instead of %.15e. That produced a messed-up query that assigned garbage values to: host_app_version.turnaround_var host_app_version.turnaround_q host_app_version.max_jobs_per_day host_app_version.consecutive_valid To repair these: - set turnaround_var and turnaround_q to zero - if max_jobs_per_day is outside of (0..config.daily_result_quota) set it to config.daily_result_quota - if consecutive_valid is outside (0..1000), set it to zero I added a script, html/ops/repair_21812.php, that does this; if you ran server code between [21181] and [21812], run this script. - scheduler/transitioner: add <debug_quota> log flag - changed the build system to always use -Wall (if we'd done this before, this bug wouldn't have happened) - fixed a bunch of other compile warnings svn path=/trunk/boinc/; revision=21812	2010-06-25 18:54:37 +00:00
David Anderson	25f9b05bdb	- validator: there were a couple of places where we needed to scale wu.rsc_fpops_est by app.min_avg_pfc. - validator: assume that app.min_avg_pfc is nonzero; it will be, since the DB default is now 1. svn path=/trunk/boinc/; revision=21804	2010-06-24 22:17:33 +00:00
David Anderson	1250f41313	- validator: fix a divide by zero (happens w/ old clients that don't report elapsed time) svn path=/trunk/boinc/; revision=21788	2010-06-22 18:09:55 +00:00
David Anderson	9262cc8c1a	- validator: fix possible divide-by-zero - validator: when claimed credit is too high, assign standard credit rather than exiting. svn path=/trunk/boinc/; revision=21783	2010-06-21 17:56:12 +00:00
David Anderson	f2e8d4601b	- validator: because of the above problem, some results have flops_estimate == 0, which causes divide by zero. Check for this and use 1e10. svn path=/trunk/boinc/; revision=21776	2010-06-18 22:27:09 +00:00
David Anderson	4147249de2	- server: delete old credit stuff - user web: show host link in user result list. Fixes #999 svn path=/trunk/boinc/; revision=21735	2010-06-12 22:08:15 +00:00
David Anderson	ef0019d8c3	- validator: bug fixes: bad formula for low_average(); failure to reread app_versions because of 1e6/1e-6 typo svn path=/trunk/boinc/; revision=21302	2010-04-26 23:12:40 +00:00
David Anderson	5035007b90	- back end: new way of deciding: - whether host is "reliable" for an app version - whether host is eligible for single replication for an app version - whether to use host scaling In each case, the answer is yes if the number of consecutive valid results is above a threshold. This replaces existing "error rate" and "scale probation" mechanisms. TODO: the # of consecutive valid results should also determine a limit on jobs in progress for an app version. Namely, if N is the threshold for host scaling, the limit should be ndevices*(max(1, consecutive_valid - N)) The client currently doesn't supply enough app version info to do this. It could be approximated; that would give some protection against cherry-picking. - credit: more conservative formulas for combining claimed credit among replicas. If there are normal replicas, we use a "low average" that weights each sample by the sum of the other samples. Otherwise we use the min (not the average) of the approximate samples. NOTE: a DB update is required svn path=/trunk/boinc/; revision=21230	2010-04-21 19:33:20 +00:00
David Anderson	6893691ae2	- validator: message tweak svn path=/trunk/boinc/; revision=21212	2010-04-19 22:57:49 +00:00
David Anderson	61195cb59d	- validator: fix bug where host.total_credit not incremented svn path=/trunk/boinc/; revision=21211	2010-04-19 21:46:45 +00:00
David Anderson	b71d3e6cf4	- back end: typo and tweaks svn path=/trunk/boinc/; revision=21196	2010-04-16 21:16:18 +00:00
David Anderson	021edb02c2	- back end programs: improve log msgs svn path=/trunk/boinc/; revision=21193	2010-04-16 18:07:08 +00:00
David Anderson	02717af2f3	- bug fixes svn path=/trunk/boinc/; revision=21187	2010-04-15 21:58:44 +00:00
David Anderson	b2451544e1	- server: change the following from per-host to per-(host, app version): - daily quota mechanism - reliable mechanism (accelerated retries) - "trusted" mechanism (adaptive replication) - scheduler: enforce host scale probation only for apps with host_scale_check set. - validator: do scale probation on invalid results (need this in addition to error and timeout cases) - feeder: update app version scales every 10 min, not 10 sec - back-end apps: support --foo as well as -foo for options Notes: - If you have, say, cuda, cuda23 and cuda_fermi plan classes, a host will have separate quotas for each one. That means it could error out on 100 jobs for cuda_fermi, and when its quota goes to zero, error out on 100 jobs for cuda23, etc. This is intentional; there may be cases where one version works but not the others. - host.error_rate and host.max_results_day are deprecated TODO: - the values in the app table for limits on jobs in progress etc. should override rather than config.xml. Implementation notes: scheduler: process_request(): read all host_app_versions for host at start; Compute "reliable" and "trusted" for each one. write modified records at end get_app_version(): add "reliable_only" arg; if set, use only reliable versions skip over-quota versions Multi-pass scheduling: if have at least one reliable version, do a pass for jobs that need reliable, and use only reliable versions. Then clear best_app_versions cache. Score-based scheduling: for need-reliable jobs, it will pick the fastest version, then give a score bonus if that version happens to be reliable. When get back a successful result from client: increase daily quota When get back an error result from client: impose scale probation decrease daily quota if not aborted Validator: when handling a WU, create a vector of HOST_APP_VERSION parallel to vector of RESULT. Pass it to assign_credit_set(). Make copies of originals so we can update only modified ones update HOST_APP_VERSION error rates Transitioner: decrease quota on timeout svn path=/trunk/boinc/; revision=21181	2010-04-15 03:13:56 +00:00
David Anderson	e05a479f42	- scheduler and validator: distinguish between 1) peak FLOPS (based on benchmarks or GPU attributes). This does not change over time. It's not adjusted on the basis of statistics. It's not affected by wu.rsc_fpops_est. It can be compared across projects. versus 2) projected FLOPS: the scheduler's best guess as to what will satisfy X * elapsed_time = wu.rsc_fpops_est; this is used to make server-side runtime estimates, and it's sent to the client and used for its runtime estimates. It may be based on the (host, app version) elapsed time average. My checkin [21153] mistakently confounded these two. Notes: 1) app_plan() now must return both peak and projected FLOPS. 2) result.flops_estimate stores peak FLOPS 3) the <flops> field in app_info.xml files should be projected FLOPS. But its accuracy is not important; it's not used once the server has statistics for the (host, app version) svn path=/trunk/boinc/; revision=21164	2010-04-10 05:49:51 +00:00
David Anderson	1d765245ed	- scheduler: sweeping changes to the way job runtimes are estimated: see http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation svn path=/trunk/boinc/; revision=21153	2010-04-08 23:14:47 +00:00
David Anderson	212fb765e9	- validator: detect jobs that used GPU app but fell back to CPU (SETI@home does this if GPU initialization fails). Treat these like CPU apps for credit purposes. svn path=/trunk/boinc/; revision=21130	2010-04-06 23:48:35 +00:00
David Anderson	e276aa5ed6	- server: make the -d 4 feature work with FCGI svn path=/trunk/boinc/; revision=21109	2010-04-05 23:12:02 +00:00
David Anderson	2536797068	- validator: remove update_credit_per_cpu_sec(). Irrelevant. TODO: remove related code - validator: update wu.canonical_credit correctly. However, this field should be deprecated. - validator: check for error return from assign_credit_set(). svn path=/trunk/boinc/; revision=21096	2010-04-05 20:03:54 +00:00
David Anderson	a2a661993b	- validator: -d 4 means -d 3 plus print all DB queries (todo: do this for all daemons) - validator: change cmdline args from -foo to --foo (todo: do this for all daemons) - validator: pass max_granted_credit to assign_credit_set() svn path=/trunk/boinc/; revision=21093	2010-04-05 18:59:16 +00:00

1 2

58 Commits