On startup the client sees, for each project,
whether it's barred from using particular resources (CPU, GPU).
It was showing these situations as notices.
I think the idea was to make sure the user hadn't changed
a setting and forgotten about it.
But this was annoying overkill.
Instead, just show startup messages (not notices)
saying what (project, resource) pairs are disallowed, and wy.
Sometimes jobs finish (and create finish files) while the
client is shutting down, and the client doesn't notice they finished.
This can result in the job starting from the beginning.
Fix this by checking for finish files on startup.
If CPU benchmark is requested via RPC by user, it should always run even if 'skip_cpu_bechmark' is set.
This fixes#2358
Signed-off-by: Vitalii Koshura <lestat.de.lionkur@gmail.com>
A job is assigned a max runtime as:
max_elapsed_time = rp->wup->rsc_fpops_bound/rp->avp->flops
The purpose is to eventually abort jobs that are in an infinite loop.
Various problems (e.g. bad GPU peak FLOPS calculations)
can cause this limit to be too small, e.g. one second,
in which case the job is aborted almost immediately.
In this change, if the calculated limit is < 2 minutes,
it's assumed to be in error, a limit of 30 minutes is used instead,
and an event log message is written.
Of course the underlying problem still must be addressed.
But this change will, in some cases, prevent a situation where
thousands of jobs are dispatched and immediately aborted.
When an app finishes, it writes a "finish file",
which ensures the client that the app really finished.
If the app process is still there N seconds after the finish file appears,
the client assumes that something went wrong, and it aborts the job.
Previously N was 10.
This was too small during periods of heavy paging.
I increased it to 300.
It has been pointed out that if the app creates the finish file,
and its output files are present,
it should be treated as successful regardless of whether it exits.
This is probably true, but right now we don't have a mechanism
for killing a job and marking it as success.
The longer timeout makes this moot.
Synopsis: max concurrent was being enforced in the last stage of CPU sched,
but not in earlier stages, or in work fetch.
This caused starvation in some cases.
Fix this by modeling max concurrent in RR sim and make_run_list().
- CPU sched: model and enforce max concurrent limits in building run list
for CPU jobs; otherwise the list has jobs we can't actually run
- RR simulation: model and enforce max concurrent limits
- RR sim: fix bug in calculation of # idle instances
- RR sim: model unavailability of GPUs
e.g. if we can't run GPU jobs we can potentially run more CPU jobs
- work fetch: if a project is at a max concurrent limit,
don't fetch work from it.
The jobs we get (possibly) wouldn't be runnable.
NOTE: we currently provide max concurrent limits
at both project and app level.
The problem with app level is that apps can have versions that
use different resources.
It would be better to have limits at the resource level instead.
- In many cases (e.g. job completion) CPU sched and work fetch are both done
back to back. Each of them does RR simulation.
Only need to do this once (efficiency).
- Show max concurrent settings in startup messages
- Make max runnable jobs (1000) into a #define
- Fix removal of "can't fetch work" notices
- Make "can't fetch work" notices resource-specific;
the reasons may differ between resources
- Get rid of WF_DEBUG macro;
just print everything if log_flags.work_fetch_debug is set.
- Change project- and resource-level work-fetch reason codes
(DONT_FETCH_PREFS etc.) from #defines to enums,
and give them prefixes RSC_REASON and PROJECT_REASON
- Fix bug where the return of compute_project_reason() wasn't
actually being stored in project.work_fetch.
- Add work-fetch reason MAX_CONCURRENT (project is at max concurrent limit)
It may take a minute or two between deciding to fetch work
and actually getting some; you may have to try a few projects.
So it's better to start work fetch a bit before a resource instance becomes idle,
rather than waiting until it's idle.
We were already doing this, with a "lead time" of 3 minutes,
except for the case where all the fetchable projects
are zero resource share ("backup" projects).
We'd request work from backup projects only when an instance is idle.
This change fixes that by allowing work fetch from backup projects
if an instance is within 3 minutes of going idle.
It also makes the 3 minutes, in both places,
into a constant WF_EST_FETCH time rather than hardwired.
BTW, the reason for the old policy is that we want to avoid
situations where we fetch a big job from a backup project
when jobs from a non-backup project would have been available soon.
This change may cause that to happen (rarely)
but it's worth it to avoid idleness.
newer_version_startup_check() compares gstate.client_version_check_url to nvc_config.client_version_check_url. If different, it clears gstate.newer_version and updates gstate.client_version_check_url.
- on startup, look for and parse a file containing the installer filename,
which encodes a project ID and login token.
- look up the project ID in the all-projects list
- do an RPC to that project, mapping the login token to weak auth
- attach to that project using weak auth
A result with a lot of failed uploads could overflow a 4K buffer.
Change report_result_error() so you just pass it the error message,
rather than va_args nonsense.
- use std::string instead of malloced array for ACCT_MGR_OP::global_prefs_xml
- use copy_element_contents() instead of dup_element_contents()
to get global prefs.
The latter uses fgets instead of fgetc,
so it requires that close tag be on a line by itself.
TODO: don't use fgets anywhere in XML parsing.
- fix a bug in copy_element_contents() where it consumes an extra character
There was a 20-30 second delay between exclusive app exit
and resuming tasks. This was excessive.
Reduce it to 5-15 sec (uncertainty is because we
check exclusive apps every 10 sec)
This is the interval between when the client sends a "quit" message
to when it kills the app via TerminateProcess or SIGKILL.
Apparently VM apps that do their own state-saving
(i.e. that don't use snapshot-based checkpointing)
can take more than 15 seconds to do so.
Hopefully 60 is enough.
Rom suggested making this interval shorter in the case where
the OS is shutting down.
I don't think this is necessary since the OS kill everything anyway
after some period (5 sec in the case of Win 10).
There were two problems:
1) we were sorting before parsing the client state file
(which is where we get project names from)
2) the Win implementation of strcasecmp() wasn't right;
it returned difference but not order.
A while back I changed the job sched and work fetch policies to use
REC-based project priority.
The work fetch logic sorts the project list (in CLIENT_STATE::projects)
by descending priority.
This causes two problems:
- If you have a lot of projects, it's hard to find a particular one
in the event log, e.g. in work_fetch_debug output.
- In the manager's Statistics tab, the selected project can change
unexpectedly since we identify it by array index,
and the array order may change.
Solution: sort CLIENT_STATE::projects alphabetically (case insensitive).
In WORK_FETCH, copy this array to a separate array,
that is then sorted by decreasing priority.
Currently the server doesn't know about different client "brands",
e.g. HTC Power to Give, Charity Engine, GridRepublic, etc.,
so there's no way to collect statistics about them.
Changes:
- client: at startup, read a "client brand" string from client_brand.txt
(i.e. branded clients will have to include this file in their installer)
Report this string in scheduler requests.
- scheduler: parse this request element,
and store it in host.serialnum as [BOINC|7.4.2|brand]
On client startup, decide whether we need to do CPU benchmarks
(cmdline option was set, or we haven't done them for 30 days).
If so, do them when possible.
If a project has active uploads, defer work fetch from it for 5 minutes
even if there are idle devices (that's the change).
This addresses a situation (reported by Rytis) where
- a project P has a jobs-in-progress limit less than NCPUS
- P's jobs finish and are uploading
- the client asks P for work and doesn't get any because of the limit
- the client does exponential backoff from P
Over the long term, P can get much less than its fair share of work
- If run with --gui_rpc_unix_domain, the client will listen on
a Unix-domain socket (named "boinc_socket") rather than on a TCP port.
- Add RPC_CLIENT::init_unix_domain() function to C++ GUI RPC interface
(Note: we'll need to add a corresponding function to the Java interface)
- boinccmd: add --unix_domain option
I forgot that the wrapper has a 1-second poll for suspend and resume,
so sub-second throttling won't work properly for wrapper apps.
Revert to a variant of the old scheme,
in which the min of the suspended and resumed periods is 1 sec.
Also, fix task start/suspend/resume log messages.
- Add a GUI RPC ("set_language") that lets the Manager communicate
the user's selected language code to the client at startup.
- The client stores the language code in the client state file
- The client appends a "lang=X" GET argument to the URLs from
which notices are fetched.
- The next steps (not done) are 1) to change the get_notices.php
script to parse the argument and do translation, and
2) extend our Pootle system to allow volunteer translation
of notices by all projects.