This can lead to starving the CPUs if there are both GPU and MT jobs.
The basic problem is that a host with GPUs will never have all its CPUs
available for MT jobs.
It should probably advertise fewer CPUs, or something.
We were waiting until there was no task for the project
before asking for another task.
We should have been waiting until there was no in-progress task.
A while back we added a mechanism intended to defer work-request RPCs
while file uploads are happening,
with the goal of reporting completed tasks sooner
and reducing the number of RPCs.
There were 2 bugs in this mechanism.
First, the decision of whether an upload is active was flawed;
if several uploads were active and 1 finished,
it would act like all had finished.
Second, when WORK_FETCH::choose_project.cpp() picks a project,
it sets p->sched_rpc_pending to RPC_REASON_NEED_WORK.
If we then decide not to request work because an upload
is active, we need to clear this field.
Otherwise scheduler_rpc_poll() will do an RPC to it,
piggybacking a work request and bypassing the upload check.
Round-robin simulation, among other things, creates a bitmap
"sim_excluded_instances" of instances that are idle because of CPU exclusions.
There was a problem in how this was computed;
in the situation where there are fewer jobs than GPU instances
it could fail to set any bits, so no work fetch would happen.
My solution is a bit of a kludge, but should work in most cases.
The long-term solution is to treat GPU instances separately,
eliminating the need for GPU exclusions.
It's reported that the client can repeatedly make work request RPCs
that don't request work for any resource.
I'm not sure why this happens, but prevent it.
If the user typed an extremely long URL into the
Attach to Account Manager wizard, a buffer overrun could result.
There were several places in the code that assumed user-entered
URLs are small (e.g. 256 chars):
- canonicalize_master_url.cpp()
- several GUI RPC interfaces, when generating XML request message
- URL-escaping (not relevant here, but fix anyway)
Change all these to stay within buffers regardless of URL size.
Note: do this by truncation.
This will cause error messages like "can't connect to project"
rather than saying the URL is too long. That's OK.
We want to track the product name (e.g. "HTC One X") of Android devices.
On Android, the API to get this is Java,
so we need to do it in the GUI rather than the client.
- Add product_name field to HOST_INFO
- Add a GUI RPC for passing this info from the GUI to the client.
- Store it in client_state.xml, so that the client knows it initially.
The product name is included in scheduler RPC requests, as part of <host_info>.
TODO: add server-side support for parsing it and storing in DB.
Also: move DEVICE_STATUS out of HOST_INFO; it didn't belong there.
if a project sends us <no_rsc_apps> flags for all processor types,
then by default the client will never do a scheduler RPC to that project again.
This could happen because of a transient condition in the project,
e.g. it deprecates all its app versions for a while.
To avoid this situation, the client now checks whether the no_rsc_apps flags
are set for all processor types.
If they are, it clears them all.
This will cause work fetch to use backoff,
and the client will occasionally contact the project.
Previously the client had (C++) code to
- check whether on AC or USB power
- get battery status and temperature
- check whether on wifi
These functions looked in various places under /sys.
Problem: the paths are system-dependent,
so whatever we do won't work on all devices.
The Android APIs for getting this info are in Java,
so we can't call them from the client.
Solution: have the GUI periodically get this info
and report it to the client via a GUI RPC.
The GUI must make this RPC periodically:
if the client doesn't get one within some period of time
(currently 30 sec) it suspends computing and network.
Also: if suspending jobs because of battery charge level
or temperature, leave them in memory.
In enforce_run_list(), don't count the RAM usage of NCI tasks.
NCI tasks run sporadically, so it doesn't make to count it;
doing so can starve regular jobs in some cases.
Add OPENCL_DEVICE_PROP cpu_opencl_prop to HOST_INFO;
this store info about the host's ability to run CPU OpenCL apps.
Detect this, and report it in scheduler requests.
The basic problem: the way we assign GPU instances when creating
the "run list" is slightly different from the way we assign them
when we actually run the jobs;
the latter assigns a running job to the instance it's using,
but the former doesn't.
Solution (kludge): when building the run list,
don't reserve instances for currently running jobs.
This will result in more jobs in the run list, and avoid starvation.
For efficiency, do this only if there are exclusions for this type.
Comment: this is yet another complexity that would be eliminated
if GPU instances were modeled separately.
I wish I had time to do that.
- client emulator: change default latency bound from 1 day to 10 days
This gives you a way to simulate the effects of app_config.xml
- client: piggyback requests for resources even if we're backed off from them
- client: change resource backoff logic
Old: if we requested work and didn't get any,
back off from resources for which we requested work
New: for each resource type T:
if we requested work for T and didn't get any, back off from T
Also, don't back off if we're already backed off
(i.e. if this is a piggyback request)
Also, only back off if the RPC was due to an automatic
and potentially rapid source
(namely: work fetch, result report, trickle up)
- client: fix small work fetch bug
by Jacob Klein.
The new policy is roughly as follows:
- find the highest-priority project P that is allowed
to fetch work for a resource below buf_min
- Ask P for work for all resources R below buf_max
for which it's allowed to fetch work,
unless there's a higher-priority project allowed
to request work for R.
If we're going to do an RPC to P for reasons other than work fetch,
the policy is:
- for each resource R for which P is the highest-priority project
allowed to fetch work, and R is below buf_max,
request work for R.
- client: when parsing MD5, use 64 instead of 33 char buffer.
When the XML parser reads a string,
it enforces the buffer size limit BEFORE it strips whitespace.
So if a project put whitespaces before or after the MD5,
it would fail to parse.
to use project's share of instances.
- client emulator: if client_state.xml doesn't have <no_rsc_apps>
for a project, and the project doesn't have apps for that resource,
the project can be asked for work for that resource.
- remote job submission:
- prefix error messages with "BOINC server:"
so higher levels can tell where the error is coming from
- "get templates" RPC can take job name instead of app name
- Condor interface
- add BOINC_SELECT_PROJECT function
- BOINC_SUBMIT no longer has info about output files
- Change BOINC_FETCH_OUTPUT semantics
(usually in a static variable called "last_time")
of the last time we did something,
and we only do it again when now - last_time exceeds some interval.
Example: sending heartbeat messages to apps.
Problem: if the system clock is decreased by X,
we won't do any of these actions are time X,
making it appear that the client is frozen.
Solution: when we detect that the system clock has decreased,
set a global var "clock_change" for 1 iteration of the polling loop,
and disable these time checks if clock_change is set.
- scale amount of work request by
(# non-excluded instances)/#instances
- change policy:
old: don't fetch work if #jobs > #non-excluded instances
new: don't fetch work if # of instance-seconds used in RR sim
> work_buf_min * (#non-exluded instances)/#instances
has an invalid URL, type, or app
- server, create_work() function: if a <file_info> in input template
lists URLs, they're directories; append filename to each one
Android: For all power/battery file descriptors, NULL out their buffers so the client will grab the latest information and not recycle the old information.
- Don't compute if the battery is overheated
- Don't compute until the batter is 95% charged.
Then stop computing if it falls below 90%.
(On some devices, computing causes the batter to drain
even while it's recharging).
This was supposed to be in my 507cd79 commit, but it got botched somehow.
- client: the <task> debug flag enables suspend/resume messages
for both CPU and GPU.
Previously CPU messages were always shown,
and GPU messages were shown if <cpu_sched_debug> was set.
- client: fix bug where reschedule wasn't being done on GPU suspend or resume.
(especially per-app exclusions) was incomplete and buggy.
Changes:
- make bitmaps of included instances per (app, resource type)
- in round-robin simulation, we keep track of used instances
(so that we know if there are instances that are idle
because of exclusions).
Do this based on app-level exclusions
(previously it was done based on project-wide exclusions,
which didn't include app-level exclusions).
- compute RSC_PROJECT_WORK_FETCH::non_excluded_instances
as the logical OR of the per-app masks.
I.e. if you exclude an instance for all apps separately,
it's the same as excluding it for the project as a whole.
(Note: this bitmap is used for only 1 purpose:
if we have idle instances, don't request work from a project
for which those instances are excluded.)
- define RSC_PROJECT_WORK_FETCH::ncoprocs_excluded as the # of
instances excluded for *any* app, not the # excluded for all apps.
This quantity is used in work fetch to make sure we don't
unboundedly fetch jobs that turn out not to have a GPU to run on
due to exclusions.
* Move the windows_format_error_string function to win_util.cpp, .h instead of it being scattered between util.h and str_util.cpp.
* Convert the Windows error string into UTF8 before allowing it to be used by the caller
* Remove windows_error_string from library