Previously the client had (C++) code to
- check whether on AC or USB power
- get battery status and temperature
- check whether on wifi
These functions looked in various places under /sys.
Problem: the paths are system-dependent,
so whatever we do won't work on all devices.
The Android APIs for getting this info are in Java,
so we can't call them from the client.
Solution: have the GUI periodically get this info
and report it to the client via a GUI RPC.
The GUI must make this RPC periodically:
if the client doesn't get one within some period of time
(currently 30 sec) it suspends computing and network.
Also: if suspending jobs because of battery charge level
or temperature, leave them in memory.
In enforce_run_list(), don't count the RAM usage of NCI tasks.
NCI tasks run sporadically, so it doesn't make to count it;
doing so can starve regular jobs in some cases.
Add OPENCL_DEVICE_PROP cpu_opencl_prop to HOST_INFO;
this store info about the host's ability to run CPU OpenCL apps.
Detect this, and report it in scheduler requests.
The basic problem: the way we assign GPU instances when creating
the "run list" is slightly different from the way we assign them
when we actually run the jobs;
the latter assigns a running job to the instance it's using,
but the former doesn't.
Solution (kludge): when building the run list,
don't reserve instances for currently running jobs.
This will result in more jobs in the run list, and avoid starvation.
For efficiency, do this only if there are exclusions for this type.
Comment: this is yet another complexity that would be eliminated
if GPU instances were modeled separately.
I wish I had time to do that.
- client emulator: change default latency bound from 1 day to 10 days
This gives you a way to simulate the effects of app_config.xml
- client: piggyback requests for resources even if we're backed off from them
- client: change resource backoff logic
Old: if we requested work and didn't get any,
back off from resources for which we requested work
New: for each resource type T:
if we requested work for T and didn't get any, back off from T
Also, don't back off if we're already backed off
(i.e. if this is a piggyback request)
Also, only back off if the RPC was due to an automatic
and potentially rapid source
(namely: work fetch, result report, trickle up)
- client: fix small work fetch bug
by Jacob Klein.
The new policy is roughly as follows:
- find the highest-priority project P that is allowed
to fetch work for a resource below buf_min
- Ask P for work for all resources R below buf_max
for which it's allowed to fetch work,
unless there's a higher-priority project allowed
to request work for R.
If we're going to do an RPC to P for reasons other than work fetch,
the policy is:
- for each resource R for which P is the highest-priority project
allowed to fetch work, and R is below buf_max,
request work for R.
- client: when parsing MD5, use 64 instead of 33 char buffer.
When the XML parser reads a string,
it enforces the buffer size limit BEFORE it strips whitespace.
So if a project put whitespaces before or after the MD5,
it would fail to parse.
to use project's share of instances.
- client emulator: if client_state.xml doesn't have <no_rsc_apps>
for a project, and the project doesn't have apps for that resource,
the project can be asked for work for that resource.
- remote job submission:
- prefix error messages with "BOINC server:"
so higher levels can tell where the error is coming from
- "get templates" RPC can take job name instead of app name
- Condor interface
- add BOINC_SELECT_PROJECT function
- BOINC_SUBMIT no longer has info about output files
- Change BOINC_FETCH_OUTPUT semantics
(usually in a static variable called "last_time")
of the last time we did something,
and we only do it again when now - last_time exceeds some interval.
Example: sending heartbeat messages to apps.
Problem: if the system clock is decreased by X,
we won't do any of these actions are time X,
making it appear that the client is frozen.
Solution: when we detect that the system clock has decreased,
set a global var "clock_change" for 1 iteration of the polling loop,
and disable these time checks if clock_change is set.
- scale amount of work request by
(# non-excluded instances)/#instances
- change policy:
old: don't fetch work if #jobs > #non-excluded instances
new: don't fetch work if # of instance-seconds used in RR sim
> work_buf_min * (#non-exluded instances)/#instances
has an invalid URL, type, or app
- server, create_work() function: if a <file_info> in input template
lists URLs, they're directories; append filename to each one
Android: For all power/battery file descriptors, NULL out their buffers so the client will grab the latest information and not recycle the old information.
- Don't compute if the battery is overheated
- Don't compute until the batter is 95% charged.
Then stop computing if it falls below 90%.
(On some devices, computing causes the batter to drain
even while it's recharging).
This was supposed to be in my 507cd79 commit, but it got botched somehow.
- client: the <task> debug flag enables suspend/resume messages
for both CPU and GPU.
Previously CPU messages were always shown,
and GPU messages were shown if <cpu_sched_debug> was set.
- client: fix bug where reschedule wasn't being done on GPU suspend or resume.
(especially per-app exclusions) was incomplete and buggy.
Changes:
- make bitmaps of included instances per (app, resource type)
- in round-robin simulation, we keep track of used instances
(so that we know if there are instances that are idle
because of exclusions).
Do this based on app-level exclusions
(previously it was done based on project-wide exclusions,
which didn't include app-level exclusions).
- compute RSC_PROJECT_WORK_FETCH::non_excluded_instances
as the logical OR of the per-app masks.
I.e. if you exclude an instance for all apps separately,
it's the same as excluding it for the project as a whole.
(Note: this bitmap is used for only 1 purpose:
if we have idle instances, don't request work from a project
for which those instances are excluded.)
- define RSC_PROJECT_WORK_FETCH::ncoprocs_excluded as the # of
instances excluded for *any* app, not the # excluded for all apps.
This quantity is used in work fetch to make sure we don't
unboundedly fetch jobs that turn out not to have a GPU to run on
due to exclusions.
* Move the windows_format_error_string function to win_util.cpp, .h instead of it being scattered between util.h and str_util.cpp.
* Convert the Windows error string into UTF8 before allowing it to be used by the caller
* Remove windows_error_string from library
Old: each scheduler process holds a semaphore
while scanning the shared-mem job array.
On machines with many CPUs
there seems to be contention for this semaphore,
causing slow scheduler response and possibly connection failures.
New: Don't hold the semaphore while scanning array.
Instead, if find a job that passes quick_check(),
acquire the semaphore and recheck that the job is present in array
and passes quick_check().
- client: show messages if app_config.xml has unrecognized tags
- Win process control (affects API and wrapper):
Since Win doesn't have an API for process suspend/resume,
we were suspending processes by
1) enumerating all the threads in the system (typically several thousand)
2) suspending those belonging to the given process
The problem: for each thread, the code was calling a function
in diagnostics_win.cpp to see if the thread was exempted from suspension.
This check (which is unnecessary anyway if we're suspending another process)
was surrounded by a semaphore acquire/release.
The result: performance problems.
It could take a minute to suspend the threads.
Solution:
1) do the check for exemption only if we're suspending threads
in our own process (i.e. from the API)
2) if we're suspending multiple processes, enumerate the threads
only once, and see if each one belongs to any of the processes
3) have the wrapper elevate itself to normal priority.
Otherwise it can get preempted for long periods,
sometimes in the middle of scanning the threads.
Note: post-9x versions of Win have a process group API that includes suspend/resume.
We'll switch to this soon.
PRINCIPLE: AVOID PER-GPU-TYPE VARIABLES
- get rid of alloca() stuff in gutil.cpp; almost certainly not needed
- don't include malloc.h; it doesn't exist on BSD systems
http://boinc.berkeley.edu/trac/wiki/ClientAppConfig
This lets users do the following:
1) limit the number of concurrent jobs of a given app
(e.g. for WCG apps that are I/O-intensive)
2) Specify the CPU and GPU usage parameters of GPU versions
of a given app.
Implementation notes:
- max app concurrency is enforced in 2 places:
1) when building the initial job run list
2) when enforcing the final job run list
Both are needed to avoid possible starvation.
- however, we don't enforce it during RR simulation.
Doing so could cause erroneous shortfall and work fetch.
This means, however, that work buffering will not work
as expected if you're using max concurrency.
report them.
64 is chosen a bit arbitrarily, but the idea is to
limit the number of tasks reported per RPC,
and to accelerate the reporting of small tasks.
the binding of the get_state() RPC
- client: move client_start_time and previous_uptime
from CLIENT_STATE to TIME_STATS,
so that these are also visible in GUI RPC
- scheduler RPC: move uptime and previous_uptime
into <time_stats>
- client: condition an RR simulation message on <rrsim_detail>
- boinccmd: show TIME_STATS info in --get_state
- client: if an app's finish file has existed for 10 seconds, kill it;
it must be hung in boinc_finish().
This behavior has been seen with LHC@home and maybe other projects.
Note: this fixes a major problem (starvation)
with project-level GPU exclusion.
However, project-level GPU exclusion interferes with most of
the client's scheduling policies.
E.g., round-robin simulation doesn't take GPU exclusion into account,
and the resulting completion estimates and device shortfalls
can be wrong by an order of magnitude.
The only way I can see to fix this would be to model each
GPU instance as a separate resource,
and to associate each job with a particular GPU instance.
This would be a sweeping change in both client and server.
Old: heartbeat mechanism
Problem: if the client is blocked for > 30 secs
(e.g. because it takes a long time to write the state file,
of because it's stopped in a debugger)
then apps exit.
This is bad is the app doesn't checkpoint and has been
running for a long time.
New: the client passes its PID to the app.
The app periodically (10 sec) checks that the process still exists.
Notes:
- For backward compatibility (e.g. new API w/ old client,
or vice versa) the client still sends heartbeats,
and the API checks heartbeats if the client doesn't pass a PID.
- The new mechanism works only if the client's PID isn't assigned
to a new process within 10 secs of the client exiting.
Windows 2000 reuses PIDs immediately, so check for Win2K
and don't use this mechanism if so.
TODO: For Unix multithread apps,
critical sections aren't currently being enforced.
Need to fix this by masking signals.
svn path=/trunk/boinc/; revision=26147
cards. It appears that the Nvidia API was only setting 32-bits
of the 64-bit value. The remaining 32-bits were whatever
was on the stack.
client/
gpu_nvidia.cpp
svn path=/trunk/boinc/; revision=26084
- Allow projects to report "desired disk usage" (DDU).
If the client learns that a project wants disk space,
it can shrink the allocation to other projects.
- Base share computation on DDU rather than disk usage.
- Introduce the notion of "disk resource share".
This is defined (somewhat arbitrarily) as resource share
plus 1/10 of the largest resource share.
This is intended to ensure that even zero-share projects
get enough disk space to store app versions and data files;
otherwise they wouldn't be able to compute.
- server: use host.d_boinc_max (which wasn't being used)
to start d_project_share reported by client.
- volunteer storage: change the way hosts are allocated to chunks.
Allow hosts to store several chunks of the same file, if needed
svn path=/trunk/boinc/; revision=26052
initial work request to a project
- client: put some casts to double in NVIDIA detect code.
Shouldn't make any difference.
- volunteer storage: truncate file to right size after retrieval
svn path=/trunk/boinc/; revision=26051
allow it to fetch work of that type if the # of runnable
jobs it <= the # of non-excluded instances (rather than 0).
svn path=/trunk/boinc/; revision=26045
for a reason other than work fetch,
and we're deciding whether to piggyback a work request,
skip the checks for hysteresis (buffer < min)
and for per-resource backoff time.
These checks are there only to limit the rate of RPCs,
which is not relevant since we're doing one any.
This fixes a bug where a project w/ sporadic jobs specifies
a next_rpc_delay to ensure regular polling from clients.
When these polls occur they should request work regardless of backoff.
svn path=/trunk/boinc/; revision=26002
"cpu" in XML, and other code was looking for "CPU".
To fix this and prevent similar problems,
processor type names are now encapsulated in proc_type_name_xml().
Code should use this rather than having hard-wired names.
Redefine: GPU_TYPE_* as macros that call proc_type_name_xml().
svn path=/trunk/boinc/; revision=25996
the power management window proc, it was removed during one of the Win9x
code scrubs. When we see it, inform the client it is time to shutdown.
client/
sysmon_win.cpp
svn path=/trunk/boinc/; revision=25882
keep the RESULT record so that we can report it to the scheduler.
Otherwise we'll keep getting the same job if the project has
<resend_lost_results> set.
svn path=/trunk/boinc/; revision=25879
and change types of mem-size fields from int to double.
These fields are size_t in NVIDIA's version of this;
however, cuDeviceGetAttribute() returns them as int,
so I don't see where this makes any difference.
- client: fix bug in handling of <no_rsc_apps> element.
- scheduler: message tweaks.
Note: [foo] means that the message is enabled by <debug_foo>.
svn path=/trunk/boinc/; revision=25849
Otherwise it doesn't work for files >= 2GB
- Client: TIME_STATS::trim_stats_log() wasn't working because
it's called in the constructor of TIME_STATS,
which is called before we've done a chdir() to the data dir.
Note: for this reason, no disk access should be done in constructors
of global objects. A quick scan found no instances of this.
svn path=/trunk/boinc/; revision=25846