- Fix a crash bug which prevented "Launch another BOINC Manager" from working
- Fix compiler warnings of possible buffer overflow due to incorrect calls to strncat()
- Adjust directories for per-user files on Macintosh
- gpu_active_frac is the fraction of time GPU use is allowed
while the client is running.
Previously the client reported it but we weren't storing it in the DB.
We may need it in the future for batch scheduling logic.
- fix a crashing bug in scheduler
- client: minor message tweak
- If run with --gui_rpc_unix_domain, the client will listen on
a Unix-domain socket (named "boinc_socket") rather than on a TCP port.
- Add RPC_CLIENT::init_unix_domain() function to C++ GUI RPC interface
(Note: we'll need to add a corresponding function to the Java interface)
- boinccmd: add --unix_domain option
The web RPCs done by the client during project attach
(lookup_account, create_account)
have an email address and password hash in their request.
Network sniffers could potentially see these,
so we should use HTTPS for these RPCs if possible.
However, not all BOINC projects have SSL-enabled web servers.
So I did the following:
- Change get_project_config.php to return an additional
<web_rpc_url_base> element.
This is SECURE_URL_BASE (if specified in the project's
project.inc config file) or, if not, the master URL.
- This new element is parsed into the PROJECT_CONFIG structure.
- In calls to create_account and lookup_account,
the Manager uses PROJECT_CONFIG::web_rpc_url_base
if it's available, else the master URL.
So, the new Manager/client uses HTTPS for RPCs to projects
that have updated their get_project_config.php,
and specify a SECURE_URL_BASE with https:// prefix.
Android note: I added code to parse the new config element,
but didn't change the higher-level code;
Joachim will need to do this.
The batch query call used by Condor (query_batch_set(), in the C++ API)
returned info about all the jobs in the set of batches,
even those that hadn't changed.
This is potentially inefficient - a query might return info
about 10,000 jobs, only a few (or none) of which have changed state
since the last call.
Solution: add a "min_mod_time" parameter to the call.
Only jobs that have changed state since that time are reported.
Also, add a "server_time" field to the return,
giving the current time on the server
(in case there's clock skew between client and server)
Also, fix some text scrambling introduced in previous checkin;
there must have been a gremlin in my vim.
Also remove our copies of the Windows debug engine. Recent version of Windows have recent enough versions of the debugger engine to generate call-stacks.
- Add a GUI RPC ("set_language") that lets the Manager communicate
the user's selected language code to the client at startup.
- The client stores the language code in the client state file
- The client appends a "lang=X" GET argument to the URLs from
which notices are fetched.
- The next steps (not done) are 1) to change the get_notices.php
script to parse the argument and do translation, and
2) extend our Pootle system to allow volunteer translation
of notices by all projects.
Various bad things could happen when CPU throttling was used together w/ GPU apps.
Examples:
- on a multi-GPU system, several GPU tasks are assigned to the same GPU
- a suspended GPU task remains in memory (tying up its GPU resources)
while other tasks try to use the GPU.
The problem was that parts of the code assumed that suspended
GPU processes don't exist - i.e. that when a GPU task is suspended
it's always removed from memory.
This isn't true in the presence of CPU throttling.
So I made the following changes:
- When assigning GPUs to tasks, treat suspended tasks like running tasks
(i.e. reserve their GPUs)
- At the end of the CPU-scheduling logic, if there are any GPU tasks
that are suspended and not scheduled, remove them from memory,
and trigger a reschedule so we can reallocate their GPUs.
Also, a cosmetic change: in the resource usage string shown in the GUI,
include "(device X)" even if the task is suspended (i.e. because of throttling).
Also: zero out COPROC::opencl_device_indexes[] so we don't write
a garbage number to init_data.xml for non-OpenCL jobs
The OPENCL_CPU_PROP structure was being referred to as both
"opencl_cpu_prop" and "cpu_opencl_prop", roughly 50/50,
in variable names and XML tags.
Let's standardize on "opencl_cpu_prop",
which is what current clients are sending in scheduler requests.
This makes the host CPID stable; if you repeatedly install BOINC
on a particular node, it will get the same host CPID each time,
and your host table won't get lots of redundant entries.
A host can have multiple NICs;
we use the MAC address of the first Ethernet controller we find,
or the last NIC if there is none.
Of course, this will create problems if we get the same MAC address
for different hosts; in principle this shouldn't happen.
Remove the unused file hostinfo_network.h
This will allow the core client to kill VirtualBox VM's launched indirectly by vboxwrapper. Vboxwrapper launches vboxsvc.exe which launches vboxheadless.exe. This should also take care of the core client being able to kill child processes of the regular wrapper as well. I don't know the full scope of this type of issue? Maybe the default ACLs for a process changed within the last couple of versions of Windows.
On Windows, the working-set size reported by the OS for VM apps is too low.
Apparently the RAM usage is in fact roughly the VM size.
This can lead to running multiple VM apps,
which use more RAM than is available, causing performance problems.
Solution: use workunit.rsc_memory_bound as the working set size for VM apps.
(Note: for now, a VM app is one where the plan class includes "vbox").
- Show the OpenCL platform vendor for each OpenCL CPU description.
- OpenCL may not reliably report total RAM, available RAM and max FLOPS for CPUs, so exclude these from the OpenCL CPU descriptions; that information is available elsewhere.
- Batches now have optional "expire time".
If this time passes and the batch is not retired, abort and retire it.
- Add script "expire_batches" which enforces the above.
Run it as a periodic task.
- Add a web RPC for setting the expire time of a batch
(it can be changed multiple times)
- Add a C++ interface for this RPC
- Add a BOINC_SET_LEASE command to the BOINC GAHP
("lease" is Condor term for expire time)
Notes:
- The same CPU can have a different cpu_opencl_prop for each of multiple OpenCL platforms. We send them all to the project server because:
- Different OpenCL platforms report different values for the same CPU.
- Some OpenCL CPU apps may work better with certain OpenCL platforms.
- OpenCL has only 64 bits for global_mem_size, so it can report a max of only 4GB; get the CPU RAM size from gstate.hostinfo.m_nbytes.
- the eliminates the need to include global_prefs_override.xml in the Android distribution
- defering reporting tasks confuses some Android users; since WiFi connectivity is likely to be sporadic, and jobs tend to be long, it's probably better the report them ASAP
Added safety features requested by Rom Walton:
* Change COPROC_ATI::get_available_ram and COPROC_NVIDIA::get_available_ram to static routines to prevent calling them without first loading CAL or CUDA libraries.
* Add tests for NULL library calls in these routines.
* Add comments warning about need to call from a separate child process on dual-GPU laptops, proper library initialization, etc.
Some dual-GPU laptops (e.g., Macbook Pro) don't power down the more powerful GPU until all applications which used them exit. To save battery life, the client launches a second instance of the client as a child process to detect and get info about the GPUs.
The child process writes the info to a temp file which our main client then reads.
This option is enabled at compile time by defining USE_CHILD_PROCESS_TO_DETECT_GPUS as non-zero in gpu_detect.cpp
Old: if the timer thread gets a <suspend> message while we're in
a critical section, it sets a "suspend_request" flag.
The timer then periodically (10X/sec) checks whether
suspend_request is set and we're no longer in a critical section;
if so it suspends the worker thread.
Problem (pointed out by Oliver): this doesn't work if the worker thread
is almost always in a critical section
(as is the case for GPU apps, which treat GPU kernels as critical sections).
The app never gets suspended.
New:
1) boinc_end_critical_section() checks suspend_request;
if set, it calls suspend_activities()
2) On Unix, if suspend_activities() is called from the worker thread,
it calls sleep() in a loop until the suspension is over.
(Note: pthreads has no suspend/resume).
3) Add a mutex to protect the data structures shared between
the timer and worker threads.
Oliver pointed out that
BOINC_QUERY_BATCHES now prints, for each queried batch,
a count of jobs followed by the jobs
BOINC_ABORT_JOBS takes a list of jobs, which may belong to different batches.
The handler for this looks up the batches and makes sure
the jobs belong to the user.
If the user typed an extremely long URL into the
Attach to Account Manager wizard, a buffer overrun could result.
There were several places in the code that assumed user-entered
URLs are small (e.g. 256 chars):
- canonicalize_master_url.cpp()
- several GUI RPC interfaces, when generating XML request message
- URL-escaping (not relevant here, but fix anyway)
Change all these to stay within buffers regardless of URL size.
Note: do this by truncation.
This will cause error messages like "can't connect to project"
rather than saying the URL is too long. That's OK.
- XML parser: for parse_string(), malloc the 256KB buffer instead of
allocating it on the stack; the latter crashes threads with 32KB stacks.
However, do the malloc() only if we've actually seen the start tag
(this required a bit of code shuffle).
- BOINC GAHP: escape spaces in error msgs
We want to track the product name (e.g. "HTC One X") of Android devices.
On Android, the API to get this is Java,
so we need to do it in the GUI rather than the client.
- Add product_name field to HOST_INFO
- Add a GUI RPC for passing this info from the GUI to the client.
- Store it in client_state.xml, so that the client knows it initially.
The product name is included in scheduler RPC requests, as part of <host_info>.
TODO: add server-side support for parsing it and storing in DB.
Also: move DEVICE_STATUS out of HOST_INFO; it didn't belong there.
if a project sends us <no_rsc_apps> flags for all processor types,
then by default the client will never do a scheduler RPC to that project again.
This could happen because of a transient condition in the project,
e.g. it deprecates all its app versions for a while.
To avoid this situation, the client now checks whether the no_rsc_apps flags
are set for all processor types.
If they are, it clears them all.
This will cause work fetch to use backoff,
and the client will occasionally contact the project.
Previously the client had (C++) code to
- check whether on AC or USB power
- get battery status and temperature
- check whether on wifi
These functions looked in various places under /sys.
Problem: the paths are system-dependent,
so whatever we do won't work on all devices.
The Android APIs for getting this info are in Java,
so we can't call them from the client.
Solution: have the GUI periodically get this info
and report it to the client via a GUI RPC.
The GUI must make this RPC periodically:
if the client doesn't get one within some period of time
(currently 30 sec) it suspends computing and network.
Also: if suspending jobs because of battery charge level
or temperature, leave them in memory.
Add OPENCL_DEVICE_PROP cpu_opencl_prop to HOST_INFO;
this store info about the host's ability to run CPU OpenCL apps.
Detect this, and report it in scheduler requests.
- client: when parsing MD5, use 64 instead of 33 char buffer.
When the XML parser reads a string,
it enforces the buffer size limit BEFORE it strips whitespace.
So if a project put whitespaces before or after the MD5,
it would fail to parse.
- remote job submission:
- prefix error messages with "BOINC server:"
so higher levels can tell where the error is coming from
- "get templates" RPC can take job name instead of app name
- Condor interface
- add BOINC_SELECT_PROJECT function
- BOINC_SUBMIT no longer has info about output files
- Change BOINC_FETCH_OUTPUT semantics
- change "query_batch" to "query_batches"; allow multiple batches
- add "ping server" web RPC and GAHP function
- change BoincDb::get() so that it generates XML error message if needed
- add Web RPC for querying a completed job (returns status,
stderr out, run times, etc.)
- support Jaime's changes to GAHP protocol
- support for zipped output files
- Don't compute if the battery is overheated
- Don't compute until the batter is 95% charged.
Then stop computing if it falls below 90%.
(On some devices, computing causes the batter to drain
even while it's recharging).
This was supposed to be in my 507cd79 commit, but it got botched somehow.
- client: the <task> debug flag enables suspend/resume messages
for both CPU and GPU.
Previously CPU messages were always shown,
and GPU messages were shown if <cpu_sched_debug> was set.
- client: fix bug where reschedule wasn't being done on GPU suspend or resume.
* Move the windows_format_error_string function to win_util.cpp, .h instead of it being scattered between util.h and str_util.cpp.
* Convert the Windows error string into UTF8 before allowing it to be used by the caller
* Remove windows_error_string from library