boinc/dcapi/doc/boinc.xml

585 lines
20 KiB
XML

<?xml version="1.0"?>
<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<sect1 id="boinc">
<title>BOINC</title>
<sect2>
<title>Terminology</title>
<para>
The term <emphasis>workunit</emphasis> means approximately the same thing
for DC-API as for BOINC.
</para>
<para>
Both the DC-API and BOINC uses the term <emphasis>result</emphasis>, but
they mean different things. In BOINC, results are instances of a work
unit waiting to be downloaded or currently under execution. The DC-API
result is what BOINC calls the <emphasis>canonical result</emphasis>.
This means that when BOINC generates multiple results (e.g. for
redundant computation), the DC-API will not be notified about the status
of individual BOINC results; instead, it will be notified only if the
canonical result is found or the whole work unit is marked as failed by
BOINC.
</para>
<para>
In the following sections, <emphasis>result</emphasis> will mean the
DC-API term, while <emphasis>BOINC result</emphasis> will refer to the
BOINC definition.
</para>
<para>
The DC-API master application is an <emphasis>assimilator</emphasis> in
BOINC terms.
</para>
</sect2>
<sect2>
<title>Configuration options</title>
<sect3>
<title>Master side</title>
<warning>
Important note: the directories specified by
<literal>WorkingDirectory</literal>, <literal>ProjectRootDir</literal>
and the upload &amp; download directories specified in BOINC's
<filename>config.xml</filename> must all reside on the same filesystem
since the DC-API uses the <function>link()</function> and
<function>rename()</function> system calls.
</warning>
<variablelist>
<varlistentry>
<term>InstanceUUID</term>
<listitem>
<para>
REQUIRED. The value must be a Universally Unique Identifier. The
value must be unique for every master application running on the
same grid backend. If two master applications are started with the
same <literal>InstanceUUID</literal> value, their behaviour is
undefined.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>BoincConfigXML</term>
<listitem>
<para>
REQUIRED. The location of the BOINC <filename>config.xml</filename>
configuration file.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>ProjectRootDir</term>
<listitem>
<para>
REQUIRED. The location of the project's root directory. This
directory. This is the directory that contains the
<filename>templates</filename>, <filename>upload</filename>,
<filename>download</filename> and other BOINC-related
subdirectories.
</para>
</listitem>
</varlistentry>
</variablelist>
</sect3>
<sect3>
<title>Per-client configuration</title>
<variablelist>
<varlistentry>
<term>Redundancy</term>
<listitem>
<anchor id="DC-API-Boinc-Redundancy"/>
<para>
OPTIONAL. Integer value specifying the quorum required to consider
the work unit as valid. The default value is 1. If this value is N,
<itemizedlist>
<listitem>
<para>
N + log(N) initial BOINC results will be created. If one of
them finishes, a new one will be created automatically until
the work unit either succeeds or fails.
</para>
</listitem>
<listitem>
<para>
The work unit will be considered failed if more than N +
log(N + 2) + 1 BOINC results fail.
</para>
</listitem>
<listitem>
<para>
The work unit will be considered failed if there are N +
log(N + 2) + 1 successful results but the validator could
not find a canonical result.
</para>
</listitem>
<listitem>
<para>
The work unit will be considered failed if the state of the
work unit is still not decided after 2 * (N + log(N + 2))
BOINC results have been received.
</para>
</listitem>
</itemizedlist>
<note>
When the redundancy is greater than 1, the work unit can not be
suspended using <function><link
linkend="DC-suspendWU">DC_suspendWU()</link></function>.
</note>
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>MaxOutputSize</term>
<listitem>
<para>
OPTIONAL. Max. size of any output files the client application
generates. The default is 256 KiB. If the size of an output file
exceeds this value, the BOINC core client will not upload that
file and will report the BOINC result as failed.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>MaxMemUsage</term>
<listitem>
<para>
OPTIONAL. Max. memory usage of the client application. The default
is 128 MiB. Hosts with less available memory will not download
work units for this application. Also, if the applications's real
memory usage exceeds this limit, the BOINC core client aborts the
application and reports the BOINC result as failed.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>MaxDiskUsage</term>
<listitem>
<para>
OPTIONAL. Max. disk usage of the client application, including all
output and temporary files. The default is 64 MiB. Hosts with less
usable disk space will not download work units for this
application. Also, if the application's disk usage exceeds this
limit, the BOINC core client aborts the apllication and reports
the BOINC result as failed.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>EstimatedFPOps</term>
<listitem>
<para>
OPTIONAL. The estimated run-time of the client application,
expressed in the number of floating point operations. The default
is 10<superscript>13</superscript>. This value is used by the
BOINC server to decide whether a given host is eligible to run a
work unit and is also used by the BOINC core client for scheduling
decisions.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>MaxFPOps</term>
<listitem>
<para>
OPTIONAL. Max. CPU usage of the client application, expressed in the number
of floating point operations. The default is
10<superscript>15</superscript>. If the application uses more CPU
time than this value divided by the CPU's speed, then the BOINC
core client aborts the application and reports the BOINC result as
failed.
</para>
<note>
As per recommendations in the BOINC documentation, the value of
<literal>MaxFPOps</literal> should be several times larger than
the expected run time of a work unit on an avarage host.
</note>
</listitem>
</varlistentry>
<varlistentry>
<term>DelayBound</term>
<listitem>
<para>
Time in seconds the BOINC server waits for a result to finish. If
a client has donwloaded a BOINC result and did not finish in the
given time, the result is considered failed and a new one is
generated.
</para>
<note>
If <literal>DelayBound</literal> is smaller than the estimated run
time of the application on a given host (calculated by dividing
<literal>EstimatedFPOps</literal> by the host's speed), then the
BOINC result will not be offered for download. If no host is fast
enough to complete the application within the specified time
limit, the result will remain unsent for an unspecified amount of
time and DC-API will receive no feedback for it.
</note>
</listitem>
</varlistentry>
<varlistentry>
<term>EnableSuspend</term>
<listitem>
<para>
OPTIONAL. Boolean value telling if work units for this client can
be suspended using <function><link
linkend="DC-suspendWU">DC_suspendWU()</link></function> or
not. The default value is true.
<note>
When the redundancy is greater than 1, the work unit can not be
suspended using <function><link
linkend="DC-suspendWU">DC_suspendWU()</link></function>,
regardless of the value of this configuration option.
</note>
</para>
</listitem>
</varlistentry>
</variablelist>
</sect3>
<sect3>
<anchor id="Boinc-Config">
<title>Considerations for BOINC configuration</title>
</anchor>
<para>
If you want to use master-to-client messaging, you must enable it in the
BOINC project's configuration by making sure that the
<literal>&lt;msg_to_host/&gt;</literal> tag is present in
<filename>config.xml</filename>. Client-to-master messaging is always
enabled and does not require configuration.
</para>
</sect3>
</sect2>
<sect2>
<title>Backend-specific issues</title>
<sect3>
<title>Deploying the application</title>
<para>
Deploying the application consists of two steps: registering the client
application(s) in the BOINC database, and running the master daemon.
</para>
<para>
All client applications should be compiled for every platform you need,
and installed under the project's <filename>apps</filename> directory.
The BOINC name of the client application must be the same as the master
uses when it calls <function><link
linkend="DC-createWU">DC_createWU()</link></function>. See the BOINC
documentation about how the client binaries should be named and placed
and how they should be registered in the database.
</para>
<para>
The most common method of deploying the master application is to run it
as a BOINC daemon by adding it to BOINC's
<filename>config.xml</filename>. See the BOINC documentation for
details. Other methods of deploying the master application depending on
how it was designed are also possible, but the following rules must be
fulfilled:
<itemizedlist>
<listitem>
<para>
The master application must have access to the BOINC project's
<filename>config.xml</filename>
</para>
</listitem>
<listitem>
<para>
The BOINC <application>file_deleter</application> process must
have enough privileges to be able to remove files and directories
created by the master application. If the master runs under the
same user account as the BOINC daemons, this is usually not a
problem.
</para>
</listitem>
<listitem>
<para>
The master application must be able to create files and
directories under the project's <filename>download</filename>
directory, and it must be able to access files under the project's
<filename>upload</filename> directory.
</para>
</listitem>
</itemizedlist>
</para>
<!-- XXX Update when the validator API is added -->
<para>
Besides the master and client applications, you must also define a
validator for the application in <filename>config.xml</filename>. If you
are not using redundancy then you may use the
<filename>sample_trivial_validator</filename> that comes with BOINC.
This validator accepts everything without checking.
</para>
<para>
If redundancy is desired, you may use the
<application>validator_for_dcapi</application> validator which does a
textual (meaning converting between UNIX and Windows line endings)
comparison of the first output file.
</para>
<warning>
If you are running multiple master applications under the same BOINC
project, and you want to use
<filename>sample_trivial_validator</filename> for any of them, then
you must use it for all of them. This restricition exists for any
other validator that is not DC-API aware, since it can not determine
which work unit belongs to which master and therefore which results
should it validate and which ones should it leave alone.
</warning>
</sect3>
<sect3>
<title>Redundant computation</title>
<para>
Redundancy is very important if you are running computations on
untrusted clients instead but may even be useful on dedicated clients to
protect from hardware failures. Besides deliberate tampering with the
output, clients may also produce incorrect results due to hardware
problems like bad memory, overheating or faulty CPU or simply disk
corruption.
</para>
<para>
Redundant computing means sending the same work unit to multiple
different clients and comparing the results. The comparison is performed
by a tool BOINC calls <emphasis>validator</emphasis>. The validator
usually is application-specific as it must understand the output file
format to filter out unimportant noise (like different line endings on
different operating systems, or small differences between floating point
results due to the different rounding characteristics of different CPU
architectures).
</para>
<para>
Redundancy can be enabled in DC-API on a per client application basis by
adding the appropriate <link
linkend="DC-API-Boinc-Redundancy"><literal>Redundancy</literal></link>
value to the client's configuration group.
</para>
<para>
If redundancy is enabled for a client application, work units for that
client can not be suspended. The reason that it is generally impossible
to compare the state of two BOINC results suspended at two different
stage of their execution. If one of the suspended results is already
corrupt and is restarted, the validator will no longer recognize the
corruption since all new results starting from the corrupted initial
state will produce the same but bad output.
</para>
</sect3>
<sect3>
<title>Messaging</title>
<para>
BOINC provides a limited messaging support that is accessible thru the
DC-API <function><link
linkend="DC-sendWUMessage">DC_sendWUMessage()</link></function> and
<function><link
linkend="DC-sendMessage">DC_sendMessage()</link></function>
functions on the master and client side, respectively.
<note>
See the note about <link linkend="Boinc-Config">configuration</link>
requirements for master-to-client messaging.
</note>
</para>
<para>
BOINC messaging has several restrictions:
<itemizedlist>
<listitem>
<para>
Messages can only be sent to BOINC results that are currently
running. If a work unit has no running result, messages sent to it
are silently discarded.
</para>
</listitem>
<listitem>
<para>
If redundancy is enabled, master-to-client messages are sent to
all running BOINC results regardless their state. In case of
client-to-master messages, the master cannot tell which BOINC
result sent the message. This means that "request-response" style
messaging is hard to implement correctly when redundancy is
enabled.
</para>
</listitem>
<listitem>
<para>
Messages sent by the master are delivered only when the client
connects to the master next time. Since the master has no control
over this, the client should periodically send messages to the
master to force a connection if timely receiving of messages sent
by the master is important. Be caraful about the extra load placed
on the BOINC server by clients sending messages too frequently.
</para>
</listitem>
<listitem>
<para>
When multiple messages are being queued in either direction due to
the client not connecting to the server frequently enough, they
will be delivered to the peer in an undefined order.
</para>
</listitem>
</itemizedlist>
</para>
</sect3>
<sect3>
<title>Cancelling a running work unit</title>
<para>
The <function><link
linkend="DC-cancelWU">DC_cancelWU()</link></function> function can
be used to cancel a running work unit. This function is implemented
by sending a special message to all running BOINC results. This implies
that unless clients where BOINC results for this work unit are running
connect back to the BOINC server, the cancel request may not be
delivered until the client finishes the computation.
</para>
<para>
Due to a race condition between various components of the BOINC system,
it is also possible that a new BOINC result is created and is sent out
after the work unit has been cancelled. Such BOINC result will not
receive the cancellation request and will run until it finishes its
computation. Its result however will not be reported to the DC-API
master, so the application should not be concerned about this.
</para>
</sect3>
<sect3>
<anchor id="Boinc-Result">
<title>When is a result reported</title>
</anchor>
<!-- XXX Rework when the validator API is merged -->
<para>
The BOINC core client handles the completion of a BOINC result in two
phases: first it uploads the output files, then it notifies the BOINC
server that the result has been finised. The validator will notice the
completion of the BOINC result only when this notification is received.
</para>
<para>
However, this notification is sent only when the core client has to
connect the BOINC server the next time, which may be a long time if the
core client has already started processing the next BOINC result while
the output files of the previous result were being uploaded.
</para>
<para>
When there are no more work units to download, the client sleeps for a
couple of minutes before trying again. This means that the reporting of
the completion of the last work unit may be delayed for a couple of
minutes even after all its output files have been uploaded.
</para>
<para>
The DC-API master application will receive notification about a result
when the validator has made its decision. This may also introduce some
delay after all BOINC results have been completed.
</para>
</sect3>
<sect3>
<title>Work unit priority</title>
<para>
The priority of a work unit can be set either by using the
<function><link
linkend="DC-setWUPriority">DC_setWUPriority()</link></function>
function or by specifying it in the configuration file using the
<literal>DefaultPriority</literal> key. Either way, the priority can
be an arbitrary 32-bit integer.
</para>
<para>
The BOINC scheduler dispatches higher priority work units first. Results
belonging to work units with lower priorities will not be offered to
clients until all the higher priority work units are exhausted.
</para>
</sect3>
<sect3>
<title>Common errors</title>
<para>
There are some common errors:
<variablelist>
<varlistentry>
<term>No results are reported</term>
<listitem>
<para>
Check the validator. When there is no validator defined in
<filename>config.xml</filename> or the validator fails for some
reason, the DC-API master will not receive result notifications.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>The final result is not reported</term>
<listitem>
<para>
There is a couple minutes <link
linkend="Boinc-Result">delay</link> before reporting the final
result. It is normal.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>
I've fixed a bug in a client application, but results are still
computed using the old client
</term>
<listitem>
<para>
Be sure to give the new client binary with a version number
greater than the old client, or otherwise the clients will not
notice that the binary has been updated and will not download
it.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
</sect3>
<sect3>
<title>Open issues</title>
<para>
The following list contains the known problems with DC-API's BOINC
backend:
</para>
<itemizedlist>
<listitem>
<para>
Messages are not removed from the <literal>msg_to_host</literal>
and <literal>msg_from_host</literal> tables by the
<application>db_purge</application> tool, so they need to be cleaned
up manually from time to time to prevent the database from being
filled up.
</para>
</listitem>
<listitem>
<para>
The DC-API creates result template files in the
<filename>templates</filename> subdirectory in the project's root
directory, but those files are never removed.
</para>
</listitem>
</itemizedlist>
</sect3>
</sect2>
</sect1>
<!-- vim: set ai sw=2 tw=80: -->