1 Job replication
David Anderson edited this page 2024-01-10 18:21:00 -08:00

The results of a job cannot be trusted, because:

  • Some hosts have consistent or sporadic hardware problems, typically causing errors in floating-point computation.
  • Some volunteers may maliciously return wrong results; they may even reverse-engineer your application, deciphering and defeating any internal validation mechanism it might contain.

BOINC offers several mechanisms for validating results. However, there is no "one size fits all" solution. The choice depends on your requirements, and on the nature of your applications (you can use different mechanisms for different applications).

No replication

The first option is to not use replication. Each job gets done once. The validator examines single results.

This approach is useful if you have some way (application-specific) of detecting wrong results with high probability.

Replication

BOINC supports replication: each job gets done on N different hosts, and a result is considered valid if a strict majority of hosts return it.

One problem with replication is that there are discrepancies in the way different computers do floating point math. This makes it hard to determine when two results "agree"; two different results may be equally correct.

There are several different ways of dealing with this problem.

Eliminate discrepancies

It may be possible to eliminate numerical discrepancies. To do so you'll need to select appropriate compiler, compiler options, and math libraries, and to make sure that your checkpoint files are full precision.

This lets you do bitwise comparison of results. However, it is difficult and generally reduces the performance of your application.

Fuzzy comparison

If your application is numerically stable (i.e., small discrepancies lead to small differences in the result) you can write a "fuzzy comparison function" for the validator that considers two results as equivalent if they agree within some tolerance.

Homogeneous replication

With this variant of replication, once an instance of a job has been sent to a host, additional instances are sent only to hosts that are "numerically equivalent" (i.e. that will return bit-identical results).

The notion of "numerical equivalence" depends on your application and how it was compiled. BOINC supplies two pre-defined equivalence relations, "coarse" and "fine". Use either of these ("coarse" is preferable, if it's fine enough for your app) or define your own if needed.

Adaptive replication

This is a refinement of the replication policy. It randomly decides whether to replicate jobs, based on the measured error rate of hosts. If the first instance of a job is sent to a host with a low error rate, then with high probability no further instances will be sent.

Adaptive replication is independent of the comparison policy; you can use it with either bitwise comparison, fuzzy comparison, or homogeneous replication.