Generating result retries
Hosts may fail to process and return results for various reasons;
such results are said to be lost.
A combination of lost and erroneous results may prevent
finding canonical result for a workunit.
The result retry mechanism generates additional
results as needed to find a canonical result.
The result retry mechanism has the following project-supplied parameters:
- DWU: the expected delay (in seconds) between
creating a WU and getting a canonical result.
- Dresult: the expected delay (in seconds) between
creating a result and getting a confirmation.
- NError: give up on a workunit if it gets this many error results
(i.e., there must be a bug in the application).
- Ndet: give up on a workunit if it gets this many
non-error results without finding a canonical result
(i.e., the algorithm must nondeterministic).
- Nredundancy: try to get at least this many non-error results.
Each workunit has a retry check time.
This is initially set to now + DWU,
and is set to zero if a canonical result is found for the WU.
Each result has a deadline,
a time by which a confirmation is expected for the result.
This is initially set to now + Dresult,
Retry generation is handled by the program result_retry, invoked as
result_retry -appname name
This program continually checks for workunits past their check time
and without pending validation.
For each such workunit, it does the following:
- If any result is not sent, it generates a project warning,
and gives up on the WU (i.e., sets its check time to zero).
- If at least Nerror results have an error,
generate a project warning and give up on the WU.
- If at least Ndet results are done,
generate a project warning and give up on the WU.
- Generate Nredundancy - n new results for the WU,
where n is the number of results that are done.
The deadline of these results is now + Dresult.
- Set the check time of the WU to now + DWU
You should use crontab to make sure that
result_retry is always running.