Generating result retries
Hosts may fail to process and return results for various reasons;
such results are said to be lost.
A combination of lost and erroneous results may prevent
finding canonical result for a workunit.
The result retry mechanism generates additional
results as needed to find a canonical result.
The result retry mechanism has the following project-supplied parameters:
- DWU: the expected delay (in seconds) between
creating a WU and getting a canonical result.
- Dresult: the expected delay (in seconds) between
creating a result and getting a confirmation.
- NError: give up on a workunit if it gets this many error results
(i.e., there must be a bug in the application).
- Ndet: give up on a workunit if it gets this many
non-error results without finding a canonical result
(i.e., the algorithm must nondeterministic).
- Nredundancy: try to get at least this many non-error results.
Each workunit has a retry check time.
This is initially set to now + DWU,
and is set to zero if a canonical result is found for the WU.
Each result has a deadline,
a time by which a confirmation is expected for the result.
This is initially set to now + Dresult,
Retry generation is handled by the program result_retry, invoked as
result_retry -appname name
This program continually checks for workunits past their check time
and without pending validation.
For each such workunit, the program does the following:
- If any result is not sent, generate an error message,
and give up on the WU (i.e., set its check time to zero).
This condition indicates that either
1) the resource requirements of the WU are too much for
any host;
2) there are insufficient hosts to handle the rate of work generation; or
3) scheduling servers have been out of service.
- If at least Nerror results have an error,
generate an error message and give up on the WU.
- If at least Ndet results are done,
generate an error message and give up on the WU.
- Generate Nredundancy - n new results for the WU,
where n is the number of results that are done.
The deadline of these results is now + Dresult.
- Set the check time of the WU to now + DWU
Use crontab to run result_retry continuously.