Work unit and results states
The processing of workunits and results involves several independent activities.
To keep track of these activities,
workunit and result database records have several parameters and state fields,
and their processing can be expressed in terms of state machines.
Workunit.delay_bound
An upper bound for the interval between when a scheduler
sends an instance of this WU to a host
when the host sends the completion message.
It should be several times the execution time on an average host.
If it's exceeded, the server "gives up" on the result
and may delete its input files.
If the result is returned later,
it will still be validated and credited.
Workunit.canonical_resultid
The ID of the "canonical" result for this workunit, or zero.
Workunit.timeout_check_time
The next time to check for timeouts on this WU
(e.g. to give up on results and create new ones).
Workunit.file_delete_state
Indicates whether input files should be deleted.
Workunit.assimilate_state
Indicates whether the workunit should be assimilated.
Workunit.need_validate
Indicates that the workunit has a result that needs validation.
Workunit.error_mask
A bit mask for error conditions.
Workunit invariants:
- eventually either canonical_resultid or error_mask is set
- eventually timeout_check_time=0
- WUs are eventually assimilated
- input files are eventually deleted,
but only when all results have state=OVER
(since may need to validate results that arrive after assimilation)
and wu.assimilate_state = DONE
(since project may want to do something with WU in error case)
Result.report_deadline
give up on result (and possibly delete input files)
if don't get reply by this time.
Assignment: when send result; now + WU.delay_bound
Result.server_state
Result.outcome
SUCCESS, COULDNT_SEND, CLIENT_ERROR, NO_REPLY, DIDNT_NEED.
Defined if server_state = OVER.
Result.client_state
Records the client state (upload, process, or download)
where an error occurred.
Defined if outcome is CLIENT_ERROR.
Result.file_delete_state
Result.validate_state
Result invariants:
- eventually server_state = OVER.
- output files are eventually deleted.
Non-canonical results can be deleted as soon as the WU is assimilated.
Canonical results can be deleted only when all results have server_state=OVER.
If a result reply arrives after its timeout,
the output files can be immediately deleted.
How do we delete output files that arrive REALLY late?
(e.g. uploaded after all results have timed out, and never reported)?
Let X = create time of oldest unassimilated WU.
Any output files created before X can be deleted.