The CPU scheduling policy aims to achieve the following goals (in decreasing priority):
A result is 'active' if there is a slot directory for it. There can be more active results than CPUs.
The notion of 'debt' is used to respect the resource share allocation for each project. The debt to a project represents the amount of work (in CPU time) we owe it. Debt is decreased when CPU time is devoted to a project. We increase the debt to a project according to the total amount of work done in a time period scaled by the project's resource share.
For example, consider a system participating in two projects, A and B, with resource shares 75% and 25%, respectively. Suppose in some time period, the system devotes 25 minutes of CPU time to project A and 15 minutes of CPU time to project B. We decrease the debt to A by 25 minutes and increase it by 30 minutes (75% of 25 + 15). So the debt increases overall. This makes sense because we expected to devote a larger percentage of the system resources to project A than it actually got.
The choice of projects for which to start result computations can simply follow the debt ordering of the projects. The algorithm computes the 'anticipated debt' to a project (the debt we expect to owe after the time period expires) as it chooses result computations to run.
If a project has no runnable results, its resource share should not be considered when determining the debts for other projects. Furthermore, such a project should not be allowed to build-up debt while it has no work. Thus, its debt should be reset to zero.
This algorithm is run:
We will attempt to minimize the number of active result computations for a project by dynamically choosing results to compute from a global pool. When we allocate CPU time to project, we will choose already running tasks first, then preempted tasks, and only choose to start a new result computation in the last resort. This will not guarantee the above property, but we hope it will be close to achieving it.
data structures: ACTIVE_TASK: double cpu_time_at_last_sched double current_cpu_time scheduler_state: PREEMPTED RUNNING next_scheduler_state // temp PROJECT: double work_done_this_period // temp double debt double anticipated_debt // temp RESULT next_runnable_result schedule_cpus(): foreach project P P.work_done_this_period = 0 total_work_done_this_period = 0 foreach task T that is RUNNING: x = T.current_cpu_time - T.cpu_time_at_last_sched T.project.work_done_this_period += x total_work_done_this_period += x foreach P in projects: if P has a runnable result: adjusted_total_resource_share += P.resource_share foreach P in projects: if P has no runnable result: P.debt = 0 else: P.debt += (P.resource_share / adjusted_total_resource_share) * total_work_done_this_period - P.work_done_this_period expected_pay_off = total_work_done_this_period / num_cpus foreach P in projects: P.anticipated_debt = P.debt foreach task T T.next_scheduler_state = PREEMPTED do num_cpus times: // choose the project with the largest anticipated debt P = argmax { P.anticipated_debt } over all P in projects with runnable result if none: break if (some T (not already scheduled to run) for P is RUNNING): T.next_scheduler_state = RUNNING P.anticipated_debt -= expected_pay_off continue if (some T (not already scheduled to run) for P is PREEMPTED): T.next_scheduler_state = RUNNING P.anticipated_debt -= expected_pay_off continue if (some R in results is for P, not active, and ready to run): Choose R with the earliest deadline T = new ACTIVE_TASK for R T.next_scheduler_state = RUNNING P.anticipated_debt -= expected_pay_off foreach task T if scheduler_state == PREEMPTED and next_scheduler_state = RUNNING unsuspend or run if scheduler_state == RUNNING and next_scheduler_state = PREEMPTED suspend (or kill) foreach task T T.cpu_time_at_last_sched = T.current_cpu_time
The CPU scheduler is at its best when projects always have runnable results. When the CPU scheduler chooses to run a project without a runnable result, we say the CPU scheduler is 'starved'.
The work fetch policy has the following goal:
At a given time, the CPU scheduler may need as many as
min_results(P) = ceil(ncpus * P.resource_share / total_resource_share)
results for project P to avoid starvation.
Let
ETTRC(RS, k)be the amount of time that will elapse until the number of results in the set RS reaches k, given CPU speed, # CPUs, resource share, and active fraction. (The 'active fraction' is the fraction of time in which the core client is running and enabled to work. This statistic is continually updated.)
[estimated time to result count]
ETTPRC(P, k) = ETTRC(P.runnable_results, k)
[estimated time to project result count]
Then the amount of time that will elapse until starvation is estimated as
min { ETTPRC(P, min_results(P)-1) } over all P.
It is time to get work when, for any project P,
ETTPRC(P, min_results(P)-1) < Twhere T is the connection period defined above.
First, define estimated_cpu_time(S) to be the total FLOP estimate for the results in S divided by single CPU FLOPS.
Let ordered set of results RS(P) = { R_1, R_2, ..., R_N } for project P be ordered by increasing deadline. The CPU scheduler will schedule these results in the same order. ETTPRC(P, k) can be approximated by
estimated_cpu_time(R_1, R_2, ..., R_N-k) / avg_proc_rate(P)where avg_proc_rate(P) is the average number of CPU seconds completed by the client for project P in a second of (wall-clock) time:
avg_proc_rate(P) = P.resource_share / total_resource_share * ncpus * 'active fraction'.
To only contact scheduling servers about every T days, the client should request enough work so that for each project P:
ETTPRC(P, min_results(P)-1) >= 2T.More specifically, we want to get a set of results REQUEST such that
ETTRC(REQUEST, 0) = 2T - ETTPRC(P, min_results(P)-1).Since requests are in terms of CPU seconds, we make a request of
estimated_cpu_time(REQUEST) = avg_proc_rate * (2T - ETTPRC(P, min_results(P)-1).
NEED_WORK_IMMEDIATELY CPU scheduler is currently starved NEED_WORK Will starve within T days DONT_NEED_WORK otherwiseIt can be called whenever the client can make a scheduler RPC.
The mechanism for getting work periodically (once a second) calls compute_work_request(), checks if any project has a non-zero work request and if so, makes the scheduler RPC call to request the work.
data structures: PROJECT: double work_request total_resource_share = sum of all projects' resource_share avg_proc_rate(P): return P.resource_share / total_resource_share * ncpus * time_stats.active_frac ettprc(P, k): results_to_skip = k est = 0 foreach result R for P in order of DECREASING deadline: if results_to_skip > 0: results_to_skip-- continue est += estimated_cpu_time(R) / avg_proc_rate(P) return est compute_work_request(): urgency = DONT_NEED_WORK foreach project P: P.work_request = 0 est_time_to_starvation = ettprc(P, min_results(P)-1) if est_time_to_starvation < T: if est_time_to_starvation == 0: urgency = NEED_WORK_IMMEDIATELY else: urgency = max(NEED_WORK, urgency) P.work_request = max(0, (2*T - est_time_to_starvation)*avg_proc_rate(P)) if urgency == DONT_NEED_WORK: foreach project P: P.work_request = 0 return urgency"; page_tail(); ?>