• CPU scheduling policy: what result to run when.
  • Work fetch policy: when to contact scheduling servers, and which one(s) to contact.

    CPU scheduling

    CPU scheduling aims to achieve the following goals (decreasing priority):

    1. Maximize CPU utilization
    2. Respect the resource share allocation for each project. A project's resource share represents how much computing resources (CPU time, network bandwith, storage space) a user wants to allocate to the project relative to the resources allocated to all of the other projects in which he is participating. The client should respect this allocation to be faithful to the user. In the case of CPU time, the result computation scheduling should achieve the expected time shares over a reasonable time period.
    3. Satisfy result deadlines if possible.
    4. Given a 'minimum variety' parameter MV (seconds), reschedule CPUs at least once every MV seconds. The motivation for this goal stems from the potential orders-of-magnitude differences in expected completion time for results from different projects. Some projects will have results that complete in hours, while other projects may have results that take months to complete. A scheduler that runs result computations to completion before starting a new computation will keep projects with short-running result computations stuck behind projects with long-running result computations. A participant in multiple projects will expect to see his computer work on each of these projects in a reasonable time period, not just the project with the long-running result computations.
    5. Minimize mean time to completion for results. This means that the number of active result computations for a project should be minimized. For example, it's better to have one result from project P complete in time T than to have two results from project P simultaneously complete in time 2T.

    A result is 'active' if there is a slot directory for it. A consequence of result preemption is that there can be more active results than CPUs.

    Debt

    The notion of 'debt' is used to respect the resource share allocation for each project. The debt to a project represents the amount of work (in CPU time) we owe it. Debt is decreased when CPU time is devoted to a project. We increase the debt to a project according to the total amount of work done in a time period scaled by the project's resource share.

    For example, consider a system participating in two projects, A and B, with resource shares 75% and 25%, respectively. Suppose in some time period, the system devotes 25 minutes of CPU time to project A and 15 minutes of CPU time to project B. We decrease the debt to A by 20 minutes and increase it by 30 minutes (75% of 25 + 15). So the debt increases overall. This makes sense because we expected to devote a larger percentage of the system resources to project A than it actually got.

    The choice of projects for which to start result computations can simply follow the debt ordering of the projects. The algorithm computes the 'anticipated debt' to a project (the debt we expect to owe after the time period expires) as it chooses result computations to run.

    A sketch of the CPU scheduling algorithm

    This algorithm is run:

    We will attempt to minimize the number of active result computations for a project by dynamically choosing results to compute from a global pool. When we allocate CPU time to project, we will choose already running tasks first, then preempted tasks, and only choose to start a new result computation in the last resort. This will not guarantee the above property, but we hope it will be close to achieving it.

    1. Decrease debts to projects according to the amount of work done for the projects in the last period.
    2. Increase debts to projects according to the projects' resource shares.
    3. Let the anticipated debt for each project be initialized to its current debt.
    4. Repeat until we decide on a result to compute for each processor:
      1. Choose the project that has the largest anticipated debt and a ready-to-compute result.
      2. Decrease the anticipated debt of the project by the expected amount of CPU time.
    5. Preempt current result computations, and start new ones.

    Pseudocode

    data structures:
    ACTIVE_TASK:
        double cpu_at_last_schedule_point
        double current_cpu_time
        scheduler_state:
            PREEMPTED
            RUNNING
        next_scheduler_state    // temp
    PROJECT:
        double work_done_this_period    // temp
        double debt
        double anticipated_debt // temp
        bool has_runnable_result
    
    schedule_cpus():
    
    foreach project P
        P.work_done_this_period = 0
    
    total_work_done_this_period = 0
    foreach task T that is RUNNING:
        x = current_cpu_time - T.cpu_at_last_schedule_point
        T.project.work_done_this_period += x
        total_work_done_this_period += x
    
    foreach P in projects:
        P.debt += P.resource_share * total_work_done_this_period
                - P.work_done_this_period
    
    expected_pay_off = total_work_done_this_period / num_cpus
    
    foreach P in projects:
        P.anticipated_debt = P.debt
    
    foreach task T
        T.next_scheduler_state = PREEMPTED
    
    do num_cpus times:
        // choose the project with the largest anticipated debt
        P = argmax { P.anticipated_debt } over all P in projects with runnable result
        if none:
            break
        if (some T in P is RUNNING):
            t.next_scheduler_state = RUNNING
            P.anticipated_debt -= expected_pay_off
            continue
        if (some T in P is PREEMPTED):
            T.next_scheduler_state = RUNNING
            P.anticipated_debt -= expected_pay_off
            continue
        if (some R in results is for P, not active, and ready to run):
            T = new ACTIVE_TASK for R
            T.next_scheduler_state = RUNNING
            P.anticipated_debt -= expected_pay_off
    
    foreach task T
        if scheduler_state == PREEMPTED and next_scheduler_state = RUNNING
            unsuspend or run
        if scheduler_state == RUNNING and next_scheduler_state = PREEMPTED
            suspend (or kill)
    
    foreach task T
        T.cpu_at_last_schedule_point = current_cpu_time
    

    Work fetch policy

    The work fetch policy has the following goal:

    When to get work

    The CPU scheduler needs a minimum number of results from a project in order to respect the project's resource share. We effectively have too little work when the number of results for a project is less than this minimum number.

    min_results(P) = ceil(ncpus * P.resource_share)

    The client can estimate the amount of time that will elapse until we have too little work for a project. When this length of time is less than T, it is time to get more work.

    A sketch of the work fetch algorithm

    This algorithm determines if a project needs more work. If a project does need work, then the amount of work it needs is computed. It is called whenever the client can make a scheduler RPC.

    1. For each project
      1. If the number of results for the project is too few
        1. Set the project's work request to 2T
        2. Return NEED WORK IMMEDIATELY
      2. For all but the top (min_results - 1) results with the longest expected time to completion:
        1. Sum the expected completion time of the result scaled by the work rate and the project's resource share
      3. If the sum S is less than T
        1. Set the project's work request to 2T - S
        2. Return NEED WORK
      4. Else, set the project's work request to 0 and return DON'T NEED WORK

    The mechanism for actually getting work checks if a project has a non-zero work request and if so, makes the scheduler RPC call to request the work.

    Pseudocode

    data structures:
    PROJECT:
        double work_request_days
    
    check_work_needed(Project P):
    
    if num_results(P) < min_results(P):
        P.work_request_days = 2T
        return NEED_WORK_IMMEDIATELY
    
    top_results = top (min_results(P) - 1) results of P by expected
    completion time
    
    work_remaining = 0
    foreach result R of P that is not in top_results:
        work_remaining += R.expected_completion_time
    work_remaining *= P.resource_share * active_frac / ncpus
    
    if work_remaining < T:
        P.work_request_days = 2T - work_remaining / seconds_per_day
        return NEED_WORK
    else:
        P.work_request_days = 0
        return DONT_NEED_WORK
    
    
    
    "; page_tail(); ?>