System Simulation -- Project

Project

This page describes the project for System Simulation.

Data centers consume a large and growing amount of energy (see this report and this article). Much of the energy consumed is wasted when servers operate at very low utilization levels. Generally, server machines consume consume roughly the same amount of energy whether lightly or highly utilized. This notion - and its implications to energy use of data centers - is described in this paper. A key technology to reduce energy waste in data centers is virtualization (see this paper). With virtualization, multiple physical machines can be consolidated into one machine when the offered load to the data center is low. This enables unused machines to be powered-down and energy consumption is then reduced. Previous work suggests that even for large data centers, power management is most effective at the rack or cluster level (see this paper). A key challenge is to develop a policy to consolidate machines such that the performance criteria specified by a Service Level Agreement (SLA) can still be met. Thus, the trade-off is one of energy consumption versus response time.

For this project you will build a simulation model of a server cluster and experiment with policies to power-up and power-down machines (assuming that virtualization is used to allow such consolidation to occur - you will not be modeling virtualization). You will use both synthetic request arrivals (Poisson arrivals) and a trace of request interarrival times taken from a real web server as your workload. Several simplifications will be made to make this problem tractable.

System specification

The specification of the system is:

The cluster has 10 server machines that can each be modeled as a single server queue.
The request service time is deterministic and is 200 milliseconds for each machine in the cluster.
A load balancer controls the cluster. The load balancer distributes arriving requests to powered-up machines in a round-robin fashion.
- This load balancer also executes an algorithm (or policy) to power-up and power-down server machines in response to measurements taken at the machines. One possible policy is described below.
A powered-up machine consumes 200 W and a powered-down machine consumes 5 W.
The time to power-up and power-down a machine is instantaneous.
A machine must be powered-up (or powered-down) for a minimum of 1 minute before it can change its power state.

A figure of the system is here.

Service Level Agreement (SLA)

The SLA for the system is:

The server cluster must maintain an SLA based on measured response time. The SLA states that the mean response time must not exceed 250 milliseconds and that the 99% response time must not exceed 500 milliseconds.

Given power-up/power-down policy

The given load balancer power-up/power-down policy is as follows:


   Do forever
      Wait for a 1 minute sample period
      Collect statistics to determine the utilization for the last sample period from all powered-up machines
      Determine the grand mean utilization for all powered-up machines
      If the grand mean is greater than a high threshold then power-up one additional machine
      If the grand mean is less than a low threshold then power-down one additional machine

Note that the number of powered-up machines cannot be greater than 10 or less than 1 at any time.

Workload

You are to study the performance of this system with two workloads.

Poisson arrivals with a rate of 1.245 requests per second.
A trace from a real production web server (the trace is here). The trace (in ASCII text format) contains one month of interarrival times to a real production server at a small business.

What you are to do (and grading)

You are to model the above described system and its power-up/power-down policy and study the effects of key parameters (factors) on response time performance. You are to determine best possible parameter values that minimize energy use while still meeting the SLA. You are to also invent, describe, model, and evaluate your own policy to try to improve on the given policy. Even if your policy is not better than the given policy, this is OK if the policy is based on good engineering judgement and its evaluation is complete.

You are to do the following:

Characterize the server trace (10 points)
Develop the simulation model for the above system (and policy) and validate it (10 points)
Describe the factors and the experiment design (10 points)
Determine the best possible parameter values for the Poisson workload and determine the energy savings (20 points)
Determine the best possible parameter values for the server trace workload and determine the energy savings (20 points)
Invent and describe your own policy (10 points)
Evaluate your own policy and compare it to the given policy for both the Poisson and server trace workloads (10 points)
Complete a related work literature review (10 points)

You are to document your findings in a properly formatted IEEE-style paper of maximum length 5 pages.

Up to 30 points can be subtracted for poorly written/formatted paper and source code. Up to 20 points extra credit for a particularly insightful policy that yields much better performance (that is, uses less energy while meeting the SLA) than the above given policy.

Project submission

Please submit your project as follows. Please email to me one zip file with filename your last name followed by your first name (e.g., ChristensenKen.zip). In the zip file please have you paper named as your last name followed by your first name followed by "_paper" (e.g., ChristensenKen_paper.pdf) and your model source code (i.e., the .c files). Please name your source files with last name, first name as well (e.g., ChristensenKen.c). Include a readme in your zip file if you have more than one source code file. Please do not email your submission multiple times, you will receive a "Got it" email from me when I have received your submission.

Miscellaneous

Some miscellaneous items are:

The template for IEEE-style papers is here.
The coding style guide you are to follow is here.

Last update on June 25, 2013