ACM Comput. Surv., Vol. 52, No. 1, Article 1, Publication date: February 2019.
DOI: https://doi.org/10.1145/3291049
Power management for data centers has been extensively studied in the past 10 years. Most research has focused on owner-operated data centers with less focus on Multi-Tenant Data Centers (MTDC) or colocation data centers. In an MTDC, an operator owns the building and leases out space, power, and cooling to tenants to install their own IT equipment. MTDC's present new challenges for data center power management due to an inherent lack of coordination between the operator and tenants. In this article, we conduct a comprehensive survey of existing MTDC power management techniques for demand response programs, sustainability, and/or power hierarchy oversubscription. Power oversubscription is of particular interest, as it can maximize resource utilization, increase operator profit, and reduce tenant costs. We create a taxonomy to classify and compare key works. Our taxonomy and review differ from existing works in that our emphasis is on safe power oversubscription, which has been neglected in previous surveys. We propose future research for prediction and control of power overload events in an oversubscribed MTDC.
ACM Reference format:
Sulav Malla and Ken Christensen. 2019. A Survey on Power Management Techniques for Oversubscription of Multi-Tenant Data Centers. ACM Comput. Surv. 52, 1, Article 1 (Febraury 2019), 31 pages. https://doi.org/10.1145/3291049
The proliferation of smart consumer devices and Internet of Things (IoT) has placed an increasing demand on data processing, storage, and management in the cloud. Increasingly, companies are migrating toward cloud-based services (Columbus 2017) as such services are easy to manage and highly accessible to clients. Moreover, popular Internet services, such as search, email, movie/music streaming, and so on, inherently require a cluster of interconnected computing and storage nodes rather than a single machine. This rise in server-side computation has driven the need for more data centers, the physical home of the cloud, both in terms of number as well as size.
Data centers can range from small server closet and rooms to large buildings consuming tens of megawatts of power (equivalent to that of a small city). In 2014, electricity consumption by data centers accounted for approximately 1.8% (70 TWh) of the entire electricity used in the U.S. (Shehabi et al. 2016). Of the total energy consumed by data centers, on an average, only 60% of it is used to power the IT equipment (Shehabi et al. 2016; Uptime Institute 2014) including servers, storage, and networking equipment while the rest, two-fifths of the energy, is wasted in mechanical (cooling) and electrical (lighting) systems of the data center. Given such a large energy footprint and massive energy waste, adoption of energy efficiency techniques in data centers is necessary. Consequently, significant research has been done in the past decade to tackle this problem, which are described in many past surveys (Beloglazov et al. 2011; Kong and Liu 2014; Liu and Zhu 2010; Rahman et al. 2014; Shuja et al. 2016).
However, the focus of past research has mostly been toward owner-operated data centers where a single entity owns and manages the data center infrastructure as well as the IT equipment in them. Equally important, but far less explored, is the multi-tenant data center (MTDC), also referred to as colocation data center or simply as “colos.” In an MTDC, the data center infrastructure is owned by an operator who leases out space and power capacity in the form of racks, cages, or rooms to tenants. The operator provides power, cooling, and security while tenants install and maintain their own IT equipment to provide services to the tenant's customers, as illustrated in Figure 1. Unlike in operator-owned data center where the operator has control over all aspect of the data center, in an MTDC the operator does not control the IT hardware that belong to tenants. Many businesses opt for MTDC as building their own data center is expensive and can take years to complete, they might want to place their servers closer to the client locations, move servers from inefficient office rooms to MTDC with higher-power efficiency to minimize carbon footprint and/or expand globally to multiple locations around the world. MTDC offers a quick and scalable solution to businesses at relatively small capital expenditure.
Major websites, e-commerce, banks, and even IT giants like Google, Microsoft, IBM, and Apple use MTDC to complement their own data centers (Islam et al. 2016; Ren 2017). For example, Apple owns and operates five data centers but also leases data center space at different MTDCs worldwide, all of which consumed a total of 778GWh of electricity in 2016 (Apple 2017). As shown in Figure 2(a), about one-fifth of this electricity use was from MTDCs. The trend of Apple's increasing reliance on MTDCs can be seen in Figure 2(b). The energy use footprint by MTDCs has grown by more than fourfold in four years.
There are 4,000+ colocation data centers worldwide with 40% of them in the U.S. (Data Center Map 2017). The MTDC market size is estimated at $31.5 billion and increasing at a 14.6% annual rate (MarketsandMarkets 2017). Some of the biggest data centers in the world are multi-tenant data centers. For example, the Digital Realty data center in Chicago has a total area of 1.1 million square feet and draws more than 100MW power from the utility (Digital Realty 2017). More importantly, MTDCs consume about one-fifth of the entire data center electricity use in the U.S. This is about five times more electricity than Google type hyperscale data centers (NRDC 2014). Therefore, the MTDC is an important type of data center and making them energy efficient is equally, if not more, important. However, there is a gap between the operator and the tenants as identified by Ren et al. in their seminal paper (Ren and Islam 2014). The operator might want to manage power or adopt energy efficient techniques in their data center but they lack control over the physical servers or the workload, which are owned and managed by the tenants independently. Same power management techniques that have been proposed for data centers with a single owner and operator are not directly applicable to an MTDC.
Only very recently has there been some research to address the problem of power management in MTDC (Islam et al. 2014, 2016; Malla and Christensen 2017; Ren and Islam 2014; Tran et al. 2015b; Zhang et al. 2016). Since an MTDC operator sells data center space/power to tenants, an MTDC can be viewed as a market with the operator as the seller and the tenants as the buyers. Hence, some kind of market mechanism is required to incentivize tenants to have coordination between the MTDC operator and the tenants. In this article, we do a comprehensive survey of power management techniques for MTDC with a special focus on power oversubscription.
Oversubscription is essential for efficient use of resources. For example, telephone lines and Internet are oversubscribed, airline tickets are overbooked. Power oversubscription, that is, installing servers such that aggregated peak power is greater than the power capacity limit, is desirable in data centers as it increases the overall utilization of the power infrastructure and reduces the amortized cost associated with power provisioning. MTDC operators may want to oversubscribe power for greater profit while tenants themselves may want to oversubscribe power they have purchased to reduce cost. We formally define what power oversubscription is as well as outline in more detail the profit, risk, and challenges associated with oversubscribing power capacity of MTDC in Section 2.
The outline of the remainder of this article is as follows. In Section 2, we discuss the different aspects of MTDC ranging from its power hierarchy to different leasing schemes. In Section 3, we describe our new taxonomy and use it to classify the existing literature. In the following three sections, 4, 5, and 6, we introduce research work in each of the three main categories. In Section 7, we discuss related surveys in power management of data centers. In the last section, we discuss open problems and future directions.
In this section, we describe the power hierarchy of MTDC, concept of oversubscribing the power hierarchy, demand response programs from the utility, and different types of costs for tenants and the operator.
2.1.1 Power Hierarchy. Figure 3 shows a simple MTDC power infrastructure hierarchy without any redundancy. After a series of voltage step-downs from high to low voltage (typically, 480V), power from the utility enters the data center. On-site backup diesel generators are generally present to provide power in case of a utility power outage. An Automatic Transfer Switch (ATS) chooses between the two inputs automatically. Uninterruptible Power Supply (UPS) with batteries regulate power from the utility removing any sags or spikes as well as performing power factor correction of the load side. UPS's also enhance reliability by supplying stored energy in case of utility power outages where backup diesel generators take a few seconds to start and bear the full load. These are generally double conversion UPS with a rectifier (AC-DC) step that also charges the batteries and an inverter (DC-AC) step.
The conditioned output of the UPS's goes to multiple Power Distribution Units (PDU) spread across the server room. Each PDU transforms voltage to that suitable for the installed IT equipment (typically 110V in the U.S.) and splits to multiple outputs with circuit breakers for protection (similar to a residential circuit breaker panel) that ultimately go to racks and servers (Barroso et al. 2013). Alternative configurations are also possible, for example having redundant components (UPS and PDU) and a power bus for greater reliability and availability at increased construction cost. Multiple distributed UPS's per rack or server instead of a centralized UPS or a high-voltage DC power distribution system can be used to avoid double conversion losses for increased efficiency. Increasingly, data centers also have local renewable energy sources, like wind and solar, to minimize their dependency on the grid and reduce their carbon footprint (Deng et al. 2014).
As mentioned earlier, the MTDC operator owns and manages the power infrastructure. Multiple tenants can rent out a single server rack, a cage (with several racks), or even the entire room/building depending upon their needs.
2.1.2 Power Consumption. Two major sources of power consumption in an MTDC are the IT equipment of tenants and the cooling system used to maintain the temperature and humidity inside the data center.
IT equipment power: Tenants typically have IT equipment including, servers for computing, storage disks for storing data, and networking equipment for communication. Servers are the predominant power consumers and their power consumption varies according to their utilization. Hence, power consumption by tenants is generally modeled as depending upon the number of servers they have powered on and the average utilization of these servers. For example, consider an MTDC with $N$ tenants with each tenant, $i \in \lbrace 1,2,\ldots ,N\rbrace$, having $M_i$ homogeneous servers. A server consumes $P_{i}^{idle}$ power when completely idle, $P_{i}^{max}$ power when fully utilized, and tenants can choose to power down some of their servers when not needed to reduce the power use. The time under consideration (usually, the monthly billing cycle) can be divided into $T$ time slots, with a time interval $t \in \lbrace 1,2,\ldots ,T\rbrace$. If, for each time interval $t$ and tenant $i$, the average workload arrival rate is $\lambda _{i}(t)$, the average service rate of a server is $\mu _{i}$, and the tenant has $m_{i}(t)$ servers powered on, then the mean utilization of a server is $\lambda _{i}(t)/m_{i}(t) \mu _{i}$. A linear power model is a popular way to estimate a server's power consumption (Fan et al. 2007; Tran et al. 2016c). The total power consumption by the tenant, $x_{i}(t)$, is given as
Cooling system power: The heat generated from IT equipment must to be removed for reliable operation of an MTDC. In a typical MTDC, computer room air conditioning (CRAC) units are located in server rooms with raised floor to take in the hot air produced by servers. CRAC units have heat exchangers, where the hot air is cooled, typically using chilled water from a water-cooled chiller (Barroso et al. 2013). The cold air is pumped back to the server room through the underfloor plenum. Rooms with raised floor have perforated tiles in front of the server racks such that cool air blown from beneath enters the servers, and the process repeats. Cooling systems consume a significant amount of power (up to 40% of entire data center power (Barroso et al. 2013)). One way of modeling the cooling power of an MTDC, $p_{cool}(t)$, is as (Zhang et al. 2013)
Power usage effectiveness: The total power used by an MTDC, $p_{MTDC}(t)$, can be approximated from Equation (2) and Equation (3) as
The critical power capacity of a data center is defined as the actual power available for IT equipment ignoring other power consumption overhead (e.g., cooling). Infrastructure cost of a data center scales linearly with its critical power capacity and can cost between $11 to $25 per provisioned critical watt (Barroso et al. 2013; Karidis et al. 2009). The amortized capital expenditure for building the power infrastructure of a data center can be 1.5 times higher than its operational cost over its entire lifetime (Wang et al. 2012b). Given the high cost of building the power infrastructure of a data center, 100% utilization would be ideal. But, this is not the case in actual data centers. This is due to the power use of a server being proportional to the workload and servers are not 100% utilized all the time. The average server utilization in well-tuned hyperscale data centers is about 45% (Shehabi et al. 2016). Furthermore, it has been found that the conservative approach of installing servers such that the sum of maximum power use by individual servers is less than the critical power capacity leads to further inefficiencies and wasted resources. Even though a server may frequently consume peak power, it is rare for a large group of servers to peak simultaneously.
Fan et al. (2007) characterized the power usage of a cluster of 5,000 servers at Google for different real world workloads over a period of six months. The main finding is summarized in Figure 4(a). Fan et al. found that, as we move up the power hierarchy from rack to PDU to cluster, dynamic range of power usage is narrower and the actual maximum power use compared to potential maximum power use decreases. For example, at the rack level the actual peak is 93% of maximum possible, but at the cluster level, the actual power use did not exceed 72% of the aggregated peak power. This is to say that more than a quarter of the critical power capacity of the data center never got utilized. The authors identified and explored the opportunity for maximizing the utilization of data center power infrastructure through power oversubscription, that is, installing more servers than allowed by the critical power capacity. Similar observations were made by Islam et al. (2016) when they studied the power usage characteristics of 10 tenants in an MTDC. Figure 4(b) shows the CDF of power use by 1, 5, and 10 tenants normalized to the maximum possible power use. We see that as the number of tenants increases, the maximum normalized aggregate power use of all tenants decreases. This is due to the fact that, different tenants in an MTDC have different workloads, and the event of correlated spikes is very rare. Hence, it is possible to oversubscribe the power hierarchy in an MTDC for increased utilization and efficiency. This observation is key to the future work proposed in this article.
2.2.1 Profit and Risk of Power Oversubscription. MTDC power oversubscription is profitable to the operator as they are able to host more tenants than allowed by the power capacity. Tenants are generally charged at a fixed rate based on their power subscription (Islam et al. 2015). So power oversubscription leads to increase in the operator's profit (revenue increases without additional cost). However, any time the power infrastructure is oversubscribed, there exists the risk of having a power overload; that is, the total power consumption is above the power capacity. This can have disastrous results. For example, at the rack level or PDU, the circuit breakers may trip, leading to server downtime and service outage. Similarly, at the cluster or data center level, where the power limit is contractual rather than physical, there may be fines from the utility (Malla and Christensen 2017). Considering a typical leasing cost of $150/kW/month and practical power use characteristic of tenants, the potential profit to operator and the associated risk (power overloading probability) at different power oversubscription level is shown in Table 1 (Islam et al. 2016).
Oversubscription | Extra revenue ($/kW/year) | Probability of overloading |
---|---|---|
10% | 180 | 1% |
15% | 270 | 1% |
20% | 360 | 2% |
25% | 450 | 3% |
To avoid power overload situation and ensure safe power oversubscription, some kind of power capping mechanism is required. Power capping is a method of restricting the power use of a server or group of servers such that certain power limit/budget is not exceeded. Power management technologies in modern servers have this power capping capability. An example is Intel's Running Average Power Limit (RAPL) interface, which limits a server's average power use over a time window (David et al. 2010; Rountree et al. 2012). Capping the power of a server may affect its performance. Power capping at the server level forms the building block of power capping mechanisms at higher levels (for example, PDU and data center) in operator-owned data centers (Azimi et al. 2017; Wang et al. 2012a; Wu et al. 2016) and hence enable safe power oversubscription. But in an MTDC, the operator lacks control over the servers for any kind of power capping.
2.2.2 Tenant-side Power Oversubscription. It is possible for tenants to oversubscribe (putting in more servers than allowed by power capacity) their own power limit as well, without the operator's knowledge, to reduce the cost for power subscription. Tenants are generally charged by the MTDC for power they subscribe to rather than the actual energy they use or for their peak power consumption (Palasamudram et al. 2012). Figure 5 shows the time varying nature of tenant's power consumption. Normally, tenants subscribe to power to ensure that peak power use is below the power capacity (Figure 5(a)). This can lead to wastage of available power as peak power consumption by tenants may be rare. Tenants can oversubscribe power to better utilize the available power and reduce their operating cost as well (lower leasing cost). However, this can lead to power overload as shown in Figure 5(b) if proper control mechanisms are not in place. In practice, MTDC operators do not allow tenants to consume more than 75% (Verizon 2014) or 80% (TekLinks 2017) of the rated power circuit limit to ensure safety. Here, the oversubscription is in lower (rack or cage) levels in the power hierarchy.
If tenant $i$ has $M_i$ servers, each with maximum power rating of $P_{i}^{max}$, then the maximum possible power use by the tenant is, $x_{i}^{max}=M_i*P_{i}^{max}$. Assuming that the tenant subscribes to a power level of $S_{i}$ from the operator, we say the tenant has oversubscribed the available power when the maximum possible power consumption by the tenant is greater than its subscribed capacity, that is,
2.2.3 Operator-side Power Oversubscription. The MTDC operator can oversubscribe the power hierarchy similar to power oversubscription in operator owned data center. This can be completely transparent to the tenants. Let $C_{0}$ be the critical power capacity of an MTDC, and each tenant $i$ has a power subscription of $S_{i}$. The MTDC is said to be oversubscribed by the operator when the total allocated peak power to the tenants is greater than the critical power capacity, that is,
Safe oversubscription of MTDC is not straightforward compared to operator owned data center, because the operator lacks control over the tenant servers. Tenants will be unaware of the power oversubscription, which is an operator created artifact. There is no mechanism in place to avoid power consumption from tenants peaking simultaneously and creating a power overload situation. We look into existing research work that focuses on enabling safe power oversubscription in MTDC by an operator (Islam et al. 2016; Malla and Christensen 2017). This limited existing work is covered in Section 6 of this article.
The main source of power for an MTDC is typically the electric utility grid. Power from the grid is considered stable and reliable. In case of grid power failure, UPS with batteries help in transition to on-site diesel generators.
2.3.1 Utility Pricing. Being a large power consumer, MTDCs are charged industrial rates by the utility. The electricity bill generally has two components, volume (or energy) charge and demand (or peak power) charge (Wang et al. 2014).
Volume charge: This component of the bill is for the energy consumption by the MTDC, charged as dollar per kWh. The charge may vary throughout the billing period (for example, one month) with higher price when the demand is high. For example, in time of use (TOU) pricing we could have different time blocks in a day marked as on-peak with higher rates and off-peak with lower rates. A more complex version is real time pricing (RTP) where the electricity price varies every hour according to demand and the customers are informed an hour-ahead or a day-ahead (Albadi and El-Saadany 2008) of time.
Demand charge: This part of the bill is for the peak power consumption by the MTDC during the billing period, charged as dollar per kW. The power use by the MTDC is monitored by the utility and the highest average power consumption in a sampling interval (for example, 15 minutes) is considered as the peak demand (measured in kW) for that billing period. Thus, demand charge is only determined at the end of the billing cycle. Utilities have a demand charge to compensate for the expensive overhead infrastructures (for example, generators, transformers, and substations) they have to build to meet the power demand at all times (Islam et al. 2015). Demand charge can constitute up to 40% of the entire electricity bill of the MTDC (Govindan et al. 2011).
If we denote $\alpha (t)$ as the volume charge at time $t$, $\beta$ as the demand charge, and time interval length to be $\tau$, then the total electric bill of an MTDC in a billing cycle is
2.3.2 Demand Response. To make power grids more reliable, efficient, sustainable, and to incorporate highly varying renewable energy sources into the grid, utilities offer different demand response (DR) programs to their customers. DR refers to reshaping the consumer's electricity consumption profile using time-varying price or offering incentives for power reduction when requested, for example, when there is peak demand the wholesale market price for electricity is high, or the reliability of the grid is in danger (Siano 2014). DR programs help match demand to supply by coordinating power use with power generation rather than the traditional way of matching supply to demand (Liu et al. 2014).
As noted in a survey (Wierman et al. 2014), “data centers are large loads, but are also flexible” that makes them a good candidate for demand response programs. Data centers are generally viewed as power hungry customers but can alternatively be viewed as energy storage devices due to their flexible demand. Utilities can extract this flexibility using demand response programs. A data center can participate in demand response programs in various ways as described below.
In an MTDC, the operator can participate in DR programs by either using stored energy (that may only last for a few minutes) or using backup diesel generators (that are environmentally unfriendly) only. The IT control mechanisms described above are under the tenant's control, who do not have any incentive for DR. Hence, participation in DR programs by MTDC operator is limited and poses a unique challenge. Recent research (Guo et al. 2018a; Ren and Islam 2014; Tran et al. 2015a; Zhang et al. 2015b) has focused on this problem to design incentive programs for tenants so that MTDC can participate in DR programs. We describe this existing work in Section 4.
2.4.1 Operator Leasing Schemes. There are various ways in which an MTDC operator can charge tenants a monthly fee for power and space. Three major pricing models are described below (Islam et al. 2015). The operator can also charge extra for other additional services, such as, setup fees and network fees.
Space-based pricing: In this pricing model, the operator charges a monthly fee according to the space occupied. It may be a per square foot charge or a per rack space charge, measured in units of “U” (1.75 inches) (Optimal Networks 2015). Modern data centers are, however, moving from space-based pricing to power-based pricing as critical power is increasingly becoming the primary source of cost (Dines 2011; Islam et al. 2015).
Power-based pricing: MTDC generally charge tenants based on their power subscription. All costs for space and energy use are bundled into the flat monthly fee charged for each kW of power supplied to the tenant. At the time of contract, the tenant may specify the amount of power they want to subscribe to (their power limit). Power subscription charges are in the range of $150–200 per kW per month (Islam et al. 2015). This is a widely used pricing model in MTDCs.
Energy-based pricing: In both space-based and power-based pricing model, the tenant is not being charged for metered powered use. Rather, the monthly bill is a flat fee. This kind of pricing scheme does not encourage tenants to be energy efficient in their operation. On the contrary, in energy-based pricing model, tenants are charged for actual energy use on top of a flat power subscription fee (which is less than the charge in power-based pricing (Islam et al. 2015)). Such a pricing model encourages tenants to be energy efficient and are usually found in wholesale MTDCs where the operator leases power in excess of 100kW to a single tenant (Dines 2011).
2.4.2 Tenant Costs. Tenants may adopt energy efficient techniques, like powering down some servers during periods of light workload, if they have proper incentive to do so. Powering down servers or putting servers in some kind of power saving sleep mode can cause inconvenience to tenants (Oo et al. 2017), like performance degradation, time overhead of powering back up, and wear-and-tear of equipment. This is generally modeled using monotonically increasing cost (Islam et al. 2014) or monotonically decreasing utility (Malla and Christensen 2017) as a function of number of servers turned off. Such cost functions are private to individual tenants and are determined at their own discretion. Two types of costs generally associated with tenants are delay performance cost and switching cost (Lin et al. 2013; Oo et al. 2017; Ren and Islam 2014).
Delay performance cost: Tenants may have computational tasks or may be providing some kind of interactive service to their own customer. Powering down servers can cause delay in computation and service time performance to degrade. Delay performance cost models the cost associated with an increase in delay that may also cause violation of Service Level Agreements (SLA) (Lin et al. 2013; Ren and Islam 2014). It is a common practice to model each active server as M/M/1 queue, for which the average delay for tenant $i$ with $m_i$ active servers
Switching cost: When switching servers between power saving modes, some additional cost such as server wear-and-tear cost, state migration cost, and startup energy overhead cost are incurred (Guo et al. 2018a; Islam et al. 2014; Lin et al. 2013; Ren and Islam 2014). We can model these cost components as switching cost that grows linearly with the number of servers powering down. For a tenant $i$ with total $M_i$ servers but only $m_i$ servers active, we can express switching cost as
In this survey article, we focus on works in power management of MTDC and categorize them according to the research problem they try to solve. Figure 6 shows our taxonomy, and in Table 2 we categorize all surveyed works according to our taxonomy. These works can be categorized as follows.
Category | Subcategory | Works | |
---|---|---|---|
Demand response (DR) | Economic DR | (Ren and Islam 2014; Tran et al. 2015a, 2015b, 2016; Xu et al. 2017) | |
Emergency DR | Single data center | (Ahmed et al. 2015; Chen et al. 2015; Guo et al. 2018a; Niu et al. 2016; Sun et al. 2016b, 2015; Tran et al. 2018, 2015b, 2016b, 2016c; Wang et al. 2015, 2017; Zhang et al. 2015b; Zhao et al. 2016) | |
Geo-distributed data center | (Sun et al. 2016a; Zhang et al. 2016) | ||
Sustainability | Carbon footprint reduction | (Islam et al. 2014; Mahmud and Iyengar 2016) | |
Energy cost reduction | (Guo and Pan 2015; Guo et al. 2018a, 2018b; Islam et al. 2015; Zhang et al. 2015a) | ||
Power oversubscription | Tenant side | (Palasamudram et al. 2012) | |
Operator side | (Islam et al. 2016; Malla and Christensen 2017) |
Demand response (DR) programs aim to make power grids more stable and reliable. During periods of peak power demand or when the wholesale electricity price is high, utilities may offer monetary rewards to consumers who decrease their electricity usage. We consider such programs as economic demand response programs, where the consumer can voluntarily participate and reduce their power demand at their own will and to their economic benefit (Tran et al. 2015a, 2015b, 2016a). Another type of DR program is emergency demand response that refers to demand response in case of an emergency (for example, extreme weather that stresses a grid) when the grid is about to fail. Participants usually sign a contract with the utility for a fixed energy reduction during emergency (signaled by the utility) in return for a monetary reward (Chen et al. 2015). Emergency DR is the most widely adopted DR program representing 87% of the total DR capability across all reliability regions in the U.S. (Managan 2014), and it acts as last line of defense against a cascading power failure. Furthermore, data centers have been recognized as key participants in emergency DR programs by the EPA (EnerNOC 2013). The main difference between economic and emergency DR is that the former is voluntary with flexible energy reduction while the latter is mandatory and has a fixed amount of energy to reduce. While there have been a number of studies to enable DR participation for operator owned data centers (Liu et al. 2014, 2013; Wierman et al. 2014), they cannot be applied to MTDC directly due to the lack of coordination between the operator and tenants. An indirect control method is required. Participation by MTDC in these DR programs will have some reward mechanism to incentivize tenants to reduce power use. The operator actively passes on the financial gain from the electric utility (reward for participating in DR programs) to tenants for coordinated power demand reduction.
An MTDC might want to become environmentally friendly and green to have a good public image, attract pro-sustainable tenants or simply reduce energy cost by consuming less energy. Being green, an MTDC can also earn Leadership in Energy and Environmental Design (LEED) certification that may have tax benefits (Islam et al. 2014; USGBC 2017). In this category, we review works that focus on reducing the carbon footprint of an MTDC (Islam et al. 2014; Mahmud and Iyengar 2016) or cutting their electricity bill (Guo and Pan 2015; Guo et al. 2018a, 2018b; Islam et al. 2015; Zhang et al. 2015a). In both cases, coordination among tenants is necessary to make the MTDC energy efficient and, hence, more sustainable.
Power oversubscription can lead to power overload that may result into power outages. On average a data center power outage costs is estimated at $740,357 (Ponemon Institute 2016). To ensure safe power oversubscription, tenants can employ most of the control mechanisms that have been proposed for operator-owned data centers (Azimi et al. 2017; Bhattacharya et al. 2013; Wang et al. 2012a; Wu et al. 2016) to reduce their power use. However, if the operator wants to oversubscribe the MTDC, the operator cannot directly reduce IT power consumption during power overload as they lack control over tenant servers and workload. One way is to tap into the energy stored in batteries of UPS (Govindan et al. 2011) or use the on-site diesel generator. This prevents overdrawing power from the utility. However, even if the power infrastructure can handle the excess power, the cooling capacity may be exceeded (Islam et al. 2016) leading to overheating or thermal shutdown of servers. The key idea is to coordinate power consumption of multiple tenants, in an otherwise uncoordinated MTDC. We review works that develop incentive mechanisms that MTDC operators can adopt to encourage tenants to reduce power consumption, while minimizing the impact on their performance, to avoid power overload in an oversubscribed MTDC.
In this section, we review works that focus on sustainable MTDC participation in different types of demand response programs.
Various works have looked into MTDC participation in economic DR programs.
4.1.1 Problem Description. MTDC operators may have financial incentive to participate in economic DR programs. Generally, data centers can lower the total power use of the data center by powering down servers or throttling them to lower power consumption states and/or shifting workload in time or space (to a different data center). As tenants have no incentive to take part in DR program, the problem is finding incentive mechanisms such that the operator can pass on the compensation from utility to its tenants to participate in economic DR programs.
4.1.2 Existing Solutions. Ren et al. (2014) were among the first to study demand response participation for MTDC and to identify the split-incentive in MTDC—the operator may want to participate in demand response programs but do not have control over the servers to reduce the power consumption, while tenants who can reduce IT power use have no incentive to participate in demand response programs. Ren et al. propose a reverse auction mechanism called iCODE to break this split incentive by rewarding tenants for power reduction. The authors formulate the problem as energy reduction maximization problem subject to the constraint that total reward to tenants is less than what operator receives from utility for demand response participation. During a demand response period, the utility notifies the MTDC of the compensation rate. The MTDC operator in turn notify tenants to reduce their power consumption. Tenants, then, voluntarily prepare a set of bids (energy reduction and payment requested for it) and submit it to the operator. The operator in turn selects the winning bids for each tenant by trying to maximize the energy reduction while keeping the payments lower than what it receives as compensation from the utility. The problem of selecting winning bids is NP-hard, and hence the authors use branch and bound technique to approximate a close to optimum solution.
It should be noted that iCODE cannot guarantee truthfulness of strategic tenants. Game-theoretic approaches for economic demand response have been proposed to tackle strategic tenants. Tran et al. (2015a, 2015b) design a reward program in which the operator first sets the reward rate to maximize their total profit (revenue minus cost), and then tenants reduce power consumption to maximize their own profit. The interaction between tenants and operator is modeled and analyzed as a two-stage Stackelberg game. The reward is set by the operator (leader) and tenants (followers) respond to it, where both the tenants and the operator act to maximize their own utility (profit). Using backward induction, the authors examine the game and find the Stackelberg equilibrium. However, in this incentive mechanism, tenants need to communicate all their cost and workload information to the operator beforehand, such that operator knows the tenant's best response to the reward rate. The operator then solves the profit maximization problem in a central fashion to determine the optimum reward rate. This may not be practical in an MTDC as tenants do not want to reveal their cost and workload information to the operator. This work is extended in Tran et al. (2016a) to account for the role of utility in demand response program. Rather than assuming utility as an outside component, utility and its compensation rate determination is incorporated as part of the game. The authors propose a Reward-to-Reduce (R2R) mechanism in which the utility first determines the compensation rate for the operator, who then decides the reward rate for tenants.
Similarly, Xu et al. (2017) also analyze the interaction between tenants and operator as a non-cooperative Stackelberg game where operator is the leader and tenants are the followers. But in their mechanism, tenants do not have to reveal all their cost parameters to the operator. Tenants bid energy reduction capability, and the operator sets a reward rate in an iterative fashion. The authors prove that the iteration converges to a unique Stackelberg equilibrium. Through practical simulation under different scenarios the authors show that tenant's bids, operator reward rate and profit for all of them converges after 20 to 30 iterations. All these works are summarized and compared in Table 3.
Paper | Objective | Solution | Technique | Result |
---|---|---|---|---|
Ren and Islam (2014) | Max. energy reduction | iCODE | Bidding | 50% hourly energy reduction |
Tran et al. (2015a) | Max. utility | A linear time algorithm | Operator sets reward rate | Optimal reward rate and operator pofit |
Tran et al. (2015b) | Max. utility | A linear time algorithm | Operator sets reward rate | Optimal reward rate and operator pofit |
Tran et al. (2016a) | Max. utility | Reward-to-Reduce | Operator sets reward rate | Stackelberg equilibrium exists |
Xu et al. (2017) | Max. utility | Stackelberg equilibrium | Operator sets reward rate iteratively | Stackelberg equilibrium convergence |
Most of the works on power management of MTDC have focused on DR participation, more specifically on emergency DR.
4.2.1 Problem Description. This DR program differs from the economic DR program in that while economic DR program is flexible and participation is voluntary, emergency DR program is mandatory where the utility define a fixed energy to be reduced and non-compliance can result in a financial penalty. Since servers and their corresponding workload are controlled by the tenants in an MTDC, they need incentives to take part in emergency DR programs. The problem is finding incentive programs that are efficient, robust, not susceptible to cheating, minimize the impact to tenants, and achieve total energy reduction greater or equal to that demanded by the utility.
4.2.2 Existing Solutions. Several works that have focused on MTDC power management have tried to solve the problem of sustainable MTDC emergency DR participation. The majority of these works have focused on a single MTDC while some have focused on geo-distributed MTDCs.
Single data center: Zhang et al. (2015b) proposed Truth-DR, a reverse auction-based incentive mechanism to reward tenants for power reduction. During an emergency DR event, the operator solicits bids as an amount of energy reduction and cost associated with such energy reduction, from tenants. Then, the operator selects winners and calculates payment amounts. The problem of deciding the winning bid is stated as a social cost minimization problem, which is NP-hard. Social cost of an MTDC is defined as the sum of aggregate cost of all tenants (cost for energy reduction) and the cost of the operator (cost of using backup generators in the case where energy reduction by tenants is not sufficient). The authors come up with a primal-dual-based 2-approximation algorithm. Using realistic simulations they show how Truth-DR can incentivize tenants for energy reduction in a close-to-optimal way. Wang et al. (2015, 2017) propose a similar incentive mechanism of tenants bidding for energy reduction but they consider the case where all tenants coordinate among each other in addition to the coordination of tenants with the operator.
Truth-DR is not fair in the sense that two tenants reducing the same amount of energy may be rewarded differently. Sun et al. (2015) proposed FairDR, a reverse auction similar to Truth-DR, but solve the fairness issue. In contrast to Truth-DR, which considers the case of a single emergency DR event in isolation, FairDR considers the case of multiple emergency signals (for example, multiple consecutive emergency DR events in a single day). Only one bid is requested from tenants for the entire time window, which may consist of multiple time slots with energy reduction signals. Tenant energy reduction decisions for each time slot are made in an online fashion without the knowledge of future energy reduction requirement. Through theoretical analysis they show that their mechanism is fair and truthful and has a bounded competitive ratio in social cost saving. The authors extend their work in Sun et al. (2016b) for tenants with delay-tolerant batch jobs. They design an online reverse auction mechanism where both tenant jobs and emergency DR signals can extend to multiple time slots. Tenants bid their valuation function (cost for energy reduction) and looking at tenant valuation functions, the operator assigns energy reduction amounts to tenants and diesel generator such that operator's cost is minimized. The operator then calculates the reward for each participating tenant. The authors also develop an online scheduling algorithm for tenants to maximize their individual utility. Zhao et al. (2016) developed a reverse auction called TECH that considers the cooling infrastructure of the MTDC as well. The authors use a server heat recirculation model to find the optimum temperature of the supply air such that the server temperature does not exceed a threshold, while consuming minimum energy. Tenants submit bids containing servers they want to power down (energy reduction) and the desired compensation for it (cost) similar to other bidding mechanisms. The operator then selects winners and calculates their payment such that energy reduction target, cost constraints (operator cost is less than using diesel generator for energy reduction), and temperature constraints (server temperature below a thresholds) are met.
Chen et al. (2015) propose ColoEDR, a pricing mechanism based on parameterized supply function bidding. An advantage of using the supply function is that tenants do not have to reveal their private cost information to the operator as they did in Truth-DR and FairDR. The basic idea is that tenants bid for energy reduction during emergency situations using a parameterized supply function. Through a parameter in this supply function, they can express their energy reduction flexibility/willingness. The operator then chooses a market clearing price and communicates it to tenants. Tenants can now use the price from the operator in their supply function to calculate individual required energy reductions and payments for it. The authors also present mathematical analysis of the efficiency of ColoEDR when adapted to economic demand response, where no fixed energy reduction is required.
Niu et al. (2016) set up the problem as an optimization problem with the goal of maximizing the social welfare (sum of profits of tenants as well as profit of operator). The incentive mechanism resembles other works in that the operator rewards tenants for power reduction but differ in the sense that they model an MTDC as a cooperative environment. Tenants and operator cooperate to achieve the goal of emergency DR. During an emergency DR event, the operator starts bargaining with tenants the amount of power they need to reduce and the desired compensation for it. They propose a solution based on Nash bargain solution under concurrent bargaining of operator with all tenants. They prove that their solution is max-min fair, Pareto efficient, and maximizes social welfare. The authors extend their work in Guo et al. (2018a) to consider the case of sequential bargaining where the operator bargains with tenants one-by-one in a sequential fashion. They compare sequential and concurrent bargaining to show that concurrent bargaining always maximizes social welfare while in sequential bargaining it may not be the case and the output depends on the order in which bargaining takes place. They also argue that reverse auction type mechanisms (Chen et al. 2015; Sun et al. 2015; Zhang et al. 2015b) may not be suitable for MTDC, since the operator, who is the auctioneer, is self-interested and the output is highly dependent on the auctioneer selection.
Contrary to tenants bidding for energy reduction, Ahmed et al. (2015) proposed Contract-DR, in which the operator determines a set of contracts (energy reduction and reward amount) directly that are pushed to tenants. A tenant may participate by accepting one or none of the contracts. Similarly, Tran et al. (2015b, 2016c) propose incentive mechanisms where an operator starts out by broadcasting a reward rate. Tenants then let the operator know about their energy reduction level (that maximizes the tenant's profit), and the operator updates the announced reward rate based on this knowledge. This continues in an iterative and distributed fashion until convergence. This mechanism works for non-strategic (price-taking) tenants whereas they develop a bidding game to deal with strategic (price-anticipating) tenants. The authors use a similar mechanism in (Tran et al. 2016b) where they consider the case of a multi-tenant data center situated in shared building, that is, the entire building is not dedicated to the data center but has other office spaces as well. This non-data-center portion occupies a significant area and shares the power infrastructure of the building. They refer to such buildings as mixed-use buildings (MUB). The main differentiating factor of this work is that the authors additionally consider energy reduction by non-data center space by increasing the set point of heating, ventilation, and air conditioning (HVAC). The operator initially provides reward rates along with energy reduction targets and a target deviation penalty to offices, data center tenants, and diesel generator. Based on this information, all participants decide their own energy reduction to maximize their own profit. The operator now updates reward rates and energy reduction targets and the process repeats until convergence. The authors extend their work in Tran et al. (2018) where they also consider strategic tenants.
Geo-distributed data center: An MTDC operator might own more than one data center and there may be tenants who have their servers in one or more of these data centers. In this case, tenants can migrate their workload on top of powering down idle servers to reduce energy use in a particular data center. Sun et al. (2016a) have proposed BatchEDR to enable emergency DR in geo-distributed MTDC. BatchEDR is an online incentive mechanism. As in the previous work by the authors (Sun et al. 2016b), they consider the case of multiple emergency DR events and model tenants as having delay-tolerant workloads that can be shifted in time and processed later. During an emergency DR time period, the tenants bid with the following information at each data center: workload that can be deferred, a factor to convert workload deferment to energy reduction, and claimed cost of energy reduction. The tenant also bids the total workload that can be deferred across all data centers. Additionally, during the first bid the tenant mentions the total workload deferment throughout the entire time periods DR events occur. The authors develop a Vickrey-Clarke-Groves (VCG) mechanism to determine the energy reduction and rewards for each tenants. Zhang et al. (2016) also extend their incentive mechanism, Truth-DR (Zhang et al. 2015b), for the case of geo-distributed MTDC with similar formulation, solution, and results.
All works in emergency demand response of single or geo-distributed MTDC are summarized and compared in Table 4.
Paper | Objective | Solution | Technique | Result |
---|---|---|---|---|
Ahmed et al. (2015) | Min. operator cost | Contract-DR | Operator sets contracts directly | Cost minimized |
Chen et al. (2015) | Min. social cost | ColoEDR | Operator chooses a market clearing price | Close to the optimal social cost |
Guo et al. (2018a) | Max. social welfare | Nash bargaining solution | Cooperatively bargain between operator and tenants | Bargaining improves social welfare |
Niu et al. (2016) | Max. social welfare | Nash bargaining solution | Cooperatively bargain between operator and tenants | Concurrent bargaining maximizes social welfare |
Sun et al. (2015) | Max. social cost saving | FairDR | Bidding | Fairness is guarenteed |
Sun et al. (2016a) | Min. social cost | BatchEDR | Bidding | 32% higher social cost saving |
Sun et al. (2016b) | Min. social cost | Online mechanism design | Bidding | Close to offline optimal social welfare |
Tran et al. (2015b) | Min. social cost | EPM and SWO | Operator sets reward rate iteratively | Optimal social cost |
Tran et al. (2018) | Min. mixed-use building cost | MECH-SA and MECH-NA | Operator sets reward rate iteratively | Optimal mixed-use building cost |
Tran et al. (2016b) | Min. mixed-use building cost | DAMESH | Operator sets reward rate iteratively | Cost is smaller compared to baseline |
Tran et al. (2016c) | Min. social cost | EPM and SWO | Operator sets reward rate iteratively | Optimal social cost |
Wang et al. (2015) | Min. weighted tenant cost | P-RAA and D-RAA | Operator selects a set of tenants for energy-saving | 78% average energy-saving |
Wang et al. (2017) | Min. tenant cost | Co-Colo | Bidding | Improved resource utilization/energy efficiency |
Zhang et al. (2015b) | Min. social cost | Truth-DR | Bidding | Close to the optimal social cost |
Zhang et al. (2016) | Min. social cost | Truth-DR | Bidding | 20% to 60% cost saving |
Zhao et al. (2016) | Max. tenant energy reduction | TECH | Bidding | 20% higher energy reduction |
MTDCs have lagged behind operator-owned data centers in their adoption of renewable energy sources and making their data centers green (Greenpeace 2017). With growing concerns of climate change and push from tenants (like Apple (Apple 2017), Facebook (Facebook 2017), and Google (Google 2017)) committed to become 100% renewable powered, it is increasingly important for MTDCs to decrease their carbon footprint and become sustainable. MTDC can become sustainable by reducing “brown” energy use and increasing “green” energy use or by reducing their overall energy use by being more energy efficient. However, only using green energy, but being very inefficient in their use, is not sustainable, because we would be wasting energy that could be used by others.
5.1.1 Problem Description. Research works in this category try to minimize the carbon footprint of MTDC by taking into account the carbon emission per kilowatt-hour (measured in g/kWh) of the fuel mix being used by the grid to produce electricity. There is, however, a cost associated with being green; for example, an operator would have to give out financial reward to tenants for energy reduction required for carbon emission reduction or invest into on-site renewable energy sources to decrease reliance on grid electricity. These works that focus on minimizing carbon emission must also take operating budget constraints into account.
5.1.2 Existing Solutions. Islam et al. (2014) looked into the problem of reducing carbon footprint of an MTDC by reducing energy use whenever possible. They proposed a reverse auction incentive mechanism called GreenColo where the operator rewards tenants for energy use reduction. Tenants submit energy reduction bids (amount of energy reduction and proposed payment for it) and the operator decides winning bids to minimize carbon footprint of the MTDC based on the operator's long term budget. Carbon emission is calculated from the type of fuel and fuel mix (which changes by time of day) used by the utility that provides electricity to the MTDC. An online algorithm is developed to keep track of the reward as well as minimize the carbon footprint at each time interval such that cumulative reward is less than the operator's budget. The authors evaluate their algorithm through trace-based simulation and demonstrate that GreenColo decreases carbon emission by 18% while tenants can lower their cost by 25% without the operator increasing its operational expense. This is due to the fact that tenants reduce energy consumption lowering the electricity bill, and the savings are eventually rewarded back to tenants but the operator's total cost remains the same. Mahmud et al. (2016) investigated geographical load balancing (GLB) to minimize overall carbon footprint in hybrid (operator-owned + multi tenant) data center setting. The focus is on large IT organizations who have servers in their own data centers as well as in leased MTDCs. The authors develop a distributed resource management algorithm called CAGE (Carbon and Cost Aware GEographical Job Scheduling) that can be used by an organization for making GLB decisions based on reduction of carbon emission and cost of electricity while satisfying performance guarantees. Price of electricity generally does not correlate with carbon emission, so the authors introduce a carbon-cost parameter in their problem formulation to capture the tradeoff between carbon emission minimization and electricity cost minimization. This parameter can be set by the organization independently to match their carbon reduction target and budget.
5.2.1 Problem Description. Works in this category focus on reducing the electricity use of an MTDC. These works have primary objective of minimizing the operating cost of an MTDC by cutting the electricity bill. Overall, this translates into a greater goal of reducing the energy consumption of MTDC through coordination, making them more sustainable.
5.2.2 Existing Solutions. Islam et al. (2015) proposed REward for COst reduction (RECO), an incentive mechanism where the operator offers a reward rate to encourage tenants to reduce their energy use. The objective is to minimize the operating expense of MTDC: electricity cost and the rewards to tenant. The authors identify three challenges in designing such a reward program, (1) dynamic operating condition of MTDC (changing outside temperature, cooling efficiency, and on-site solar generation), (2) electricity pricing with volume charge and demand charge, and (3) unknown tenant response to reward. Their solution addresses each of these challenges. Time varying PUE calculations based on outside ambient temperature and autoregressive moving average (ARMA) model to predict on-site solar energy generation are used to capture the dynamics of MTDC operation. Peak power demand of an MTDC (a significant factor in electric bill) can only be known at the end of the billing cycle. The authors propose tracking the peak power of MTDC and using it as feedback while setting the reward rate, for example, to offer higher reward if the power consumption in the next time slot is predicted to be greater than the tracked peak power. Similarly, they develop a parameterized response function that is updated online to learn tenant's response to reward. Finally, they design an online heuristic algorithm to set the reward rate and validate their solution with a prototype as well as simulations. Zhang et al. (2015a) also focus on minimizing the electricity cost of the MTDC taking into account the non-trivial utility pricing that has volume charge and demand charge for electricity. They design online algorithms to solve the minimization problem through two different approaches: (1) pricing, operator sets a reward rate, tenants let the operator know about their energy reduction interest, operator selects tenants to reduce energy, and (2) auction, tenants bid (energy reduction and payment for it) for energy reduction, operator decides the winning bids.
Inefficiencies occur in MTDC due to lack of coordination between the tenants as well as between a tenant and operator. Guo et al. (2015) looked into this problem and designed a reward program to coordinate energy management for improved energy efficiency and sustainability of MTDC. Tenants are rewarded by the operator for energy reduction such that the MTDC electricity bill is minimized. The authors formulate the problem as a convex optimization problem in which the sum of operator cost and tenant cost is minimized, such that, tenant's performance SLA is not violated (delay is below a threshold). This optimization problem can be solved easily by the operator in a central fashion. But it requires the operator to be aware of all tenant's cost function (not feasible in an MTDC) and also, the solution would not be scalable. Using alternating direction method of multipliers (ADMM), the authors solved the optimization problem in a decentralized fashion. They argue that ADMM-based solution is better than the commonly used approach of dual decomposition in that, it is faster to converge, does not require the objective function to be strictly convex, and does not have the problem of choosing an appropriate step size as in the subgradient update method. The authors extend their work in Guo et al. (2017b) to take into consideration the uncertainties in a tenant's workload as well as the electricity cost from the utility. They also extend the formulation to account for tenants with delay-tolerant workloads that spans multiple time slots. The problem is restated as a stochastic optimization problem, where the objective is to minimize the expected total cost over a long period of time. They develop an online centralized solution based on Lyapunov optimization and also a practical distributed solution based on ADMM based on their previous work.
Building on previous work on tenant coordination in MTDC, Guo et al. (2018b) investigate the scenario where MTDCs buy power directly from a wholesale electricity market rather than the local utility to avoid added costs by the retail supplier. They argue that cooperation among tenants can lead to lower electricity cost than acting non-cooperatively. The basic idea is that tenant power consumption when aggregated, is less uncertain due to statistical multiplexing than when considering tenant power consumption individually. This decrease in uncertainty can result in a lower cost in the day-ahead wholesale market. The authors formulate the problem as a cooperative game theory problem to model cooperative electricity procurement where each tenant tries to minimize their own electricity cost. The proposed aggregation and cost allocation method is shown to be both fair and stable. A simulation based on Google cluster trace shows a saving of 18.03% for average hourly electricity cost.
All works related to sustainability of MTDC are summarized in Table 5.
Paper | Objective | Solution | Technique | Result |
---|---|---|---|---|
Guo and Pan (2015) | Min. social cost | A decentralized algorithm | Operator and tenants iteratively solve the optimization problem | 27% average cost saving |
Guo et al. (2017b) | Min. social cost | Online decentralized algorithm | Operator and tenants iteratively solve the optimization problem | 27% average cost saving |
Guo et al. (2018b) | Min. tenant cost | A cost allocation solution | Tenants cooperate to reduce their aggregate power cost | 18.03% average cost saving |
Islam et al. (2014) | Min. carbon footprint | GreenColo | Bidding | 18% decrease in carbon emission |
Islam et al. (2015) | Min. operator cost | RECO | Operator sets reward rate | 27% reduction in operator cost |
Mahmud and Iyengar (2016) | Min. carbon footprint and cost | CAGE | Distribute workload to geo-distributed data centers | 36% decrease in carbon emission |
Zhang et al. (2015a) | Min. operator cost | Various online algorithms | (1) Operator sets reward rate, (2) Bidding | Close to offline optimal cost |
In this section, we discuss works that aim to enable safe power oversubscription of MTDC. While in DR participation, MTDC reduced power consumption when signaled by the utility, for safe oversubscription, MTDC must reduce power consumption when they reach their power limit to avoid sustained power overload.
6.1.1 Problem Description. MTDC typically have a power-based pricing model where they charge per kilowatt of power subscription for the tenant. Tenants can save cost for power subscription if they oversubscribe, that is, put more servers than allowed by the power capacity. Tenant-side power oversubscription happens at the lower levels of the power hierarchy, the rack level or cage (few racks) level, where the power use fluctuations (power dynamic range) and probability of power overloading is greater compared to higher levels in the power hierarchy (Fan et al. 2007; Wu et al. 2016). If tenants exceed their power draw limit, then the circuit breaker may trip leading to a power outage in their server rack, or the operator may penalize tenants for power over-utilization and violating the SLA (TekLinks 2017; Verizon 2014). How can tenants safely oversubscribe the power capacity they have bought?
6.1.2 Existing Solutions. There are existing control mechanisms designed for operator-owned data centers that can be used by tenants to enable power oversubscription. The basic building block is server level power capping (Gandhi et al. 2009; Lefurgy et al. 2007, 2008; Zhang and Hoffmann 2016) through techniques like CPU throttling, DVFS, and admission control to operate a server within a power constraint. Techniques that enable power capping at higher (PDU, cluster) level of the data center (Azimi et al. 2017; Bhattacharya et al. 2013; Fu et al. 2011; Lim et al. 2011; Raghavendra et al. 2008; Wang et al. 2012a; Wu et al. 2016) coordinate server-level power capping to minimize performance degradation. Tenants can apply these control methods to ensure that power consumption is always below the subscribed power budget.
An alternative approach to power oversubscription, without affecting server performance or workload, is to use UPS as an energy buffer that can provide the excess power. Govindan et al. (2011) looked into tapping the stored energy in central UPS batteries to handle peak power draw in data centers. This may not be directly applicable to tenants as they have no control over the central UPS in an MTDC. However, distributed smaller UPS at rack level or even server level rather than a central UPS are becoming more popular in data centers. Distributed UPS avoids a central point of failure and can also avoid the AC-DC-AC double conversion loss at the central UPS (Kontorinis et al. 2012). Previous work (Govindan et al. 2012; Kontorinis et al. 2012; Palasamudram et al. 2012) has looked into utilizing such distributed UPS to oversubscribe power (or, equivalently, under provision the power infrastructure). Such techniques can be applied by tenants in MTDC to oversubscribe power and save cost.
Palasamudram et al. (2012) proposed using batteries for tenants to oversubscribe power and reduce power cost. Although this work focuses on content delivery networks (CDN) deployed in multiple MTDC around the world, the technique is applicable to other types of tenants as well. The idea is for tenants to put their own batteries at the server or rack level. Tenants do not use all of the subscribed power all the time. Hence, when they are using less power than the subscribed power, charge the battery and when the power use is high and exceeds the power subscription, discharge the battery to meet the excess demand rather than over drawing from the data center power source. One advantage of this approach is that tenants do not have to cap server power and hence, their performance is not affected. Total energy consumption does not change in such case. However, procuring batteries incurs capital expenditure (CapEx) although it decreases the monthly leasing fee (OpEx) of tenants due to oversubscription. Therefore, the authors look at the total cost (leasing fee + amortized cost of the batteries) and analyze the tradeoff between CapEx and OpEx.
Palasamudram et al. formulate two optimization problems. First, power supply minimization is for finding the minimum power to subscribe given tenants who already have a battery of certain capacity. Second, total cost minimization is for simultaneously finding the power to subscribe and battery capacity to purchase to minimize the total cost (CapEx + OpEx). These optimization problems are characterized as linear programming that can be solved efficiently. Given the power trace of a tenant, we solve the optimization problem in an offline fashion. The proposed techniques are evaluated on Akamai's CDN traces from 22 different cluster location in U.S. over a 25-day period. In their result the authors found that amount of safe power oversubscription is a concave function of battery size and saturates after certain battery capacity. A battery that can last for 5 minutes allows 7.5% oversubscription whereas a 40-minute battery allows up to 16% oversubscription and remains flat at 16% as the battery size is increased. This is because, when we oversubscribe excessively, we are discharging the battery more often than charging it, thus there is less time to charge the battery and we run out of charge regardless of its size. The authors also do a total cost analysis of tradeoff between the CapEx and OpEx. Cost savings depend on different factors of the battery (such as unit price, technology used, and expected lifetime) but they show that in a typical case, up to 13.9% cost savings is possible by using reasonably sized batteries.
6.2.1 Problem Description. An MTDC operator would benefit from oversubscribing the power infrastructure to increase power utilization and lower the provisioning cost. For example, an MTDC that operates on average at 40% of the peak capacity has effectively double the provisioning cost per watt than an MTDC that operates on average at 80% power utilization (Barroso et al. 2013). Leasing out more power capacity to tenants is also more profitable to the operator. But as soon as the power hierarchy is oversubscribed, there is always a chance of power overload. Power capping solutions are proposed to avoid coincidental power peaking of servers in a cluster to prevent power overload (Azimi et al. 2017; Wang et al. 2012a; Wu et al. 2016). However, the MTDC operator lacks control over tenant servers and there is no mechanism in place to inform tenants about the power overload. Randomly cutting off power to tenants is unacceptable (Islam et al. 2016), as it causes service disruption to tenants and may damage the reputation of the MTDC operator, causing tenants to move to another MTDC. The problem is to find indirect mechanisms for power capping in an MTDC such that the effect to tenants is both controlled and minimized in addition to avoiding power overload situations in an oversubscribed MTDC.
6.2.2 Existing Solutions. Two solutions have been proposed to address the problem of safe oversubscription in MTDC.
COOP: Islam et al. (2016) proposed a market mechanism called COOrdinated Power (COOP) based on supply function bidding. The basic idea is to offer reward to tenants to incentivize them for power reduction during power overload. They consider power oversubscription at both, PDU level and central UPS level. The problem is formulated as minimization of total cost of tenants (due to power reduction) subject to the constraint that all power capping requirement are satisfied. The use of supply function bidding prevents tenants from revealing their private cost functions. This is efficient as tenants only need to bid one parameter in a predetermined supply function to specify the amount of power they are willing to reduce at different reward rates. Tenants are suppliers of power reduction and operator is trying to buy back power from tenants.
The mechanism works as follows. The operator continually monitors the power use of all the tenants. Whenever the aggregate power consumption exceeds the capacity at any oversubscribed level, operator broadcast the emergency to all tenants along with the form of supply function and the maximum power reduction possible (usually the current power usage) for the tenant. Tenants then bid for power reduction depending upon their cost for performance degradation. A key point is that tenants do not know the operator's reward rate at this point. So the tenant's bid reflects its power reduction flexibility, power reduction at different reward rates, rather than actual power to be reduced. After receiving bids from the tenants, the operator chooses the lowest reward rate that is able to satisfy power capping constraint at each level (PDU, UPS) and communicates to tenant. Tenants determine their power reduction based on the reward rate and cap power until notified by the operator to resume normal operation. Based on the power capping time interval, power reduced, and the reward rate, operator calculates reward for each tenant.
The authors evaluated COOP on a prototype of 11 servers considering 5 tenants with realistic workload. They considered different oversubscription levels, aggressive (20%), moderate (15%), and conservative (10%). Tenants reduced their power consumption through DVFS technique available on the servers. They evaluated COOP for two cases, price-taking tenants (tenants bid without considering the effect of their bid on the operator's reward rate) and price-anticipating tenants (tenants bid to maximize profit considering the effect of their bid on the predicted operator's reward rate) and compared it with the optimal case (OPT) in which the operator can directly control tenant servers and knows all their cost functions (which is not practical in real MTDCs). Figure 7(a) shows the total performance cost of tenants in which COOP is quite close to the optimal. Reward rates offered to tenants at different oversubscription levels is shown in Figure 7(b). Reward rate increases with the increase in level of oversubscription as the amount of power reduction required increases. Moreover, the operator has to offer higher rewards when tenants are intelligent (price anticipating) and can predict how rewards are offered.
The authors argue that COOP is profitable to both tenants and the operator, as shown in Figure 8. They show that COOP increases profit for the operator as they are able to sell more power and increase profit for tenants due to the rewards they receive. However, when an operator oversubscribes aggressively, the profit actually decreases as a higher reward rate must be offered for large power reduction from tenants during power overload.
LOCAP: We proposed a dynamic pricing method called LOCAP (LOCAl price for Power) in (Malla and Christensen 2017) to enable safe power oversubscription. We consider the case of energy-based pricing where the tenants are charged separate for space/power capacity as a subscription fee (dollar per kW) and for energy use as local price (dollar per kWh). In this scheme, tenants pay for power consumption and this incentivizes them to be energy efficient. The key idea to enabling oversubscription is to make the local price dynamic and vary it to reflect total power demand in the data center. During normal operation, the local price can be fixed at a certain level. In case of a power overload, the operator can increase the local price to reduce power demand. The increase in local price must be just enough to keep the power consumption below the power limit.
We consider data center level power oversubscription and formulate the problem as a utility maximization problem with the constraint that total power use does not exceed the data center critical power capacity. This problem can be solved by the operator in a central fashion but it requires the operator to know the utility function of each individual tenant. By taking a distributed approach and solving the dual problem using the iterative gradient projection method, the operator does not need to know the utility function of each tenant. We end up with a simple and intuitive algorithm to update the local price: when power demand exceeds the capacity, increase the local price (rate of increase is proportional to the amount by which the demand exceeds supply) and vice versa. Meanwhile, tenants (who are self interested) consume power to maximize their own profit. During periods of high power price, it may be more profitable for tenants to reduce power consumption and incur some performance degradation rather than pay the high price for power.
To evaluate our proposal, we simulated an MTDC with 3 tenants having 10,000 servers each. Using real world workload traces and identical scenarios, we compared LOCAP with COOP and a case without any oversubscription (NOOP). Figure 9(a) and Figure 9(b) shows profit and total data center energy consumption for the three cases, respectively. Again, oversubscription is profitable to operator as well as tenants. LOCAP, as it charges for power use, incentivizes tenants to be energy efficient all the time and hence, lower the total energy consumption of the MTDC as a whole. LOCAP and COOP both are able to cap the total power consumption of MTDC. As shown in Figure 9(c), during a power emergency, COOP actively offers rewards to tenants while LOCAP raises the local price to avoid power overload situation.
All works relating to power oversubscription of MTDC are summarized and compared in Table 6.
Paper | Objective | Solution | Technique | Result |
---|---|---|---|---|
Islam et al. (2016) | Min. tenant cost | COOP | Operator sets reward rate | Close to optimal performance cost and power reduction |
Malla and Christensen (2017) | Max. tenant utility | LOCAP | Dynamic power pricing | 34% energy reduction |
Palasamudram et al. (2012) | Min. power supply; Min. power cost | Solve linear programming | Use batteries during power overload | 40-minute battery allows 16% oversubscription; 13.9% cost savings |
Several surveys of power management techniques for data centers have been published over the past years (Beloglazov et al. 2011; Cavdar and Alagoz 2012; Ge et al. 2013; Hammadi and Mhamdi 2014; Jin et al. 2016; Kong and Liu 2014; Liu and Zhu 2010; Mastelic et al. 2014; Mittal 2014; Orgerie et al. 2014; Rahman et al. 2014; Shuja et al. 2016; Zakarya and Gillam 2017), however with only one survey focused on MTDC (Oo et al. 2017). In this section we briefly review these published surveys and show how our survey adds to the body of knowledge by reviewing additional key works and focusing on an important topic that has been neglected in existing surveys—that of power oversubscription in MTDCs.
Beloglazov et al. (2011) surveyed works in energy and power management of computer systems at different levels. Works ranging from hardware (including dynamic component deactivation and DVFS), operating system, virtualization, and data center levels (including workload consolidation) are described in detail. The authors point out that less instantaneous power consumption may not necessarily mean less energy consumption as same amount of computation can take a longer time at a lower power level. Consuming less energy can reduce the electricity bill while consuming less power can, in addition to reducing the electricity bill, lower the cost/capacity of the required UPS, generators, power distributions units, and cooling infrastructure. Surveys by Mastelic et al. (2014), Orgerie et al. (2014), Mittal (2014), and Shuja et al. (2016) review literature that focus on improving energy efficiency of servers, storage, and networks, building blocks of a data center. Mastelic et al. (2014), in addition to hardware (servers and networks), survey various research work on energy efficiency of data center software (resource management systems and user applications) with a specific focus on cloud computing, while Mittal (2014) notes that integrating different solutions for optimal energy efficiency of the entire data center in a coordinated way is a challenge. A similar survey of works that improves the energy efficiency of large scale computing systems including, clusters, grids, and cloud data centers is conducted by Zakarya et al. (2017). Similarly, Liu et al. (2010) survey research works having the goal of improving energy efficiency, capping power consumption, and thermal management of data centers. They also survey power management of high performance computing (HPC) systems where the jobs are mostly non-interactive.
There are also research works that focus on greening the data center. They emphasize on integration of on-site renewable energy sources and/or reducing the overall energy use of the data center. Cavdar et al. (2012) and Kong et al. (2014) survey power management techniques that focus on making the data centers green. Rahman et al. (2014) surveys the use of geographical load balancing for power management of data centers. The focus of the survey is on Internet-scale geo-distributed data centers that are connected to smart grids. Such data centers can utilize the dynamic features of smart grids like, distributed generation, integration of various renewable energy sources, and demand response, and so on, to achieve different goals including minimize electricity cost, minimize carbon emission, and maximize renewable energy usage. In addition to surveying works in energy efficiency, resource management, and temperature control, Jin et al. (2016) survey different metrics used to monitor and track greenness of data centers.
Data center networks have also received considerable attention. Architectural evolution and energy efficiency of data center networks have been surveyed by Hammadi et al. (2014). Additionally, Ge et al. (2013) surveys power saving techniques in content delivery networks used for fast and effective delivery of web contents to users around the globe.
A comprehensive survey by You et al. (2017) of more than 30 surveys, from 2011 to present, on energy efficiency of data centers and cloud computing environments, reflects a myriad of research studies and surveys conducted on the subject. However, multi-tenant type data centers have received less focus from the research community. Previous power management techniques that are suitable for operator-owned data centers cannot be directly applied as the operator does not have control over a tenant's IT equipment. The survey by Oo et al. (2017) about coordination and power management in multi-tenant data centers is most closely related to our work. The authors mainly focus on market mechanisms that enable DR participation by MTDC and/or make MTDC's more sustainable while leaving out works in power oversubscription of MTDC, which we think is an essential part. MTDC power oversubscription can increase utilization of resources and is profitable to the operator.
In contrast to previous work, we survey works on power management of multi-tenant data centers with a focus on power oversubscription. We formulate the problem and describe challenges in solving it. We discuss the reward based and the dynamic pricing market mechanisms that have been proposed in the literature, highlighting their strengths and weakness. Finally, we propose possible future directions and open issues still left to be addressed.
In this survey article, we have reviewed previous research on power management of MTDC. First, we explained why power management in MTDC is equally, if not more, important than in owner-operated data centers, despite the fact that this area has been largely overlooked by the research community. Then, to have a general understanding of an MTDC, its power hierarchy and economics were described with a detailed discussion on what it means to oversubscribe the power hierarchy and the motivations for oversubscription. Next, we introduced a new taxonomy to classify previous work.
Key findings from our survey are as follows:
To have safe power oversubscription of MTDC, we need to better understand the power usage characteristics of tenants. In the future, it would be interesting to investigate the relationship between oversubscription level and probability of overloading, that is, the frequency of power overload events in an oversubscribed MTDC. Markov models have been used to model home electricity consumption (Ardakanian et al. 2011) and teletraffic theory have been used to size transformers in a distribution network for a given loss-of-load probability (LOLP) (Ardakanian et al. 2012). Similar analysis could be used to find the appropriate level of power oversubscription for a set of tenants to keep the probability of power overload event below a fixed value. This would help MTDC operators in making capacity provisioning decisions for their data centers.
Another future direction could be to see if power overload events can be predicted. Time series predictive methods such as autoregressive moving average (ARMA) and machine learning techniques such as neural networks and support vector regression have been used for predicting renewable (solar and wind) energy production (Aksanli et al. 2012), home peak load (Singh et al. 2012), and data center network load prediction (Prevost et al. 2011). Similar predictive models could be used for short-term (minutes, hours) and long-term (days) prediction of tenant power demand. The relationship between the accuracy of such models and forecast length could be studied. Using such prediction models, we could develop predictive control methods for safe power oversubscription rather than reactive control methods that detect power overload events and reduce its duration by power capping. If successful, then predictive control method would avoid power overload events preventing any transient power spikes to the power infrastructure of the data center where such power spikes compromise the reliability of the data center. One major challenge to the analysis of tenant power consumption is the availability of MTDC power use data. We are not aware of any publicly available power use data of tenants in MTDCs.
Authors’ addresses: S. Malla and K. Christensen, Department of Computer Science and Engineering, University of South Florida, 4202 East Fowler Avenue, ENB 118, Tampa, Florida 33620; emails: sulavmalla@mail.usf.edu, christen@cse.usf.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
©2019 Association for Computing Machinery.
0360-0300/2019/02-ART1 $15.00
DOI: https://doi.org/10.1145/3291049
Publication History: Received January 2018; revised August 2018; accepted October 2018