2019 Dac SSD Vibration
2019 Dac SSD Vibration
Abstract the adverse effects of vibration. In particular, this is the first work
to investigate the following research questions (RQs):
Vibration generated in modern computing environments such
as autonomous vehicles, edge computing infrastructure, and data
RQ1: What is the impact of vibration on SSD performance (e.g.,
center systems is an increasing concern. In this paper, we system-
I/O operation latency and bandwidth)?
atically measure, quantify and characterize the impact of vibration
on the performance of SSD devices. Our experiments and analysis
RQ2: Does the impact of vibration on SSD performance vary across
uncover that exposure to both short-term and long-term vibration,
different SSD vendors and I/O operation types (e.g., read and write)?
even within the vendor-specified limits, can significantly affect SSD
I/O performance and reliability.
RQ3: Is the performance of SSD devices sensitive to the length of
Keywords vibration exposure?
SSD; Vibration; Reliability; Data Centers; Autonomous Vehicles
ACM Reference Format: In this work, we systematically measure, quantify and character-
Janki Bhimani, Tirthak Patel, Ningfang Mi, and Devesh Tiwari. 2019. What ize the impact of vibration on the performance of SSD devices. Our
does Vibration do to Your SSD?. In The 56th Annual Design Automation results show that exposure to vibration, even within the vendor-
Conference 2019 (DAC ’19), June 2–6, 2019, Las Vegas, NV, USA. ACM, New specified limits, can significantly affect the performance of SSD I/O
York, NY, USA, 6 pages. https://doi.org/10.1145/3316781.3317931 performance. Our experiments discover that the degree of impact
varies across vendors and workload types – in some cases, vibration
1 Introduction can negatively affect the read/write tail latency by more than 10%,
There has been an increasing concern about the effect of noise critical for safety in autonomous vehicles [15] and performance in
and vibration on the performance of computing and storage in- data center computing environments [11]. Interestingly, we also ob-
frastructures from data centers to autonomous vehicles [13, 15, 23]. serve that repeated exposure to short-term vibration has lingering
Recent events have highlighted the significant disruptions caused after effects on SSD performance even in the absence of vibration.
by noise and vibration to operations of computing centers [11, 24]. On the other hand, long-term exposure to vibration may result in
Most notable and severely affected examples include the Nasdaq more than 30% performance degradation. Long-term exposure to
Nordic stock exchange data center in Finland (2018), the Microsoft vibration can lead to performance slowdowns and abrupt failures,
Azure data center in Europe (2017), and the ING Bank data center in although SSDs continue to function again after a restart.
Romania (2016) [23, 25, 26]. As computing and storage devices will During this study, we experimented for thousands of SSD-hours
increasingly operate in harsh environments such as space explo- with close to one hundred SSDs from different vendors. By the end
rations, edge computing, and autonomous vehicles [9, 13, 15, 16], of it, many SSDs came out permanently bruised due to vibration,
the effects of vibration will continue to worsen. and some SSDs succumbed to its adverse effects, despite the
Vibration has been shown to majorly and primarily affect the vibration being within the vendor-specified limits. We analyzed a
performance of hard disk drives (HDDs) because HDDs have large amount of sensor and performance data from these SSDs via
moving mechanical parts which can be physically perturbed by various I/O tools. However, we share only selected findings and ob-
vibration [5, 6]. However, HDDs are increasingly getting replaced servations that we could conclude with high statistical significance.
with solid state drives (SSDs) due to their lower I/O access latency Anonymized experimental data is being made publicly available at
and higher I/O bandwidth. SSDs have also been hypothesized https://github.com/GoodwillComputingLab/SSDVibration
to be less prone to vibration related side-effects because of the for the research community to better understand, model, and
absence of any mechanical components [6, 21]. This work aims to mitigate the impact of vibration on SSDs.
perform a systematic study to investigate if SSDs are resistant to
2 Background
This project was partially supported by the NSF Career Award CNS-1452751. This section describes the architectural components of a SSD and
Permission to make digital or hard copies of all or part of this work for personal or prior works that study the impact of vibration on storage devices.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation 2.1 SSD Internal Components
on the first page. Copyrights for components of this work owned by others than ACM As mentioned earlier, SSDs provide higher I/O bandwidth and lower
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a I/O operation latency than HDDs, and hence, are becoming increas-
fee. Request permissions from [email protected]. ingly prevalent from data centers to autonomous systems. A SSD
DAC ’19, June 2–6, 2019, Las Vegas, NV, USA uses semiconductor chips to persistently store data, as opposed to a
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6725-7/19/06. . . $15.00 HDD, which uses magnetic tapes. The absence of mechanical com-
https://doi.org/10.1145/3316781.3317931 ponents distinguishes SSDs from conventional electro-mechanical
Figure 1: Internal components of an SSD.
HDDs, which contain spinning disks and movable read/write heads.
As shown in Fig. 1, the two main components that compose a SSD
are the flash chips and the flash memory controller. The flash chips
are made of logical NAND gates that store data bits, and the flash
memory controller manages all the I/O operations.
Figure 2: Experimental setup for SSD vibration tests.
NAND Flash Memory is a type of nonvolatile storage technology
that does not require power to retain data and uses NAND flash As SSDs do not have any moving mechanical parts, they are
cells to store data. There are different types of NAND flash mem- believed to be more resistant to vibration [6, 21]. However, as dis-
ories depending on the number of bits stored in each cell and the cussed earlier, SSDs are composed of sensitive integrated circuit
arrangement of the cells. (IC) assemblies for NAND chips and controller – on which effect of
The Flash Memory Controller manages the data stored on flash vibration is not studied. Therefore, to bridge this knowledge gap,
memory and communicates with the compute system. The flash we study the impact of vibration on SSDs.
memory controller includes the Flash Translation Layer (FTL), which
maps the host side logical block addresses (LBAs) to the physical ad- 3 Experimental Methodology
dresses of the flash memory. This controller is responsible for imple-
menting flash management algorithms such as over-provisioning, In this section, we describe the experimental methodology to sys-
wear leveling, and garbage collection. To make the storage device tematically study the impact of vibration on SSDs in our controlled
operate properly, the controller maps out bad flash memory cells environment. More specifically, we create a controlled experimental
and allocates spare cells to be substituted for future failed cells from setup that enables us to accurately capture and analyze the effect
the over-provisioned area. To mitigate write-endurance issues, the of vibration on SSD performance. Previous works have performed
controller performs wear-leveling to distribute write I/O operations field-studies to observe and analyze the effects of vibration on
uniformly to ensure similar rate of aging among data blocks. The HDDs in large-scale data centers [6, 11, 24]. However, performing
controller also periodically performs garbage collection to improve accurate, interference-free and fine-grained experiments to develop
endurance, but it also causes high tail latency. systematic understanding of vibration’s impact is often not possi-
We note that the flash management algorithms including over- ble in real-world data centers and autonomous vehicles. Therefore,
provisioning, wear leveling, and garbage collection have significant the approach of this study is to draw conclusions via performing
impact on SSD performance, but the vendors do not disclose the controlled experiments on different types of SSDs.
details of this proprietary information. This limits our ability to 3.1 Experimental Platform Setup
identify the root causes and provide explanations for our findings
Fig. 2 shows the major components of our experimental platform
about the impact of vibration on SSD performance.
setup and how these components are connected to each other. The
SSD is placed on the vibration plate of the vibration generator. The
2.2 Effect of Vibration on Storage Devices intensity of the vibration in the vibration generator is controlled
Autonomous vehicles operate in a dynamic environment where by the wave function generator. We place the Operating System
vibration, shock, high temperature, humidity and other environ- (OS) on a separate disk that is insulated from vibration mounted
mental conditions can affect the computing and storage devices on on a rack to ensure that the system kernel is decoupled from the
board [9, 13, 15, 16]. Data centers house a large number of server and impact of vibration. The SSDs are extended from system connec-
storage racks with sophisticated power supply and cooling systems tor by SATA extension cables to ensure that none of the system
which maintain efficient operations. Thus, data center vibration can components are impacted by vibration except the SSDs. The SATA
be generated via multiple sources including computer servers and cable connecting SSDs is tightly secured to guard it against loose
power/cooling infrastructure. In a data center, servers with high connection problems while performing vibration experiments. We
load, high-velocity airflow, large fans, cooling units, chillers, com- have kept the SSD tight in-place using tapes and metal plates while
pressors, and standby power sources can contribute to vibration. under vibration. However, we note that tightly packed devices in a
Prior related works have attempted to identify the effects of this data center rack or a moving vehicle are still affected by vibration
vibration on different components of the computing and storage since not all metals can absorb vibration. Specifically designed racks
systems [6, 24]. These works have observed that the performance that absorb vibrations are made of are typically 4x more expensive
of traditional HDDs is significantly impacted by vibration [12]. I/O than traditional racks [2].
performance of HDDs can degrade by up to 50% in some cases [7]. We note that the whole setup is inside an isolated room under
Different electro-mechanical faults on storage drives have also been normal operating conditions, away from other kinds of vibration,
linked to vibration [8, 11, 28]. heat or external factors that may affect our conclusions. The servers
2
performance metrics. We use SSDs from three different major SSD
vendors. For anonymity reasons, we do not disclose the vendor
names, but they are major representative SSD vendors who share
a large market fraction. We have chosen these vendors to ensure
broad coverage of NAND type and variety in the flash management
algorithms (although the details are proprietary).
Figure 9: Long-term exposure to vibration significantly degrades the SSD write tail latency under both vibration types. This is
especially true for vendors B and C.
Figure 10: Long-term exposure to vibration can also decrease compared to the other vendors, potentially because of the NAND
the mean read and write I/O bandwidth. type (e.g., MLC, TLC). Third, our results reveal that = vibration has a
much higher impact compared to ⊥ vibration. Note that ⊥ vibration
maintain some post-effect. However, the magnitude is not as large results in high-performance degradation in the short-term itself, while
as the degradation during the vibration period. Interestingly, in = vibration does not. Our result shows that while harmless in short-
some cases, = vibration appears to have relatively higher post-effect term, = vibration becomes harmful in long-term, almost as bad as the
than ⊥ vibration, although the effect of = vibration on tail latency short-term effect of ⊥ vibration. The degradation due to ⊥ vibration
during the vibration phase itself is lower than ⊥ vibration. Lingering does not further increase dramatically over the long-term.
post-effects of vibration during no-vibration phases, albeit small but Interestingly, we observed that after long-term exposure to vi-
persistent, could be the root cause of fail-slow type performance defects bration, the mean Bandwidth also drops noticeably, up to 10%
observed in field studies [11]. for = vibration (Fig. 10). While not shown in the results due to
Effect of long-term vibration on SSD performance: Moti- space constraints, we observed higher variation in observed band-
vated by the significant immediate impact of vibration during the width under = and ⊥ vibration compared to no-vibration after
short-term, we explore if long-term exposure to vibration affects running the SSD for the specified long-term period. We also ex-
SSD performance. To capture and compare the effects methodically amined SMART attributes such as - Media_Wearout_Indicator,
against no-vibration, we employ multiple SSDs. All SSDs run the Available_Reservd_Space, and Hardware_ECC_Recovered, but
same random read-write workload for an equal amount of time; did not observe any conclusive impact of vibration. This indicates
each one-third of the group is kept under no-vibration, = vibration, that even when vibration is causing the performance degradation,
and ⊥ vibration. Performance after the long-term is normalized the corresponding symptoms may not be visible even in long-term
with respect to the initial short-term period in each group to avoid via traditional performance and health check tools.
manufacturing variability across devices. Long-term exposure to vibration can lead to silent fail-
Fig. 8, and 9 show the degradation in read and write tail latencies ures: We continued our long-term vibration experiments with an
across vibration types and vendors. We make several interesting intent to let it continue until the SSD wears out by writing more
observation. First, when the SSD experiences no vibration, the perfor- data than what is specified in the warranty sheet. To our surprise,
mance deteriorates negligibly over the long term (considered as 120 some SSDs running under vibration started observing silent and tran-
hours in this study) across all vendor types and I/O operation type. sient failures much before they surpassed the write endurance limit
However, the long-term impact of vibration on SSD performance is and soon after the length of our long-term window. We note that these
dramatic, up to 45% in many cases. Second, the degree of performance failures were not observed in all SSDs under vibration – making
degradation due to long-term exposure varies significantly across it difficult to predict and proactively manage such failures. Also,
vendors. Vendor A observes relatively small impact (less than 5%)
5
contain and mitigate the side-effects of vibration on SSD perfor-
mance in variable computing environments including autonomous
vehicles and data centers.
5 Conclusion
Figure 12: Write bandwidth CDF and mean write bandwidth This paper begins by posing a simple question for investigation:
of vendor A SSD may change in between the stop failures. “how does vibration impact the performance of your SSD?”. We
conclude by observing for the first time that vibration can have
a severe impact on the tail latency of the SSD and this impact is
dependent on the vendor. We discovered that exposure to vibration
can, surprisingly, leave post-effects even when the SSD is not under
vibration. Additionally, it can damage the SSD performance in the
long-term, which has serious implications for data center SLAs and
Figure 13: Write bandwidth CDF and mean write bandwidth
usage of SSDs in autonomous vehicles.
of vendor C SSD may change in between the stop failures.
References
our previous long-term discussion did not include any SSD perfor- [1] Bruce Allen. 2018. smartmontools. https://linux.die.net/man/8/smartctl
mance data with such failures. On the other hand, any SSD which [2] Startup Takes Aim at Performance-Killing Vibration in Datacenter. 2010. vibra-
tionrack. https://bit.ly/2FGNH6L
was under no-vibration did not show any such behavior. [3] Jens Axboe. 2018. FIO. https://fio.readthedocs.io/en/latest/fio_doc.html
The SSD failures resulted in the running workload being termi- [4] Jens et al. Axboe. 2018. blktrace. https://linux.die.net/man/8/blktrace
nated unexpectedly; however, the SSDs worked fine after a restart [5] Ethan Brush. 2018. Noise and Vibration Considerations for Data Centers and IT
Facilities. https://bit.ly/2UpXK9u
until the next failure. Fig. 11 shows the the syslog snippet for [6] Christine S Chan, Boxiang Pan, et al. 2014. Correcting Vibration-Induced Per-
one such occurrence of this type of SSD failure. The error in sys- formance Degradation in Enterprise Servers. ACM SIGMETRICS Performance
Evaluation Review 41, 3 (2014), 83–88.
log indicates that the SSD, which is connected as “sda” suddenly [7] Trinoy Dutta and Andrew R Barnard. 2017. Performance of Hard Disk Drives in
goes undetected. Then, upon relaunching the workload, it resumes High Noise Environments. Noise Control Engineering Journal 65, 5 (2017).
proper execution. Also, Linux command “lsblk” reports the SSD [8] Takehiko Eguchi, Yohei Asai, et al. 2017. Airborne and Structure-Borne Trans-
mission of High Frequency Fan Vibration in a Storage Box. In Conference on
correctly. Thus, this transient stop fault of SSD is prone to go unde- Information Storage and Processing Systems. ASME.
tected or be classified as a NDF (no defect found) in the data center [9] Ming Yang et al. 2019. Re-thinking CNN Frameworks for Time-Sensitive
setting. We performed more in-depth analysis to understand the Autonomous-Driving Applications: Addressing an Industrial Challenge. IEEE
Real-Time and Embedded Technology and Applications Symposium (RTAS). (2019).
performance trend during these transient failures. [10] Sebastien Godard. 2018. iostat. https://linux.die.net/man/1/iostat
Notably, we observed that the performance of SSDs which experi- [11] Haryadi S Gunawi, Riza O Suminto, et al. 2018. Fail-slow at Scale: Evidence of
Hardware Performance Faults in Large Production Systems. ACM Transactions
ence such transient faults drops significantly. Fig. 12, and 13 show on Storage (TOS) 14, 3 (2018), 23.
the Cumulative Distribution Function (CDF) and mean of band- [12] YY Hu, S Yoshida, et al. 2009. Analysis of Built-In Speaker-Induced Structural-
width for two failing SSDs from vendor A and C under different Acoustic Vibration of Hard Disk Drives in Notebook PCs. IEEE Transactions on
Magnetics 45, 11 (2009), 4950–4955.
types of vibration, as representative examples. These figures show [13] R Wayne et al. Johnson. 2004. The changing automotive environment: high-
the performance during the period between multiple transient fail- temperature electronics. IEEE Transactions on Electronics Packaging Manufactur-
ures. We make several new observations. First, when we compare ing 27, 3 (2004), 164–176.
[14] Kingston. 2018. A400 SSD. https://bit.ly/2WF6hTk
the SSD I/O bandwidth without vibration and with vibration, we [15] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E Haque, Lingjia
observe a significant performance drop after the transient failure. Tang, and Jason Mars. 2018. The architectural implications of autonomous driving:
Constraints and acceleration. In ASPLOS 2018. ACM, 751–766.
Second, between consecutive phases of no-vibration separated by [16] Shaoshan Liu, Jie Tang, Zhe Zhang, and Jean-Luc Gaudiot. 2017. Computer
transient failures, the bandwidth decreases by more than 20% in architectures for autonomous driving. Computer 50, 8 (2017), 18–25.
one case. We note that this decrease is larger than the decrease [17] Micron. 2018. 5100 Series. https://bit.ly/2Upqpvt
[18] Mydigitalssd. 2018. Superboot. https://mydigitalssd.com/2.5-inch-sata-ssd.php
observed in long-term under no-vibration. Upon further inspection, [19] Samsung. 2018. 850 EVO SSD. https://images-eu.ssl-images-amazon.com/
we estimated that the “media wear-out” increases at a higher rate images/I/61HGJaHYy-L.pdf
suddenly, despite the fact that SSD should have been far from its [20] Sandisk. 2018. Extreme II. http://mp3support.sandisk.com/downloads/qsg/
extreme2-ssd-datasheet.pdf
write-endurance limit. This is potentially because the damaged [21] Christine Taylor. 2018. SSD vs. HDD. http://www.enterprisestorageforum.com/
NAND cells are replaced by spare NAND cells of the over-provision storage-hardware/ssd-vs.-hdd.html
[22] Techspot. 2018. Effect of Vibrations on CPU. https://www.techspot.com/
(OP) region. Essentially, these failures appear to be silent and tran- community/topics/cpu-fan-vibrating.99261/
sient at first but lead to premature death of the SSD eventually as [23] Iain Thomson. 2018. Azure Fell over for 7 Hours in Europe because Someone
we found in several cases during our study. Accidentally Set Off the Fire Extinguishers. https://www.theregister.co.uk/2017/
10/03/faulty_fire_systems_take_down_azure_across_northern_europe/
Future Implications: Tail latency and performance slow- [24] Julian Turner. 2010. Effects of Data Center Vibration on Compute System Perfor-
downs are the most critical factors for both autonomous systems mance. In SustainIT.
to make real-time decisions and data center providers to guarantee [25] Marcel van den Berg. 2018. Bank’s Data Center Shut Down. http://up2v.nl/2016/
09/12/a-loud-sound-just-shut-down-a-banks-data-center-for-10-hours/
SLAs. Our results show that vibration may have a significant impact [26] Marcel van den Berg. 2018. Datacenter Failure. http://up2v.nl/2018/04/25/
on these factors for SSD devices. Thus, storage system researchers nasdaq-nordic-datacenter-failure-because-of-noise-of-fire-suppression-system/
[27] Dag Wieers. 2018. dstat. https://dag.wiee.rs/home-made/dstat
and practitioners need to pay closer attention to such impacts for [28] Jiaping Yang, Cheng Peng Tan, et al. 2017. An Effective System-Level Vibration
better provisioning and management of SSDs. Our results also in- Prediction Analysis Approach for Data Storage System Chassis. Microsystem
dicate that SSD manufacturers need to devise better strategies to Technologies 23, 8 (2017), 3097–3105.