Cloud Promise versus Practice: Real-World Examples of High-Performance Data Management

Kern, Kirk; Kern, Kirk

Cloud computing is more practice than promise today. Cloud hosting is today’s go to way to modernize IT infrastructure or develop new digital services. IDC estimates that global spending on public cloud will reach $370 billion by 2022, a 22.5% five-year compound annual growth rate. [i] U.S. Government directives like Cloud Smart [ii]and Data Center Optimization (DCOI)[iii] encourage cloud computing as an alternative to on premises IT infrastructure. Organizations rely on cloud services to do more with their data – for more precise weather forecasts or for precision medicine. However, environmental and other scientific workloads that require high performance data processing still face migration and performance hurdles in cloud-based IT environments. Environmental and scientific data presents a unique challenge. While the data itself is usually in unstructured files, the overall data volume is very large.

This presentation will highlight several real-world examples of how cloud providers are delivering services to meet the higher performance and low latency needs of scientific workloads. Cloud services have evolved from providing basic Infrastructure as a Service (IaaS) to easy to use, high performance capabilities that match features and service levels available in on-premises datacenters. These capabilities include extreme file processing with guaranteed quality of service. This talk will also share how cloud service providers and “traditional” technology firms are partnering to deliver data migration capabilities that enable environmental and other scientific organizations to move existing workloads to cloud with minimal disruption.

This talk targets attendees interested in new cloud services to leverage data as a strategic asset, as well as IT practitioners who want proven tools to help move datasets non-disruptively to take advantage of the scale and variety of cloud-based offerings.

Key takeaways from this presentation include practical examples of how to

Expedite big data analytics - addressing data management challenges shared by genomics sequencing and environmental data processing
Accelerate time sensitive content in the cloud
Build out a high-performance, cloud-based data repository
Migrate existing workloads to cloud non disruptively using a Federal agency example

Forecasting Weather vs Diagnosing Cancer – Common Data Challenges

Environmental data processing has a lot in common with genomics sequencing and precision medicine initiatives. Both involve a high volume of data that must be brought together for diagnosis or analysis for a desired outcome. High-performance workloads like genome sequencing and rendering database workloads have traditionally had a hard time using cloud computing. These workloads require a high-performance clustered architecture with highly scalable local storage that can also provide access to many research scientists who need to mine the complete clinical data of millions of individuals. These types of analytics workloads need high speed IO processing for large amounts of data. For example, a single complete genome is made up of 3 billion “base pairs” of DNA molecules, and sequencing a whole genome generates more than 100GB of data. By 2025, over 100 million human genomes could be sequenced, representing over a zettabyte of data. Similarly, environmental data – satellite, radar and observation data - needs to be processed and reassembled prior to dissemination. Cloud service providers now offer high performance file processing solutions that match the scalable cloud compute services. This talk will share how environmental workloads can take advantage of lessons learned from genomics sequencing in the cloud using new high-powered file services from Microsoft Azure and other cloud providers.

Accelerating Time Sensitive Content in the Cloud

Federal agencies with time-sensitive content are realizing that they must address the performance of the “backend” data management and storage services as much as they do the “front end” of a website. One federal agency organization realized that at peak access times, they were unable to service citizens looking for information on weather and related environmental information. They required a latency-sensitive and higher performance file service for their website backend. They resolved their performance and latency challenge using Microsoft Azure NetApp Files, a high-powered Microsoft service that delivers quality of service to meet the most stringent application latency and response requirements. This presentation will describe Microsoft Azure NetApp Files, a cloud native service that combines the cloud scalability of Azure and the data management (IO) performance of enterprise data storage provider NetApp.

Building a High-Performance Repository for Environmental Data

The cloud seems tailor-made to build out a large repository to collect and analyze data on demand. Like the examples above, the performance of native cloud file services can be a challenge for large volumes of data files. In this example, a public corporation collects wind turbine data that is captured and ingested in large zip files. They needed a high performance repository for this data. They are using NetApp enterprise file services running in a major cloud service provider (CSP). The NetApp data service enabled them to extend the capabilities of the native CSP offering and run reports and analytics against the data – now all in the cloud.

Migrating Existing Workloads to Cloud Without Disrupting Operations

After an organization has decided to move to the cloud, the next question is – how? This talk illustrates one way federal agencies and commercial organizations have moved their physical on-premises workloads to cloud services without disrupting operations. Data management firm NetApp has partnered with Microsoft and other cloud service providers to use existing data replication tools to move data into Microsoft Azure and other cloud services. This presentation will highlight one agency that moved 900 workloads from its existing infrastructure into the cloud in approximately 5 months.

The promise of cloud computing is here – the challenge for organizations is to use it in the right place, at the right time, to meet data centric requirements. Fortunately, cloud services now offer higher performance needed to process large volumes of data (whether genomic or environmental) and proven tools to ease migration. This talk will leave attendees with practical options for using cloud services for environmental and other scientific uses.

[i] https://www.idc.com/getdoc.jsp?containerId=prUS44891519

[ii] https://cloud.cio.gov/strategy/

[iii] https://datacenters.cio.gov/

6A.1 Cloud Promise versus Practice: Real-World Examples of High-Performance Data Management