J11.1 Enabling Direct User Capacity Planning for Data Storage

Wednesday, 25 January 2017: 10:30 AM
Conference Center: Chelan 5 (Washington State Convention Center )
Michael E. Fotta, Global Science & Technology, Inc., Fairmont, WV


The Comprehensive Large Array-data Stewardship System (CLASS) archives over 8 petabytes of NOAA's environmental data and is adding to these holdings at a current rate of over 1.5 PB per year. With the addition of new major data sources (GOES-R and JPSS) this growth will increase to over 7 PB per year by the end of 2017. Typically 1 TB/day of data ingested needs about 28 TB of disk cache for processing, over 30 TB of cache for delivery and 146 tapes per year.  It is critical that CLASS has the ability to estimate the storage media necessary for future data. Overestimating leads to over spending, but underestimating leads to a shortfall in the storage available for new data. Estimates are needed for projecting both the overall growth of CLASS storage needs and the storage needs for individual campaigns (particular data sets). The latter is especially true as CLASS charges the individual NOAA campaigns for the media resources used. Users of the CLASS system desired a capacity estimation tool that would enable them to make estimates of the cost for their particular campaign. Furthermore, users wanted to be able to run multiple simulations enabling them to manipulate variables related to this storage; that is, variables that could increase or lower the cost. In order to meet these needs Forio's Simulate™ was used to build the CLASS Storage Estimation Tool (CSET).


The input variables under a user’s control related to media needs are: 1) daily data ingest rate to CLASS, 2) the percent of data to be kept permanently on disks, 3) the volume of data to be delivered from CLASS storage daily, 3) the number of files ingested daily, 4 ) the size of these files, 5) the period of time (months or years) to ingest the data, and 6) the “situation” for ingesting the data into CLASS. The situation refers to how the data is sent to CLASS - either Backlog (an existing store of data), Operational (on a daily basis), or a combination of the two. Also, the Operational data variables may change on an annual basis. This leads to five different situations for ingesting data: 1) Backlog only, 2) Operational – no input variation, 3) Operational with yearly input variation, 4) Backlog and Operational with no input variation, and 5) Backlog and Operational with yearly variation.

      The CSET model enables users to run simulations for these different situations while manipulating variable values in order to determine the media capacities and cost for each simulation run, and then compare these runs. A campaign often has some discretion in how the data is brought in, Backlog and/or Operational, and in the variables characterizing the input of the data. Users often start with variable values that would give them the fastest ingest of the largest number of files per day. After a user enters their values and runs the simulation CSET provides a summary Resource Estimation web page for that situation. If the costs are greatly outside their budget they can go back and run a new simulation with reduced values for some variables (e.g., reduce the daily ingest rate).

      They then save the values for the input variables and results from the Resource Estimation for each run for comparison with other simulation runs. The data from the saved runs is then presented in a Run Table as discussed in the next section.      


Once a user has executed a number of runs with different variable inputs they can use the Run Tables to get an understanding of how their changes affect media capacity and cost. As the name implies a Run Table presents the data for each saved run. In order that users get the maximum information possible, but presented in a fashion that does not overwhelm them, multiple Run Tables are viewed in a web page with separate tabs (see Figure 1). There are tabs for Final Results, Backlog Results, Operational Results and Inputs.  The Results tabs show the capacities and costs for the different media (disk, tapes and tape drives) for each run while the Inputs tab shows the values for variables entered for each run. The information shown in the Final Results tab is the same as either the Backlog or Operational Results when only one of these situations is used, but varies when data is ingested via both Backlog and Operational situations.

      Using the Run Table users can compare the disk, tape and drive capacities and cost resulting from their variable inputs on different Runs. By visually comparing the results and using the Input tab users can determine how changes in a variable or variables affect these capacities and costs. All of the data shown in the Run Tables can also be saved to Excel or other tools for further analysis, if so desired.

      In the example shown in Figure 1 the user did four runs and noticed that variation in Disk Cost is the main driver in changes in Total Cost. Looking through the data it can be seen that Disk Cost variation appears related to the variation in SFS Disks as the HPSS Disks hardly vary. The user then saw that the Ingest  Rate was the same (1.1) for the two lowest number of SFS Disks, but the Run 4 had a much lower number of SFS Disks than Run 2. At this point the user would refer to the Inputs tab to discover how the inputs varied. In this case the user found that Run 4 had only specified that 1% of data be kept permanently on disks while the other run had specified 10%. The user decided that the reduced use of disks for permanent storage was acceptable given the reduction in cost.  

      Detailed examples of how users have applied the CSET, including how variables are entered and modified, and Runs compared will be presented.

Figure 1. Example CSET Run Table Web Page.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner