Massive Data Analysis

My student, Daniel Wang, and I worked with Earth System Science Professor CharlesZender on an aspect of his NCO/SDO program. The essential problem is that geoscience simulations and sensors gather enormous data sets containing detailed, time-varying, multidimensional information. This data tends to reside on servers and information portals such as the Earth System Grid. To analyze this data, a scientist would need to retrieve significant chunks of it via the network to her workstation, then use tools, such as those in Charlie Zender’s NCO (NetCDF Operators), to reduce, compare, average, or otherwise manipulate the data. With current data sizes, it may take 100GB of data to produce a few MBytes of results that can be graphed and understood by humans. Downloading this huge data is a bad idea, which will only get worse as simulation sizes grow and sensors become higher resolution.

Conventional approaches, such as Grids, don’t help too much, because the data movement problem can’t be escaped. Therefore, the solution we propose is to move the computation to the data centers, where locality can be exploited and high-performance, low-latency LANs can move data much more cost-efficiently than wide-area nets. The problem with moving computation rather than data is how to specify the operations. Scientists already have huge scripts of NCO commands, so they are reluctant to convert them to graphical or XML representations. Therefore, our approach is to allow the use of these scripts essentially as-is. The system will parse the script, perform data-flow analysis on the operations to define a workflow that can be executed on the data center’s cluster. The system will coordinate producers and consumers of data to exploit locality where possible, while maintaining a high degree of parallelism. Our results to date show this approach of remote execution of these scripts of highly data-intensive, but computationally simple, operations is much faster than first downloading the required data then executing the script locally.

This research has resulted in several talks, posters, and papers:

D. Wang, C. Zender, and S. Jenks, “Server-side parallel data reduction and analysis,” in Proceedings of International Conference on Grid and Pervasive Computing (GPC), (Paris, France), May 2007.

D. L. Wang, C. S. Zender, and S. F. Jenks, “DAP-enabled server-side data reduction and analysis,” in 23rd Conference on Interactive Information Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology at the 2007 American Meteorological Society’s Annual Meeting, (San Antonio, TX), January 2007. (Presented by Daniel Wang)

D. L. Wang, C. S. Zender, and S. F. Jenks, “Server-side netCDF data reduction and analysis,” in American Geophysical Union (AGU) Fall Meeting, (San Francisco, CA), December 2006. (Presented by Daniel Wang)