Data-Intensive Computing

Message from the CISE AD

I have some exciting news to share with all of you: NSF is partnering with Google and IBM to explore data-intensive computing. Through NSF’s reach, Google and IBM are providing software and services running on a large cluster to the broad academic community to explore innovative research and education ideas in data-intensive computing. Google and IBM launched the Academic Cluster Computing Initiative ¹ last October with instructional programs at six pilot universities, and the NSF will be joining this initiative as the first research-oriented pilot partner. We are calling the NSF program to provide access to these types of resources the Cluster Exploratory (CluE).

Here are excerpts from a prior announcement made by Dan Atkins (Director, Office of Cyberinfrastructure) and me to the community.

Data-Intensive Computing

Data-intensive computing is a computational paradigm in which the sheer volume of data is the dominant performance parameter. Storage and computation are co-located, enabling large-scale parallelism over terabytes of data. For example, Google runs an average of 100,000 MapReduce jobs per day on its clusters, processing over 20 petabytes daily ². This scale of computing effectively supports applications specified in high-level programming primitives, where the run-time system manages parallelism and data access. The architecture is extremely fault-tolerant and exhibits high degrees of reliability and availability.

Data intensive computing raises important research challenges:

For science
What are the fundamental capabilities and limitations of this paradigm?
What new programming abstractions (including models, languages, algorithms) does this computational model suggest?

For technology
- How can we automatically manage the hardware and software of these systems?
- How can we reduce their power consumption?
For society
- What (new) applications can best exploit this computing paradigm?

Data-intensive computing is at the forefront of ultra-large-scale commercial data processing. A July 2006 New York Times article³ notes that “Google, Microsoft and Yahoo are spending vast sums of capital to build out their computing capabilities.” Not only is there an increasing need for advances in data-intensive computing systems software and hardware, but also an increasing demand for a trained workforce to operate and use these systems. To date, however, the academic community has had limited access to such systems.

Enter Google and IBM

On October 8, 2007, Google and IBM announced they had teamed to provide six universities access to a large-scale computing cluster together with the software and services to use it effectively ⁴. After several months of discussions, the NSF will be joining this initiative and will be partnering with Google and IBM to broaden the reach of this powerful computing resource to foster more innovation than might be possible in the initial pilot.

Access to the Google-IBM academic cluster via the CluE program will provide the academic community with the opportunity to do fundamental, disruptive research in data-intensive computing and to explore powerful new applications. This facility can also serve as a tool for educating the next generation of scientists and engineers. This partnership is an excellent example of an academic-industry-government relationship that is a win-win-win situation for all.

System Description

The Google-IBM cluster contains well over a thousand processors connected to terabytes of memory and hundreds of terabytes of storage with internal networking as well as a substantial external network connection. The system will be configured with open source software to include Linux and Apache Hadoop ⁵ —a large-scale distributed computing platform inspired by Google’s MapReduce ⁶ and the Google File System ⁷. IBM’s Tivoli ⁸ software will also be used for management, monitoring and dynamic resource provisioning of the cluster.

The system will provide a powerful resource for large-scale data analysis, mining and visualization in addition to support for Internet-scale computing applications. Tutorial information describing the programming environment of the Google-IBM academic cluster available via the CluE program can be found on the Google Code for Educators website ⁹. Much of this material was developed in collaboration with the University of Washington, and all of it is available under permissive licenses such as the Creative Commons Attribution License.

Upcoming Solicitation

CISE is currently developing a program solicitation that will invite researchers to submit proposals requesting allocations of the Google-IBM cluster for any new, innovative use of the system and to probe the possibilities and fundamental limits of this new computing paradigm. The emphasis of the program will be to develop new approaches and applications that are outside the typical high-performance computing applications running on today’s supercomputers.

The challenge to the academic community is three-fold: to use existing tools and to develop new programming abstractions for such a “computer” to solve problems unsolvable any other way; to solve old problems in simpler or more efficient ways; and to enable new applications. This resource will also provide an opportunity to teach students how to build, use and manage data-intensive computing systems—systems that are already being used widely in industry, but are largely unavailable to students and faculty today.

Please look for the new solicitation which will be posted on the CISE web site. CISE looks forward to your bold, creative proposals for CluE!

Notes:

¹Official Google Blog: http://googleblog.blogspot.com/2007/10/let-thousand-servers-bloom.html
² J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. of the ACM, 51(1), January 2008, pp. 107-113.
³ http://www.nytimes.com/2006/06/14/technology/14search.html?_r=1&pagewanted=2&oref=slogin
⁴ See http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html or http://www-03.ibm.com/press/us/en/pressrelease/22414.wss for the text of the press release.
⁵ http://hadoop.apache.org/
⁶ http://labs.google.com/papers/mapreduce.html
⁷ http://labs.google.com/papers/gfs.html
⁸ http://www.ibm.com/software/tivoli/
⁹ http://code.google.com/edu/content/parallel.html