This article is published in the March 2007 issue.

SDSC: Harnessing Data for Science and Society


Stroll the halls of the San Diego Supercomputer Center (SDSC), and a world of discovery—from the inner space of the mind to the outer space of the universe—is brought into focus. Images of neurotransmitters activating synapses, proteins docking into molecular targets, and animations of the birth of the solar system line the center’s corridors. What were once streams of mathematical theorems, equations and solutions are transformed into visual scenes, where the surreal approaches reality.

Such is the daily life at SDSC, where petabytes (a thousand trillion bytes of electronic data) are transmitted, stored, mined, archived and computationally translated for a global user community of about 4,000 scientists and engineers— helping in their quest to solve the planet’s most complex, yet critical, mysteries. If the world is awash in a tidal wave of data, then SDSC may be likened to an international dam that’s turning this on-rushing torrent into a large but gentle reservoir.

Indeed, the overriding theme of SDSC is the harnessing of data for science and society. Last year, the center expanded its archival tape storage capacity to 25 petabytes (25 thousand trillion bytes), or roughly 1,000 times the digital plain-text equivalent of the printed collection of the Library of Congress. As a result, SDSC’s home campus of the University of California, San Diego (UCSD) now has more storage capacity than any other educational institution in the world.

The center also features an additional 2 petabytes of online disk storage, in addition to powerful high-end computing resources, an ever-widening network bandwidth and a broad spectrum of software, portals, workbenches and services—key components to what is generally referred to as cyberinfrastructure.

When combined with human expertise and large-scale computers, SDSC is considered a world leader in data cyberinfrastructure, providing tools for scientists and engineers to discover new knowledge—using end-to-end collaborative environments for data integration, large-scale analysis and simulation, data visualization and, lastly, dissemination and preservation of data-driven results.

For example, in 2006:

  • SDSC and an eight-institution team of scientists from the Southern California Earthquake Center conducted the largest and most detailed simulation ever of a 7.7-magnitude earthquake on a 230-kilometer stretch of the San Andreas Fault. In second-by-second detail, scientists were able to visualize the impact of the quake’s powerful force against the city of Los Angeles and its 11 million residents. Such findings not only are allowing earthquake scientists to get a more accurate picture of the potential destruction triggered by a quake of this size, they are providing insights whose results will eventually find their way into the design of buildings to mitigate that damage.
  • To learn more about the state of the planet’s oceans—past and present— SDSC staffers worked with Carl Wunsch, professor of physical oceanography at MIT, and other scientists with the Estimating the Circulation and Climate Ocean (ECCO) Consortium to run their highly scalable parallel simulation code, MITgcm. The results meshed with observations of the Southern Ocean during the year 2000. Findings such as these are offering a more accurate way for scientists to assess how ocean temperature, motion and other physical characteristics affect the planet’s climate, fishery dynamics, and shipping.
  • David Baker, a Howard Hughes Medical Institute investigator with the University of Washington, worked with SDSC staff to dramatically step up the prediction speed for three-dimensional protein structures, using a code dubbed Rosetta. One result of the Critical Assessment of Structure Prediction 7 (CASP7) competition represented the largest-ever calculation of the code, achieving an accurate prediction of the CASP7 target in less than three hours (using the Monte Carlo minimization scheme) for a protein structure that normally would take weeks to achieve. Such research is expected to play a critical role in the rational design of future drugs for cancer, Alzheimer’s disease and HIV, among others.

Big problems require access to big computational resources, sometimes working together across a powerful grid. Toward that end, SDSC was one of the founding sites of the National Science Foundation’s (NSF) TeraGrid, a multi-year effort to build and deploy the first national-scale grid infrastructure for open scientific research. Eight other centers participate in the TeraGrid project, including the National Center for Supercomputing Applications at the University of Illinois, Argonne National Laboratory, the Pittsburgh Supercomputing Center and the Texas Advanced Computing Center. Combined, the TeraGrid harnesses more than 150 teraflops of computing power (150 trillion calculations per second or the computing power of about 15,000 average desktop computers) through a cross-country network backbone that operates at 10 gigabits or more per second.

For its part, SDSC’s national-scale computational resources, housed in its 13,000 square-foot machine room, include one of the top computers in the world, known as IBM BlueGene Data (known inside SDSC as the “Intimidata”), which packs 17.2 teraflops and more than 6,000 processors into just three racks of space. DataStar, another SDSC supercomputer, is a 15.6 teraflop IBM machine with a total shared memory of seven terabytes, and is designed for large-scale, data-intensive and compute-intensive scientific codes. These machines are available through TeraGrid, along with an IA-64 based cluster with a total peak speed of 3.1 teraflops.

At SDSC, data is a driving force, and managing, analyzing, visualizing, and computing with data are all critical to speed science and engineering discovery. SDSC hosts the Protein Data Bank, a global resource for protein information. The Center also created DataCentral, the first nationally allocated storage infrastructure for community digital collections for all academic disciplines. Data Central currently hosts 93 such collections, including: digital tomographic images of the human brain, astronomical observations from the 2-Micron All Sky Survey, digital visualizations of earthquake simulations, tsunami data, digital videos of bee behavior, Chinese text from the Pacific Rim Library Alliance, digital data collection from the Library of Congress, and even digital images of Japanese art.

Data Central also stores data on the scientific analysis of network function from the Cooperative Association for Internet Data Analysis (CAIDA), based at SDSC, which provides engineering and traffic analysis of Internet traffic and performance. The CAIDA data collection includes the UCSD “Network Telescope” data, which monitors unexpected traffic including network security events, such as infection of hosts by Internet worms.

Last year, the National Archives and Records Administration (NARA), the NSF and SDSC/UCSD signed a landmark Memorandum of Understanding, providing the legal basis for preserving federal electronic records and other informational materials resulting from federally sponsored scientific and engineering research and education at SDSC. Preserving valuable digital assets is critical if the nation is to maintain its competitive edge in science and education. The agreement marked the first time NARA established an affiliated relationship for preserving digital data with an academic institution.

SDSC also signed an agreement with the National Center for Atmospheric Research (NCAR) to exchange the archival storage of 100 terabytes at each institution, a significant step towards the replication and protection of critical research and education data collection for the science and engineering communities.

Managing and using the current explosion of data is often easier said than done. For almost a decade, SDSC’s Storage Resource Broker (SRB) has been widely used for managing and integrating distributed shared collections for a variety of academic research projects in this country and around the globe. A new middleware system from this group, iRODS (the Integrated Rule-Oriented Data System), provides next-generation data management that is easily customized for user needs and community policies.

SDSC also is contributing to the development and implementation of efficient schemes to move data over wireless networks in real-time. Center researchers are working on several projects with the NSF-supported High Performance Wireless Research and Education Network (HPWREN)—led by Principal Investigator Hans-Werner Braun at SDSC and Co-principal Investigator Frank Vernon at the Scripps Institution of Oceanography, both at UCSD—in collaboration with scientists at San Diego State University.

Last summer, HPWREN researchers were recruited by the California Department of Forestry and Fire Protection (CDF) to establish a critical communications lifeline for firefighters battling a 7,000-acre wildfire, known as the Horse Fire, in the Cleveland National Forest. The researchers set up hardware at key points to allow firefighters in remote locations to communicate via a wireless link from the Horse Fire incident command post to the Internet. HPWREN plays an important role for large-scale sensor network applications in several NSF initiatives covering earth sciences, oceanography, biology, and earthquake engineering simulation.

The Synthesis Center—operated by SDSC and the California Institute for Telecommunications and Information Technology (CalIT2) at UCSD—was launched in 2005 to help today’s scientists and engineers address complex and multidisciplinary problems in a collaborative way. Synthesis Center investigators have made significant advances in visualization techniques and technologies. For example, SDSC’s Greg Quinn is developing a program that will allow doctors to view a patient’s medical history—including X-rays and diagnostic scans—on mobile devices similar to your cell phone or PDA, a key step in the development of personalized medicine.

Of all the resources at SDSC, perhaps the greatest is SDSC’s professional staff, scientists, computer scientists, software developers and support personnel. The staff works to help users to optimize their experience with computational and data resources at the Center, and partner with the community in large-scale collaborations that are the hallmark of today’s science. SDSC is an integral partner in key cyberinfrastructure-oriented community projects including GEON (development of cyberinfrastructure for the Geosciences), BIRN (development of a national infrastructure for biomedical informatics), and the cyberinfrastructure center for the George E. Brown Jr. National Earthquake Engineering Simulation (NEES) project.

A vital part of SDSC’s guiding philosophy is the empowerment of science and engineering communities, both present and future. SDSC offers full-time support, including 24-hour helpdesk services, code optimization, training, and portal development; and a variety of other services, including workshops, training courses and outreach activities to the community. During the summer, SDSC opens its auditorium doors to hundreds of people interested in getting hands-on training in using cyberinfrastructure and high-performance computing in a variety of disciplines, including the humanities, arts and social sciences.

SDSC also invites students and teachers to experience computing at its highest levels. More than 1,200 teachers from about 140 schools attended TeacherTECH workshops at SDSC in 2006, more than double the participation from the previous year. What’s more, six workshops introduced nearly 200 high school students and community college biology and ecology professors to SDSC’s Discover Data Portal.

In so doing, the stream of data for tomorrow’s discoveries will continue to flow, waiting to be mined by the next generation of scientists and engineers. And SDSC will be there to harness that data for all those seeking solutions to big problems, and answers to intractable mysteries.

(For more information about the San Diego Supercomputer Center, please visit the SDSC website at www.sdsc.edu)

Warren Froelich is Director of Communications and Public Relations at the San Diego Supercomputer Center.

SDSC: Harnessing Data for Science and Society