This article is published in the March 2015 issue.

NSF and the National Big Data Initiative


cbaruThree years have passed since the launch in March 2012 of the National Big Data Research and Development Initiative by the White House Office of Science and Technology Policy (OSTP). The breathtaking pace of activity in big data has continued unabated in the intervening years. In August 2014, Gartner declared that “big data” had passed the peak of the so-called “Hype Cycle.” This only means that the community can now roll up their collective sleeves and get to work on the real issues, rather than worrying about the hype!

While dealing with data is not a new phenomenon—whether in science, business, or government—there is recognition in every field and discipline that the easy availability of vast amounts of data, continuous data streams, a heterogeneous range of datasets, and the use of all of these data for action and insights—including in “real time”—has indeed created a new phenomenon, which the community is beginning to embrace as the new discipline of Data Science, or Data Science and Engineering.

Every national priority area and initiative, whether cybersecurity, Precision Medicine, BRAIN, smart energy, Materials Genome, or Advanced Manufacturing, will generate more, new data, and run into the challenges of Big Data. Scientific breakthroughs and new business, government, and societal applications will only come from effective use of all these data. The ability to convert data to action and insights will be the gating factor in our ability to create efficient and effective solutions in all of these priority areas.

Since the Big Data Initiative announcement three years ago, the White House has taken action in a number of ways. In May 2013, the Administration released the Open Data Policy so that information generated and stored by the Federal Government is made more open and accessible to innovators and the public to fuel entrepreneurship and economic growth while increasing government transparency and efficiency. Given the interest in gaining maximum value from their data assets, government agencies are hiring Chief Data Officers and Data Scientists. On February 18, 2015, the White House announced the appointment of Dr. DJ Patil as the first U.S. Chief Data Scientist. In a keynote talk the next day at the Strata+Hadoop World Conference in San Jose, Dr. Patil noted that he had already seen a number of innovative uses of data in government, sometimes even surpassing industry’s use of data.

Coordination among federal agencies for the Big Data Initiative is enabled through the Networking and Information Technology Research and Development (NITRD) Big Data Senior Steering Group (BDSSG), co-chaired by NSF and NIH, with members from DARPA, DOD OSD, DHS, DOE-Science, HHS, NARA, NASA, NIST, NOAA, NSA, and USGS. Last fall, the BDSSG issued a Request for Input to inform the development of a framework, set of priorities, and ultimately a strategic plan for the National Big Data Initiative1; and last month, NSF sponsored a workshop at Georgetown University to obtain additional input from academia, industry, and the community at-large. A second related workshop was hosted by the Homeland Security Advanced Research Projects Agency (HSARPA) on February 23, 2015 in Washington, DC.

One of the cornerstones of the original Big Data Initiative announcement was the creation of a research program in Core Technologies and Techniques for Advancing Big Data Science & Engineering, or BigData. In the first year, this was a joint initiative between NSF and NIH. In the second year (2014), NIH had initiated their BD2K program, and NSF continued with the BigData program. The most recent, third NSF BigData solicitation released on February 19, 2015 includes participation from all NSF Directorates, as well as participation by the Office of Financial Research, Department of Treasury. In subsequent years, we hope to collaborate with additional agencies.

In addition to leading research efforts to advance Big Data science and engineering, NSF is also providing leadership to accelerate the Big Data innovation ecosystem. Building on the momentum of the White House Data to Knowledge to Action event in November 2013, which announced new Big Data partnerships, NSF last fall issued a Request For Input on the formation of Big Data Regional Innovation Hubs. We plan to soon announce a series of regional workshops for later this year to further explore this concept.

While the Big Data program creates enormous opportunities for creating new knowledge from large-scale data across all disciplines, there are also new challenges to be addressed including, sustainability, identifying which data, from the vast ocean of data sets, need to be retained for the long term, and the business models to support that;reproducibility, ensuring that results from data experiments can be reproduced at a later time, especially by others; data to action, how to reach decisions and take confident action from data, for example, in business and/or government applications; and, data to insight, obtaining an understanding of the underlying phenomena from data, for example, in medical and/or scientific applications.

Furthermore, even as organizations grapple with issues of managing and exploiting their ever-increasing data resources, we may only be at the beginning of this data deluge. With the impending arrival of the so-called Internet-of-Things (IoT) one can expect even larger volumes of data from a vast array of sensors spanning spatial scales from the individual (e.g. wearables and personal monitoring devices), to the home or factory (Smart Homes, Industrial Internet), and urban environments (Smart Cities). NSF is planning to organize a series of workshops in 2015 on the topic of Big Data and the Internet Of Things.

The Big Data phenomenon has led to the recognition of data science and engineering as a newdisciplinary area—not just at the PhD and Master’s levels but also at the undergraduate level. A number of universities are actively developing a full undergraduate curriculum in Data Science, or Data Science and Engineering. Indeed, the picture that is emerging is that the scale of data science is much larger than we had originally anticipated. As an example from the Big Data Strategic Initiative workshop at Georgetown earlier this year, Dr. Andrew Moore’s keynote noted that Google currently employs about 10,000 people who help curate Web data to assist in Google Search[2]. The CISE-supported Expeditions in Computing project AMPLab at UC Berkeley is about Algorithms, Machines, and People. The “human in the loop” will be a key factor in our ability to fully exploit data resources in future.

As the Google example illustrates, much of this work is not necessarily at the graduate level. There will likely be an ecosystem of data science-related employment at the graduate level, undergraduate level and, possibly, at the community college level. NSF plans to organize workshops to explore educational opportunities to serve all aspects of the data science industry.

These are indeed exciting times for Big Data and Data Science and we invite you to participate in this exciting new opportunity not only for the CISE community, but for a host of related disciplines!