What is Data Science in the 21st Century?
By Barbara Ryder, Virginia Tech
Last July, a distinguished panel of computer scientists – David Culler (UC Berkeley), Rayid Ghani (U of Chicago), Rahel Jhirad (Hearst) and Rob Rutenbar (UIUC) — discussed this question with a group of approximately 100 CRA Conference at Snowbird attendees. There was agreement that data science is an interdisciplinary field, combining techniques from machine learning, natural language processing, data mining, algorithms, information retrieval, etc.
Rutenbar and Culler addressed the issues of the relationship between computer science (CS) and data science in academia, and the necessary skills to be learned by students wanting to pursue a career in data science. Rutenbar characterized the growing interest in CS and data science as a “tsunami” of student enrollment. He described new UIUC blended majors, CS+X, which present CS as an enabling technology (e.g., with X as anthropology, astronomy, linguistics, statistics). These new CS+X majors integrate a subset of topics from a CS degree with specific subset of domain topics, resulting in an integrated course of study achieved through close cooperation between faculty in the two respective departments. Rutenbar predicted that soon data science will be added to the mix (i.e., CS+DS+X majors). Culler described the Berkeley approach beginning with a new course Foundations of Data Science that fundamentally co-mingles CS and statistics through active learning exercises. This course is a foundation of a variety of majors that integrate CS and statistics, for example, with physics or business, emphasizing “principled use of computational methods and statistical techniques” to draw conclusions about complex data. Culler described experimentation with delivering this vision in different ways, including a possible data science minor accompanying many of today’s majors.
Ghani spoke of the many perspectives on data science, including emphasis in academia on using data science for social good, emphasis in industry on predictive modeling, and government/non-profit emphasis on specific public policy objectives enabled by data science and mobile technology. Stressing application to real-world problems, Ghani presented specific ideas on how data science could be applied to improve public policy. Jhirad discussed the skill set needed by new data science employees to become productive contributors in industry. She catalogued the many existing different functional roles of an industrial data scientist, including database architect, ‘big data’ engineering, and machine learning/statistics expert, and emphasized the need to communicate well with domain experts.
From the wide-ranging discussions, it became clear that CS departments are either involved in (or are interested in) data science educational efforts, many of them integrated across the university.
Questions discussed included:
- Where should data science research be published to reach both the CS and domain communities?
- How to engage domain departments in data science?
- What to teach about ‘data cleaning’, a key activity of data scientists?
- Where does ethics ‘fit’ as part of data science studies?
- How should the study of data science be structured, given the hetereogeneity of models, data, validation techniques?
- And how can academia and industry interact productively in the education of future data scientists?
More information about the session including panel slides, as well as articles, resources, and a listing of data science programs contributed by the participants, is available at http://bit.ly/cradsp.
Given the strong interest in data science in the community, CRA formed a committee on data science that released a public announcement, Computing Research and the Emerging Field of Data Science, on behalf of the CRA Board of Directors posted on October 7, 2016 and available here:https://cra.org/data-science/. Of possible interest also is the recent statement of the American Statistical Association on the Role of Statistics in Data Science, available here: http://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/. In addition, a subcommittee of the NSF CISE Advisory Committee is preparing a report on data science for release hopefully in December 2016.