A Task-Centric Framework to Revolutionize Big Data Systems Research

The following Great Innovative Idea is from Da Yan, tenure-track assistant professor in the Department of Computer Sciences at the University of Alabama at Birmingham (UAB). Yan presented his poster, A Task-Centric Framework to Revolutionize Big Data Systems Research, at the Computing Community Consortium (CCC) Early Career Researcher Symposium, August 1-2, 2018.

The Idea

Big Data frameworks such as Apache Hadoop and Apache Spark are becoming increasingly popular due to their emphasis on ease of programming, but they are dominantly designed for data-intensive iterative computations, and there lacks an efficient solution to compute-intensive Big Data analytics. Based on my insight that compute-intensive problems are often solved by divide and conquer (e.g., a recursive algorithm), a general task-centric framework, called T-thinker, is developed for compute-intensive Big Data problems. The framework effectively utilizes the CPU cores in a cluster by properly dividing a problem over a big dataset into tasks over smaller subsets of the dataset, and by overlapping CPU processing with network communication (e.g., for requesting a subset of dataset).

Impact

Many compute-intensive applications can be built on top of T-thinker for efficient parallel execution, such as community detection, subgraph matching, training decision trees, frequent pattern mining, facility location problems and matrix computations. A successful example is the graph mining system G-thinker open-sourced at http://www.cs.uab.edu/yanda/gthinker/. T-thinker will greatly benefit researchers and practitioners who need compute-intensive tools for processing Big Data (which is currently lacking).

Other Research

Dr. Da Yan’s research interests include Big Data analytics systems, algorithms for processing Big Data, parallel/distributed computing, data mining and machine learning.

Researcher’s Background

Dr. Da Yan is currently a tenure-track Assistant Professor at the Department of Computer Science, the University of Alabama at Birmingham. He is the sole winner of Hong Kong 2015 Young Scientist Award in Physical/Mathematical Science, and the recipient of DASFAA 2011 Best Paper Award. He has developed a comprehensive platform of systems, collectively called BigGraph@CUHK, for data-intensive iterative big graph analytics. These systems are orders of magnitude faster than their competitors, and have been used by other researchers in their work published in top venues such as SIGMOD, ICDE, IEEE Cluster, etc. Dr. Yan regularly publishes in 1st-tier conferences and journals like SIGMOD, PVLDB, SIGKDD, ICDE, WWW, TKDE, TPDS, SoCC, EuroSys, etc. He was invited as the 1st author to write 2 books in Foundations and Trends in Databases and Springer Briefs in Computer Science, respectively, and a book chapter in Encyclopedia of Big Data Technologies. He also regularly serves as the reviewers of top journals including TODS, VLDBJ, TKDE, TPDS, WWWJ, TNSE, etc., and serves in the program committees of top conferences such as SIGMOD 2019, PVLDB 2018, IJCAI 2017, ICPP 2018, ICA3PP 2017, IRI 2017-2018, ICPADS 2016, etc. Dr. Yan is the leading program co-chair of the BIOKDD 2018 workshop held in conjunction with SIGKDD 2018, and serves in the PCs of a number of other workshops on database and data mining research. Dr. Yan’s research is sponsored by NSF, Microsoft Azure, and South Big Data Hub.