Machine Learning for Storage and Execution Layers of Database Systems

CIFellows Spotlight highlights the work of the Computing Innovation Fellows (CIFellows) for the computing research community.

Ibrahim Sabek began his CIFellowship in September 2020 after receiving his PhD from the University of Minnesota in January 2020. Sabek is at the Massachusetts Institute of Technology (MIT) working with Michael Cafarella, Principal Research Scientist, and Tim Kraska, Associate Professor, at MIT’s Computer Science and Artificial Intelligence Laboratory.

Ibrahim Sabek

Current Project

My current project is exploiting machine learning models (ML) to optimize the performance of data-intensive systems, with special focus on data access and query execution modules. This includes introducing ML-optimized core data structures, such as indexes and bloom filters, and boosting the performance of main in-memory operations, such as joins and query scheduling, using statistical and deep learning techniques.

Machine learning (ML) has been exploited in different computing fields, e.g., computer vision, natural language processing, artificial intelligence, bioinformatics, etc, where researchers succeeded in providing solutions that exhibit useful learning behavior autonomously. Undoubtedly, the field of data management is not an exception to this, as there has been a flurry of research efforts over the past few decades to explore the usage of ML in automatically choosing database indexes, and fine-tuning physical query plans. However, such research efforts are still limited in trials and have not explored the full power of ML yet. My project aims to complement these trials and explore the applicability of learned solutions inside the core functions of database engines, and not just using ML as an out-of-box solution on top of the database operations.

Impact

With the current explosion in the amounts of data generated around us, database vendors need intelligent ways to manage, query and analyze data. Using my work, the core data access and query execution operations in databases become smarter and instance-optimized for different user-specific data and query workloads, resulting in improved performance. This paves the way for building a totally self-driven full-fledged learned data management system with value-added features that differentiate them from traditional databases.

Other Research

Besides this project, I develop a special interest in spatial data analysis and management, and how we adopt statistical learning and inference methods to provide efficient and scalable spatial-aware knowledge bases and analytics tools. In this line of work, my research won the first place in the graduated student research competition (SRC) of ACM SIGSPATIAL’19, a top research venue in spatial data management and computing. In addition, my TurboReg system, a tool for efficient and scalable spatial regression analysis, was selected among the best papers of ACM SIGSPATIAL’18, and its extension, RegRocket system, was invited for a spatial issue of ACM TSAS journal, a top journal in spatial algorithms and systems.