Table representation learning for data set discovery and data integration in datalakes

Summary

The aim of this thesis proposal is to define and develop new solutions for structured tabular data discovery by learning table representations using Large Language Models (LLMs) and Graph Neural Networks (GNNs). The proposed approach suggests that the underlying transfer learning capabilities and the ability to handle graph-based data provide a robust framework for the challenges of modern data integration, enabling deeper analysis and accurate models for discovering and integrating heterogeneous datasets in a data lake. The scientific approach requires theoretical and practical experience in structured data processing and deep learning.

Context and Motivation

Table representation learning [1,4,5,8,9] offers a revolutionary approach to data analysis and interpretation. By harnessing the power of Large Language Models (LLM) [7] and Graph Neural Networks (GNN) [2], which encode data into dense numerical vectors, we can transform traditional tabular data (rows and columns of cells) into a rich, semantic representation. This enables more nuanced understanding and exploration of complex datasets, facilitating deeper insights and more accurate predictions.

LLMs and GNNs have widely been used for various data processing tasks like table extraction [14], data cleaning [10], column type annotation [15], data and schema matching [16], table classfication, question answering[17], and dataset discovery [3]. In this thesis we are particularly interested in the use of table representation learning for related table discovery [11,12]. Recent work shows that LLMs and GNNs can be used for data matching [16] in general and the detection of joinable tables [6] and unionable tables [13] which is a fundamental taks for data set discovery and data integration in data lakes. However, current approaches are still very task specific and rather limited in the semantics over tabular data they are able to capture.

Scientific objective and methodology

A goal of using LLMs and GNNs for data integration is to facilitate the merging of heterogeneous datasets from different sources into a unified, semantically meaningful representation. Embeddings provide a way to transform raw data, which may come in various formats and structures, into a lower-dimensional vector space where similarities and relationships between data points are preserved.

By leveraging table representation learning, we aim to overcome the challenges associated with integrating diverse datasets, such as differences in data formats, schema, and semantics. Generated embedding vectors capture the inherent characteristics and relationships within the data, enabling more effective integration and analysis. Furthermore, embeddings can enhance the interoperability of data across different domains, allowing for cross-domain data integration and knowledge transfer.

More precisely, learning representations for a set of tables is difficult because it involves the following non-trivial steps. Our goal is to design a novel effective solution for each of these steps:

Data preprocessing: data cleaning, sampling, and augmentation
Data representation: select an appropriate table representation learning method based on the data structure and semantics to generate embeddings of cells, rows, columns, and tables.
Finetuning and training: train models for encoding the semantic and structural context of cells, tuples, columns, and tables, which is useful for the specific data set discovery and data integration tasks.
Experimentation and preliminary validation: perform similarity search and cluster analysis on the obtained embeddings to find rows, columns, and tables that share similar characteristics or patterns for the specific task (joinable/unionable table discovery).
Validation: evaluate the quality of the obtained embeddings with respect to specific related dataset discovery and integration tasks.

Another goal of this project is to address scalability and efficiency issues. While both LLMs and GNNs have shown promising scalability to large datasets, there is still a need for more efficient training algorithms to handle the increasing amounts of data encountered in modern data integration scenarios.

Justification of the scientific approach

The approach of using Large Language Models (LLMs) and Graph Neural Networks (GNNs) for data integration is justified due to several reasons.

Semantic understanding: LLMs have demonstrated exceptional capabilities in understanding and generating natural language text. This understanding can be leveraged in data integration tasks where metadata descriptions, schema mappings, and even data instances can be encoded into textual representations. This enables a deeper semantic understanding of the data, aiding in integration tasks.
Flexible representations: LLMs can generate flexible representations of data and metadata, which can capture complex relationships and contexts within and across datasets. These representations can be used to bridge semantic gaps between heterogeneous data sources, facilitating integration.
Transfer learning: LLMs are often pre-trained on large-scale text corpora, capturing vast amounts of knowledge about language and real-world semantics. Fine-tuning these pre-trained models on domain-specific data integration tasks allows for effective transfer of this knowledge, leading to improved performance even with limited labeled training data.
Graph-based data representation: Data integration inherently involves understanding the relationships and dependencies between various data entities. GNNs excel in processing graph-structured data and learning representations that capture the relational structure of the data. By representing data and metadata as graphs, GNNs can effectively capture complex relationships and dependencies, facilitating integration tasks.
Combining textual and structural information: LLMs can encode textual information such as schema descriptions, data summaries, and integration rules, while GNNs can handle structural information represented as graphs. By combining these two approaches, we can leverage both textual and structural clues for more comprehensive data integration.

Overall, the combination of LLMs and GNNs offers a powerful and versatile framework for data integration, leveraging the strengths of both approaches to effectively handle the complexities of modern data integration challenges.

Related publications

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics 11: 227–249. https://doi.org/10.1162/tacl_a_00544
Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30, 9: 1616–1637.
Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. The VLDB Journal 29, 1: 251–272. https://doi.org/10.1007/s00778-019-00564-x
Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, and George Karypis. 2023. HyTrel: Hypergraph-enhanced Tabular Data Representation Learning. Advances in Neural Information Processing Systems 36: 32173–32193. Retrieved March 27, 2024 from https://proceedings.neurips.cc/paper_files/paper/2023/hash/66178beae8f12fcd48699de95acc1152-Abstract-Conference.html
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2022. Turl: Table understanding through representation learning. ACM SIGMOD Record 51, 1: 33–40.
Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2023. DeepJoin: Joinable Table Discovery with Pre-Trained Language Models. Proceedings of the VLDB Endowment 16, 10: 2458–2470. https://doi.org/10.14778/3603581.3603587
Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. 2024. Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding – A Survey. Retrieved March 13, 2024 from http://arxiv.org/abs/2402.17944
Madelon Hulsebos, Xiang Deng, Huan Sun, and Paolo Papotti. 2023. Models and Practice of Neural Table Representations. In Companion of the 2023 International Conference on Management of Data (SIGMOD ’23), 83–89. https://doi.org/10.1145/3555041.3589411
Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. Tabbie: Pretrained representations of tabular data. In arXiv preprint arXiv:2105.02584.
Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2021. DomainNet: Homograph Detection for Data Lake Disambiguation. https://doi.org/10.48550/arXiv.2103.09940
Rutian Liu, Eric Simon, Bernd Amann, and Stéphane Gançarski. 2019. Augmenting analytic datasets using natural and aggregate-based schema complements. In Bases de données avancées (BDA 2019). Retrieved from https://hal.sorbonne-universite.fr/hal-03981998
Rutian Liu, Eric Simon, Bernd Amann, and Stéphane Gançarski. 2020. Discovering and merging related analytic datasets. Information Systems 91: 101495. https://doi.org/10.1016/j.is.2020.101495
Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table union search on open data. Proceedings of the VLDB Endowment 11, 7: 813–825.
Shubham Singh Paliwal, D. Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2019. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 128–133. Retrieved April 3, 2024 from https://ieeexplore.ieee.org/abstract/document/8978013/
Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Language Models. In Proceedings of the 2022 International Conference on Management of Data, 1493–1503. https://doi.org/10.1145/3514221.3517906
Runhui Wang, Yuliang Li, and Jin Wang. 2022. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation. Retrieved April 2, 2024 from http://arxiv.org/abs/2207.04122
Bowen Zhao, Changkai Ji, Yuejie Zhang, Wen He, Yingwen Wang, Qing Wang, Rui Feng, and Xiaobo Zhang. 2023. Large Language Models are Complex Table Parsers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 14786–14802. https://doi.org/10.18653/v1/2023.emnlp-main.914