Table des matières

Table representation learning for data set discovery and data integration in datalakes

Summary

The aim of this thesis proposal is to define and develop new solutions for structured tabular data discovery by learning table representations using Large Language Models (LLMs) and Graph Neural Networks (GNNs). The proposed approach suggests that the underlying transfer learning capabilities and the ability to handle graph-based data provide a robust framework for the challenges of modern data integration, enabling deeper analysis and accurate models for discovering and integrating heterogeneous datasets in a data lake. The scientific approach requires theoretical and practical experience in structured data processing and deep learning.

Context and Motivation

Table representation learning [1,4,5,8,9] offers a revolutionary approach to data analysis and interpretation. By harnessing the power of Large Language Models (LLM) [7] and Graph Neural Networks (GNN) [2], which encode data into dense numerical vectors, we can transform traditional tabular data (rows and columns of cells) into a rich, semantic representation. This enables more nuanced understanding and exploration of complex datasets, facilitating deeper insights and more accurate predictions.

LLMs and GNNs have widely been used for various data processing tasks like table extraction [14], data cleaning [10], column type annotation [15], data and schema matching [16], table classfication, question answering[17], and dataset discovery [3]. In this thesis we are particularly interested in the use of table representation learning for related table discovery [11,12]. Recent work shows that LLMs and GNNs can be used for data matching [16] in general and the detection of joinable tables [6] and unionable tables [13] which is a fundamental taks for data set discovery and data integration in data lakes. However, current approaches are still very task specific and rather limited in the semantics over tabular data they are able to capture.

Scientific objective and methodology

A goal of using LLMs and GNNs for data integration is to facilitate the merging of heterogeneous datasets from different sources into a unified, semantically meaningful representation. Embeddings provide a way to transform raw data, which may come in various formats and structures, into a lower-dimensional vector space where similarities and relationships between data points are preserved.

By leveraging table representation learning, we aim to overcome the challenges associated with integrating diverse datasets, such as differences in data formats, schema, and semantics. Generated embedding vectors capture the inherent characteristics and relationships within the data, enabling more effective integration and analysis. Furthermore, embeddings can enhance the interoperability of data across different domains, allowing for cross-domain data integration and knowledge transfer.

More precisely, learning representations for a set of tables is difficult because it involves the following non-trivial steps. Our goal is to design a novel effective solution for each of these steps:

Another goal of this project is to address scalability and efficiency issues. While both LLMs and GNNs have shown promising scalability to large datasets, there is still a need for more efficient training algorithms to handle the increasing amounts of data encountered in modern data integration scenarios.

Justification of the scientific approach

The approach of using Large Language Models (LLMs) and Graph Neural Networks (GNNs) for data integration is justified due to several reasons.

Overall, the combination of LLMs and GNNs offers a powerful and versatile framework for data integration, leveraging the strengths of both approaches to effectively handle the complexities of modern data integration challenges.

  1. Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics 11: 227–249. https://doi.org/10.1162/tacl_a_00544
  2. Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30, 9: 1616–1637.
  3. Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. The VLDB Journal 29, 1: 251–272. https://doi.org/10.1007/s00778-019-00564-x
  4. Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, and George Karypis. 2023. HyTrel: Hypergraph-enhanced Tabular Data Representation Learning. Advances in Neural Information Processing Systems 36: 32173–32193. Retrieved March 27, 2024 from https://proceedings.neurips.cc/paper_files/paper/2023/hash/66178beae8f12fcd48699de95acc1152-Abstract-Conference.html
  5. Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2022. Turl: Table understanding through representation learning. ACM SIGMOD Record 51, 1: 33–40.
  6. Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2023. DeepJoin: Joinable Table Discovery with Pre-Trained Language Models. Proceedings of the VLDB Endowment 16, 10: 2458–2470. https://doi.org/10.14778/3603581.3603587
  7. Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. 2024. Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding – A Survey. Retrieved March 13, 2024 from http://arxiv.org/abs/2402.17944
  8. Madelon Hulsebos, Xiang Deng, Huan Sun, and Paolo Papotti. 2023. Models and Practice of Neural Table Representations. In Companion of the 2023 International Conference on Management of Data (SIGMOD ’23), 83–89. https://doi.org/10.1145/3555041.3589411
  9. Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. Tabbie: Pretrained representations of tabular data. In arXiv preprint arXiv:2105.02584.
  10. Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2021. DomainNet: Homograph Detection for Data Lake Disambiguation. https://doi.org/10.48550/arXiv.2103.09940
  11. Rutian Liu, Eric Simon, Bernd Amann, and Stéphane Gançarski. 2019. Augmenting analytic datasets using natural and aggregate-based schema complements. In Bases de données avancées (BDA 2019). Retrieved from https://hal.sorbonne-universite.fr/hal-03981998
  12. Rutian Liu, Eric Simon, Bernd Amann, and Stéphane Gançarski. 2020. Discovering and merging related analytic datasets. Information Systems 91: 101495. https://doi.org/10.1016/j.is.2020.101495
  13. Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table union search on open data. Proceedings of the VLDB Endowment 11, 7: 813–825.
  14. Shubham Singh Paliwal, D. Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2019. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 128–133. Retrieved April 3, 2024 from https://ieeexplore.ieee.org/abstract/document/8978013/
  15. Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Language Models. In Proceedings of the 2022 International Conference on Management of Data, 1493–1503. https://doi.org/10.1145/3514221.3517906
  16. Runhui Wang, Yuliang Li, and Jin Wang. 2022. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation. Retrieved April 2, 2024 from http://arxiv.org/abs/2207.04122
  17. Bowen Zhao, Changkai Ji, Yuejie Zhang, Wen He, Yingwen Wang, Qing Wang, Rui Feng, and Xiaobo Zhang. 2023. Large Language Models are Complex Table Parsers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 14786–14802. https://doi.org/10.18653/v1/2023.emnlp-main.914