Use este identificador para citar ou linkar para este item:
https://repositorio.ufpe.br/handle/123456789/55271
Compartilhe esta página
Título: | Finding structured data from text using language models |
Autor(es): | SILVA, Levy de Souza |
Palavras-chave: | Inteligência computacional; Tabelas da internet; Recuperação de tabelas; Correspondência de notícias e tabelas |
Data do documento: | 7-Dez-2023 |
Editor: | Universidade Federal de Pernambuco |
Citação: | SILVA, Levy de Souza. Finding structured data from text using language models. 2023. Tese (Doutorado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2023. |
Abstract: | The Internet is a rich source of structured information. From Web Tables to public datasets, there exists a huge corpus of relational data online. Previous studies estimate that over 418M tables, in Hypertext Markup Language (HTML) format, can be found on the Web. Not limited to them, a large number of data repositories also provide ac- cess to thousands of datasets. As a result of that, over the last years, a growing body of work has begun to explore this data for several downstream applications. For example, Web Tables have been widely utilized for the task of Question Answering (QA), whose goal is to retrieve a table that answers a query from a table collection. In the context of datasets, their most popular application is the dataset retrieval task, which aims to find structured datasets for an end-user. The point of intersection for table/dataset re- trieval is that they need to match unstructured queries and relational data, in addition to being a ranking task. Moreover, the core challenge of this task is how to construct a robust matching model for computing this similarity degree. Towards this front, this thesis work is divided into three parts. In the first one, we explore the problem of QA Table Retrieval, in which our goal is to outline the best solutions for this task. In se- quence, we focus on an unexplored news-table matching problem, whose Web Tables are applied to augmenting news stories. Lastly, we concentrate on the dataset retrieval task. Specifically, we summarize our main contributions as follows: (I) we present a novel tax- onomy for table retrieval that classifies the table retrieval methods into five groups, from probabilistic approaches to sophisticated neural networks. Our research also points out that the best results for this task are achieved by using deep neural models, built on top of recurrent networks and convolutional architectures; (II) we introduce a novel atten- tion model based on Bidirectional Encoder Representations from Transformers (BERT) for computing the similarity degree between news stories and Web Tables, in addition to comparing its performance against Information Retrieval (IR) techniques, document/sen- tence encoders, text-matching models, and neural IR approaches. In short, a hypothesis test confirms that our approach outperforms all baselines in terms of the Mean Reciprocal Ranking metric; and (III) we propose Data Augmentation Pipeline for Dataset Retrieval (DAPDR), a solution that leverages Large Language Models (LLMs) to create synthetic questions for dataset descriptions, which are then applied to training supervised retrievers. Finally, we evaluate DAPDR on dataset search benchmarks using a set of dense retrievers, whose main results show that the retrievers tuned in DAPDR statistically outperform the original models at different Normalized Discounted Cumulative Gain (NDCG) levels. |
URI: | https://repositorio.ufpe.br/handle/123456789/55271 |
Aparece nas coleções: | Teses de Doutorado - Ciência da Computação |
Arquivos associados a este item:
Arquivo | Descrição | Tamanho | Formato | |
---|---|---|---|---|
TESE Levy de Souza Silva.pdf | 1,83 MB | Adobe PDF | ![]() Visualizar/Abrir |
Este arquivo é protegido por direitos autorais |
Este item está licenciada sob uma Licença Creative Commons