Discovering a domain-specific schema from general-purpose knowledge base

SILVA NETO, Everaldo Costa

Please use this identifier to cite or link to this item: https://repositorio.ufpe.br/handle/123456789/51840

Share on

Title:	Discovering a domain-specific schema from general-purpose knowledge base
Authors:	SILVA NETO, Everaldo Costa
Keywords:	Banco de dados; Descoberta de esquema; Descoberta do domínio; Representação de entidade
Issue Date:	13-Jun-2023
Publisher:	Universidade Federal de Pernambuco
Citation:	SILVA NETO, Everaldo Costa. Discovering a domain-specific schema from general-purpose knowledge base. 2023. Tese (Doutorado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2023.
Abstract:	General-purpose knowledge bases (KBs), e.g., DBpedia, YAGO, and Wikidata, store fac- tual data about a set of entities. These KBs have been constructed to store cross-domain knowledge, e.g., health, entertainment, industry, sports, and arts. Most applications that use data from general-purpose KBs are domain-specific. Some tasks, such as query formu- lation and information extraction, require a data schema to explore the contents of a KB. However, schema-related declarations are not mandatory and, sometimes, are not pro- vided. Therefore, these domain-specific applications face two issues: (1) they require only a subset of data that meets the domain of interest, but general-purpose KBs have a large volume of factual data within many distinct domains; and (2) the lack of schema-related information. In this thesis, we address the problem of domain-specific schema discov- ery from general-purpose KBs. Specifically, we build ANCHOR, an end-to-end pipeline to identify a domain-specific dataset as well as its schematic description in an automatic way. ANCHOR works in three steps: domain discovery, class identification and class schema discovery. First, it extracts a specific domain exploring category-category mappings from KB. From this, it identifies domain entities through entity-category mappings. Next, the class identification step discovers implicit classes within the dataset. For that, ANCHOR learns entity representation from entity-category mappings and uses it to identify im- plicit entities’ classes by grouping similar entities. Finally, the class schema discovery task builds the class schema, i.e., it identifies a set of relevant attributes that best describe the entities within the same class. For that, ANCHOR runs CoFFee, an approach based on attributes co-occurrence and frequency to identify a set of core attributes for each class discovered in the previous step. We have performed an extensive experimental evaluation on four distinct DBpedia domains. For the class identification task, we compare ANCHOR against some traditional and embedding-based baselines. The results show that applied to standard clustering algorithms, our entity representation outperforms the baselines and is effective for the class identification task. For the class schema discovery task, we compare CoFFee against two state-of-the-art approaches. The results show that CoFFee proved to be effective in filtering out less relevant attributes. It selects a set of core attributes keep- ing its retrieval rate high and producing a higher-quality schema class for the identified classes.
URI:	https://repositorio.ufpe.br/handle/123456789/51840
Appears in Collections:	Teses de Doutorado - Ciência da Computação

Files in This Item:

File	Description	Size	Format
TESE Everaldo Costa Silva Neto.pdf		4.11 MB	Adobe PDF	View/Open

This item is protected by original copyright

View License

Show full item record Recommend this item

This item is licensed under a Creative Commons License