Abstract

The web contains a vast amount of relational data in the form of HTML tables. But, in contrast to databases, web tables do not carry meta-data that conveys their schema. The relation’s attribute labels that we can obtain from the table header are unreliable. The challenge is to identify attribute labels if they are missing in the header, specify them if they are incomplete and substitute them if they are uninformative or false. In this thesis we describe a two-fold approach to this problem. First, we assign class labels from a knowledge base to columns of a table by linking the column cells to entities in the knowledge base. Second, we perform information extraction techniques to find attribute labels in the context of a web table and to identify the relation name. We conduct experiments on a web table corpus extracted fromWikipedia using the YAGO knowledge base. The experiments show that our approach in retrieving attribute labels from the contents and the context of web tables is promising.

More