Dresden Web Table Corpus (DWTC)

About

The Common Crawl is a freely available web crawl created and maintained by the foundation of the same name. The July 2014 incarnation of this corpus, which was used as the basis for this corpus, contains 3.6 billion web pages and is 266TB in size. The data is hosted on Amazon S3, and could thus be easily processed using EC2. The data tables were recognized through a combination of simple heuristics for pre-filtering and a trained classifier to distinguish layout and various kinds of data tables. We included not only (pseudo-)relational tables, but also other kinds of data tables, such as the vertical schema/single entity tables that are common on the web. The features used are similar to the features used in related work, e.g., Cafarella et al. WebTables: exploring the power of tables on the web.

We discovered that the Common Crawl contains many physically identical pages under various logical URLs, as providing multiple URLs variants for a single page is common practice on the web today. This led to many duplicate tables in the initial extracted data. While we originally extracted 174M tables from the Common Crawl, which is consistent with numbers from related work, after content-based deduplication only 125M tables remained.

The final corpus includes only the extracted table data and some metadata described below, not the complete HTML pages from which each table originated. We instead provide code that can automatically retrieve the full HTML text from the Common Crawl S3 bucket using the metadata bundled with the data. This reduces corpus size to 70GB of Gzip compressed data.

News

25.02.2015: Published version 1.1.0 of the the corpus and companion libraries. The new version is based on the July 2014 version of the Common Crawl. It contains more tables, fixes errors of the old version, and also contains some new attributes, such as table classification results that allow distinguishing relational and single-entity tables (see Schema for details)

Getting Started

The corpus consists of 500 individual files directly downloadable from the TU Dresden Database Technology Group’s web server, with URLs of the form https://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-XXX.json.gz
To download the first file as a sample click here. To download the full dataset (or a subset of any size), you can use a shell command such as

for i in $(seq -w 0 500); do wget https://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-$i.json.gz; done

The easiest way to work with the dataset is to use the provided Java library, documented at its Github repository page. We also provide a description of the corpus data format and schema below.

Schema

The corpus consists of a set of GZip compressed text files. Each line of text contains one JSON document representing one extracted table and its metadata. The easiest way to use these documents is to use the provided Java library, but you can also decompress, read line by line, and use any JSON parser. We provide a JSON schema.

Code

We provide both the code for the extractor, which is partly based on code published by the Web Data Commons project, as well as a companion library for working with the data set.The library is found at https://github.com/JulianEberius/dwtc-tools and the extractor at https://github.com/JulianEberius/dwtc-extractor.

Related Work

The Web Data Commons project recently published a very similar web table corpus, based on an older version of the Common Crawl, and using a different extraction method and storage format. A very good overview of the related work can be found on their project page.

Corpus Statistics

Table Statistics

see Tables

see Graphs

Most common TLDs

see Table

Most common domains

see Table

Estimated Distinct Attributes

see Table

Citation

The corpus was initially created for and published in conjunction with the following paper.

@inproceedings{Eberius:2015,
  Author = {Eberius, Julian and Thiele, Maik and Braunschweig, Katrin and Lehner, Wolfgang},
  Title = {Top-k Entity Augmentation Using Consistent Set Covering},
  Series = {SSDBM '15},
  Year = {2015},
  Doi = {10.1145/2791347.2791353}
}

Related Publications

Related Student Theses

License

The corpus data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The code, which derives partly from the code used by the Web Data Commons project can be used under the terms of the same license, the Apache Software License.

Credits

The extraction of the Web Table Corpus was supported by an Amazon Web Services in Education Grant award.

Research Projects

About

News

Getting Started

Schema

Code

Related Work

Corpus Statistics

Table Statistics

Most common TLDs

Most common domains

Estimated Distinct Attributes

Citation

Related Publications

Related Student Theses

License

Credits