This page is the home of the Dresden Web Tables Corpus, a collection of about 125 million data tables extracted from the Common Crawl.
Each extracted table's data and metadata is provided as one JSON document, in a simple one-table-per-line corpus file format.
Includes a Java companion library to work with the data set. Iterate, process, index, analyze.
The Common Crawl is a freely available web crawl created and maintained by the foundation of the same name. The July 2014 incarnation of this corpus, which was used as the basis for this corpus, contains 3.6 billion web pages and is 266TB in size. The data is hosted on Amazon S3, and could thus be easily processed using EC2. The data tables were recognized through a combination of simple heuristics for pre-filtering and a trained classifier to distinguish layout and various kinds of data tables. We included not only (pseudo-)relational tables, but also other kinds of data tables, such as the vertical schema/single entity tables that are common on the web. The features used are similar to the features used in related work, e.g., Cafarella et al. WebTables: exploring the power of tables on the web.
We discovered that the Common Crawl contains many physically identical pages under various logical URLs, as providing multiple URLs variants for a single page is common practice on the web today. This led to many duplicate tables in the initial extracted data. While we originally extracted 174M tables from the Common Crawl, which is consistent with numbers from related work, after content-based deduplication only 125M tables remained.
The final corpus includes only the extracted table data and some metadata described below, not the complete HTML pages from which each table originated. We instead provide code that can automatically retrieve the full HTML text from the Common Crawl S3 bucket using the metadata bundled with the data. This reduces corpus size to 70GB of Gzip compressed data.
The corpus consists of 500 individual files directly downloadable from the TU Dresden Database Technology Group's web server, with URLs of the form
http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-XXX.json.gz
To download the first file as a sample click here. To download the full dataset (or a subset of any size), you can use a shell command such as
for i in $(seq -w 0 500); do wget http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-$i.json.gz; done
The easiest way to work with the dataset is to use the provided Java library, documented at its Github repository page. We also provide a description of the corpus data format and schema below.
The corpus consists of a set of GZip compressed text files. Each line of text contains one JSON document representing one extracted table and its metadata. The easiest way to use these documents is to use the provided Java library, but you can also decompress, read line by line, and use any JSON parser. We provide a JSON schema.
We provide both the code for the extractor, which is partly based on code published by the Web Data Commons project, as well as a companion library for working with the data set.The library is found at https://github.com/JulianEberius/dwtc-tools and the extractor at https://github.com/JulianEberius/dwtc-extractor.
The Web Data Commons project recently published a very similar web table corpus, based on an older version of the Common Crawl, and using a different extraction method and storage format. A very good overview of the related work can be found on their project page.
Table Type: entity | Table Type: relation | |||||
---|---|---|---|---|---|---|
Total number: | 77,666,916 | Total number: | 58,674,016 | |||
Avg. #column: | 2.47 | Avg. #column: | 5.79 | |||
Avg. #row: | 8.96 | Avg. #row: | 17.16 | |||
Min. #column: | 2 | Min. #column: | 2 | |||
Min. #row: | 2 | Min. #row: | 2 | |||
Max. #column: | 6,878 | Max. #column: | 7,291 | |||
Max. #row: | 46,743 | Max. #row: | 28,891 | |||
Table Type: matrix | Table Type: other | |||||
Total number: | 1,973,354 | Total number: | 7,219,536 | |||
Avg. #column: | 7.50 | Avg. #column: | 8.15 | |||
Avg. #row: | 17.69 | Avg. #row: | 15.31 | |||
Min. #column: | 3 | Min. #column: | 2 | |||
Min. #row: | 2 | Min. #row: | 2 | |||
Max. #column: | 2,035 | Max. #column: | 113,682 | |||
Max. #row: | 9,030 | Max. #row: | 15,023 |
TLD | Count |
---|---|
com | 92,416,580 |
org | 21,391,455 |
net | 5,190,071 |
edu | 4,377,877 |
co.uk | 2,912,414 |
gov | 2,836,297 |
de | 1,949,211 |
es | 853,369 |
ca | 849,214 |
fr | 799,457 |
com.au | 767,434 |
ac.uk | 556,531 |
info | 500,836 |
it | 442,881 |
eu | 418,469 |
ru | 393,374 |
com.br | 390,334 |
pl | 348,186 |
nl | 336,709 |
tx.us | 304,570 |
Domain | Count |
---|---|
wikipedia.org | 5,843,615 |
google.com | 3,657,730 |
worldcat.org | 2,005,895 |
godlikeproductions.com | 1,724,842 |
flightaware.com | 1,572,751 |
itjobswatch.co.uk | 1,297,494 |
stackexchange.com | 1,227,872 |
cricketarchive.com | 1,154,101 |
e90post.com | 1,047,063 |
hotels.com | 972,114 |
m3post.com | 919,125 |
go.com | 904,348 |
mixedmartialarts.com | 755,110 |
wowprogress.com | 747,538 |
sports-reference.com | 724,744 |
baseball-reference.com | 675,849 |
macrumors.com | 668,493 |
nhl.com | 660,072 |
stackoverflow.com | 643,068 |
weatherbase.com | 629,457 |
Attribute | Count |
---|---|
NULL | 108,884,792 |
date | 8,283,909 |
title | 6,768,556 |
name | 5,625,831 |
1 | 4,309,750 |
description | 4,285,050 |
2 | 3,594,400 |
location | 3,244,015 |
3 | 3,122,464 |
type | 3,015,000 |
views | 2,974,465 |
5 | 2,874,085 |
4 | 2,838,993 |
publication date | 2,715,011 |
rating | 2,691,175 |
year | 2,578,777 |
filing date | 2,519,377 |
6 | 2,387,802 |
author | 2,356,401 |
7 | 2,309,544 |
@inproceedings{Eberius:2015,
Author = {Eberius, Julian and Thiele, Maik and Braunschweig, Katrin and Lehner, Wolfgang},
Title = {Top-k Entity Augmentation Using Consistent Set Covering},
Series = {SSDBM '15},
Year = {2015},
Doi = {10.1145/2791347.2791353}
}
The corpus data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The code, which derives partly from the code used by the Web Data Commons project can be used under the terms of the same license, the Apache Software License.
The extraction of the Web Table Corpus was supported by an Amazon Web Services in Education Grant award.