Dresden Web Tables Corpus


125M Web Data Tables

This page is the home of the Dresden Web Tables Corpus, a collection of about 125 million data tables extracted from the Common Crawl.

Easy to use

Each extracted table's data and metadata is provided as one JSON document, in a simple one-table-per-line corpus file format.

Tools included

Includes a Java companion library to work with the data set. Iterate, process, index, analyze.


About

The Common Crawl is a freely available web crawl created and maintained by the foundation of the same name. The July 2014 incarnation of this corpus, which was used as the basis for this corpus, contains 3.6 billion web pages and is 266TB in size. The data is hosted on Amazon S3, and could thus be easily processed using EC2. The data tables were recognized through a combination of simple heuristics for pre-filtering and a trained classifier to distinguish layout and various kinds of data tables. We included not only (pseudo-)relational tables, but also other kinds of data tables, such as the vertical schema/single entity tables that are common on the web. The features used are similar to the features used in related work, e.g., Cafarella et al. WebTables: exploring the power of tables on the web.

We discovered that the Common Crawl contains many physically identical pages under various logical URLs, as providing multiple URLs variants for a single page is common practice on the web today. This led to many duplicate tables in the initial extracted data. While we originally extracted 174M tables from the Common Crawl, which is consistent with numbers from related work, after content-based deduplication only 125M tables remained.

The final corpus includes only the extracted table data and some metadata described below, not the complete HTML pages from which each table originated. We instead provide code that can automatically retrieve the full HTML text from the Common Crawl S3 bucket using the metadata bundled with the data. This reduces corpus size to 70GB of Gzip compressed data.

News

Getting Started

The corpus consists of 500 individual files directly downloadable from the TU Dresden Database Technology Group's web server, with URLs of the form http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-XXX.json.gz
To download the first file as a sample click here. To download the full dataset (or a subset of any size), you can use a shell command such as

for i in $(seq -w 0 500); do wget http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-$i.json.gz; done

The easiest way to work with the dataset is to use the provided Java library, documented at its Github repository page. We also provide a description of the corpus data format and schema below.

Schema

The corpus consists of a set of GZip compressed text files. Each line of text contains one JSON document representing one extracted table and its metadata. The easiest way to use these documents is to use the provided Java library, but you can also decompress, read line by line, and use any JSON parser. We provide a JSON schema.

Code

We provide both the code for the extractor, which is partly based on code published by the Web Data Commons project, as well as a companion library for working with the data set.The library is found at https://github.com/JulianEberius/dwtc-tools and the extractor at https://github.com/JulianEberius/dwtc-extractor.

Related Work

The Web Data Commons project recently published a very similar web table corpus, based on an older version of the Common Crawl, and using a different extraction method and storage format. A very good overview of the related work can be found on their project page.

Corpus Statistics

Table Statistics
see Tables
Table Type: entity Table Type: relation
Total number: 77,666,916 Total number: 58,674,016
Avg. #column: 2.47 Avg. #column: 5.79
Avg. #row: 8.96 Avg. #row: 17.16
Min. #column: 2 Min. #column: 2
Min. #row: 2 Min. #row: 2
Max. #column: 6,878 Max. #column: 7,291
Max. #row: 46,743 Max. #row: 28,891
Table Type: matrix Table Type: other
Total number: 1,973,354 Total number: 7,219,536
Avg. #column: 7.50 Avg. #column: 8.15
Avg. #row: 17.69 Avg. #row: 15.31
Min. #column: 3 Min. #column: 2
Min. #row: 2 Min. #row: 2
Max. #column: 2,035 Max. #column: 113,682
Max. #row: 9,030 Max. #row: 15,023
see Graphs table classes
Most common TLDs
see Table
TLD Count
com 92,416,580
org 21,391,455
net 5,190,071
edu 4,377,877
co.uk 2,912,414
gov 2,836,297
de 1,949,211
es 853,369
ca 849,214
fr 799,457
com.au 767,434
ac.uk 556,531
info 500,836
it 442,881
eu 418,469
ru 393,374
com.br 390,334
pl 348,186
nl 336,709
tx.us 304,570
Most common domains
see Table
Domain Count
wikipedia.org 5,843,615
google.com 3,657,730
worldcat.org 2,005,895
godlikeproductions.com 1,724,842
flightaware.com 1,572,751
itjobswatch.co.uk 1,297,494
stackexchange.com 1,227,872
cricketarchive.com 1,154,101
e90post.com 1,047,063
hotels.com 972,114
m3post.com 919,125
go.com 904,348
mixedmartialarts.com 755,110
wowprogress.com 747,538
sports-reference.com 724,744
baseball-reference.com 675,849
macrumors.com 668,493
nhl.com 660,072
stackoverflow.com 643,068
weatherbase.com 629,457
Estimated Distinct Attributes
see Table
Attribute Count
NULL 108,884,792
date 8,283,909
title 6,768,556
name 5,625,831
1 4,309,750
description 4,285,050
2 3,594,400
location 3,244,015
3 3,122,464
type 3,015,000
views 2,974,465
5 2,874,085
4 2,838,993
publication date 2,715,011
rating 2,691,175
year 2,578,777
filing date 2,519,377
6 2,387,802
author 2,356,401
7 2,309,544

Citation

The corpus was initially created for and published in conjunction with the following paper.
@inproceedings{Eberius:2015,
  Author = {Eberius, Julian and Thiele, Maik and Braunschweig, Katrin and Lehner, Wolfgang},
  Title = {Top-k Entity Augmentation Using Consistent Set Covering},
  Series = {SSDBM '15},
  Year = {2015},
  Doi = {10.1145/2791347.2791353}
}

License

The corpus data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The code, which derives partly from the code used by the Web Data Commons project can be used under the terms of the same license, the Apache Software License.

Credits

The extraction of the Web Table Corpus was supported by an Amazon Web Services in Education Grant award.

Contact


Mail:
julian.eberius@tu-dresden.de
Address:
Dep. of Computer Science
Technische Universität Dresden
Room 3108
Noethnitzer Str. 46
01062 Dresden
(directions)
Phone:
+49 (351) 463 38283
Fax:
+49 (351) 463 38359

EU ESF Sachsen Sachsen
maintained by Julian Eberius (julian.eberius@tu-dresden.de)