• Facebook
  • Youtube
  • News
  • Teaching
    • Lectures and Courses
    • Information about examinations
    • Topics for Student Theses
  • Research
  • Publications
  • Our Group
    • Our Team
    • Contact
    • Legal Notice
 
 
  • News
  • Teaching
    • Lectures and Courses
    • Information about examinations
    • Topics for Student Theses
  • Research
  • Publications
  • Our Group
    • Our Team
    • Contact
    • Legal Notice

Research Projects

Dresden Web Table Corpus (DWTC)

  • DIWAN
  • PAL
  • DEUS
  • BRICS — Bit flip Resilience for In-memory Column Stores
  • ReEF
  • QPCI
  • DAPHNE
  • GOFLEX
  • HAEC – Highly Adaptive Energy-Efficient Computing
  • VAVID
  • EXPLOIDS
  • RoSI
  • Freddy
  • Dresden Web Table Corpus (DWTC)
  • ERIS
  • Flash Forward Query (FFQ)
  • AUGUR
  • DeExcelarator
  • MulTe
  • InVerDa
  • Dr.-Ing. Julian Eberius, Dr.-Ing. Maik Thiele

     About

    The Common Crawl is a freely available web crawl created and maintained by the foundation of the same name. The July 2014 incarnation of this corpus, which was used as the basis for this corpus, contains 3.6 billion web pages and is 266TB in size. The data is hosted on Amazon S3, and could thus be easily processed using EC2. The data tables were recognized through a combination of simple heuristics for pre-filtering and a trained classifier to distinguish layout and various kinds of data tables. We included not only (pseudo-)relational tables, but also other kinds of data tables, such as the vertical schema/single entity tables that are common on the web. The features used are similar to the features used in related work, e.g., Cafarella et al. WebTables: exploring the power of tables on the web.

    We discovered that the Common Crawl contains many physically identical pages under various logical URLs, as providing multiple URLs variants for a single page is common practice on the web today. This led to many duplicate tables in the initial extracted data. While we originally extracted 174M tables from the Common Crawl, which is consistent with numbers from related work, after content-based deduplication only 125M tables remained.

    The final corpus includes only the extracted table data and some metadata described below, not the complete HTML pages from which each table originated. We instead provide code that can automatically retrieve the full HTML text from the Common Crawl S3 bucket using the metadata bundled with the data. This reduces corpus size to 70GB of Gzip compressed data.

    News

    • 25.02.2015: Published version 1.1.0 of the the corpus and companion libraries. The new version is based on the July 2014 version of the Common Crawl. It contains more tables, fixes errors of the old version, and also contains some new attributes, such as table classification results that allow distinguishing relational and single-entity tables (see Schema for details)

    Getting Started

    The corpus consists of 500 individual files directly downloadable from the TU Dresden Database Technology Group’s web server, with URLs of the form https://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-XXX.json.gz
    To download the first file as a sample click here. To download the full dataset (or a subset of any size), you can use a shell command such as

    for i in $(seq -w 0 500); do wget https://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-$i.json.gz; done

    The easiest way to work with the dataset is to use the provided Java library, documented at its Github repository page. We also provide a description of the corpus data format and schema below.

    Schema

    The corpus consists of a set of GZip compressed text files. Each line of text contains one JSON document representing one extracted table and its metadata. The easiest way to use these documents is to use the provided Java library, but you can also decompress, read line by line, and use any JSON parser. We provide a JSON schema.

    Code

    We provide both the code for the extractor, which is partly based on code published by the Web Data Commons project, as well as a companion library for working with the data set.The library is found at https://github.com/JulianEberius/dwtc-tools and the extractor at https://github.com/JulianEberius/dwtc-extractor.

    Related Work

    The Web Data Commons project recently published a very similar web table corpus, based on an older version of the Common Crawl, and using a different extraction method and storage format. A very good overview of the related work can be found on their project page.

    Corpus Statistics

    Table Statistics

    see Tables
    see Graphs

    Most common TLDs

    see Table

    Most common domains

    see Table

    Estimated Distinct Attributes

    see Table

    Citation

    The corpus was initially created for and published in conjunction with the following paper.

    @inproceedings{Eberius:2015,
      Author = {Eberius, Julian and Thiele, Maik and Braunschweig, Katrin and Lehner, Wolfgang},
      Title = {Top-k Entity Augmentation Using Consistent Set Covering},
      Series = {SSDBM '15},
      Year = {2015},
      Doi = {10.1145/2791347.2791353}
    }
    

    Related Publications

    Related Student Theses

    License

    The corpus data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The code, which derives partly from the code used by the Web Data Commons project can be used under the terms of the same license, the Apache Software License.

    Credits

    The extraction of the Web Table Corpus was supported by an Amazon Web Services in Education Grant award.

    • Research
    • Dresden Web Table Corpus (DWTC)

    Last modified: Thursday, May 23rd, 2019 at 09:32 AM

    • Faculty of Computer Science
    • Institute of System Architecture
    • Technische Universität Dresden

    Copyright © 2012-2023   Database Systems Group Dresden

    • Logo Dresden Database Systems Group
    • Logo Technische Universität Dresden