This page is home to the DeExcelarator project, aiming at the development of a comprehensive approach for information and knowledge extraction from spreadsheet files.
Our approach focuses on the cell granularity, utilizing machine learning techniques and heuristics. We have incorporated and further extended features from various related work.
We have built a corpus of 828,252 annotated cells. It consists of spreadsheets from three corpora (FUSE, ENRON, and EUSES), covering different domains.
Our work includes a Java desktop application tool for annotating regions of cells in Excel spreadsheets. There is ongoing work on the development of a processing pipeline able to handle large corpora of spreadsheets.
Spreadsheet applications are one of the most used tools for content generation and presentation in the industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets. Differently from other file types, such as XML and JSON, spreadsheets do not have attached metadata describing the (structural) function of the individual units of data. This makes it difficult for machines to interpret the information maintained in these files, although the same task is rather easy for humans.
In this page, we summarize the current status of our work on table identification and layout recognition in spreadsheets. For the first stage, we have focused on discovering the layout of the data, using machine learning for classification and a set of heuristics. We work on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach delivers very high accuracy bringing us a crucial step closer towards automatic information extraction from spreadsheet files.
The layout inference process is defined as a series of steps, illustrated in the figure below. Initially, the application reads the spreadsheet file and extracts the features of each non-blank cell. In the next step, cells are classified with high accuracy based on their features. Finally, a post-processing step improves the quality of the results even further by applying a set of rules that are able to identify (most probably) misclassified cells and relabel them. This figure also includes the Table Reconstruction task, which forms a separate topic and is therefore left as future work.
We define five building blocks for spreadsheet tables: Headers, Attributes, Metadata, Data and Derived (see figure below). A "Header" (H) cell represents the label of a column and can be flat or hierarchical (stacked). Hierarchical structures can be also found in the left-most or right-most columns of a table, which we call "Attributes" (A), a term first introduced in (Chen 2013). Attributes can be seen as instances from the same or different (relational) dimensions placed in one or multiple columns in a way that conveys the existence of a hierarchy. We label cells as "Metadata" (M) when they provide additional information about the table as a whole or its specific sections. Examples of Metadata are the table name, creation date, and the unit for the values of a column. The remaining cells form the actual payload of the table and are labeled as "Data". Additionally, we use the label "Derived" (B) to distinguish those cells that are aggregations of other Data cells' values. Derived cells can have a different structure from the core Data cells, therefore we need to treat them separately.
We annotated a collection of 465 Excel sheets (216 files) from three spreadsheet corpora: FUSE, ENRON, and EUSES. To accomplish this task, we developed a specialized tool for interactive annotations in Microsoft Excel spreadsheets. In the figures bellow we present statistics about these annotations.
The selection of the files was random, but with the intention to maintain some kind of proportionality to the original size of the considered corpora. Therefore, the largest number of sheets are from FUSE, followed by ENRON and then EUSES.
We have also annotated the tables in the selected spreadsheets. They will be relevant for future work on automatic table identification.
The following figure displays the contribution of each corpus as a percentage of the number of cells per defined annotation label. On the top of each column in the presented chart,we display the total number of cells for the label. In total 828,252 non-blank cells were annotated.
Since the number of annotated Data cells is overwhelmingly larger than the other classes (labels), 808,179 cells (97.6% of the total annotated cells), we decided to downsample. This not only reduces the possibility of a biased feature selection process, but also speeds up the training and validation processes. For this purpose, from the Data region of each annotated table we consider only the first, last and three random rows in between. By applying these technique the Data class was reduced to 32,905 cells. Together with the annotated cells from the other classes, the final golden standard, used for training and validation, consist of 52,970 instances.
The initial set consisted of 81 cell features. Here we have also included string features such as file and sheet name. We were able to retrieve them using Apache POI, which is the most complete JAVA library for working with Excel spreadsheets. Nevertheless, some of the features required a custom implementation on top of this library. Furthermore, using Weka, a well known toolkit for data mining tasks, we binarized, cleaned, and evaluated these features. The final set contains 43 features, listed in the table below.
We have grouped the selected features into 5 categories based on their characteristics. The content features describe the cell value, but not its format. The features in the second group (column) characterize the styling aspects of the cell except of the font, which is the subject of third group. Formulas referencing other cells are the subject of the forth group. Finally, we have defined features that describe the location (neighborhood) of the cell. The adjacent cells are referred to as “neighbors”. Note, hidden and blank cells are not counted as neighbors.
Independently of the category, Features can be numeric or boolean. We mark the numeric ones with a hashtag (#) as suffix. We use question mark (?) for boolean features.
For our evaluation, we consider various classification algorithms, most of which have been successfully applied to similar tasks in the literature. Specifically, we consider CART (SimpleCART in Weka), C4.5 (J48 in Weka), Random Forest and Support Vector Machines (SMO in Weka), which uses the sequential minimal optimization algorithm developed to train the classifier. Here we consider both a polynomial kernel and an RBF kernel. We evaluate the classification performance using 10-fold cross validation. The best scores are achieved by Random Forest classifier. More detailed results can be found below.
The results from the 10-fold cross validation are summarized in the following table. For each one of the considered classification algorithms we provide the precision, recall, and f1 measures per label. The values are displayed as percentages.
The following table represents the confusion matrix from the evaluation of the Random Forest classifier, which had the highest accuracy in the 10-fold cross validation. The values having a bold fond represent the number of correctly classified cells for each label. The ones marked with light red color are the "most" problematic missclassifications, since they are happening much more often than for other combinations of labels. The second column, which displays the cases cells were predicted as Data, contains many such values.
|a||d||h||m||b||← Classified as|
|4038||123||15||33||13||a = Attribute|
|130||32581||83||51||60||d = Data|
|20||119||9195||23||2||h = Header|
|37||122||94||3110||7||m = Metadata|
|28||197||18||6||2865||b = Derived|
We decided to train and run 10-fold cross validation using Random Forest classifier on the full dataset of annotated cells (828,252) with the selected features. The F1 measure does not change much for the Attributes (96.6%) and Header (97.7%) cells. The classifier scores 99.9% on F1 measure for Data cells, since the full dataset contains a vast number of instances of this class. The F1 measure decreases for Metadata and Derived, 93.5% and 94.9% respectively.
To provide a more concrete picture on the accuracy of the classification, in the chart below we display the percentage of sheets for which the classifier has misclassified 2 or less cells. We have stacked those cases that have 0 misclassification with those that have 1 or 2.
Use the following links to access the components used for this reasearch project.
* The Spreadsheet Annotator tool it is still in development phase. It has been tested so far in Windows 7 SP1 with Microsoft Office Excel 2013.