Spreadsheets compose a notably large and valuable dataset of documents within the enterprise settings and on the Web. However, extracting data from these files it is rather a cumbersome task. They are optimized for human consumption but lack with what regards automatic machine processing.
Koci et al. therefore proposed a table identification and extraction process that incorporates a cell classification task as a first step. Since classification represents a supervised-learning approach, it requires a substantial amount of labeled training data that is expensive to obtain. Therefore, in this thesis, we want to apply Active Learning (AL), which is an iterative learning process that can be used to train classification models by selecting only the most informative examples from an unlabeled dataset (normally a human expert). The main task is to implement the active learning cycle for the spreadsheet cell classification scenario and evaluate the different selection strategies mentioned in the research literature. The implementation should be flexible enough to test various strategies