From Semi-Structured Documents to Relations

Spreadsheets compose a notably large and valuable data set of documents within the enterprise settings and on the Web. They are extensively used by business professionals, scientists, and everyday common users. In the last years with the advent of the open data movement, an increasing number of government agencies, nonprofit organizations, and other institutions make data available as spreadsheets. However, transforming these data to another format or combining them with other sources (including other spreadsheets) is rather a cumbersome task. It still requires a considerable involvement from the user. The reason is that spreadsheets were primarily designed for human consumption and less for machine consumption. However, following the increase in data availability and the technological advancements, the demand and possibilities for deeper and more accurate analysis (of data) have increased. In the enterprise level new concepts have emerged, such „big data“ and „data lakes“. It has become more and more apparent that being able to integrate and reuse data from different formats can be very beneficial. These observations motivate the search for better methods to leverage the richness of spreadsheet data.

This PhD thesis aims at tackling this challenge by implementing a system (pipeline) able to understand the characteristics (e.g., structure of the data) of arbitrary spreadsheets and extract their data. This processing pipeline has to automatically perform many consecutive tasks, each dealing with a different aspect of the spreadsheet content, before being able to produce a rich usable output. In addition, the system should take into consideration that not all spreadsheets contain meaningful data. They are also used to create forms, scorecards, graphs, and other not genuine table structures. The intended solution should be able to filter out such cases, and only process genuine tables.

We envision that RDBMSs will be the primary environments to digest the exported data from a spreadsheet. After all, RDBMSs are the most used data management systems. However, it is our aim go beyond the relational model. The output of the processing pipeline will be stored in generic intermediate format that is capable of maintaining not only the exported data, but also its explicit and implicit characteristics. Furthermore, the system should be capable to transform on-demand the output to popular formats, such as JSON, XML, RDF, and relational tables.

Finally, we aim at a solution able to handle various and large volumes of spreadsheets. Therefore, we have considered for our experiments datasets of considerable size from different domains. These provide the settings for building a system that can be utilized at the enterprise level. Furthermore, it can become an integral component of research projects from related areas, such as information retrieval, data management, and document analysis.