DanieleDePaola / UnravelMultiFormatAnalysis

The following is a possible implementation for the Unravel algorithm. The original Paper can be easily found here: https://www.microsoft.com/en-us/research/uploads/prod/2020/10/OOPSLA20_Structural_Interpretation_Text_Formats.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UnravelMultiFormatAnalysis

The extraction of data from raw datasets and its transformation into readable and comprehensible formats is a common and critical activity for every data science project and application. This is a common characteristic for the data coming from IoT sensors (e.g. log files containing JSON, hyperlinks and several other types of information stored with different representations) or also datasets extracted from hypermedia web pages (containing several types of data as links and images). With the growth of variety in text formats and data representations, an increasing percentage of documents is starting to contain a mixture of data notations that can be hardly interpreted by a single parsing technique: the different possible fields or informations contained in a text document could be separated by sets of different delimiting punctuation, could be represented using complex open standard file formats like JSON, XML or could be organised by using a combination of delimiters and text formatting techniques. The presence of non-standard delimiters and notations specifically affects the generation of broadly used standard file formats for Datasets like csv, representing the data as well separated fields organised into tables. In this work, we will thus introduce a possible implementation for a text parsing procedure, Unravel (Gulwani et al., 2020), which is able to handle text files containing different text formats, translating it into several readable and accessible tabular versions, following the user preferences. After the introduction of several motivating examples for the application of the Unravel procedure, we will define the main theoretic characteristics of the general procedure formulated in the original paper (Gulwani et al., 2020). We will then present the simplified version implemented and, eventually, present some interesting results from the experiments conducted on different datasets. The original Paper can be easily found here: https://www.microsoft.com/en-us/research/uploads/prod/2020/10/OOPSLA20_Structural_Interpretation_Text_Formats.pdf

About

The following is a possible implementation for the Unravel algorithm. The original Paper can be easily found here: https://www.microsoft.com/en-us/research/uploads/prod/2020/10/OOPSLA20_Structural_Interpretation_Text_Formats.pdf

License:Creative Commons Zero v1.0 Universal


Languages

Language:Jupyter Notebook 100.0%