Concept-and-use-of-Data-Dictionary-for-Data-Lakes

Constructing Polystore is involved with building a data dictionary on top of Data Lake to manage heterogeneous data models and it could represent a part of facilitating data analytics, by creating data dictionary on the level of data lake process. Considering different data models in form of different DBMSs and data types, mapping and transformation between data models is widely perceived.

In this master thesis, MySQL DBMS is used as part of relational databases’ repository, mongoDB is exploited as a JSON documents repository and Neo4j is used as graph database repository. Additionally, another version of MySQL is used to store the data dictionary of all data models.

The challenge faced in the project is constructing a middleware to extract and map the schema and metadata into final data dictionary in form of relational data model which is based on MySQL DBMS.

Regarding to the difference in the format and structure of JSON documents in mongoDB and graph databases in neo4j and for the reason that JSON documents and graph database are a schema less data structure, mapping, converting and building a relational data model from JSON and Graph database is the most important task of my master thesis. The mapping of data models from all three data models will be manipulated by using a python console program. The data dictionary will be created in the form of table in different databases in a separate single localhost of MySQL. Each data model has its own challenges as for mapping data model and transferring data dictionary into data dictionary repository.

The challenge we face in the retrieving data dictionary from relational database is in transferring the results of query from one relational DBMS (MySQL) into another server of MYSQL which is used as the container of data dictionary. The data dictionary will be extracted from a view called “information_schema “and it will be inserted in another MySQL server with the permission.

The process of extracting data dictionary from collections of mongoDB is totally different. After selecting the requested collections by data dictionary extraction middleware from mongoDB, the collections will be extracted into python data structures and the whole document will be traversed and the schema of the document will be stored in a list of dictionary data structure and in the process of extracting schema from JSON Document, the repetitive keys which already exist in the same path will not be added into python data structures. In this case, array contains similar records and record contains similar records in the nested document will manipulate for checking redundancy in the schema.

As a result, a sample of record contains pairs of key and value will be store inside JSON schema variable for the reason that there was dozen of same repetitive record in the same level of document. Therefor just a candidate of record will be store and the other record will refused to store in the schema.

After schema extraction and produce it in the python data structure, it would be transformed into the MySQL data dictionary server. In the process of schema transformation from python data structure into relational data dictionary, the name of collection produce as name of Database in Data dictionary repository (MySQL) and all keys of object in the list data structure represent as name of table and the value of object which contains a bucket of key and values represent as component of table. The final result of each table contains two attribute: keys and the type of keys and the number occurrence and repetition of record inside collection will be produced inside last tuple of the table. The same process will be repeated for all selected collections.

In the method of schema extraction from graph database in Neo4j DBMS in my master thesis, after connecting middleware to the active graph database, a number of predefined queries will be executed by cypher query language and the results will be produced in the form of list of list in the python data structure as like as table in the relational database and finally the result will be transformed to the tables one by one.

AMIRezanejad / Concept-and-use-of-Data-Dictionary-for-Data-Lakes

Concept-and-use-of-Data-Dictionary-for-Data-Lakes

About