burakakyol / ADV

Automatic Data Visualization Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automatic Data Visualization Tool - Idea

This file contains ideas and workplans of The ADV tool.

The Idea:

    Automatic Data Visualization (ADV), is a tool that will allow users to enter any kind of data to the pipeline and get the plottings of it.

    ADV basicly analyses the contents of a file and decides best plotting algorithms or methods to plot the data that is provided from user. Input will be analysed from all kinds aspects such as: column types, relations, data variances, means, medians etc. After, the tool will generate a metadata of input, no matter the shape of input. System will use state-of-the-art technics of Machine Learning to help itself deciding what is best while only looking the metadata. Results of desicions will be input for plotter engine which is a high-graded algorithm that can perform any kind of plotting action. Later on, users can view the results that algorithm generated within seconds with a good looking UI which they can interract with results and do one last touch on them before exporting. The last step will be the export of the results which will be prepared in a special way to preserve the means of all aspects of the data.

    In the end, one simple tool will do all of the job that a data scientist will perform on a data to get meaningful plottings. The mission here is actually helping the individuals by not only showing what is hidden inside the data, but help them to understant it.

Process:

  1. Import API
  2. Input analysing
    1. File format checker
    2. File content checker
      1. File datatypes checking
      2. Language analysis & defining categorical values
      3. Numeric data stats calculation
    3. Metadata (Data dictionary) generating
      1. Metadata export API
  3. Plot models selection (Metadata fed to ML models)
  4. Plotter Engine
  5. Interactive Visualizator Engine
  6. Export API

Design Notes on Steps:

  1. Import API

    Basic import api. Gets the desired data through an UI. On production, places files under S3 or other secure data storage.

  2. Input analysing

    1. File format checker

      Checks the validation of file format. Start with .csv and expand later on. Diversity of formats mustn't affect the tools work model since they all will end up on a same metada. Which indicates metadata generator is the adaptor for all tasks with user side.

    2. File content checker

      1. File datatypes checking

        Checks possible problems and erros about data. Outputs can be: go or no go for entire process.

      2. Language analysis & defining categorical values

        NLP tools decide whether data is hierarhical or checks the context. #Cont. improvement.

      3. Numeric data stats calculation

        Basic datascience formulas and calculations will be applied on numerical columns to measure and identify the data and patterns of it.

    3. Metadata (Data dictionary) generating

      The most important step of the pipeline. Convert the input data to one single metadata for feeding to ML models.

      1. Metadata Export API (Export API can be used)

        Extra Feature: Export the metadata of the input-data to user.

  3. Plot models selection (Metadata fed to ML models)

    ML models that gets metadata and returns the best plot options (with custom parameters) for each column of the input-data. #Cont. improvement.

  4. Plotter Engine

    Plots the real data with the order and info provided by ML models.

  5. Interactive Visualization Engine & Display

    An UI for user-end side of the tool. Users can edit, view and interact with the data and export the final form.

  6. Export API

    Exports the last version of the visuals with customized templates.

TO-DO's:

List of unordered tasks that needs to be done.

  • Import API
  • Input Analyse Algorithm
    • Input Analyse Algorithm - Research & Design
    • Input Analyse Algorithm - File Format Check (Done at import api)
    • Input Analyse Algorithm - File Datatype Checker
    • Input Analyse Algorithm - Language Analysis Model Build for EN
    • Input Analyse Algorithm - Language Analysis Model Build for TR
    • Input Analyse Algorithm - Language Analysis Model Integration to Pipeline
    • Input Analyse Algorithm - Language Analysis - Process String Input Data
    • Input Analyse Algorithm - Numeric Input Analyse & Calculation
  • Metadata Generator
  • Export API
  • Plot Deciding Model Development
    • Plot Deciding Model Development - Research & Design
    • Plot Deciding Model Development - Data Finding
    • Plot Deciding Model Development - Training
    • Plot Deciding Model Development - Integration
  • Plotter Engine
    • Plotter Engine - Research & Design
    • Plotter Engine - Build
    • Plotter Engine - Integration
  • UI Development
    • UI Development - Design
    • UI Development - Input Page Development
    • UI Development - Output Page Development
  • Export Wrapper Engine
  • Dockerize
    • Dockerize - Convert to Microservice Architecture
    • Dockerize - Dockerfile Configuration for Each Microservice
    • Dockerize - Docker-Compose Configuration
  • Log Module Implementation
  • Git Repo Design & Structure
  • S3 Bucket Configuration
  • Cloud DevOps - Research & Design
  • Proof of Concept

Example Datas

import pandas as pd

example_input_data = {"Id": [1,2,3,4],
                     "Size":["L","XL","S","S"],
                     "Price":[5.99,6.99,3.99,3.99],
                     "Color":["Red","Red","Red","Blue"],
                     "Weight":[200,300,150,150],
                     "Produce_Year":[2019,2018,2019,2017]}
example_metadata = {"Row_Name":["Id","Size","Price","Color","Weight","Produce_Year"],
                   "Data_Type":["int","str","int","str","int","int"],
                   "is_categorical":[0,1,0,0,0,0],
                   "is_time":[0,0,0,0,0,1],
                   "mean":[None,None,5.24,None,200,2018],
                   "variance":[None,None,1.6875,None,3750.0,0.6875]}
ml_output = {"Row_Name":["Id","Size","Price","Color","Weight","Produce_Year"],
            "Plot_Type":[None,"Histogram","Hisrogram","Piechart","Histogram","Time_series_plot"],
            "Custom_Parameters":[None,"p=2","p=3, k=9","colors=['r','b','g','o']",None,"zoom_level=3"]}
df1 = pd.DataFrame(data=example_input_data)
df2 = pd.DataFrame(data=example_metadata)
df3 = pd.DataFrame(data=ml_output)
print("Input Data:")
df1
Input Data:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id Size Price Color Weight Produce_Year
0 1 L 5.99 Red 200 2019
1 2 XL 6.99 Red 300 2018
2 3 S 3.99 Red 150 2019
3 4 S 3.99 Blue 150 2017
print("Metadata:")
df2
Metadata:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Row_Name Data_Type is_categorical is_time mean variance
0 Id int 0 0 NaN NaN
1 Size str 1 0 NaN NaN
2 Price int 0 0 5.24 1.6875
3 Color str 0 0 NaN NaN
4 Weight int 0 0 200.00 3750.0000
5 Produce_Year int 0 1 2018.00 0.6875
print("ML Output:")
df3
ML Output:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Row_Name Plot_Type Custom_Parameters
0 Id None None
1 Size Histogram p=2
2 Price Hisrogram p=3, k=9
3 Color Piechart colors=['r','b','g','o']
4 Weight Histogram None
5 Produce_Year Time_series_plot zoom_level=3

About

Automatic Data Visualization Tool


Languages

Language:Python 100.0%