Automatic Data Visualization Tool - Idea

This file contains ideas and workplans of The ADV tool.

The Idea:

Automatic Data Visualization (ADV), is a tool that will allow users to enter any kind of data to the pipeline and get the plottings of it.

ADV basicly analyses the contents of a file and decides best plotting algorithms or methods to plot the data that is provided from user. Input will be analysed from all kinds aspects such as: column types, relations, data variances, means, medians etc. After, the tool will generate a metadata of input, no matter the shape of input. System will use state-of-the-art technics of Machine Learning to help itself deciding what is best while only looking the metadata. Results of desicions will be input for plotter engine which is a high-graded algorithm that can perform any kind of plotting action. Later on, users can view the results that algorithm generated within seconds with a good looking UI which they can interract with results and do one last touch on them before exporting. The last step will be the export of the results which will be prepared in a special way to preserve the means of all aspects of the data.

In the end, one simple tool will do all of the job that a data scientist will perform on a data to get meaningful plottings. The mission here is actually helping the individuals by not only showing what is hidden inside the data, but help them to understant it.

Process:

Import API
Input analysing
1. File format checker
2. File content checker
  1. File datatypes checking
  2. Language analysis & defining categorical values
  3. Numeric data stats calculation
3. Metadata (Data dictionary) generating
  1. Metadata export API
Plot models selection (Metadata fed to ML models)
Plotter Engine
Interactive Visualizator Engine
Export API

Design Notes on Steps:

Import API

Basic import api. Gets the desired data through an UI. On production, places files under S3 or other secure data storage.
Input analysing
1. File format checker
  
  Checks the validation of file format. Start with .csv and expand later on. Diversity of formats mustn't affect the tools work model since they all will end up on a same metada. Which indicates metadata generator is the adaptor for all tasks with user side.
2. File content checker
  1. File datatypes checking
    
    Checks possible problems and erros about data. Outputs can be: go or no go for entire process.
  2. Language analysis & defining categorical values
    
    NLP tools decide whether data is hierarhical or checks the context. #Cont. improvement.
  3. Numeric data stats calculation
    
    Basic datascience formulas and calculations will be applied on numerical columns to measure and identify the data and patterns of it.
3. Metadata (Data dictionary) generating
  
  The most important step of the pipeline. Convert the input data to one single metadata for feeding to ML models.
  1. Metadata Export API (Export API can be used)
    
    Extra Feature: Export the metadata of the input-data to user.
Plot models selection (Metadata fed to ML models)

ML models that gets metadata and returns the best plot options (with custom parameters) for each column of the input-data. #Cont. improvement.
Plotter Engine

Plots the real data with the order and info provided by ML models.
Interactive Visualization Engine & Display

An UI for user-end side of the tool. Users can edit, view and interact with the data and export the final form.
Export API

Exports the last version of the visuals with customized templates.

TO-DO's:

List of unordered tasks that needs to be done.

Example Datas

import pandas as pd

example_input_data = {"Id": [1,2,3,4],
                     "Size":["L","XL","S","S"],
                     "Price":[5.99,6.99,3.99,3.99],
                     "Color":["Red","Red","Red","Blue"],
                     "Weight":[200,300,150,150],
                     "Produce_Year":[2019,2018,2019,2017]}
example_metadata = {"Row_Name":["Id","Size","Price","Color","Weight","Produce_Year"],
                   "Data_Type":["int","str","int","str","int","int"],
                   "is_categorical":[0,1,0,0,0,0],
                   "is_time":[0,0,0,0,0,1],
                   "mean":[None,None,5.24,None,200,2018],
                   "variance":[None,None,1.6875,None,3750.0,0.6875]}
ml_output = {"Row_Name":["Id","Size","Price","Color","Weight","Produce_Year"],
            "Plot_Type":[None,"Histogram","Hisrogram","Piechart","Histogram","Time_series_plot"],
            "Custom_Parameters":[None,"p=2","p=3, k=9","colors=['r','b','g','o']",None,"zoom_level=3"]}
df1 = pd.DataFrame(data=example_input_data)
df2 = pd.DataFrame(data=example_metadata)
df3 = pd.DataFrame(data=ml_output)

print("Input Data:")
df1

Input Data:

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Id	Size	Price	Color	Weight	Produce_Year
0	1	L	5.99	Red	200	2019
1	2	XL	6.99	Red	300	2018
2	3	S	3.99	Red	150	2019
3	4	S	3.99	Blue	150	2017

print("Metadata:")
df2

Metadata:

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Row_Name	Data_Type	is_categorical	is_time	mean	variance
0	Id	int	0	0	NaN	NaN
1	Size	str	1	0	NaN	NaN
2	Price	int	0	0	5.24	1.6875
3	Color	str	0	0	NaN	NaN
4	Weight	int	0	0	200.00	3750.0000
5	Produce_Year	int	0	1	2018.00	0.6875

print("ML Output:")
df3

ML Output:

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Row_Name	Plot_Type	Custom_Parameters
0	Id	None	None
1	Size	Histogram	p=2
2	Price	Hisrogram	p=3, k=9
3	Color	Piechart	colors=['r','b','g','o']
4	Weight	Histogram	None
5	Produce_Year	Time_series_plot	zoom_level=3

About

Automatic Data Visualization Tool

Languages

Language:Python 100.0%