ghosthamlet / autoEDA-resources

A list of software and papers related to automatic and fast Exploratory Data Analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

autoEDA-resources

A list of software and papers related to automated Exploratory Data Analysis, including

  • fast data exploration and visualization,
  • augmented analytics,
  • visualization recommendation and other tools that speed up data exploration (visual exploration in particular).

Pull requests with software, paper and conference presentations are welcome.

Software

R packages

My summary of R packages is in R Journal

Complete Packages

  • dataMaid (CRAN package) - automated checks of data validity.

  • DataExplorer (CRAN package) - automated data exploration (including univariate and bivariate plots, PCA) and treatment.

  • funModeling (CRAN package) - automated EDA, simple feature engineering and outlier detection.

  • SmartEDA (CRAN package) - automated generation of descriptive statistics and uni- and bivariate plots, parallel coordinate plots. Details can be found in a dedicated paper.

  • autoEDA (GitHub package) - automated EDA with uni- and bivariate plots. An article with an introduction can be found on LinkedIn.

    • auto-EDA (GitHub package) - uni- and bivariate plots for data exploration in regression and classification problem. The package cleans data automatically to improve the plots. Another version of Xander Horn's package.
  • visdat (CRAN package) - 6 exploratory/diagnostic plots for initial data analysis.

  • dlookr (CRAN package) - tools for data quality diagnosis, basic exploration and feature transformations.

  • xray (CRAN package) - first look at the data - distributions and anomalies. More in the blog post.

  • arsenal (CRAN package) - statistical summaries (models and exploration) and quick reporting.

  • RtutoR (CRAN package) - learning material with a automatic reports module. More at R-Bloggers.

  • exploreR (CRAN package) - exploration based on univariate linear regression.

  • summarytools (CRAN package) - table to summarise datasets and perform simple uni- and bivariate analyses.

  • inspectdf (CRAN package) - tools for column-wise exploration and comparison of data frames. Examples are provided in a README of the GitHub repo.

  • explore (CRAN package) - interactive Shiny app for comprehensive dataset exploration (including uni- and bivariate relationships, correlation analysis and simple modeling with decision trees) and stand-alone function for quick exploration. Examples are given in a vignette.

  • skimr (CRAN package) - well formatted summaries of data frames, vector and matrices. Examples are provided in a vignette.

  • janitor (CRAN package) - a tools for fast data cleaning. All functionalities are introduced in the vignette.

  • autoplotly (CRAN package) - a library for fast visualization of statistical results supported by ggfortify. Details can be found in the vignette or JOSS paper

Packages in Development

  • AEDA (GitHub package) - summary statistics, correlation analysis, cluster analysis, PCA & other projections.

  • dataexpks (GitHub package) - quick reports with basic data summaries.

  • automatic-data-explorer (GitHub package) - basic EDA and creating Markdown reports from multiple R scripts.

  • xda (GitHub package) - basic data summaries.

  • EDA - stub of a package.

  • modeler (GitHub package) - tools for exploration and pre-processing.

  • IEDA (GitHub package) - EDA simplified through interactive visualization.

  • seda (GitHub package) - fast EDA tool in active development.

Domain-specific packages

Related packages

  • featuretoolsR (CRAN package) - R port to Python library for automated feature engineering.

  • vtreat (CRAN package) - data treatment (pre-processing) that includes dealing with missing data and large categorical variables. Details can be found in the paper about vtreat.

  • report - automated modeling report generation.

  • FactoInvestigate (CRAN package) - has an automatic reporting module which selects best plots that summarise different projection techniques.

  • gtsummary (GitHub package) - presentation-ready tables summarizing data sets, regression models, and more.

  • clean (CRAN package) - fast data cleaning.

  • finalfit (CRAN package) - tables and plots to quickly visualize regression results.

  • modelsummary (GitHub package) - summary tables for regression models.

Python libraries

Complete Packages

  • Dora (pip library) - data cleaning, featuring engineering and simple modeling tools.

  • statsModels (pip library) - collection of statistical tools, including EDA.

  • TPOT (pip library) - autoML tool with feature engineering module.

  • HoloViews (pip library) - automated visualization based on short data annotations.

  • lens (pip library) - fast calculation of summary statistics and correlations. Presentation about the library.

  • pandas-profiling - popular library for quick data summaries and correlation analysis.

  • speedML (pip library) - large library for ML with module dedicated to fast EDA.

  • edaviz - Python library for fast data exploration that provides functions for dataset overviews, bivariate plots and finding good predictors. (Free version only works for small datasets).

Packages in Development

Related packages

  • featuretools - library for automated feature engineering.

  • pyvtreat - Python version of the R's vtreat package.

  • autoimpute - easier handling of missing values.

Stata packages

  • eda - a package that produces a pdf report with all permutations of univariate and bivariate visualizations and tables. Notably, three-dimensional displays are also possible.

Web services

  • DIVE - MIT's tools for data exploration that tries to choose best (most informative) visualizations.

  • Automatic Statistician - tool for automated EDA and modeling.

  • Several Shiny apps by R Squared Computing, including visulizer and descriptr.

Standalone software

  • auto-eda - automatic EDA with SQL.

  • elycite - tools for exploration and modelling available (locally) as an web application. Designed for NLP problems.

Papers

Methods and tools for autoEDA

Visualization recommendation frameworks

Augmented analytics

Conference presentations

About

A list of software and papers related to automatic and fast Exploratory Data Analysis

License:Creative Commons Attribution 4.0 International


Languages

Language:HTML 99.3%Language:TeX 0.5%Language:R 0.2%