DataStrategist / How-to-do-things-with-R

A braindump of things I find cool, and/or things I want to experiment with and/or my canonical way to tackle DS problems. In theory, kept up-to-date, buuuuuuut

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

R things that are hot right now

to do r things to keep in mind

To sort:

Cool r commands :

use count() using the sort and wt entries.

add_count() instead of group_by/mutate/ungroup

summarize(x = list())

mutate(thingie = fct_reorder(column, function)

geom_col+coord_flip (to deal w/ pesky labels

flow::flow_view() on a function, a quoted expression, or the path of an R script to visualize it.

flow::flow_run()  on a call to a function to visualize which logical path in the code was taken. Set browse = TRUE to debug your function block by block (similar to base::browser()) as the diagram updates.

janitor::clean_names to clean df column names when imported by a silly method.

dplyr::slice_max to get the top n entries of a df (according to a certain field)

asdf

combine `crossing` with `augment` especially augment(data_that_has_been_crossed, type.predict = "response")

To do Net Promoter Score or other marketting stuff: https://cran.r-project.org/web/packages/marketr/vignettes/introduction_to_marketer.html 800f1bbe-6e29-4fc7-b12c-73eb898f37db.png

To send better bash scripts (to talk to the console):  https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/shQuote

mtcars %>% select(1,2,3) %>%

    purrr::by_row(sum, .collate = "cols", .to = "BOOM") 800f1bbe-6e29-4fc7-b12c-73eb898f37db.png

Cool correlation Chart: PerformanceAnalytics::chart.Correlation(iris[-5],pch=21) 800f1bbe-6e29-4fc7-b12c-73eb898f37db.png

https://prodi.gy/ - a tool that helps create labels for data, and learns while it's going.

How it all started:

How I used R to create a word cloud, step by step | Georeferenced

Concept / meta

Startup

Complicated startup sequence: https://rstats.wtf/images/R-startup.svg

Efficiency

https://www.fstpackage.org/

Tips

4 ways to be more productive, using RStudio's terminal - Jozef's Rblog

https://speakerdeck.com/jennybc/how-to-name-files - How to name fil.....

Teaching

swcarpentry/r-novice-gapminder: Introduction to R for non-programmers using gapminder data.

RStudio Cloud

https://education.rstudio.com/learn/

master-the-tidyverse/01-Visualize-Data.Rmd at master · rstudio/master-the-tidyverse

Teaching Tech Together

Learning to Teach Machines to Learn | Alison Hill

To create exams: http://www.r-exams.org/tutorials/

To create dummy data/fake PII data for examples: https://github.com/paulhendricks/generator or https://github.com/trinker/wakefield

Blogs

R in Business Intelligence – Jan Gorecki – blog

## EDA ## Explore Your Dataset in R — Little Miss Data

Principles

t-test and how big sample group - Alexa and Accented English

omg, binder! - the stupidest thing...

Debugging

Debugging in R: How to Easily and Efficiently Conquer Errors in Your Code

to view errors smarter: recover() ( https://www.inwt-statistics.com/read-blog/debugging-in-r.html)

proffer v0.0.2: Builds on pprof to provide profiling tools capable of detecting sources of slowness in R code. Look here for more information.

Convenience:

http://dirk.eddelbuettel.com/code/anytime.html automatically detect date format from ANY string

funneljoin v0.1.0: Implements a time-based joins to analyze sequence of events, both in memory and out of memory. See the vignette for details.

biglmm v0.9-1: Provides regression for data too large to fit in memory. This package functions exactly like the biglm package, but works with later versions of R.

dbx v0.2.1: Provides select, insert, update, upsert, and delete database operations for PostgreSQL, MySQL, SQLite, and other databases. See the README for usage

metaDigitise v1.0.0: Provides functions to extract, summarize and digitize data from published figures in research papers. The vignette shows how to use the package.  Printed Plot

visdat - vis_guess() guesses the type of each field

naniar::vis_miss - to visualize missing fields

Automation/pipeline

targets (ex Drake) - Let's you set up a pipeline of steps, including a .sh file and network analysis!

callr - for controller scripts that source in many things, this keeps each call in its own environment

docker - Talk about deploying Docker & Kubernettes

Obtaining, Cleaning & Processing

DataOps

Scheduling R Tasks via Windows Task Scheduler | TRinker's R Blog

CRAN - Package genderizeR

Twitter analysis using R (Semantic analysis of French elections)

Google Vision API in R with RoogleVision | Stoltzmaniac

How to make your machine learning model available as an API with the plumber package

NCmisc-package: Miscellaneous Functions for Creating Adaptive Functions and... in NCmisc: Miscellaneous Functions for Creating Adaptive Functions and Scripts

Securing a dockerized plumber API with SSL and Basic Authentication | QUNIS

About — Deon

Scraping

CRAN - Package robotstxt

Pirating Web Content Responsibly With R | rud.is

ORiley book on Mining social networks TOC - github

Analysis

General

MultiFit v0.1.2: Provides functions to test for independence of two random vectors and learn and report the dependency structure. For more information, see Gorsky and Ma (2018) and the vignette. Like correlation?

Compare data.frames: compareDF::compare_df() and then to visualize, compareDF::create_output_table

To categorize numeric variable:

In ggplot2:

  • cut_number(): Makes n groups with (approximately) equal numbers of observation
  • cut_interval(): Makes n groups with equal range
  • cut_width: Makes groups of width width

Recommendation systems: 

Analyze satellite imagery:

https://www.youtube.com/watch?v=k1K6nqgtRL8

Causal inference: 

https://deepmind.com/blog/article/Causal_Bayesian_Networks

SNA/network

Good tutorial: https://www.mr.schochastics.net/material/netVizR

Drag and drop, collapsible d3.js Tree with 50,000 nodes - bl.ocks.org

Collapsible Force Layout - bl.ocks.org

Summary of community detection algorithms in igraph 0.6 | R-bloggers

RPubs - Network Visualization Tutorial 2015

Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R | OUseful.Info, the blog

Good book: https://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch03.pdf

The mother of all packages:  https://igraph.org/

Best way to visualize:  https://datastorm-open.github.io/visNetwork/

    (for big networks, use: visNetwork::visPhysics(stabilization = FALSE) %>% visNetwork::visIgraphLayout()  )

to calculate bridges: https://cran.r-project.org/web/packages/networktools/networktools.pdf

tidy manipulation of SNA data:  https://www.data-imaginist.com/2017/introducing-tidygraph/

to plot geo networks: https://ggobi.github.io/ggally/#ggallyggnetworkmap

To draw cool networks: http://blog.schochastics.net/post/sketchy-hand-drawn-like-networks-in-r/

ORiley list of Graph theory resources

To filter SNAs: MultiScale Algorithm

SNA examples:

Recipie recommendation using ingredient networks

Mapping Reddit using backbone and cluster

More datasets and some convenience functions - http://blog.schochastics.net/post/extending-network-analysis-in-r-with-netutils/

Visualise network in a more simplified way: https://blog.revolutionanalytics.com/2015/08/contracting\-and\-simplifying\-a\-network\-graph.html

NLP / Sentiment Analysis

Extracting basic Plots from Novels: Dracula is a Man in a Hole – Learning Machines

NLPclient v1.0: Implements an interface to the Stanford CoreNLP annotation client which includes a part-of-speech (POS) tagger, a named entity recognizer (NER), a parser, and a co-reference resolution system.

sentimentr - Sentiment analysis including negation

udpipe - break down text analysis into 4 parts:  'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing'

quanteda - for viz

textclean - for cleaning text (including  replace_emoticon(), check_text() )

Emojis Analysis in R | R-bloggers

Emoji Sentiment Ranking 1.0

Emoji Sentiment Ranking v1.0

400+ Sarcastic Quotes, Sarcasm Sayings - CoolNSmart

bfelbo/DeepMoji: State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc.

1606.07772.pdf

MonkeyLearn - Natural Language Processing

Emoji data science in R: A tutorial – PRISMOJI

Emoji Sentiment Ranking 1.0

Emoji Sentiment Ranking 1.0

Automated Text Feature Engineering using textfeatures in R | DataScience+

NLP's ImageNet moment has arrived

1801.06146.pdf

bnosac :: open analytical helpers - You did a sentiment analysis with tidytext but you forgot to do dependency parsing to answer WHY is something positive/negative

Textrank for summarizing text

https://bookdown.org/max/FES/text-data.html#text-data - How to check the keywords relevant to one class in a multi-class problem.

Topic modelling

Automated Topic Discovery: An Approachable Explanation

Topic modeling made just simple enough. | The Stone and the Shell

Julia Silge - Training, evaluating, and interpreting topic models

Semi-supervised topic modelling - CorEx

[textrank�(https://cran.r-project.org/web/packages/textrank/vignettes/textrank.html) - To find the most relevant sentences in a topic

Tidy Topic Modeling

For fuzzy matching names, think about using Initials in order to avoid some problems. Tom/Thomas/Tommy --> T

To visualize topics: https://github.com/cpsievert/LDAvis

another option: https://www.rtextminer.com/articles/a_start_here.html#why-textminer

Qual

Discourse Network Analysis: Undertaking Literature Reviews in R

A very brief introduction to species distribution models in R

CRAN - Package anomalize

rOpenSci | Working with audio in R using av

Exploring correlations in R with corrr

AutoEDA stuff:

BESTTT: DataExplorer  DataExplorer::create_report(df)

Good blog article:  https://www.groundai.com/project/the-landscape-of-r-packages-for-automated-exploratory-data-analysis/1

ggpairs - https://ggobi.github.io/ggally/#columns_and_mapping

to view summary data:  skimr::skimr

SmartEDA - several things, but especially the Parallel Coordinate Plots

modelling & ML

General

Structural Equation Modeling with lavaan in R (article) - DataCamp

https://pbiecek.github.io/ceterisParibus/ -  present model responses around a single point in the feature space. For example around a single prediction for an interesting observation. Plots are designed to work in a model-agnostic fashion, they are working for any Machine Learning model and allow for model comparisons. Can do what if, single and multiple classification, regression, a bunch of stuff.

c8419687-b8c1-4dfc-b882-105075664f3c.png

Best Subsets Regression - to figure out the best model varying the components: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/#:~:text=The%20R%20function%20regsubsets(),to%20incorporate%20in%20the%20model.&text=The%20function%20summary()%20reports,variables%20for%20each%20model%20size.

NeuralNetwork

How to create a sequential model in Keras for R

Modelling

Missing Value Treatment | DataScience+

Utah Water Time-series and anomaly detection

Handling missing data with MICE package; a simple approach | DataScience+

Graphically analyzing variable interactions in R | R-bloggers

Feature Selection using Genetic Algorithms in R

Machine Learning

Unsupervised

Quick and easy t-SNE analysis in R – intobioinformatics

Machine Learning vs. Statistics | Open Data Science

Practical Machine Learning Problems - Machine Learning Mastery

Random forest in parallel example

Imputing Missing Data with R; MICE package | DataScience+

Slides from my talk on the broom package – Variance Explained

Partitioning cluster analysis: Quick start guide - Unsupervised Machine Learning - Documentation - STHDA

What are the Best Machine Learning Packages in R? | R-bloggers

Does money buy happiness after all? Machine Learning with One Rule

How to Identify the Most Important Predictor Variables in Regression Models | Minitab

Web Scraping and Applied Clustering Global Happiness and Social Progress Index | DataScience+

Explaining complex machine learning models with LIME

IMDB Genre Classification using Deep Learning – Florian Teschner – YaDS (Yet another Data Scientist)

A guide to GPU-accelerated ship recognition in satellite imagery using Keras and R (part I)

GANs explained. Generative Adversarial Networks applied to Generating Images | Open Data Science

Dealing with unbalanced data in machine learning

Explain.png (3350×2058)

MI2DataLab/modelDown: modelDown generates a website with HTML summaries for predictive models

Tuning xgboost in R: Part I | insightR

Hidden Technical Debt in Machine Learning Systems

stanford-cs-230-deep-learning/super-cheatsheet-deep-learning.pdf at master · afshinea/stanford-cs-230-deep-learning

Tell Me a Story: How to Generate Textual Explanations for Predictive Models – SmarterPoland.pl

When Cross-Validation is More Powerful than Regularization – Win-Vector Blog

Visualizing

General

https://www.data-to-viz.com/ - What viz should I use?

ggeasy - To make eeeeverything easy

To combine plots in one:

ggmatrix 

https://ggobi.github.io/ggally/#ggallyggmatrix

To plot 2 different facet levels: 

d5eef46a-ff6a-4626-9157-f3212a0764c3.png

d5eef46a-ff6a-4626-9157-f3212a0764c3.png

gggibbous v0.1.0: Extends ggplot2 to offer moon charts, pie charts where the proportions are shown as crescent or gibbous portions of a circle, like the lit and unlit portions of the moon. It i all illuminated in the vignette.

ea58b53c-d4d8-4cf5-8260-51410301025a.png

ea58b53c-d4d8-4cf5-8260-51410301025a.png

ggvoronoi v0.8.0: Provides functions to create, manipulate and visualize Voronoi diagrams using the deldir and ggplot2 packages. The vignette shows how.

To highlight areas of the plot:

ggalt (also to do additional shapes and functionalities) 7cccb95f-7c8c-4bdd-9f70-aa0b2f84a53d.png

7cccb95f-7c8c-4bdd-9f70-aa0b2f84a53d.png

or 

gghighlight: highlight certain series f566d3d9-b212-4ade-8820-0d8450e676e2.png

f566d3d9-b212-4ade-8820-0d8450e676e2.png

3d Plots

https://github.com/bwlewis/rthreejs

https://www.rayshader.com/

https://symbolixau.github.io/mapdeck/articles/layers.html

Markdown

Options - Chunk options and package options - Yihui Xie | 谢益辉

https://github.com/trinker/numform - presenting numbers better (like percents, rounding etc... suitable for inclusion in report tables).

Viz

animint/references.org at master · tdhock/animint

rCharts

hadley/gg2v

ggplot2 - Easy way to mix multiple graphs on the same page - R software and data visualization - Documentation - STHDA

candlestick chart - Animating googleVis plots in R - Stack Overflow

Better animation (interpolation for points to be used w/ gganimate- https://github.com/thomasp85/tweenr

Radar Charts

trelliscopejs

R to D3 rendering tools • r2d3

nachocab/clickme interactive plots

Cool examples of tables

Shiny

The R Shiny packages you need for your web apps! - Enhance Data Science

Debugging with Shiny

Discovery Dashboards | Engineering | Wikimedia Foundation

Discovery Dashboards | Engineering | Wikimedia Foundation

trestletech/shinyTable

trestletech/shinyTable

seascapemodels

Introduction to DataExplorer

sortable v0.4.2: Provides functions to enables drag-and-drop behavior in Shiny apps, by exposing the functionality of the SortableJS JavaScript library as an htmlwidget. There is a live demo on Using Sortable and another on Using Sortable widgets, and a vignette on the Interface to SortableJS.

Package building

Deal with dependencies in package generation:

Unit/Integration Testing

Testing, testing, testing! | R-bloggers

The Travis CI Blog: What is CI - Testing and Deploying (Part 2)

Travis CI for R — Advanced guide – Towards Data Science

mocking using mockr and mockery: https://www.youtube.com/watch?v=iRFJ6f7ZhsQ

Topics

HR - Human Resources

Chapter 13 Gender Pay Gap | HR Analytics in R

Economics and R

Music

R-Music: Introduction to the chorrrds package

The Minor fall, the Major lift: inferring emotional valence of musical chords through lyrics

TileMaker/tile_maker.R at master · DataStrategist/TileMaker

Finance

Algorithmic Trading: Using Quantopian's Zipline Python Library In R And Backtest Optimizations By Grid Search And Parallel Processing

Maps

How to highlight countries on a map - SHARP SIGHT LABS

Reverse Geocoding

tmap in a nutshell

CRAN - Package mapview

Merging spatial buffers in R | Insights of a PhD

Is London a Forest? How to Use GIS and Open Data to Find Out

Unique IDs - PlayerIds · Robert Nguyen

Many examples: https://gitlab.com/dickoa/30daymapchallenge

Free geocoding! https://photon.ko

moot.io/

Instruction

Communicating with R Markdown Workshop | Alison Hill

Principles & Practice of Data Visualization

CONJ620: CM 1.4

Getting LearnR tutorials to run on mybinder.org | Ted Laderas, PhD

(PDF) Influencer Fraud on Instagram - A Descriptive Analysis of the World's Largest Engagement Community (Master Thesis by Jonas Schröder)

DBM Express Order For Service - mexindian@gmail.com - Gmail

Data

General

climate v0.3.0: Provides access to meteorological and hydrological data from OGIMET, University of Wyoming - atmospheric vertical profiling data, and Polish Institute of Meteorology and Water Management - National Research Institute. There is a vignette.

CCAMLRGIS v3.0.1: Loads and creates spatial data, including layers and tools that are relevant to the activities of the Commission for the Conservation of Antarctic Marine Living Resources ( CCAMLR). Have a look at the vignette.

schrute v0.1.1: Contains the complete scripts from the American version of the Office television show in tibble format. Have a look at the vignette and practice NLP.

fredr v1.0.0: Provides an R client for the Federal Reserve Economic Data (FRED). There are vignettes on FRED CategoriesReleasesSeriesSources, and Tags, as well as a Getting Started Guide

jstor v0.3.2: Provides functions to import metadata, ngrams, and full-texts delivered by Data for Research by JSTOR. There is an Introduction, and vignettes on Automating File Import and Known Quirks. to analyze publications/papers

rLandsat v0.1.0: Provides functions to search and acquire Landsat data using an API built by Development Seed and the U.S. Geological Survey. See README for how to use the package.

weathercan v0.2.7: Provides tools for downloading historical weather data from the Environment and Climate Change Canada website. Data can be downloaded from multiple stations over large date ranges, and automatically processed into a single dataset. There is an Introduction, a Glossary, and vignettes on Flags and Interpolation.

Music lyrics: https://statnamara.wordpress.com/2021/01/26/scraping-analysing-and-visualising-lyrics-in-r/

NLP

Introducing the schrute Package: the Entire Transcripts From The Office · technistema

Film Corpus 2.0 | Natural Language and Dialogue Systems

Data strategy

Summary-Designed-Data-Maturity-Framework-Social-Sector-FINAL-v1.pdf

Crime

Accessing the Justice Data Lab service - GOV.UK

A large repository of networkdata · David Schoch

HDX Universe: The shape of the Humanitarian Data Exchange

From data to Viz | Find the graphic you need

Twitter Trending Hashtags and Topics - Trendsmap

Omdena | Building AI for Good Through Community Collaboration

di

Sovereign Environmental, Social, and Governance Data | World Bank

Global Marine Environment Datasets

 

About

A braindump of things I find cool, and/or things I want to experiment with and/or my canonical way to tackle DS problems. In theory, kept up-to-date, buuuuuuut