summarytools: An R Package For Descriptive Statistics

Version 0.8.2 (available on the official R repository) fixes the issues reported with 0.8.0 and 0.8.1. Please make sure you are using the latest version!

The two following vignettes complement information found on this page:

What is summarytools?

summarytools is an R package providing tools to neatly and quickly summarize data. Its main purpose is to provide hassle-free functions that every R programmer once wished were included in base R:

descriptive statistics with all common univariate statistics for numerical vectors.
frequency tables with proportions, cumulative proportions and missing data information.
cross-tabulations between two factors or any discrete data, with total, rows or columns proportions.
Extensive data frame summaries that facilitate data cleaning and firsthand evaluation

One of its goals is to make it easier for newcomeRs who might be used to other statistical software which usually provide a wide range of functions and procedures out-of-the-box, making it very simple to create, with code or with a few point-and-click actions, thorough and well-formatted reports. For most tasks not relying on advanced statistical methods, summarytools allows you to do just that.

Here are some of the package's features:

Variable and data frame labels are supported
Sampling weights can be used in frequency tables and descriptive statistics
Output files of various formats can be generated (plaintext, markdown and html)
The html outputs are well integrated in RStudio (if an external browser is not your preferred method)

Why use summarytools?

You're looking for straightforward descriptive functions to get up and running in no time
You're looking for flexibility in terms of outputs

How to install

To benefit from all the latests fixes, install it from GitHub:

install.packages("devtools")
library(devtools)
install_github('dcomtois/summarytools')

To install the most recent version on the R-CRAN repository, just type into your R console:

install.packages("summarytools")

For the most up-to-date version that has all the latest features but might also contain bugs (which can be fixed rapidly in most cases):

install.packages("devtools")
library(devtools)
install_github('dcomtois/summarytools', ref='dev-current')

You can also see the source code and documentation on the official R site here.

A First Example

Using the iris sample data frame, we'll jump right to the most popular function in the package, dfSummary (Data Frame Summary).

With the following 2 lines of code, we'll generate a summary report for ''iris`` and have it displayed in RStudio's Viewer pane:

library(summarytools)
view(dfSummary(iris))

Building on the strengths of pander and htmltools, the summary reports produced by summarytools can be:

Displayed in plain text in the R console (default behaviour)
Written to plain text files / markdown text files
Written to html files that fire up in RStudio's Viewer pane or in your system's default browser

Four Core Functions

1 - Frequency tables with `freq()`

The freq() function generates a table of frequencies with counts and proportions.

library(summarytools)
freq(iris$Species)

Frequencies 
iris$Species 
Type: Factor (unordered) 

                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
---------------- ------ --------- -------------- --------- --------------
          setosa     50     33.33          33.33     33.33          33.33
      versicolor     50     33.33          66.67     33.33          66.67
       virginica     50     33.33         100.00     33.33         100.00
            <NA>      0                               0.00         100.00
           Total    150    100.00         100.00    100.00         100.00

2 - Descriptive (univariate) statistics with `descr()`

The descr() function generates common central tendency statistics and measures of dispersion for numerical data. It can handle single vectors as well as dataframes, in which case it just ignores non-numerical columns.

We'll use the rmarkdown style for the next example:

descr(iris, style = "rmarkdown")

Descriptive Statistics: iris

N: 150

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Mean	5.84	3.06	3.76	1.20
Std.Dev	0.83	0.44	1.77	0.76
Min	4.30	2.00	1.00	0.10
Median	5.80	3.00	4.35	1.30
Max	7.90	4.40	6.90	2.50
MAD	1.04	0.44	1.85	1.04
IQR	1.30	0.50	3.50	1.50
CV	7.06	7.01	2.13	1.57
Skewness	0.31	0.31	-0.27	-0.10
SE.Skewness	0.20	0.20	0.20	0.20
Kurtosis	-0.61	0.14	-1.42	-1.36
N.Valid	150.00	150.00	150.00	150.00
Pct.Valid	100.00	100.00	100.00	100.00

Transposing and selecting the stats you need

If your eyes/brain prefer seeing things the other way around, just use "transpose=TRUE". Here, we also select only the statistics we wish to see:

descr(iris, stats = c("mean", "sd", "min", "med", "max"), transpose = TRUE)

Descriptive Statistics: iris

N: 150

	Mean	Std.Dev	Min	Median	Max
Sepal.Length	5.84	0.83	4.30	5.80	7.90
Sepal.Width	3.06	0.44	2.00	3.00	4.40
Petal.Length	3.76	1.77	1.00	4.35	6.90
Petal.Width	1.20	0.76	0.10	1.30	2.50

3 - Cross-tabulations with `ctable()`

Here we'll use a sample data frame included in the package (tobacco), which contains simulated data. Say we want to cross-tabulate variables smoker and diseased. By default, ctable() gives row proportions, so we don't need to specify any additionnal parameter.

Here, instead of ctable(tobacco$smoker, tobacco$diseased), we'll make use of the with() (R-base) function:

data("tobacco")
with(tobacco, ctable(smoker, diseased))

Cross-Tabulation / Row proportions 
smoker * diseased 
Data Frame: tobacco 
-------- ---------- ------------- ------------- ---------------
           diseased           Yes            No           Total
  smoker                                                       
     Yes              125 (41.9%)   173 (58.1%)    298 (100.0%)
      No               99 (14.1%)   603 (85.9%)    702 (100.0%)
   Total              224 (22.4%)   776 (77.6%)   1000 (100.0%)
-------- ---------- ------------- ------------- ---------------

Note that markdown has no support (yet) for multi-line headers. Here is what the html version looks like:

view(with(tobacco, ctable(smoker, diseased)))

It is possible to display column, total, or no proportions at all. We can also omit the marginal totals and proportions altogether, so at its simplest form, we have:

with(tobacco, ctable(smoker, diseased, prop = "n", totals = FALSE))

Cross-Tabulation 
smoker * diseased 
Data Frame: tobacco 
-------- ---------- ----- -----
           diseased   Yes    No
  smoker                       
     Yes              125   173
      No               99   603
-------- ---------- ----- -----

4 - Data Frame Summaries

As seen earlier, the dfSummary() function gives information for all variables contained in a data frame. The objective in building this function was to fit as much relevant information as possible in a concise, legible table. Version 0.8.0 introduced graphs for both ascii and html outputs.

dfSummary(tobacco)

Data Frame Summary: tobacco                                                                         
N: 1000                                                                                             
-------------------------------------------------------------------------------------------------------------------------
No   Variable         Stats / Values               Freqs (% of Valid)     Text Graph                  Valid     Missing  
---- ---------------- ---------------------------- ---------------------- --------------------------- --------- ---------
1    gender           1. F                         489 (50.0%)            IIIIIIIIIIIIIIII            978       22       
     [factor]         2. M                         489 (50.0%)            IIIIIIIIIIIIIIII            (97.8%)   (2.2%)   
                                                                                                    
2    age              mean (sd) : 49.6 (18.29)     63 distinct val.       .     .         .   . :     975       25       
     [numeric]        min < med < max :                                   : . . : : :   : : . : :     (97.5%)   (2.5%)   
                      18 < 50 < 80                                        : : : : : : : : : : : :                        
                      IQR (CV) : 32 (0.37)                                : : : : : : : : : : : :                        
                                                                          : : : : : : : : : : : :                        
                                                                          : : : : : : : : : : : :                        
                                                                                                    
3    age.gr           1. 18-34                     258 (26.5%)            IIIIIIIIIIIII               975       25       
     [factor]         2. 35-50                     241 (24.7%)            IIIIIIIIIIII                (97.5%)   (2.5%)   
                      3. 51-70                     317 (32.5%)            IIIIIIIIIIIIIIII                               
                      4. 71 +                      159 (16.3%)            IIIIIIII                                       
                                                                                                    
4    BMI              mean (sd) : 25.73 (4.49)     974 distinct val.                  :               974       26       
     [numeric]        min < med < max :                                             : : .             (97.4%)   (2.6%)   
                      8.83 < 25.62 < 39.44                                          : : :                                
                      IQR (CV) : 5.72 (0.17)                                      : : : : :                              
                                                                                  : : : : : .                            
                                                                              . : : : : : : : .                          
                                                                                                    
5    smoker           1. Yes                       298 (29.8%)            IIIIII                      1000      0        
     [factor]         2. No                        702 (70.2%)            IIIIIIIIIIIIIIII            (100%)    (0%)     
                                                                                                    
6    cigs.per.day     mean (sd) : 6.78 (11.88)     37 distinct val.       :                           965       35       
     [numeric]        min < med < max :                                   :                           (96.5%)   (3.5%)   
                      0 < 0 < 40                                          :                                              
                      IQR (CV) : 11 (1.75)                                :                                              
                                                                          :                                              
                                                                          :     .         .     .                        
                                                                                                    
7    diseased         1. Yes                       224 (22.4%)            IIII                        1000      0        
     [factor]         2. No                        776 (77.6%)            IIIIIIIIIIIIIIII            (100%)    (0%)     
                                                                                                    
8    disease          1. Hypertension              36 (16.2%)             IIIIIIIIIIIIIIII            222       778      
     [character]      2. Cancer                    34 (15.3%)             IIIIIIIIIIIIIII             (22.2%)   (77.8%)  
                      3. Cholesterol               21 ( 9.5%)             IIIIIIIII                                      
                      4. Heart                     20 ( 9.0%)             IIIIIIII                                       
                      5. Pulmonary                 20 ( 9.0%)             IIIIIIII                                       
                      6. Musculoskeletal           19 ( 8.6%)             IIIIIIII                                       
                      7. Diabetes                  14 ( 6.3%)             IIIIII                                         
                      8. Hearing                   14 ( 6.3%)             IIIIII                                         
                      9. Digestive                 12 ( 5.4%)             IIIII                                          
                      10. Hypotension              11 ( 5.0%)             IIII                                           
                      [ 3 others ]                 21 ( 9.4%)             IIIIIIIII                                      
                                                                                                    
9    samp.wgts        mean (sd) : 1 (0.08)         0.86!: 267 (26.7%)     IIIIIIIIIIIII               1000      0        
     [numeric]        min < med < max :            1.04!: 249 (24.9%)     IIIIIIIIIIII                (100%)    (0%)     
                      0.86 < 1.04 < 1.06           1.05!: 324 (32.4%)     IIIIIIIIIIIIIIII                               
                      IQR (CV) : 0.19 (0.08)       1.06!: 160 (16.0%)     IIIIIII                                        
                                                   ! rounded                                                             
-------------------------------------------------------------------------------------------------------------------------

Markdown

summarytools uses the pander package to generate ascii content. As a consequence, we can easily generate markdown content; We do this simply by specifying style="rmarkdown".

In the console, the output of a function using style = 'rmarkdown' looks like this:

## Frequencies 
### iris$Species 
**Type:** Factor (unordered) 

|         &nbsp; | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
|---------------:|-----:|--------:|-------------:|--------:|-------------:|
|     **setosa** |   50 |   33.33 |        33.33 |   33.33 |        33.33 |
| **versicolor** |   50 |   33.33 |        66.67 |   33.33 |        66.67 |
|  **virginica** |   50 |   33.33 |       100.00 |   33.33 |       100.00 |
|     **\<NA\>** |    0 |         |              |    0.00 |       100.00 |
|      **Total** |  150 |  100.00 |       100.00 |  100.00 |       100.00 |

This ``ascii-plus-plus'' content needs an interpreted in order to be displayed as html. markdown documents can be converted to other formats as well, such as pdf or rtf.

To learn more about markdown and rmarkdown, see John MacFarlane's page and this RStudio's R Markdown Quicktour.

Under the Hood - Generating Html

Version 0.8.0 of summarytools uses RStudio's htmltools package and version 4 of Bootstrap's cascading stylesheets.

It is also possible to include your own css if you wish to customize the look of the output tables.

print() and view()

summarytools has a generic print() method, print.summarytools. By default, its method argument is set to 'pander'. To easily display html outputs in RStudio's Viewer, we use the view() function, which acts as a wrapper around the generic print function, this time using method = 'viewer'. When used outside of RStudio, the method falls back on 'browser' and the report is opened in your system's default browser.

Using by() and lapply()

When using by() (with freq() or descr()) or lapply() (with freq()), R returns a list containing summarytools objects. Using the view() function with those objects is necessary in order to have non-redundant and clean section headings.

Example: Using the iris data frame, we will display descriptive statistics broken down by Species.

# First save the results
iris_stats_by_species <- by(data = iris, 
                            INDICES = iris$Species, 
                            FUN = descr, stats = c("mean", "sd", "min", "med", "max"), 
                            transpose = TRUE)
# Then use view(), like so:
view(iris_stats_by_species, method = "pander")

Non-numerical variable(s) ignored: Species

Descriptive Statistics: iris 
Group: iris$Species = setosa 
N: 50 

                     Mean   Std.Dev    Min   Median    Max
------------------ ------ --------- ------ -------- ------
      Sepal.Length   5.01      0.35   4.30     5.00   5.80
       Sepal.Width   3.43      0.38   2.30     3.40   4.40
      Petal.Length   1.46      0.17   1.00     1.50   1.90
       Petal.Width   0.25      0.11   0.10     0.20   0.60


Group: iris$Species = versicolor 
N: 50 

                     Mean   Std.Dev    Min   Median    Max
------------------ ------ --------- ------ -------- ------
      Sepal.Length   5.94      0.52   4.90     5.90   7.00
       Sepal.Width   2.77      0.31   2.00     2.80   3.40
      Petal.Length   4.26      0.47   3.00     4.35   5.10
       Petal.Width   1.33      0.20   1.00     1.30   1.80


Group: iris$Species = virginica 
N: 50 

                     Mean   Std.Dev    Min   Median    Max
------------------ ------ --------- ------ -------- ------
      Sepal.Length   6.59      0.64   4.90     6.50   7.90
       Sepal.Width   2.97      0.32   2.20     3.00   3.80
      Petal.Length   5.55      0.55   4.50     5.55   6.90
       Petal.Width   2.03      0.27   1.40     2.00   2.50

To see an html version of these results, proceed like this:

view(iris_stats_by_species)

Writing to files

The console will always tell you the location of the temporary html file that is created in the process. However, you can specify the name and location of that file explicitly if you need to reuse it later on:

view(iris_stats_by_species, file = "~/iris_stats_by_species.html")

There is also an append = boolean parameter for adding content to existing reports.

Overriding formatting attributes

When a summarytools object is stored, its formatting attributes are stored with it. However, you can override most of them when using the print() function.

Example:

data(tobacco)
bmi_stats <- descr(tobacco$BMI)
print(bmi_stats, Variable.label = "Body Mass Index")

Descriptive Statistics: tobacco$BMI 
Variable Label: Body Mass Index 
N: 1000 

                       BMI
----------------- --------
             Mean    25.73
          Std.Dev     4.49
              Min     8.83
           Median    25.62
              Max    39.44
              MAD     4.18
              IQR     5.72
               CV     5.73
         Skewness     0.02
      SE.Skewness     0.08
         Kurtosis     0.26
          N.Valid   974.00
        Pct.Valid    97.40

Extra Features

Weighted statistics

Versions 0.5 and above support weights for freq() and descr().

Function what.is() helps you figure out quickly what an object is by...

Putting together the object's class(es), type (typeof), mode, storage mode, length, dim and object.size, all in a single table
Extending the is() function in a way that the object is tested against all functions starting with is. -- see this post on StackOverflow for details;
Giving a list of the object's attributes names and length (c.g. rownames, dimnames, labels, etc.)

Some examples

> what.is(c)
$properties
      property    value
1        class function
2       typeof  builtin
3         mode function
4 storage.mode function
5          dim         
6       length        1
7  object.size 56 Bytes

$extensive.is
[1] "is.function"  "is.primitive" "is.recursive"

$function.type
[1] "primitive" "generic"  


> what.is(NaN)
$properties
      property    value
1        class  numeric
2       typeof   double
3         mode  numeric
4 storage.mode   double
5          dim         
6       length        1
7  object.size 48 Bytes

$extensive.is
[1] "is.atomic"  "is.double"  "is.na"      "is.nan"     "is.numeric" "is.vector" 

$object.type
[1] "base"

News

I've been working on summarytools on and off over the last few months. Now I'm happy to introduce version 0.8. The most notable changes compared to 0.7 are:

A cross-tabulation function, ctable()
Improved support for by(), with() and lapply() functions
Optional barplots and histograms in dfSummary()
New layouts, most noticeable in html reports
Improved flexibility on several fronts
Support for Date / POSIX columns

Final notes

The package comes with no guarantees. It is a work in progress and feedback / feature requests are welcome. Just send me an email (dominic.comtois (at) gmail.com), or open an Issue if you find a bug.

Also, the package grew significantly larger, and maintaining it all by myself is time consuming. If you would like to contribute, please get in touch, I'd greatly appreciate the help.

anishsingh20 / summarytools

summarytools: An R Package For Descriptive Statistics

What is summarytools?

Why use summarytools?

How to install

A First Example

Building on the strengths of pander and htmltools, the summary reports produced by summarytools can be:

Four Core Functions

1 - Frequency tables with `freq()`

2 - Descriptive (univariate) statistics with `descr()`

Descriptive Statistics: iris

Transposing and selecting the stats you need

Descriptive Statistics: iris

3 - Cross-tabulations with `ctable()`

4 - Data Frame Summaries

Markdown

Under the Hood - Generating Html

print() and view()

Using by() and lapply()

Writing to files

Overriding formatting attributes

Extra Features

Weighted statistics

Function what.is() helps you figure out quickly what an object is by...

Some examples

News

Final notes

About

Languages

summarytools: An R Package For Descriptive Statistics

What is summarytools?

Why use summarytools?

How to install

A First Example

Building on the strengths of pander and htmltools, the summary reports produced by summarytools can be:

Four Core Functions

1 - Frequency tables with freq()

2 - Descriptive (univariate) statistics with descr()

Descriptive Statistics: iris

Transposing and selecting the stats you need

Descriptive Statistics: iris

3 - Cross-tabulations with ctable()

4 - Data Frame Summaries

Markdown

Under the Hood - Generating Html

print() and view()

Using by() and lapply()

Writing to files

Overriding formatting attributes

Extra Features

Weighted statistics

Function what.is() helps you figure out quickly what an object is by...

Some examples

News

Final notes

About

Languages

1 - Frequency tables with `freq()`

2 - Descriptive (univariate) statistics with `descr()`

3 - Cross-tabulations with `ctable()`