RedAmber

A simple dataframe library for Ruby.

Powered by Red Arrow
Inspired by the dataframe library Rover-df

Requirements

Supported Ruby version is >= 2.7.

Since v0.2.0, this library uses pattern matching which is an experimental feature in 2.7 . It is usable but a warning message will be shown in 2.7 . I recommend Ruby 3 for performance.

# Libraries required
gem 'red-arrow',   '>= 9.0.0'

gem 'red-parquet', '>= 9.0.0' # Optional, if you use IO from/to parquet
gem 'rover-df',    '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame

Installation

Install requirements before you install Red Amber.

Apache Arrow GLib (>= 9.0.0)
Apache Parquet GLib (>= 9.0.0) # If you use IO from/to parquet

See Apache Arrow install document.

Minimum installation example for the latest Ubuntu is in the 'Prepare the Apache Arrow' section in ci test of Red Amber.

Add this line to your Gemfile:

gem 'red_amber'

And then execute:

bundle install

Or install it yourself as:

gem install red_amber

Docker image and Jupyter Notebook

RubyData Docker Stacks is available as a ready-to-run Docker image containing Jupyter and useful data tools as well as RedAmber (Thanks to @mrkn).

Also you can try the contents of this README interactively by Binder.

`RedAmber::DataFrame`

It represents a set of data in 2D-shape. The entity is a Red Arrow's Table object.

require 'red_amber' # require 'red-amber' is also OK.
require 'datasets-arrow'

arrow = Datasets::Penguins.new.to_arrow
penguins = RedAmber::DataFrame.new(arrow)

# =>
#<RedAmber::DataFrame : 344 x 8 Vectors, 0x0000000000013790>
    species  island    bill_length_mm bill_depth_mm flipper_length_mm ...     year
    <string> <string>        <double>      <double>           <uint8> ... <uint16>
  0 Adelie   Torgersen           39.1          18.7               181 ...     2007
  1 Adelie   Torgersen           39.5          17.4               186 ...     2007
  2 Adelie   Torgersen           40.3          18.0               195 ...     2007
  3 Adelie   Torgersen          (nil)         (nil)             (nil) ...     2007
  4 Adelie   Torgersen           36.7          19.3               193 ...     2007
  : :        :                      :             :                 : ...        :
341 Gentoo   Biscoe              50.4          15.7               222 ...     2009
342 Gentoo   Biscoe              45.2          14.8               212 ...     2009
343 Gentoo   Biscoe              49.9          16.1               213 ...     2009

For example, DataFrame#pick accepts keys as arguments and returns a sub DataFrame.

penguins.keys
# =>
[:species,
 :island,
 :bill_length_mm,
 :bill_depth_mm,
 :flipper_length_mm,
 :body_mass_g,
 :sex,
 :year]

df = penguins.pick(:species, :island, :body_mass_g)
df

# =>
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003cc1c>
    species  island    body_mass_g
    <string> <string>     <uint16>
  0 Adelie   Torgersen        3750
  1 Adelie   Torgersen        3800
  2 Adelie   Torgersen        3250
  3 Adelie   Torgersen       (nil)
  4 Adelie   Torgersen        3450
  : :        :                   :
341 Gentoo   Biscoe           5750
342 Gentoo   Biscoe           5200
343 Gentoo   Biscoe           5400

DataFrame#drop drops some columns to create a remainer DataFrame.

You can specify by keys or a boolean array of same size as n_keys.

# Same as df.drop(:species, :island)
df = df.drop(true, true, false)

# =>
#<RedAmber::DataFrame : 344 x 1 Vector, 0x0000000000048760>
    body_mass_g
       <uint16>
  0        3750
  1        3800
  2        3250
  3       (nil)
  4        3450
  :           :
341        5750
342        5200
343        5400

Arrow data is immutable, so these methods always return an new object.

DataFrame#assign creates new columns or update existing columns.

# New column is created because ':body_mass_kg' is a new key.
df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0)

# =>
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x00000000000212f0>
    body_mass_g body_mass_kg
       <uint16>     <double>
  0        3750          3.8
  1        3800          3.8
  2        3250          3.3
  3       (nil)        (nil)
  4        3450          3.5
  :           :            :
341        5750          5.8
342        5200          5.2
343        5400          5.4

DataFrame#slice selects rows (observations) to create a sub DataFrame.

# returns 5 rows at the start and 5 rows from the end
penguins.slice(0...5, -5..-1)

# =>
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
  species  island    bill_length_mm bill_depth_mm flipper_length_mm ...     year
  <string> <string>        <double>      <double>           <uint8> ... <uint16>
0 Adelie   Torgersen           39.1          18.7               181 ...     2007
1 Adelie   Torgersen           39.5          17.4               186 ...     2007
2 Adelie   Torgersen           40.3          18.0               195 ...     2007
3 Adelie   Torgersen          (nil)         (nil)             (nil) ...     2007
4 Adelie   Torgersen           36.7          19.3               193 ...     2007
: :        :                      :             :                 : ...        :
7 Gentoo   Biscoe              50.4          15.7               222 ...     2009
8 Gentoo   Biscoe              45.2          14.8               212 ...     2009
9 Gentoo   Biscoe              49.9          16.1               213 ...     2009

DataFrame#remove rejects rows (observations) to create a remainer DataFrame.

# penguins[:bill_length_mm] < 40 returns a boolean Vector
penguins.remove(penguins[:bill_length_mm] < 40)

# =>
#<RedAmber::DataFrame : 244 x 8 Vectors, 0x000000000007d6f4>
    species  island    bill_length_mm bill_depth_mm flipper_length_mm ...     year
    <string> <string>        <double>      <double>           <uint8> ... <uint16>
  0 Adelie   Torgersen           40.3          18.0               195 ...     2007
  1 Adelie   Torgersen          (nil)         (nil)             (nil) ...     2007
  2 Adelie   Torgersen           42.0          20.2               190 ...     2007
  3 Adelie   Torgersen           41.1          17.6               182 ...     2007
  4 Adelie   Torgersen           42.5          20.7               197 ...     2007
  : :        :                      :             :                 : ...        :
241 Gentoo   Biscoe              50.4          15.7               222 ...     2009
242 Gentoo   Biscoe              45.2          14.8               212 ...     2009
243 Gentoo   Biscoe              49.9          16.1               213 ...     2009

DataFrame manipulating methods like pick, drop, slice, remove, rename and assign accept a block.

Previous example is also OK with a block.

penguins.remove { bill_length_mm < 40 }

Next example is an usage of block to update a column.

df = RedAmber::DataFrame.new(
  integer: [0, 1, 2, 3, nil],
  float:   [0.0, 1.1,  2.2, Float::NAN, nil],
  string:  ['A', 'B', 'C', 'D', nil],
  boolean: [true, false, true, false, nil])
df

# =>
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000003131c>
  integer    float string   boolean
  <uint8> <double> <string> <boolean>
0       0      0.0 A        true
1       1      1.1 B        false
2       2      2.2 C        true
3       3      NaN D        false
4   (nil)    (nil) (nil)    (nil)

df.assign do
  vectors.select(&:float?).map { |v| [v.key, -v] }
  # => returns [[:float], [-0.0, -1.1, -2.2, NAN, nil]]
end

# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000e270c>
    index    float string
  <uint8> <double> <string>
0       0     -0.0 A
1       1     -1.1 B
2       2     -2.2 C
3       3      NaN D
4   (nil)    (nil) (nil)

Next example is to eliminate rows containing nil.

# remove all observations containing nil
nil_removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
nil_removed.tdr

# =>
RedAmber::DataFrame : 342 x 8 Vectors
Vectors : 5 numeric, 3 strings
# key                type   level data_preview
0 :species           string     3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
1 :island            string     3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
2 :bill_length_mm    double   164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
3 :bill_depth_mm     double    80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
4 :flipper_length_mm int64     55 [181, 186, 195, 193, 190, ... ]
5 :body_mass_g       int64     94 [3750, 3800, 3250, 3450, 3650, ... ]
6 :sex               string     3 {"male"=>168, "female"=>165, ""=>9}
7 :year              int64      3 {2007=>109, 2008=>114, 2009=>119}

For this frequently needed task, we can do it much simpler.

penguins.remove_nil # => same result as above

DataFrame#summary shows summary statistics in a DataFrame.

puts penguins.summary.to_s(width: 82)

# =>
  variables            count     mean      std      min      25%   median      75%      max
  <dictionary>      <uint16> <double> <double> <double> <double> <double> <double> <double>
0 bill_length_mm         342    43.92     5.46     32.1    39.23    44.38     48.5     59.6
1 bill_depth_mm          342    17.15     1.97     13.1     15.6    17.32     18.7     21.5
2 flipper_length_mm      342   200.92    14.06    172.0    190.0    197.0    213.0    231.0
3 body_mass_g            342  4201.75   801.95   2700.0   3550.0   4031.5   4750.0   6300.0
4 year                   344  2008.03     0.82   2007.0   2007.0   2008.0   2009.0   2009.0

DataFrame#group method can be used for the grouping tasks.

starwars = RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
starwars

# =>
#<RedAmber::DataFrame : 87 x 12 Vectors, 0x000000000000607c>
   unnamed1 name            height     mass hair_color skin_color  eye_color ... species
    <int64> <string>       <int64> <double> <string>   <string>    <string>  ... <string>
 0        1 Luke Skywalker     172     77.0 blond      fair        blue      ... Human
 1        2 C-3PO              167     75.0 NA         gold        yellow    ... Droid
 2        3 R2-D2               96     32.0 NA         white, blue red       ... Droid
 3        4 Darth Vader        202    136.0 none       white       yellow    ... Human
 4        5 Leia Organa        150     49.0 brown      light       brown     ... Human
 :        : :                    :        : :          :           :         ... :
84       85 BB8              (nil)    (nil) none       none        black     ... Droid
85       86 Captain Phasma   (nil)    (nil) unknown    unknown     unknown   ... NA
86       87 Padmé Amidala      165     45.0 brown      light       brown     ... Human

starwars.group(:species) { [count(:species), mean(:height, :mass)] }
        .slice { count > 1 }

# =>
#<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000006e848>
  species    count mean(height) mean(mass)
  <string> <int64>     <double>   <double>
0 Human         35        176.6       82.8
1 Droid          6        131.2       69.8
2 Wookiee        2        231.0      124.0
3 Gungan         3        208.7       74.0
4 NA             4        181.3       48.0
5 Zabrak         2        173.0       80.0
6 Twi'lek        2        179.0       55.0
7 Mirialan       2        168.0       53.1
8 Kaminoan       2        221.0       88.0

See DataFrame.md for other examples and details.

`RedAmber::Vector`

Class RedAmber::Vector represents a series of data in the DataFrame. Method RedAmber::DataFrame#[key] returns a Vector with the key key.

penguins[:bill_length_mm]
# =>
#<RedAmber::Vector(:double, size=344):0x000000000000f8fc>
[39.1, 39.5, 40.3, nil, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, ... ]

Vectors accepts some functional methods from Arrow.

This is an element-wise comparison and returns a boolean Vector of same size.

penguins[:bill_length_mm] < 40

# =>
#<RedAmber::Vector(:boolean, size=344):0x000000000007e7ac>
[true, true, false, nil, true, true, true, true, true, false, true, true, false, ... ]

Next example returns aggregated result.

penguins[:bill_length_mm].mean

# =>
43.92192982456141

See Vector.md for details.

Jupyter notebook

71 Examples of Red Amber shows more examples in jupyter notebook.

You can try this notebook on Binder.

Development

git clone https://github.com/heronshoes/red_amber.git
cd red_amber
bundle install
bundle exec rake test

Community

I will appreciate if you could help to improve this project. Here are a few ways you can help:

Let's talk in the discussions.
- Browse Q and A, how to use, tips, etc.
- Ask questions you’re wondering about.
- Share ideas. The idea may be promoted to issues or pull requests.
Report bugs or suggest new features
Fix bugs and submit pull requests
Write, clarify, or fix documentation

License

The gem is available as open source under the terms of the MIT License.

mrkn / red_amber