Hai Vo 7/21/23
- Introduction
- How it works
- Installation
- Milestone
- Documentation
- Usage example
- Random speed comparison
- Implement list
- Different type of i/o
This project is a python package to mimic r::stringr functionalities, the core functions are written in Rust and then export to Python. Note that I write this package mostly for personal use (convenience and speed) and learning purpose, so please use with care!
Any type of contribution are welcome!
- Using arrow format to store main input array.
- Using pyo3 for python binding
- Convert Python type (mostly List) to Rust type (mostly Vec) for the case not using arrow. This may cause some overhead, but it make the code more flexible. For example: many function not only vectorize over main array but also it arugments.
This package is not on PyPi yet, so you need to compile from source.
First you need rust compiler:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Then install this package as normal python package:
git clone https://github.com/vohai611/stringpy.git
pip3 install ./stringpy
Or you can download and install from prebuild wheels under github action artifact
- Implement basic function
- Add document
- Add test
- Add CI/CD
- Add example
- Add codecov
- [] Release PyPi
- [] Add benchmark
- [] Vectorize on arguments
The documentation can be found at here
Code
# setup
import stringpy as sp
import pandas as pd
import numpy as np
import random
import string
Code
df = pd.DataFrame({'group': ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'],
'value': ['one', 'two', 'three', 'four',None, 'six', 'seven', 'eight', 'nine', 'ten']})
df2 = df.groupby('group').agg(lambda x: sp.str_c(x, collapse='->'))
df2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
value | |
---|---|
group | |
a | one->three->->seven->nine |
b | two->four->six->eight->ten |
Code
sp.str_split(df2['value'], pattern='->')
<pyarrow.lib.ListArray object at 0x134b45360>
[
[
"one",
"three",
"",
"seven",
"nine"
],
[
"two",
"four",
"six",
"eight",
"ten"
]
]
Code
a = sp.str_replace_all(['ThisIsSomeCamelCase', 'ObjectNotFound'],
pattern='([a-z])([A-Z])', replace= '$1 $2').to_pylist()
sp.str_replace_all(sp.str_to_lower(a), pattern = ' ', replace = '_')
<pyarrow.lib.StringArray object at 0x104077e20>
[
"this_is_some_camel_case",
"object_not_found"
]
Code
vietnam = ['Hà Nội', 'Hồ Chí Minh', 'Đà Nẵng', 'Hải Phòng', 'Cần Thơ', 'Biên Hòa', 'Nha Trang', 'BMT', 'Huế', 'Buôn Ma Thuột', 'Bắc Giang', 'Bắc Ninh', 'Bến Tre', 'Bình Dương', 'Bình Phước', 'Bình Thuận', 'Cà Mau', 'Cao Bằng', 'Đắk Lắk', 'Đắk Nông', 'Điện Biên', 'Đồng Nai', 'Đồng Tháp']
sp.str_remove_ascent(vietnam)
<pyarrow.lib.StringArray object at 0x134b45de0>
[
"Ha Noi",
"Ho Chi Minh",
"Da Nang",
"Hai Phong",
"Can Tho",
"Bien Hoa",
"Nha Trang",
"BMT",
"Hue",
"Buon Ma Thuot",
...
"Binh Duong",
"Binh Phuoc",
"Binh Thuan",
"Ca Mau",
"Cao Bang",
"Dak Lak",
"Dak Nong",
"Dien Bien",
"Dong Nai",
"Dong Thap"
]
Although this package is not aim to speed optimization, but in most case, it still get a decent speed up compare with pandas, thank to Rust!
Below are some of random comparison between stringpy
and pandas
:
Code
letters = string.ascii_lowercase
a = [''.join(random.choice(letters) for i in range(10)) for i in range(600_000)]
a_sr = pd.Series(a)
Code
%%time
a_sr.str.replace('\w', 'b', regex=True)
CPU times: user 447 ms, sys: 7.09 ms, total: 454 ms
Wall time: 454 ms
0 bbbbbbbbbb
1 bbbbbbbbbb
2 bbbbbbbbbb
3 bbbbbbbbbb
4 bbbbbbbbbb
...
599995 bbbbbbbbbb
599996 bbbbbbbbbb
599997 bbbbbbbbbb
599998 bbbbbbbbbb
599999 bbbbbbbbbb
Length: 600000, dtype: object
Code
%%time
sp.str_replace_all(a, pattern='\w', replace= 'b')
CPU times: user 4.95 s, sys: 27 ms, total: 4.98 s
Wall time: 4.98 s
<pyarrow.lib.StringArray object at 0x104077ca0>
[
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
...
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb",
"bbbbbbbbbb"
]
Code
%%time
a_sr.str.slice(2,4)
CPU times: user 55.7 ms, sys: 4.04 ms, total: 59.8 ms
Wall time: 59.7 ms
0 az
1 aj
2 qr
3 wr
4 ao
..
599995 ky
599996 tn
599997 dj
599998 dg
599999 ny
Length: 600000, dtype: object
Code
%%time
sp.str_sub(a, start=2, end=4)
CPU times: user 272 ms, sys: 4.64 ms, total: 277 ms
Wall time: 276 ms
<pyarrow.lib.StringArray object at 0x134b44100>
[
"az",
"aj",
"qr",
"wr",
"ao",
"ds",
"ef",
"br",
"pi",
"dg",
...
"ps",
"mn",
"mm",
"dt",
"co",
"ky",
"tn",
"dj",
"dg",
"ny"
]
## Counting
::: {.cell execution_count=11}
``` {.python .cell-code}
%%time
a_sr.str.count('a')
CPU times: user 132 ms, sys: 3.08 ms, total: 135 ms
Wall time: 135 ms
0 2
1 1
2 0
3 1
4 1
..
599995 2
599996 0
599997 0
599998 1
599999 0
Length: 600000, dtype: int64
:::
Code
%%time
sp.str_count(a, pattern='a')
CPU times: user 427 ms, sys: 2.26 ms, total: 430 ms
Wall time: 430 ms
<pyarrow.lib.Int32Array object at 0x134b458a0>
[
2,
1,
0,
1,
1,
0,
1,
0,
1,
1,
...
1,
0,
0,
0,
0,
2,
0,
0,
1,
0
]
-
str_count
-
str_detect
-
str_extract /str_extract_all
-
[] str_locate() str_locate_all()
-
str_match() str_match_all()
-
str_replace() str_replace_all()
-
str_remove() str_remove_all()
-
str_split()
-
[] str_split_1() str_split_fixed() str_split_i()
-
str_starts() str_ends()
-
str_subset()
-
str_which()
-
str_c(), str_combine()
-
[] str_flatten() str_flatten_comma()
- str_dup()
- str_length() str_width()
- str_pad()
- str_sub()/ str_sub_all()
- str_trim() str_squish()
- str_trunc()
- [] str_wrap()
- str_to_upper() str_to_lower() str_to_title() str_to_sentence()
- str_unique()
- str_remove_ascent()
-
@export
: one array in, one array out -
@export2
: multiple array in, one array out
apply_utf8!()
apply_utf8_bool!()
apply_utf8_lst!()
- vec in vec out
- apply_utf8!()
- @export
- vec+ in vec out
- apply_utf8!()
- @export2
- vec in vec out
- apply_utf8_bool!()
- @export
- vec in vec<vec> out
- apply_utf8_lst!()
- @export