alperyilmaz / qsv

CSVs sliced, diced & analyzed.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

qsv: Ultra-fast CSV data-wrangling toolkit

Linux build status Windows build status macOS build status Security audit Downloads Clones Discussions Docs HomeBrew HomeBrew Installs Crates.io Crates.io downloads Minimum supported Rust version

  Table of Contents
qsv logo qsv is a command line program for
indexing, slicing, analyzing, filtering,
enriching, validating & joining CSV files.
Commands are simple, fast & composable.

* Available Commands
* Installation
* Whirlwind Tour
* Cookbook
* FAQ
* Changelog
* Performance Tuning
* Benchmarks
* NYC School of Data 2022 slides
* Sponsor

ℹ️ NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. On top of xsv's 20 commands, it adds numerous new features, 27 additional commands, 6 apply subcommands & 31 apply operations (for a total of 84). See FAQ for more details.

Available commands

Command Description
apply12 Apply series of string, date, math, currency & geocoding transformations to a CSV column. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection).
behead Drop headers from a CSV.
cat Concatenate CSV files by row or by column.
count3 Count the rows in a CSV file. (Instantaneous with an index.)
dedup42 Remove duplicate rows (See also extsort & sortcheck commands).
enum Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value.
excel Exports a specified Excel/ODS sheet to a CSV file.
exclude3 Removes a set of CSV data from another set based on the specified columns.
explode Explode rows into multiple ones by splitting a column value based on the given separator.
extsort2 Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm.
fetch Fetches data from web services for every row using HTTP Get. Comes with jql JSON query language support, dynamic throttling (RateLimit) & caching with optional Redis support for persistent caching.
fetchpost Similar to fetch, but uses HTTP Post. (HTTP GET vs POST methods)
fill Fill empty values.
fixlengths Force a CSV to have same-length records by either padding or truncating them.
flatten A flattened view of CSV records. Useful for viewing one record at a time.
e.g. qsv slice -i 5 data.csv | qsv flatten.
fmt Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
foreach1 Loop over a CSV to execute bash commands. (not available on Windows)
frequency35 Build frequency tables of each column. (Uses multithreading to go faster if an index is present.)
generate1 Generate test data by profiling a CSV using Markov decision process machine learning.
headers Show the headers of a CSV. Or show the intersection of all headers between many CSV files.
index Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file. Also enables multithreading for frequency, split, stats and schema commands.
input3 Read CSV data with special quoting, trimming, line-skipping and UTF-8 transcoding rules. Typically used to "normalize" a CSV for further processing with other qsv commands.
join3 Inner, outer, cross, anti & semi joins. Uses a simple hash index to make it fast.
jsonl Convert newline-delimited JSON (JSONL/NDJSON) to CSV. See tojsonl command to convert CSV to JSONL.
lua1 Execute a Lua 5.4 script over CSV lines to transform, aggregate or filter them. Lua is much faster than Python.
luajit1 Execute a LuaJIT 2.0 (a Just-In-Time compiler for Lua 5.1) script over CSV lines to transform, aggregate or filter them. LuaJIT is even faster still than Lua.
partition Partition a CSV based on a column value.
pseudo Pseudonymise the value of the given column by replacing them with an incremental identifier.
py1 Evaluate a Python expression over CSV lines to transform or filter them. Python's f-strings is particularly useful for extended formatting, with the ability to evaluate Python expressions as well.
Consider using the lua/luajit commands instead if you're having Python version issues (Python 3.6 and up supported, with Python 3.11 required on prebuilt qsv) as it's much faster, embedded, can do aggregations & has no external dependencies.
rename Rename the columns of a CSV efficiently.
replace Replace CSV data using a regex.
reverse4 Reverse order of rows in a CSV. Unlike the sort --reverse command, it preserves the order of rows with the same key.
sample3 Randomly draw rows (with optional seed) from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample).
schema5 Infer schema from CSV data and output in JSON Schema format. Uses multithreading to go faster if an index is present. See validate command to use the generated JSON Schema to validate if similar CSVs comply with the schema.
search Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows.
searchset Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows.
select Select, re-order, duplicate or drop columns.
slice34 Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
sniff3 Quickly sniff CSV metadata (delimiter, header row, preamble rows, quote character, flexible, is_utf8, number of records, number of fields, field names & data types).
sort2 Sorts CSV data in alphabetical, numerical, reverse or random (with optional seed) order (See also extsort & sortcheck commands).
sortcheck3 Check if a CSV is sorted. With the --json options, also retrieve record count, sort breaks & duplicate count.
split35 Split one CSV file into many CSV files of N chunks. (Uses multithreading to go faster if an index is present.)
stats345 Infer data type (Null, String, Float, Integer, Date, DateTime) & compute descriptive statistics for each column in a CSV (sum, min/max, min/max length, mean, stddev, variance, nullcount, quartiles, IQR, lower/upper fences, skewness, median, mode & cardinality). Uses multithreading to go faster if an index is present.
table4 Show aligned output of a CSV using elastic tabstops.
tojsonl5 Smartly converts CSV to a newline-delimited JSON (JSONL/NDJSON). By scanning the CSV first, it "smartly" infers the appropriate JSON data type for each column. See jsonl command to convert JSONL to CSV.
transpose4 Transpose rows/columns of a CSV.
validate32 Validate CSV data with a JSON Schema (See schema command). If no jsonschema file is provided, validates if a CSV conforms to the RFC 4180 standard.

Installation

For macOS and Linux (64-bit), you can quickly install qsv with Homebrew:

brew install qsv

Pre-built binaries for Windows, Linux and macOS are also available for download, including binaries compiled with Rust Nightly/Unstable (more info).

There are four variants of qsv:

  • qsv enables all features valid for the target platform6
  • qsvnp enables all features EXCEPT python ("np" stands for "no python")
  • qsvlite has all features disabled (~half the size of qsv)
  • qsvdp is optimized for use with DataPusher+, with only DataPusher+ relevant commands and the self-update engine removed (~sixth of the size of qsv).

Alternatively, you can install from source by installing Rust and installing qsv using Rust's cargo command7:

cargo install qsv --features all_full

If you encounter compilation errors, ensure you have the Python development libraries installed and you're using the exact version of the dependencies qsv was built with by issuing:

cargo install qsv --locked --features all_full

The binary will be installed in ~/.cargo/bin.

Compiling from source also works similarly:

git clone git@github.com:jqnatividad/qsv.git
cd qsv
cargo build --release --features all_full
# or if you encounter compilation errors
cargo build --release --locked --features all_full

The compiled binary will end up in ./target/release/.

To enable optional features, use cargo --features (see Feature Flags for more info):

cargo install qsv --features apply,generate,lua,fetch,foreach,python,self_update,full
# or shorthand
cargo install qsv --features all_full
# or to install all features EXCEPT python
cargo install qsv --features nopython_full
# or to install qsvlite
cargo install qsv --features lite
# or to install qsvdp
cargo install qsv --features datapusher_plus

# or when compiling from a local repo
cargo build --release --features apply,generate,lua,fetch,foreach,python,self_update,full
# shorthand
cargo build --release --features all_full
# all features EXCEPT python
cargo build --release --features nopython_full
# for qsvlite
cargo build --release --features lite
# for qsvdp
cargo build --release --features datapusher_plus

Minimum Supported Rust Version

qsv's MSRV policy is to require Rust stable - currently version 1.65.

Tab Completion

qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of docopt.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):

# install docopt-wordlist
cargo install docopt

# IMPORTANT: run these commands from the root directory of your qsv git repository
# to setup bash qsv tab completion
echo "DOCOPT_WORDLIST_BIN=\"$(which docopt-wordlist)"\" >> $HOME/.bash_completion
echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion
echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion

File formats

qsv recognizes UTF-8/ASCII encoded, CSV (.csv) and TSV files (.tsv and .tab). CSV files are assumed to have "," (comma) as a delimiter, and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter command-line option or with the QSV_DEFAULT_DELIMITER environment variable or automatically detected when QSV_SNIFF_DELIMITER is set.

When using the --output option, note that qsv will UTF-8 encode the file and automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv and .tab files.

JSONL/NDJSON files are also recognized and converted from/to CSV with the jsonl and tojsonl commands respectively.

The fetch & fetchpost commands also produces JSONL files when its invoked without the --new-column option, and TSV files with the --report option.

The sniff, sortcheck and validate commands produce JSON files with their --json and --pretty-json options.

The schema command produces a JSON Schema Validation (Draft 7) file with the ".schema.json" file extension, which can be used with the validate command.

The excel command recognizes Excel and Open Document Spreadsheet(ODS) files (.xls, .xlsx, .xlsm, .xlsb and .ods files).

RFC 4180

qsv validates against the RFC 4180 CSV standard. However IRL, CSV formats vary significantly and qsv is actually not strictly compliant with the specification so it can process "real-world" CSV files. qsv leverages the awesome Rust CSV crate to read/write CSV files.

Click here to find out more about how qsv conforms to the standard using this crate.

UTF-8 Encoding

The following commands require UTF-8 encoded input (of which ASCII is a subset) - dedup, exclude, fetch, fetchpost, frequency, join, schema, sort, stats & validate.

For these commands, qsv checks if the input is UTF-8 encoded by scanning the first 8k, and will abort if its not unless QSV_SKIPUTF8_CHECK is set. On Linux and macOS, UTF-8 encoding is the default.

This was done to increase performance of these commands, as they make extensive use of from_utf8_unchecked so as not to pay the repetitive utf-8 validation penalty, no matter how small, even for already utf-8 encoded files.

Should you need to re-encode CSV/TSV files, you can use the input command to transcode to UTF-8. It will replace all invalid UTF-8 sequences with . Alternatively, there are several utilities you can use to do so on Linux/macOS and Windows.

Windows Usage Note

Unlike other modern operating systems, Microsoft Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file and trying to open it with Excel (which ignores the comma delimiter, with everything in the first column):

qsv stats wcp.csv > wcpstats.csv

Which is weird, since you would think Microsoft's own Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file, use the --output option instead:

# so instead of redirecting stdout to a file
qsv stats wcp.csv > wcpstats.csv

# do this instead
qsv stats wcp.csv --output wcpstats.csv

Python

With the python feature, qsv will look for Python shared libraries (libpython* on Linux/macOS, python*.dll on Windows) against which it was compiled, and abort with an error if not found, detailing the Python library it was looking for.

Note that this will happen as soon as the qsv binary is invoked, even if you're not running the py command.

If you don't need to run the py command, simply use qsvnp ("np" stands for "no python"), qsvlite, or qsvdp.

If you need the py command, the prebuilt qsv binary is compiled, as a policy, using the current stable Python minor version (currently Python 3.11) at the time of release.

If you require a different Python version (Python 3.6 and up are supported), you'll need to install/compile from source, making sure you have the development libraries for the desired Python version installed when doing so.

PyO3 - the underlying crate that enables the python feature, uses a build script to determine the Python version and set the correct linker arguments. By default it uses the python3 executable. You can override the Python interpreter by setting PYO3_PYTHON (e.g., PYO3_PYTHON=python3.6), before installing/compiling qsv. See the PyO3 User Guide for more information.

If you're distributing python-enabled qsv, you can also "bundle" the Python shared library by including it in the same directory as the qsv binary. qsv will automatically use the "bundled" library instead of the default Python version in the environment.

Also, consider using the lua/luajit commands instead of the py command if the mapping/filtering operation you're trying to do can be done with lua/luajit. Lua is much faster than Python and LuaJIT is even faster still. In addition, Lua/LuaJIT is embedded into qsv, can do aggregations & has no external dependencies, unlike Python.

The py command cannot do aggregations because PyO3's GIL-bound memory limitations will quickly consume a lot of memory (see issue 449 for details). To prevent this, the py command processes CSVs in batches (default: 30,000 records), with a GIL pool for each batch, so no globals are available across batches.

Environment Variables

Variable Description
QSV_DEFAULT_DELIMITER single ascii character to use as delimiter. Overrides --delimeter option. Defaults to "," (comma) for CSV files and "\t" (tab) for TSV files when not set. Note that this will also set the delimiter for qsv's output to stdout.
However, using the --output option, regardless of this environment variable, will automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv and .tab files.
QSV_SNIFF_DELIMITER if set, the delimiter is automatically detected. Overrides QSV_DEFAULT_DELIMITER and --delimiter option. Note that this does not work with stdin.
QSV_NO_HEADERS if set, the first row will NOT be interpreted as headers. Supersedes QSV_TOGGLE_HEADERS.
QSV_TOGGLE_HEADERS if set to 1, toggles header setting - i.e. inverts qsv header behavior, with no headers being the default, and setting --no-headers will actually mean headers will not be ignored.
QSV_AUTOINDEX if set, automatically create an index when none is detected. Also automatically update stale indices.
QSV_COMMENT_CHAR set to an ascii character. If set, any lines(including the header) that start with this character are ignored.
QSV_MAX_JOBS number of jobs to use for multithreaded commands (currently apply, dedup, extsort, frequency, schema, sort, split, stats, tojsonl and validate). If not set, max_jobs is set to the detected number of logical processors. See Multithreading for more info.
QSV_NO_UPDATE if set, prohibit self-update version check for the latest qsv release published on GitHub.
QSV_PREFER_DMY if set, date parsing will use DMY format. Otherwise, use MDY format (used with apply datefmt, schema, sniff & stats commands).
QSV_REGEX_UNICODE if set, makes search, searchset and replace commands unicode-aware. For increased performance, these commands are not unicode-aware by default and will ignore unicode values when matching and will abort when unicode characters are used in the regex. Note that the apply operations regex_replace operation is always unicode-aware.
QSV_SKIPUTF8_CHECK if set, skip UTF-8 encoding check. Otherwise, for several commands that require UTF-8 encoded input (see UTF8-Encoding), qsv scans the first 8k.
QSV_RDR_BUFFER_CAPACITY reader buffer size (default (bytes): 16384)
QSV_WTR_BUFFER_CAPACITY writer buffer size (default (bytes): 65536)
QSV_LOG_LEVEL desired level (default - off; error, warn, info, trace, debug).
QSV_LOG_DIR when logging is enabled, the directory where the log files will be stored. If the specified directory does not exist, qsv will attempt to create it. If not set, the log files are created in the directory where qsv was started. See Logging for more info.
QSV_PROGRESSBAR if set, enable the --progressbar option on the apply, fetch, fetchpost, foreach, lua, py, replace, search, searchset, sortcheck and validate commands.
QSV_REDIS_CONNSTR the fetch command can use Redis to cache responses. Set to connect to the desired Redis instance. (default: redis:127.0.0.1:6379/1). For more info on valid Redis connection string formats, see https://docs.rs/redis/latest/redis/#connection-parameters.
QSV_FP_REDIS_CONNSTR the fetchpost command can also use Redis to cache responses (default: redis:127.0.0.1:6379/2). Note that fetchpost connects to database 2, as opposed to fetch which connects to database 1.
QSV_REDIS_MAX_POOL_SIZE the maximum Redis connection pool size. (default: 20).
QSV_REDIS_TTL_SECONDS set time-to-live of Redis cached values (default (seconds): 2419200 (28 days)).
QSV_REDIS_TTL_REFRESH if set, enables cache hits to refresh TTL of cached values.

Several dependencies also have environment variables that influence qsv's performance & behavior:

  • Memory Management (mimalloc)
    When incorporating qsv into a data pipeline that runs in batch mode, particularly with very large CSV files using qsv commands that load entire CSV files into memory, you can fine-tune Mimalloc's behavior using its environment variables.
  • Network Access (reqwest)
    qsv uses reqwest for its fetch, validate and --update functions and will honor proxy settings set through the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables.

ℹ️ NOTE: To get a list of all active qsv-relevant environment variables, run qsv --envlist. Relevant env vars are defined as anything that starts with QSV_ and MIMALLOC_, and the proxy variables listed above.

Feature Flags

qsv has several features:

  • mimalloc (default) - use the mimalloc allocator (see Memory Allocator for more info).
  • apply - enable apply command. This swiss-army knife of CSV transformations is very powerful, but it has a lot of dependencies that increases both compile time and binary size.
  • fetch - enables the fetch and fetchpost commands.
  • generate - enable generate command.
  • full - enable to build qsv binary variant which is feature-capable.
  • all_full - enable to build qsv binary variant with all features enabled (apply,fetch,foreach,generate,luajit,python).
  • nopython_full - enable to build qsvnp binary variant with all features (apply,fetch,foreach,generate,luajit) EXCEPT python.
  • lite - enable to build qsvlite binary variant with all features disabled.
  • datapusher_plus - enable to build qsvdp binary variant - the DataPusher+ optimized qsv binary.
  • nightly - enable to turn on nightly/unstable features in the rand, regex, hashbrown, parking_lot and pyo3 crates when building with Rust nightly/unstable.
  • self_update - enable self-update engine, checking GitHub for the latest release.

The following "power-user" features can be abused and present "foot-shooting" scenarios:

  • lua - enable lua command. Embeds a Lua 5.4 interpreter into qsv.
  • luajit - enable luajit command. Embeds a LuaJIT 2.0 interpreter into qsv. LuaJIT is a Just-In-Time compiler for the Lua 5.1 language and is thus much faster than Lua. Note that the lua and luajit interpreters are mutually exclusive features.
  • foreach - enable foreach command (not valid for Windows).
  • python - enable py command (requires Python 3.6+ shared library). Note that qsv will look for the Python shared library (libpython.* on Linux/macOS, python*.dll on Windows) for the Python version it was compiled against and will abort if the library is not found, even if you're not using the py command. Check Python section for more info.

ℹ️ NOTE: qsvlite, as the name implies, always has non-default features disabled. qsv can be built with any combination of the above features using the cargo --features & --no-default-features flags. The pre-built qsv binaries has all applicable features valid for the target platform6.

License

Dual-licensed under MIT or the UNLICENSE.

Sponsor

qsv was made possible by
datHere Logo
Standards-based, best-of-breed, open source solutions
to make your Data Useful, Usable & Used.

Naming Collision

This project is unrelated to Intel's Quick Sync Video.

Footnotes

  1. enabled by optional feature flag. Not available on qsvlite & qsvdp. 2 3 4 5 6

  2. multithreaded even without an index. 2 3 4 5

  3. uses an index when available. 2 3 4 5 6 7 8 9 10 11 12

  4. loads the entire CSV into memory. Note that dedup, stats & transpose have modes that do not load the entire CSV into memory. 2 3 4 5 6

  5. multithreaded when an index is available. 2 3 4 5

  6. The foreach feature is not available on Windows. The python feature is not enabled on cross-compiled pre-built binaries as we don't have access to a native python interpreter for those platforms (aarch64, i686, and arm) on GitHub's x86_64-based action runners. Compile natively on those platforms with Python 3.6+ development environment installed, if you want to enable the python feature. 2

  7. Of course, you'll also need a linker and a C compiler. Linux users should generally install GCC or Clang, according to their distribution’s documentation. For example, if you use Ubuntu, you can install the build-essential package. On macOS, you can get a C compiler by running $ xcode-select --install. For Windows, this means installing Visual Studio 2022. When prompted for workloads, include "Desktop Development with C++", the Windows 10 or 11 SDK, and the English language pack, along with any other language packs your require.

About

CSVs sliced, diced & analyzed.

License:The Unlicense


Languages

Language:Rust 98.8%Language:Shell 1.2%Language:Python 0.0%