ridjohansen / structured-text-tools

A list of command line tools for manipulating structured text data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What follows is a list of text-based file formats with command line tools for manipulating each (with a focus on Linux).

Table of contents

DSV

Delimiter-separated values, including CSV, TSV, etc.

Awk

Awk is a POSIX-standard command line tool and programming language for processing DSV data. A list of Awk links follows.

  • Awk.info — an extensive resource on Awk.
  • AWK Vs NAWK Vs GAWK — a comparison of implementations.
  • If you already know how to program in some language, the nawk man page is a great way to learn Awk quickly. What you learn from it will apply to other implementations on different platforms. Read it first if you feel overwhelmed by the sheer size of the GNU Awk manual.

POSIX commands

Name and link Description
cut Select portions of each line in a file or several. Can work with delimiter-separated fields. See man 1 cut on your system (GNU, FreeBSD).
join Join the lines of two files on a common field. See man 1 join on your system (GNU, FreeBSD).
paste Combine consecutive lines in a text file into one. See man 1 paste on your system (GNU, FreeBSD).
sort Sort lines by key fields. See man 1 sort on your system (GNU, FreeBSD).
uniq Find or remove repeated lines. See man 1 uniq on your system (GNU, FreeBSD).

Other tools

Name and link Description
GNU datamash Perform statistical operations on text input.
Miller sed, awk, cut, join and sort for name-indexed data such as CSV and tabular JSON.
tab A non-Turing-complete programming language for data processing. An alternative to Awk.
xsv Index, slice, analyze, split and join CSV files.

SQL-based utilities

Name Programming language and database engine Features Usage link License
csvkit Python, SQLite 3 Use header row for column names, custom input and output encoding, custom input field separator, custom output field separator, custom output formatting, CSV JOINs, Python module. Excel and JSON to CSV. CSV to JSON. SQL queries for CSV. Usage MIT
q Python, SQLite 3 Use header row for column names, custom input and output encoding, gzipped input, custom input field separator (string literal), custom output field separator, custom output formatting, table JOINs, Python module. Usage GNU GPL 3
Sqawk Tcl, SQLite 3 Use header row for column names, custom input field separator (regexp, per-file), custom input record delimiter (regexp, per-file), custom table names, custom output field separator, custom output record separator, merge selected columns into one, ASCII/Unicode table output, CSV input and output, JSON output, Tcl output, table JOINs. Usage MIT
sqawk C, SQLite 3 Use header row for column names, column name aliases, can skip lines until a regexp matches, custom input field separator (string literal, per-file), keep SQLite file, show generated SQL, table JOINs. Usage ?
Squawk Python, custom SQL interpreter Access log and CSV input, JSON and CSV output, Python code generation. Three-clause BSD
termsql Python, SQLite 3 Use header rows for column names, custom field separator (regexp), custom record separator (string literal), lines as columns, skip a given number of lines and the beginning and at the end, merge selected columns into one, HTML, CSV, SQL and Tcl output. Manual MIT
textql Go, SQLite 3 Use header rows for column names, keep SQLite file, custom input field separator (string literal). Usage MIT

XML, HTML

Name and link Description
pup Filter HTML pages using CSS selectors. Inspired by jq.
Saxon Scrape XML and HTML data using XPath. Documentation.
tq Retrieve content from HTML using CSS selectors.
xml2 Convert XML and HTML to and from flat, greppable lists of "path=value" statements.
XMLStarlet Transform, query, validate and edit XML documents.

See also: Grep and Sed Equivalent for XML Command Line Processing on StackOverflow.

JSON

Name and link Description
jo Create JSON objects from the shell.
jq Create and manipulate JSON with a functional (as in "functional programming") DSL. Can convert JSON to other formats.
jshon Create and manipulate JSON using getopt-style command-line options.
json2 Convert JSON to and from flat, greppable lists of "path=value" statements. Modeled after xml2.
jsonaxe A JSON processor similar to JQ with a Python-based DSL.
json Similar to JQ but written in JavaScript. Can run arbitrary JavaScript on the JSON input.
json-table Transform JSON data structures into tables of columns and rows for processing in the shell.
json.tool (Python 3 docs) Validate and pretty-print JSON data. This module is part of the standard library of Python 2/3 and so is likely available wherever Python is installed.
jsonwatch Track changes in JSON data live from the command line. Works like watch -d.
lobar Explore JSON interactively or process it in batch with a wrapper for lodash.chain(). An alternative to jq with a JavaScript syntax.
validjson Validate or pretty-print JSON data.

YAML, TOML

With a format converter like Remarshal (below) you can use (JSON)[#json] tools to process YAML and TOML but beware that you don't lose data in the conversion (example).

Name and link Description
Remarshal. Convert between YAML, TOML and JSON. Validate or pretty-print each of the three formats.
shyaml Read data from YAML files. Can output null-terminated strings for use in shell scripts.
validyaml Validate or pretty-print YAML data.

INI

Name and link Platform License Description
crudini Any with Python 2.x GNU GPLv2 Set and remove properties in INI files. Retrieve properties as shell script commands to set the corresponding variables. Outputs updated INI data or changes files in place.
IniFile (DOS version) Windows (x86, x86-64), MS-DOS Closed-source freeware Set and remove properties in INI files. Retrieve properties as a list of batch file set commands to set the corresponding variables. Changes files in place.
initool Windows, Linux, FreeBSD MIT Set and remove properties in INI files and check for their existence. Outputs updated INI data.

Configuration files

  • Augeas — extract data from and modify a number of file formats. Note that not all formats are equally well supported by Augeas and for some only a limited subset of all valid files can be parsed.
  • Elektra — manipulate configuration files. Shares Augeas' limitations when it comes to application-specific configuration files (it uses the same lenses) but has better support for generic formats such as JSON or INI.

Bonus round: CLIs for single-file databases

Name Description File format
GNU Recutils "[A] set of tools and libraries to access human-editable, plain text databases called recfiles." Text-based, roughly "key: value"
SDB "[A] simple string key/value database based on djb's cdb disk storage and supports JSON and arrays introspection." Binary
sqlite3(1) "[A] simple command-line utility [...] that allows the user to manually enter and execute SQL statements against an SQLite database." Binary

License

The contents of this document is licensed under the Creative Commons Attribution 4.0 International License. By contributing you agree to release your contribution under this license.

Disclosure

Sqawk, jsonwatch, Remarshal and initool were written by the curator of this document.

About

A list of command line tools for manipulating structured text data