Unix table column and row manipulation using column names
pick
is an expressive low-memory command-line tool for manipulating text file tables.
Entire scripts can be replaced by concise command line invocations.
Pick allows database-style queries (select) and filters (where) on a single text file or stream using its column names (or indexes if no names are present). Columns can be selected, mapped, transformed and combined and rows can be filtered using conditions. Additionally output can be demuxed into different files.
pick
is robust and intuitive by supporting column names as handles.
It is lightweight as it processes data per-line without the need to load the table into memory.
It is expressive in that short command lines are sufficient to get at the data.
Note
For your benefit, miller (unix command mlr
)
is an amazing widely-used command-line tool for handling tables (using column names also), in an entirely
different league than pick in terms of capabilities. It is available in most
Linux distributions as a supported package.
Pick embodies, comparatively, an extremely minimalist approach with a different and greatly limited focus in the same problem space. Within its narrow focus on column manipulation and row selection it is very concise, has extensive support for SAM format, and has miscellaneous features such as simultaneous transformations of multiple columns, demultiplexing rows to different files, and mapping values using dictionaries. Think of it as one of these weirdly evolved deep-sea creatures (one that is pretty).
In simple to middling cases pick can avoid both the need for a script (R, awk, Python, Ruby et cetera) and
having to load the entire data set into memory.
I use it in conjunction with UNIX tools such as comm
, join
, sort
and datamash
to simplify file-based computational workflows
and make them more robust and understandable by promoting the use of column names as handles
(as opposed to column indexes as used with cut
and awk
). You can
- Use column names or column indexes to
- Select columns
- Change columns (using computation and string operations)
- Combine columns into new columns (using computation and string operations)
- Filter (or fork) rows on boolean clauses computed on columns
- Select multiple columns using ranges or regular expressions
- Take the same action on multiple columns using a lambda expression
- Split/demultiplex rows to different output files based on (computed) labels in columns
There is no downside, except, as ever, it comes with its own syntax for computation. For plain column selection and row filtering this syntax is not needed though; pick command lines look pleasant enough for common use cases.
Computation syntax is minimalist and terse, employing a stack language with just three types (variables, constants and operators).
In order to work as a command line tool, the pick
computation language does away with whitespace entirely.
On first sight it might look arcane or terrifying, requiring a long second look.
Compensating for the terse stack language, pick
's inner computation loop is simple and dependable.
Pick one or more columns
Pick columns and filter or select rows
Selecting based on numerical proximity
Syntax for computing new columns
Examples of computing new columns
Selecting and manipulating multiple columns with regular expressions, lists and ranges
Map column values using a dictionary
Ragged input
SAM and CIGAR support
Unique or counted values
Splitting, demultiplexing and forking rows across different outputs
Retrieving unique values and asserting the number of rows found
Miscellaneous
Escaping special characters
Maps can be useful to select or filter out data
Creating fasta and fastq files
Useful regular expression features
Applying the same action to each table entry
Loading data from the previous row
Loading a previous row within a group
Option processing
Pick options
Pick operators
Implementation notes
Pick one or more columns
Pick columns foo
and bar
from the file data.txt
. Order is as specified, the output
will contain a header with column names foo
and bar
.
pick foo bar < data.txt
Below
(1) pick columns bar
and foo
from data.txt
, in that order. With -h
the output header is dropped.
(2) Pick all columns excluding bar
and foo
.
(3) With -A
all columns are selected; this is useful when the goal is just to filter rows (see below).
(1) pick -h bar foo < data.txt
(2) pick -x bar foo < data.txt
(3) pick -A < data.txt
Columns can be picked using a regular expression for column names. This can be helpful for large tables. Quotes
are needed to prevent shell interpretation of characters that are special to the shell.
The following examples selects column zut
, columns with names that start with foo
followed by a digits
and columns that start with bar_
.
pick zut '^foo\d+$' '^bar_' < data.txt
A pattern that contains any of [({\*?^$
is assumed to be a regular expression rather than just a column name.
Pick allows use of regular expressions selection in various places.
Several pick column operators also use regular expressions.
Picking columns using indexes and index ranges
If no header is present indexes and index ranges can be used.
-k
implies the first row has no special meaning (as column names) and handles are 1-based indexes.
pick -k 5 3 7-9 < data.txt
The following index expressions are supported:
x column x
x-y columns from x to y
x- column x and all onward
'o+x-y*m' columns o+x to o+my with increments of m (quotes needed for *)
'x-y*m' columns mx to my with increments of m (quotes needed for *)
o+x-y columns o+x to o+y
Pick columns and filter or select rows
- Strings starting with
@
indicate a selection on one or two column values. - Selections can operate on computed columns and computed values that are not output (see further below).
- Selections are performed only after all computations are finished. Hence it is currently not possible to perform a computation conditionally on a selection.
- Selections can occur anywhere, even mixed in with column selections and computations. This will always be the case; new syntax will be required should a pre-compute selection feature be added.
Pick columns foo
and bar
, only taking rows where tim
fields are larger than zero.
multiple @
selections are possible; default is AND
of multiple clauses, use -o
for OR
.
tim
can refer to a newly computed variable (see below).
pick foo bar @tim/gt/0 < data.txt
where tim
is larger than the column value in zut
(the leading colon in :zut
indicates
that the value to compare to should be taken from column zut
):
pick foo bar @tim/gt/:zut < data.txt
It is possible for zut
to be a newly computed value derived from other (existing or computed) columns.
Further selection examples
where tim
is the string flub123
:
pick foo bar @tim=flub123 < data.txt
where tim
is NOT the string flub123
:
pick foo bar @tim/=flub123 < data.txt
where tim
matches the string flub123
:
pick foo bar @tim~flub123 < data.txt
where tim
matches the string flub
followed by zero or more digits:
pick foo bar @tim~flub'\d*' < data.txt
where tim
matches the string flub
followed by one or more digits:
pick foo bar @tim~flub'\d+' < data.txt
where the entirety of the tim
column value matches the string flub
followed by one or more digits,
and nothing else, by anchoring the regular expression:
pick foo bar @tim~^flub'\d+$' < data.txt
where tim
does not match the string flub
followed by one or more digits:
pick foo bar @tim/~flub'\d+' < data.txt
The full list of comparison operators:
= /= string identy select, avoid
~ /~ string (Perl) regular expression select, avoid
~eq~ ~ne~ ~lt~ ~le~ ~ge~ ~gt~ string comparison
/eq/ /ne/ /lt/ /le/ /ge/ /gt/ numerical comparison
/ep/ /om/ numerical proximity (additive, multiplicative)
/all/ /any/ /none/ bit selection
=
is for string identity, /=
is for string not equal to. These are shorthand
for ~eq~
and ~ne~
, respectively. ~
tests against
a perl regular expression, accepting matches, /~
tests against a perl regular
expression, discarding matches. /ep/
(epsilon) and /om/
(order of magnitude)
are described here.
By default comparison is to a constant value; in order to compare to a column
its name or index is used, preceded by a colon:
pick foo bar @tim/gt/:bob < data.txt
pick -k 3 5 @8/gt/:6 < data.txt
Selecting based on numerical proximity
Using epsilon and selecting within additive range
Select all rows where tim
is approximately 1.0. The default epsilon (maximum allowed
deviation) for this is 0.0001 but can be changed (see below).
pick -A @tim/ep/1.0 < data.txt
As above, but make epsilon more stringent (one in a million).
pick -A @tim/ep/1.0/0.000001 < data.txt
In this case, select rows where columns tim
and pat
are no further than one apart.
pick -A @tim/ep/:pat/1 < data.txt
Using order of magnitude and selecting within multiplicative range
The default order of magnitude is 2 but can be changed. Below selects rows
where column tim
is no larger than twice column pat
and column pat
is
no larger than twice column tim
, ignoring signs.
pick -A @tim/om/:pat < data.txt
Add @tim/gt/0
to additionally require the sign to be positive for example.
Change the order of magnitude by adding it as a parameter, in this case 1.01.
pick -A @tim/om/:pat/1.01 < data.txt
Syntax for computing new columns
Derived values, also known as computations can be
- output as a new column
- compared against with selection criteria
- used to break up computations into smaller parts
A computation is expressed in a stack language that has three types. These
are the column handle type, the constant value type
(a number or a string) and the operator type.
A column handle is either a column name or a column index if -k
is used.
Each of the three types is designated by and introduced by a specific character.
These are
- colon
:
for a column handle - caret
^
for a constant value (number or string) - comma
,
for an operator
Constant values and column handles are URL-decoded, hence the escape mechanism
for including any of the characters ^:,%
in a constant value or column handle is to url-encode them.
The following is an example of a computation:
:foo^144,add
is an expression that indicates the column named foo
, the number 144 and the add
operator.
The result of it is the sum of the value in the foo
column and 144.
Each computation needs a name. It can be thought of as a variable name. If the computation
is output as a new column the name will be used as the column name. The two forms are below,
where (1) newname
will not be output as a new column (but is still available e.g. for other computations or comparison)
and (2) newname2
will be output.
(1) newname1:=<compute>
(2) newname2::<compute>
Examples of computing new columns
In the example below the <compute>
part (with name doodle
) is yam:bob,sub^1,add
. It does not start with
either a colon, caret or comma.
By default the first part is always assumed to be a column handle unless a constant value or operator is found.
This particular compute puts two column values on the stack (for columns yam
and bob
), then subtracts
bob
from yam
, and adds 1 to the result. If the two columns denote inclusive bounds for an interval
then this will give the interval length.
In this example, the final output is the existing columns foo
, bar
and the new column doodle
.
pick foo bar doodle::yam:bob,sub^1,add < data.txt
By default pick
will refuse a compute for which the name clashes with an existing name.
Allowing such can be useful however if the goal is to update an existing column. This is facilitated by the -i
(in-place) option.
The example below selects all columns (-A
) and adds 1 to column foo
in-place.
pick -Ai foo::foo^1,add < data.txt
Once all operators are exhausted pick will concatenate everything that is still on the stack. Thus below
simply concatenates columns foo
and bar
.
pick -h ::foo:bar < data.txt
In several places pick is happy to accept empty strings. One example is the compute name.
Each compute needs an associated name that is unique (the part before ::).
If no compute name is specified pick will construct a unique name automatically, which
is useful if output column names are not required.
In this example pick
outputs the length of each field in the foo
column.
pick -h ::foo,len < data.txt | hissyfit
The automatic compute names are visible if neither -h
(no output header) nor -k
(additionally no input header) is specified.
Leaving out compute names is only sensible or useful in the presence of one of these two options.
The following example swaps two columns whilst retaining all other columns. This is just to illustrate how columns and compute names interact; a simpler way to do the same is shown after. Compute names are like normal variables, so to swap two values a third name is needed.
pick -Aki foo:=1 1::2 2::foo < data.txt
- -k implies no columns names are read, column handles are 1 2 3 ..
- -A selects all columns for output.
- -i is needed to allow overwriting existing columns 1 and 2.
- Assignments happen proceeding from left to right
- := computes a value without outputting it,
- :: computes a value and selects it for output.
A simpler way of doing the same is this:
pick -k 2 1 3- < data.txt
If you just want columns 2
and 1
in that order it only needs
pick -k 2 1 < data.txt
Selecting and manipulating multiple columns with regular expressions, lists and ranges
There are three modes of selecting/modifying multiple columns. Each is briefly introduced below, followed by more examples and explanation.
- Simply selecting multiple columns for output. Example usage
pick 'num\d{2}$' < data.txt
- Selecting multiple columns and reducing them to a single value by e.g. concatenation, taking the minimum or maximum, or adding all values. Examples of usage:
pick nummax::'num\d+$',maxall < data.txt # largest among all num[digit] columns
echo {1..20} | tr ' ' $'\t' | pick -k ::'.*',mulall # compute 20 factorial
echo {1..9} | tr ' ' '\t' | pick -iK sum-squared::'.*',addall,sq '.*':=__^3,pow sum-cubes::'.*',addall | column -t
# compute 1**3 + 2**3 + .. + 9**3 and
# (1+2+..+9)**2
-
Selecting multiple columns and executing the same operation on each column using a lambda expression. The parameter in pick lambda expressions is written
:__
. Each instance of it will be replaced by the column name, multiplexed over all selected columns. Below is a list of examples; another set can be found here.Multiple column selection and modification using a regular expression:
pick -i '^num\d{2}$'::__^1,add < data.txt
Multiple column selection and modification using a list:
pick -i foo:bar:zut::__^1,add < data.txt
Lists can take a mix of regular expressions and column names:
pick -i foo:bar:zut:'num\d+':'yay\d+'::__^1,add < data.txt
It is possible to rename the columns with a prefix and/or a suffix:
pick pfx/foo:bar:zut:'num\d+':'yay\d+'/sfx::__^1,add < data.txt
pick foo:bar:zut:'num\d+':'yay\d+'/sfx::__^1,add < data.txt
pick pfx/foo:bar:zut:'num\d+':'yay\d+'/::__^1,add < data.txt
With a regular expression, if parentheses are used then the outer group can be used to capture a single element to be used in renaming:
> echo -e "col01\tcol02\tcol03\n3\t4\t5" | pick x_/'^col(\d{2})$'/::__^1,add
x_01 x_02 x_03
4 5 6
It can be useful to have two version for each in a set of columns, for example
to present a column both as a percentage and as a count. If double slashes are
used pick
will include the original as well as the derived column:
> echo -e "a\tb\tc\n3\t4\t5" | pick '.*'//_pct::__:c^1,pct
a a_pct b b_pct c c_pct
3 60.0 4 80.0 5 100.0
It is possible to transform columns while keeping their old values around for
other use (e.g. filtering or computation). In this example the column values
are squared. The old columns are renamed by adding the suffix o
but are
withheld from output due to the use of :=
rather than ::
.
> echo -e "a\tb\tc\n3\t4\t5" | pick -i '.*'/o:=__ '.*'::__,sq oldsum::ao:bo:co,addall
a b c oldsum
9 16 25 12
Of note is that currently regular expression selection only works on the input columns
and does not take into account newly computed columns. Hence it is not possible
to specify the computation oldsum::ao:bo:co,addall
with a regex as 'oldsum:.o$,addall'
(although this can be achieved easily by piping pick output to a second pick invocation).
The order in which the above was specified is important. If the two computations are switched (with the column copy/rename coming last) then the copy will pick up the in-place-modified columns:
> echo -e "a\tb\tc\n3\t4\t5" | pick -i '.*'::__,sq '.*'/o:=__ oldsum::ao:bo:co,addall
a b c oldsum
9 16 25 50
Lambda expressions with index selection rather than column names
Lambda expressions work with -k
as well:
pick -k 3:5-8::__^1,add < data.txt
Regular expressions
A pattern that contains any of [({\*?^$
is assumed to be a regular expression rather than just a column name.
Use -F
(fixed) to prevent regular expressions being used.
Be careful with patterns in the compute part (as above). If the pattern starts with ^
(for start of string), it must be url-encoded as %5E
; otherwise it will be
interpreted as the pick
token introducing a constant value. The characters
^ : ,
have special meaning in the pick
stack language (see above) and must
be url-encoded.
Map column values using a dictionary
Dictionaries can be specified in different ways:
--fdict-NAME=/path/to/dictfile (key,value) = (col1, col2) (rows with two fields)
or (col1, 1) (rows with one field)
--cdict-NAME=foo:bar,zut:tim comma-separated key:value pairs
--cdict-NAME=foo,zut comma-separated keys, all set to value 1
--fasta-dict-NAME=/path/to/fastafile read ID->sequence mapping from fasta file
--fastq-dict-NAME=/path/to/fastqfile read ID->sequence mapping from fastq file
--table-dict-NAME=/path/to/tablefile read ID->column->item mapping from table file
NAME
is the name of the dictionary. Multiple dictionaries can be imported.
A dictionary is specified by its name for use with the map
operator or tmap
operator for table dictionaries.
A table dictionary uses the row names in the file as key,
and associates for each row name its column values by using the column name as key.
map
needs two keys; the first is the item to look up, the second is the NAME
of the dictionary to use.
tmap
needs a third key; the name of the column.
Multiple dictionary specifications can be used for the same NAME
.
echo -e "a\t3\nb\t4\nc\t8" | pick -Aik --cdict-foo=a:Alpha,b:Beta 1::1^foo,map
By default if no key is found in the dictionary the value is left alone. It is possible to specify a not-found string using this syntax:
--fdict-NAME/STRING=/path/to/dictfile
--cdict-NAME/STRING=foo:bar,zut:tim
--fasta-dict-NAME/STRING=/path/to/fastafile
--fastq-dict-NAME/STRING=/path/to/fastqfile
--table-dict-NAME/STRING=/path/to/tablefile
For example
echo -e "a\t3\nb\t4\nc\t8" | pick -Aik --cdict-foo/FOONOTFOUND=a:Alpha,b:Beta 1::1^foo,map
gives as output
Alpha 3
Beta 4
FOONOTFOUND 8
You could grep that value, or use pick itself to select or filter such columns, e.g. below shows an idiomatic way to find rows where a column value is not part of a limited set of prescribed values.
- the
-i
(in-place) option is dropped. - dictionary values are not specified and thus set to 1 by pick.
- the dictionary is given the name
foo
, refered to later by themap
operator. - the mapped values in column 1 are put in variable
check
. check
is set to zero if the field incol1
is not found in the dictionary.check
is not output (:=
instead of::
).- Those rows are selected where
check
has value 0 (not found).
echo -e "col1\tcol2\na\t3\nb\t4\nc\t8" | pick -A --cdict-foo/0=a,b check:=col1^foo,map @check=0
col1 col2
c 8
Use --fdict-dictNAME/STRING=FILENAME
if you want to read the dictionary values from file instead.
Ragged input
Ragged input (rows with varying number of columns, such as possible with SAM format) can be processed by using the option -O<NUM>
and requires additionally the -k
parameter. With this type of input column names are not supported.
At most <NUM>
columns are consumed in each row. Excess fields in the row will be concatenated
onto the last consumed column. If the input row has fewer than <NUM>
fields additional empty fields
will be added (and output e.g. if -A
is used).
For SAM input just use either --sam
or --sam-h
(the latter will output the SAM header if present).
See below for more information about SAM and CIGAR support.
SAM and CIGAR support
Use --sam
or --sam-h
if the input is SAM format. This will set the options -k
(headerless input) and -O11
(overflow columns collated in column 11) and make the sequence lengths available in the seqlen
dictionary (if the sam header is found). If the output should still contain the SAM header, use --sam-h
.
It is possible to load sequences to combine (and excise subsequences from) with SAM input by
using either --fasta-dict-NAME
or --fastq-dict-NAME
. With this
(where NAME
can be freely chosen) a sequence can be retrieved from the read name field (column 1 in SAM format) with
::1^NAME,map
To obtain the left- and right-clipped (non-aligned) sequences as well as the matched part from the read (see below
for operators such as qryclipl
):
SEQ:=1^SEQ,map \
leftclip::SEQ^0,qryclipl,substr \
matchedpart::SEQ,qryclipl,qrycov,substr \
rightclip::SEQ,qryclipr,dup,neg,xch,substr \
When using either of --sam
or --sam-h
pick makes several new operators available that compute certain alignment-related
offsets and widths. The following table lists these shorthand operators, along with
a more verbose and obtuse/obsolete pick equivalent using older operators (still available).
Shown below are simple computes with just a single operator used. Obviously these
can be combined in various ways.
With these operators pick can be used to efficiently filter alignments, for example removing those that do not start near expected primer sites (see below). Other applications include the computation and extraction of quantities for quality control.
using --sam without using --sam or --sam-h
or --sam-h
---------------------------------------
qs::,qrystart qs::6,cgqrystart query start, 1-based
qe::,qryend qe::6,cgqryend query end 1-based, inclusive
qc::,qrycov qc::6,cgqrycov amount of bases covered by alignment in query
ql::,qrylen qc::6,cgqrylen query length
rs::,refstart rs::4 reference start, 1-based
re::,refend re::4:6,cgrefcov,add reference end, 1-based, inclusive
re::,refcov rc::6,cgrefcov amount of bases covered by alignment in reference
rl::,reflen rl::3^seqlen,map reference length
qcl::,qryclipl (omitted) Number of 5p trailing query bases [sam]
qcr::,qryclipr (omitted) Number of 3p trailing query bases [sam]
rcl::,refclipl (omitted) Number of 5p trailing reference bases [sam]
rcr::,refclipr (omitted) Number of 3p trailing reference bases [sam]
Make sure to use samtools view -h
to include header information so that reflen
is available.
Should a sequence name not be found in the seqlen
dictionary the value 0
is returned for the sequence length.
In this case pick currently issues an error only if reflen
is used (not in case 3^seqlen,map
is used).
To require alignment to be proximal within 20 bases to primer sites, use e.g.
mark5p=123 # your value here
mark3p=1234 # your value here
samtools view -h <bamfile> | pick --sam-h -A delta5p:=,refstart^$mark5p,sub delta3p:=^$mark3p,refend,sub @delta5p/le/20 @delta3p/le/20
Pick has a few other/older operators that support parsing of SAM columns. For now this pertains specifically to the CIGAR
string in the sixth column. Below <cigaritems>
is a user-defined subset of MINDSHP=X
, the different
alignment types supported by CIGAR strings
(respectively alignment match, insertion in reference, deletion from reference, skip from reference,
soft-clip, hard-clip, padding, sequence match, sequence mismatch).
The operators are
<cigarstring>
<cigaritems>
cgsum
Count the total number of bases covered by all alignment types in <cigaritems>
.
<cigarstring>
<cigaritems>
cgmax
Returns the size of the longest stretch of bases across all alignment types in <cigaritems>
.
<cigarstring>
<cigaritems>
cgcount
Returns the number of events across all alignment types in <cigaritems>
.
<cigarstring>
cgqrycov - still supported but qrycov operator prefered.
The number of bases in query covered by this alignment; the sum of all events in MI=X
.
<cigarstring>
cgqryend - still supported but qryend operator prefered.
The end of the alignment in query (1-based).
<cigarstring>
cgqrylen - still supported but qrylen operator prefered.
The length of query, the sum of all events in MIS=X
.
<cigarstring>
cgqrystart - still supported byut qrystart operator prefered.
The start of the alignment in query (1-based).
<cigarstring>
cgrefcov - still supported but refcov operator prefered.
The number of bases in reference covered by this alignment; the sum of all events in MDN=X
.
You can use the get
operator (<value> <regex> get
)
to retrieve information from the concatenated fields in picks last input column.
Splitting, demultiplexing and forking rows across different outputs
Pick can be used to split or demultiplex output into different files. Use e.g. this combination, where NAME
is
of your choice:
--demux=NAME NAME:=sampleid^.txt
This tells pick to use a row's NAME
column as the file name to write the row to, where
NAME
can be any column (input or computed).
In this example NAME
is a computed column that is not output, where the filename is
formed from the value in the sampleid
column with a .txt
suffix added to it.
Pick will recognise file names ending in .gz
or .gzip
and in that case compress the output using gzip.
The next example splits the input into chunks of size 1000
, retaining the header for each,
with output names defined in the S
column as split<N>.txt
, where <N>
are zero-padded batch numbers.
pick -A --demux=S S:=^split,r0wno^1000,idiv^4,zp^.txt < data.txt
File Written Filtered
split0001.txt 1000 0
split0003.txt 1000 0
split0002.txt 1000 0
split0000.txt 384 0
If --demux
is used pick will output on STDERR
a table of output files and tallies of
how many rows each file contains, as well as how many were deselected.
The set of all output files will always correspond to the full set of unique values accumulated
over the <NAME>
column across all input rows, regardless of whether a row is deselected or not.
Hence, in the presence of selection, demux files may contain zero data rows.
Demux output files have or do not have a header line in line with the -k
and -h
options,
just like normal output.
A separate and compatible forking mechanism exists that allows sending of any de-selected row (i.e. one that
does not satisfy the @
selection criteria) to a specified file name. This is achieved with
--other=<FILENAME>
These two mechanisms can be used simultaneously. Similar to demuxing, a file name ending in .gz
or .gzip
causes the file to be compressed using gzip
.
Retrieving unique values and asserting the number of rows found
If the input is queried for a value that should be present and unique, you can do pick let the checking
by passing -E1
. More generally -E<NUM>
will exit with an error if the number of rows found is
different from <NUM>
.
Miscellaneous
Escaping special characters
Some uses of pick, especially involving computation, may require characters with special meaning either to the shell or to pick to be escaped. For the shell aspect this is usually possible simply by using single quotes. For pick the mechanism used is url-encoding, and this can equally be used for characters with special meaning to the shell.
A url-encoded character is written as a percent sign followed by two hexadecimal digits (a hexadecimal
digit is one of 0123456789ABCDEF
), for example %0A
for <NEWLINE>
. A list of useful cases (note that
lower case versions of these are allowed too):
^ %5E ; %3B ( %28 <TAB> %09
: %3A ! %21 ) %29 <NEWLINE> %0A
, %2C / %2F < %3C <CR> %0D @ %40
% %25 \ %5C > %3E <SPACE> %20 = %3D
Use pick -z
to show this list, use pick -z <string>
to url-encode string, and pick -zz <string>
to url-decode <string>
.
The characters = / , : ^
require url-encoding in certain contexts as they are used as pick syntax:
^
,:
,,
and=
are used in computation syntax./
,:
and,
are used in map specifications using--cdict-NAME=k1:v1,k2:v2
.
These will be URL-decoded:
- Column names specified on the command line, including regular expressions expanding to column names.
- For a computation
<name>::<compute>
, both<name>
and any constants and names found in<compute>
. - In selection filters
@<name><op><:name|constant>
both<name>
and<:name|constant>
. - In
--cdict-NAME/default=k1:v1,k2:v2
all keys (k1
etc) and values (v1
etc).
Maps can be useful to select or filter out data
Direct filtering of data based on information in the table is not always possible. In some cases an external list has been computed that contains identifiers for which the rows should be deleted or retained. This is generically done like this:
pick -A --fdict-DEL/___=delete-file.txt DEL:=myid^DEL,map @DEL=___ < data.txt > reduced-data.txt
Here a temporary column DEL
is computed that contains the value in the myid
column mapped
using the keys in the file delete-file.txt
. If the value is not to be deleted it is set to the
default value ___
. Finally those rows are chosen where DEL
has that value ___
.
Of note is that DEL
is used here in two ways, once as the name of the map, and once as the name
of a column - these are entirely different namespaces. The following is equivalent:
pick -A --fdict-DELMAP/___=delete-file.txt DELCOL:=myid^DELMAP,map @DELCOL=___ < data.txt > reduced-data.txt
In any case, DEL
or DELMAP
and DELCOL
are never seen in the output, the string used
should be chosen to make the command line more legible. Similarly, the string ___
can be chosen
by the user. It should not be among the values assigned to the keys in delete-file.txt
, noting
that the default value used by pick is 1
.
The pick
invocation if keys need to be retained is very similar, only in this case we keep
those rows where the mapped value is not the same as the default (not found) value.
pick -A --fdict-KEEP/___=keep-file.txt KEEP:=myid^KEEP,map @KEEP/=___ < data.txt > reduced-data.txt
Creating FASTA and FASTQ files
Create FASTA files with pick. The operator ,fasta
makes this easy.
Previously one needed, assuming identifier and sequence are stored in key
and sequence
(quotes needed as >
is special to the shell),
pick -h '::^>:key^%0A:sequence' > out.fa
This has now been simplified to
pick -h ::key:sequence,fasta > out.fa
The ,fasta
operator requires two string values on the stack. To add further
annotation to the identifier line construct the required sequence of strings
and then apply for example ,catall
. The example below constructs a template
(zut=#)
and then replaces the placeholder character #
with the column/variable zut
.
The template is quoted to avoid shell interpretation of parenthesis and hash sign.
pick -h ::key^' (zut=#)^#':zut,ed,catall:sequence,fasta > out.fa
The ,fastq
operator works in exactly the same way as the ,fasta
operator.
The result is a FASTQ record, where the quality string is currently always set to Z
in every position.
Useful regular expression features
-
Use
\K
(keep) to anchor a pattern but retain it withed edg del delg
, e.g.:HANDLE^'patx\Kpaty',delg
will retain patternpatx
and only delete patternpaty
.Example:
> echo -e "a\nthequickbrownfox theslowbrownbear" | pick -h ::a^'quick\Kbrown',delg
thequickfox theslowbrownbear
-
Use
patx(?=paty)
to anchorpatx
topaty
without includingpaty
in the matched part.:HANDLE^'patx(?=paty)',get
will just fetchpatx
.Example:
> echo -e "a\nthequickbrownfox theslowbrownbear" | pick -h ::a^'brown(?=bear)',delg
thequickbrownfox theslowbear
Such patterns can be combined - here either of the two is considered match:
> echo -e "a\nthequickbrownfox theslowbrownbear" | pick -h ::a^'brown(?=bear)|quick\Kbrown',delg
thequickfox theslowbear
- Use
(?i)pat
to make a pattern case insensitive.
Applying the same action to each table entry
The recipes below can be limited to a set of columns by
using regular expressions, lists and ranges.
In these examples all column names are selected with the regular expression '.*'
that will match any string of at least one character.
The in-place option -i
is needed as input columns are changed and output under the same name.
Increment each entry by one:
pick -i '.*'::__,incr < data.txt
Format each entry to have two digits behind the decimal comma:
pick -i '.*'::__^2,dd < data.txt
Format each entry in scientific notation with five significant digits:
pick -i '.*'::__^5,sn < data.txt
Remove leadinig and trailing whitespace (%5E
encodes beginning of line ^
, here needed as ^
indicates
a constant in pick computations):
pick -i '.*'::__^'(%5E\s+|\s+$)',delg < data.txt
Loading data from the previous row
To cache/store the previous row use one of
--pstore
--pstore/<LIST>
--pstore/<LIST>/<DEFAULT>
--pstore//<DEFAULT>
Fields from the previous row are then available to load with ^colname,pload
.
If specified, <LIST>
should be a comma-separated string of key-value pairs themselves
separated by a colon; all keys and values will be URL-decoded. The keys should be column names;
the values will be used to initialise the fields of the predecessor of the first row.
If <DEFAULT>
is specified it is used for all columns not yet named.
Example (compute the first ten Fibonacci numbers):
yes | head | pick -k --pstore/x:1,y:0 x::^y,pload y::x^x,pload,add
Loading a previous row within a group
This functionality is an extension of the general caching mechanism (--pstore
in the previous section). With either of
--group=<COLNAME>
--group-first-ref=<COLNAME>
pick recognises groups of consecutive rows where column <COLNAME>
has the same value.
The first row of such a group is always skipped (after computation, before output). Each subsequent row of the
group can load column values from a reference row using pload
.
With --group
the reference row is simply the previous row.
With --group-first-ref
the first (skipped) row is the reference row.
If there are no consecutive rows in the input where <COLNAME>
assumes the same value then all rows will be skipped.
Below groups based on the value found in column gene
, then retrieves the previous exon end coordinate
and the current exon start coordinate, increments the former and decrements the latter, thus
outputting intron coordinates.
pick --group=gene intron_start::^exon_end,pload,incr intron_end::exon_start,decr < data.txt
Option processing
Single-letter options can be combined or specified separately. The offset for -O
(ragged input), optional offset for -A
(insertion of new columns) and -E
expected result count are accommodated, so e.g. -kA2O11
will be understood by pick.
The option for purging lines with a certain pattern /<pat>
and the option for passing through
lines with a certain pattern //<pat>
can be tagged on at the end, e.g. -kA2/#
.
Pick options
-
-l
print a table of all pick operators. -
-l <str>
as above, limited to operators in sections matching<str>
.
Available sections arearithmetic bitop devour dictionary format input math output precision regex sam stack string
. -
-H
summary of pick syntax. -
--sam
/--sam-h
Expect sam input, pass through sam header (--sam-h
).
These options effectively set-k -O11
,-/^@
(--sam
) or-//^@
(--sam-h
) and additionally store sequence lengths in theseqlen
dictionary if the input contains a sam header. -
-h
do not print header -
-k
headerless input, use 1 2 .. for input column names,x-y
for range fromx
toy
. -
-o
OR multiple select criteria (default is AND) -
-x
take complement of selected input column(s) (works with-i
) -
-c
only output the count of rows that pass filtering -
-i
in-place:<HANDLE>::<COMPUTE>
replaces<HANDLE>
if it exists -
-/<pat>
skip lines matching<pat>
; use e.g.-/^#
for commented lines,-/^@
for sam files -
-//<pat>
pass through lines matching (allows perl regular expressions, e.g.^ $ . [] * ? (|)
work. -
-v
verbose -
-A
print all input columns (selecting by colspec applies, -T
accepted) -
-A<N>
<N>
integer; insert new columns at position<N>
. Negative<N>
is relative to rightmost column. -
-O<N>
<N>
integer; allow ragged input (e.g. SAM use-O11
), merge all columns at/after position<N>
-
-E<N>
<N>
integer; expect rows returned, exit with error if this is not the case. -
-P
protect against 'nan' and 'inf' results (see-H
for environment variablesPICK_*_INF
) -
-Z
as-P
, discard rows that have items that need protecting -
-K
headerless input as-k
but use derived names to output column names -
-U
with-k
and-K
keep output columns unique and in original order -
-R
add_
column variable if no row name field exists in the header. Note: an empty field is recognised and mapped to_
automatically. -
-f
force processing (allows both identical input and output column names) -
-F
fixed names; do not interpret names as regular expressions. Default behaviour is to assume a regular expression if a name contains one of[ { ( \ * ? ^ $
. -
-z ARG+
print url-encoding ofARG+
(no argument prints a few especially useful cases) -
-zz ARG+
print url-decoding ofARG+
-
--inf=<str>
Set divide-by-zero result to<str>
-
--add-inames=<csv>
,--inames=<csv>
comma-separated values to use as column names instead of actual column names. The list must cover all columns in the input. Names that are used in selection, compute and filter expressions must be picked from this list. Output names are from the list. If using-k
then--inames=CSV
provides temporary handles; use--add-inames=CSV
to add them to the output. -
--onames=<csv>
Override output column names to be taken from comma-separated values. -
--idx-list
Output list of selected indexes.
--name-list
Output list of selected column names.
These can be combined. If combined, indexes will always come before names (transpose the output to obtain an index-name mapping). Pick will exit if any of them is used. -
--version
Output version number; outputs a corresponding git tag and date tag. The aim is for this to be the git tagx
of commitx
that is prior to commity
that insertedx
into the pick version tag. I'm not quite sure how well this executes the idea of an informative and lazy version numbering system. -
--pstore
--pstore/<LIST>
--pstore/<LIST>/<DEFAULT>
--pstore//<DEFAULT>
Use one of these to load data from the previous row.
Pick operators
Pick supports a wide range of functionality. Standard arithmetic, bit
operations and a number of math functions are provided (see below). It is also possible
to match and extract substrings using Perl regexes (as a derived value or new
column) with get
, change an existing column using a regex with ed
and
edg
, compute md5 sums, URL-encode and decode, convert to and from binary,
octal and hex, reverse complement DNA/RNA, and extract statistics from cigar
strings. Display options include formatting of fractions and percentages
and zero padding of integers.
For an idea of the possibilities you could look at the Makefile in the test directory, although it is more geared towards tests of selection of and operations on multiple columns.
The documentation is output when given -H
(-h
is the option to prevent
output of column names) or -l
for a table of operators, also supplied below.
Arithmetic: add addall catall decr div idiv incr max maxall min minall mod mul mulall pow sub
Bit operators: and or xor
Stack devourers: addall gmeanall hmeanall joinall maxall meanall minall mulall
Dictionary: map
Formating: binto dd frac hexto md5 octto pct pml sn tobin tohex tooct urldc urlec
Input: binto hexto lineno md5 octto rowno r0wno urldc urlec
Math: abs ceil cos dd exp exp10 floor int log log10 sign sin sn sq sqrt tan
Output: md5 urldc urlec zp
Precision: dd frac pct pml sn
Regular expressions: del delg ed edg get
Sam file support: qrystart qryend qrycov qrylen refstart refend refcov reflen cgsum cgmax cgcount
Use --sam
(sam input) or --sam-h
(additionally copy/output sam header) to activate these operators
and set various pick options.
Stack control: dup pop xch
String manipulation: cat del delg ed edg get joinall lc len map md5 rc rev substr uc uie urldc urlec
Below is the table pick supplies when given -l
.
Operator Consumed Produced Description
--------------------------------------------------------------------------------
F0 - F[0] First input column [demo]
abs x abs(x) Absolute value of x [math]
add x y x+y Add x and y, sum, addition [arithmetic]
addall * sum(Stack) Sum of all entries in stack [arithmetic/devour]
and x y x and y Bitwise and between x and y [bitop]
binto x x' Read binary representation x [input/format]
cat x y xy Concatenation of x and y [string]
catall * Stack-joined Stringified stack [string/devour]
ceil x ceil(x) The ceil of x [math]
cgcount c s Count of s in c Count of s items in cigar string c [string/sam]
cgmax c s Max of s in c Max of lengths of s items in cigar string c [string/sam]
cgqrycov c qrycov Count of query bases covered by cigar string c (MI=X events) [string/sam]
cgqryend c qryend Last base considered aligned in query for cigar string c [string/sam]
cgqrylen c qrylen Length of query (MIS=X events) in cigar string c [string/sam]
cgqrystart c qrystart First base considered aligned in query for cigar string c [string/sam]
cgrefcov c refcov Count of reference bases covered by cigar string c (MDN=X events) [string/sam]
cgsum c s Sum of s in c Sum of lengths of s items in cigar string c [string/sam]
cos x cos(x) Cosine of x [math]
dd x N x' Floating point x printed with N decimal digits [math/format/precision]
decr x x-- x decremented by one [arithmetic]
del x p x =~ s/p// Delete pattern p in x [string/regex]
delg x p x =~ s/p// Globally delete pattern p in x [string/regex]
div x y x/y Division, fraction, (cf -P and PICK_DIV_INF) [arithmetic]
dup x x x Duplicate top entry x [stack]
ed x p s x =~ s/p/s/ Substitute pattern p by s in x [string/regex]
edg x p s x =~ s/p/s/g Globally substitute pattern p by s in x [string/regex]
exp x e**x Exponential function applied to x [math]
exp10 x 10^x 10 to the power of x [math]
floor x floor(x) The floor of x [math]
frac x y N x/y Division, fraction x/y with N decimal digits (cf -P and PICK_DIV_INF) [precision/format]
get x r r-match-of-x If x matches regex r take outer () group or entire match, empty string otherwise (cf uie) [string/regex]
gmeanall * gmean(Stack) Geometric mean of all entries in stack, multiplication [arithmetic/devour]
groupi - x Push within-group offset x onto stack [input]
groupno - x Push group number x onto stack [input]
hexto x x' Read hex representation x [input/format]
hmeanall * hmean(Stack) Harmonic mean of all entries in stack [arithmetic/devour]
idiv x y x // y Integer division, divide (cf -P and PICK_DIV_INF) [arithmetic]
incr x x++ x incremented by one [arithmetic]
int x int(x) x truncated towards zero (do not use for rounding) [math]
joinall * s Stack-joined-by-s Stringified stack with s as separator [string/devour]
lc x lc(x) Lower case of x [string]
len x len(x) Length of string x [string]
lineno - x Push file line number x onto stack [input]
log x log(x) Natural logarithm of x [math]
log10 x log10(x) Logarithm of x in base 10 [math]
log2 x log2(x) Logarithm of x in base 2 [math]
map x dname map-of-x Use map of x in dictionary dname (if found; cf --cdict-dname= --fdict-dname=) [string/dictionary]
max x y max(x,y) Maximum of x and y [arithmetic]
maxall * max(Stack) Max over all entries in stack [arithmetic/devour]
md5 x md5(x) MD5 sum of x [string/format/input/output]
meanall * mean(Stack) Mean of all entries in stack [arithmetic/devour]
min x y min(x,y) Minimum of x and y [arithmetic]
minall * min(Stack) Min over all entries in stack [arithmetic/devour]
mod x y x mod y x modulo y, remainder [arithmetic]
mul x y x*y Multiply x and y, multiplication, product [arithmetic]
mulall * product(Stack) Product of all entries in stack, multiplication [arithmetic/devour]
neg x -x The sign-reversed value of x [math]
octto x x' Read octal representation x [input/format]
or x y x or y Bitwise or between x and y [bitop]
pct x y N pct(x/y) Percentage of x relative to y with N decimal digits (cf -P and PICK_DIV_INF) [precision/format]
pload c prevrow[c] Field of column c in the previous row [state]
pml x y N pct(x/y) Promille of x relative to y with N decimal digits (cf -P and PICK_DIV_INF) [precision/format]
pop x - Remove top entry x from stack [stack]
pow x y x**y x raised to power y [arithmetic]
r0wno - x Push current table (start zero) row number x onto stack [input]
rc x rc(x) Reverse complement [string]
rev x rev(x) String reverse of x [string]
rot13 x rot13(x) Rot13 encoding of x [crypto]
rowno - x Push current table (start one) row number x onto stack [input]
sign x sign(x) The sign of x (-1, 0 or 1) [math]
sin x sin(x) Sine of x [math]
sn x N x' Floating point x in scientific notation with N decimal digits [math/format/precision]
sq x x^2 Square of x [math]
sqrt x sqrt(x) Square root of x [math]
sub x y x-y Subtract y from x, subtraction [arithmetic]
substr x i k x[i:i+k-1] Substring of x starting at i (zero-based) of length k [string]
tan x tan(x) Tangens of x [math]
tobin x x' Binary representation of x [format]
tohex x x' Hex representation of x [format]
tooct x x' Octal representation of x [format]
uc x uc(x) Upper case of x [string]
uie x y x-or-y Use x if not empty, otherwise use y [string]
urldc x urldc(x) Url decoding of x [string/format/input/output]
urlec x urlec(x) Url encoding of x [string/format/input/output]
xch x y y x Exchange x and y [stack]
xor x y x xor y Bitwise exclusive or between x and y [bitop]
zp x N x' x left zero-padded to width of N [output/string/format]
These are additionally available if --sam
is supplied:
Operator Consumed Produced Description
--------------------------------------------------------------------------------
alnmatch - refmatch Amount of reference/query matched by alignment (ignoring indels) [sam]
cgcount c s Count of s in c Count of s items in cigar string c [string/sam]
cgmax c s Max of s in c Max of lengths of s items in cigar string c [string/sam]
cgqrycov c qrycov Count of query bases covered by cigar string c (MI=X events) [string/sam]
cgqryend c qryend Last base considered aligned in query for cigar string c [string/sam]
cgqrylen c qrylen Length of query (MIS=X events) in cigar string c [string/sam]
cgqrystart c qrystart First base considered aligned in query for cigar string c [string/sam]
cgrefcov c refcov Count of reference bases covered by cigar string c (MDN=X events) [string/sam]
cgsum c s Sum of s in c Sum of lengths of s items in cigar string c [string/sam]
fasta i s fasta format ID and sequence in FASTA format [sam]
fastq i s fastq format ID and sequence in FASTQ format [sam]
qryclipl - qryclipl Number of 5p trailing query bases [sam]
qryclipr - qryclipr Number of 3p trailing query bases [sam]
qrycov - qrycov Span of query covered by alignment [sam]
qryend - qryend Last base in query covered by alignment [sam]
qrylen - qrylen Length of query sequence [sam]
qrystart - qrystart Start of alignment in query [sam]
refclipl - refclipl Number of 5p trailing reference bases [sam]
refclipr - refclipr Number of 3p trailing reference bases [sam]
refcov - refcov Span of reference covered by alignment [sam]
refend - refend Last base in reference covered by alignment [sam]
reflen - reflen Length of reference sequence (requires samtools view -h) [sam]
refstart - refstart Field 4 from sam format [sam]
Implementation notes
Pick is currently implemented in Perl, a language not as popular as it once was. Nonetheless for data/record munging and manipulation Perl is a formidable competitor. In particular pick benefits from the power of perl regular expressions (regexes); these can be used as pick selection and modification operators on the command line. Perl's support for regexes is built deeply into the language. I've been pleasantly surprised by the seamlessness and ease of its treatment of command-line strings as regexes. Some useful regex features are described here.
Pick additionally benefits greatly from Perl's mechanisms for number/string and string/number conversion. Some interesting insights into Perl's data type conversions.
This implementation compiles all references to column names into array offsets. It has no hash lookups during the core computation and output loop. Each computation is stored as a stack with code references where needed. I see no drastic improvements available in pure perl, but I'd love to be wrong about this (unwrapping the code references may lead to some speed-up but code modularity would suffer).
It is tempting to implement pick in C or Rust to get a speed boost. However,
reinventing an integer/float/string equivalence system (with its many niggling
corner cases) from scratch does not seem right (where's the C library for that?).
Below gives a rough indication of pick speed relative to baseline perl speed;
the latter is measured as a skeleton loop over lines of input with each line split into fields.
The timings can be perfomed by running make time
and make time2
in the test directory.
Timings of comparisons and compute, no output
███▋ 87 perl one comparison
███▊ 92 perl two comparisons
█████████▌ 230 pick one comparison
████████████▉ 309 pick two comparisons
████████████████▌ 398 pick one compute (addition)
███████████████████████████ 649 pick two computes (addition)
██████████████████████████████████▎ 824 pick three computes (addition)
███████████████████████████████████████████▏ 1036 pick four computes (addition)
█████████████████████████████████████████████████████▍ 1283 pick five computes (addition)
████████████████████████████████████████████████████▏ 1251 pick five computes (multiplication)
███████████████████████████████████▌ 853 pick one compute, (five add operators)
Timings of output and compute
███▋ 87 perl print none
███▌ 85 perl print one
█████ 121 perl print all
█████▌ 134 perl print all, add column (addition)
██████ 146 pick print none
████████▊ 210 pick print one
████████████ 288 pick print all
█████████████████████████▋ 617 pick print all, twice
███████████████████████████████████████▉ 958 pick print all, thrice
████████████████████████▌ 589 pick print all, add column (addition)
█████████████████████████▌ 613 pick print all plus compute
████████████████████████████████████ 864 pick print all plus long compute
███████████████████████████████▊ 764 pick print all plus long compute shortcut