gretl - Graph evaluation toolkit

Description

gretl is a tool for basic graph statistics using GFA format input. Our statistics are based on nodes, edges and paths/walks. Walks can also be used, but will be represented as paths internally. Many commands do not work without paths/walk information.

Requirements on GFA file:

GFA format v1.0, v1.1 or v1.2.
GFA file has numerical node ID

Comment:

Sorted node IDs are not required, but all "Jump" related statistics will be based on the order of the nodes in the GFA file. Check this paper for more information. Run odgi sort -O" to sort the graph in pan-genomic order.
We recommend dense node ID, starting at 1 and end at the number of nodes +1. Memory efficient on multiple levels.

Installation:

Git

git clone https://github.com/MoinSebi/gretl  
cd gretl   
cargo build --release  
./target/release/gretl

Testing

We provide a small test suite to test the basic functionality of the tool. If you are interested in output format, check the data/test/yeast/ directory after running the following command.

cargo test

Usage

Stats

Calculate statistics on GFA file. A list of all stats can be found here. Please consider using the --pansn option to group the paths by sample. Read more information about PanSN-spec here.

Available options:

-bins Adjust number and size of bins. Histogram-like statistics which classify nodes by their length into bins.
-path Report statistics for each path in the graph.
-y Report output in YAML format (default is tsv).

Graph statistics also include "hybrid" statistics, which are average and standard deviation of all path statistics. All hybrid stats have the prefix "Path". A full list of all statistics be found in paper directory in this repository.

Example


./gretl stats -g /path/to/graph.gfa -o /path/to/output.txt

Result

TSV or YAML file with statistics
Merge the output of multiple graphs to compare them.
Example comparison: plot
Example output

ID2INT

Convert any string-based node identifier to numeric values. Use odgi sort to sort the graph in pan-genomic order, which will create more meaningful statistics in gretl stats (see above). Nevertheless, numerical node IDs a required by any gretl command.

Available options:

-d, --dict <dict> Write a dictionary with new and old IDs to a plain text file.

Example

./gretl id2int -g /path/to/graph.gfa -o /path/to/output.gfa -d /path/to/dict.txt

Result:

GFA file with numerical node IDs

Comment: This function will convert all IDs in the graph. Additional data in tags will not be converted.

Node-list

Individual node statistics. Statistics provided:

Length
Degree
Depth
Core

Length and degree are based on the graph itself, while depth and core are based on the paths.

Example

./gretl node-list -g /path/to/graph.gfa -o /path/to/output.txt

Result

TSV file output

Nodes	1	2	3	4	5	6	7	8
Length	21176	15530	15530	24351	24367	100	1	1
Core	1	1	1	1	1	2	1	1
Depth	1	1	1	1	1	2	1	1
ND_in	0	0	0	0	0	2	1	1
ND_out	1	1	1	1	1	2	1	1
ND_total	1	1	1	1	1	4	2	2

Comment The information of the reported table can be used as a individual lookup or to create own window-like statistics (over nodes).

Core

Compute user-defined statistics of the graph (-s). Calculate the statistics for each node and summarize for each possible value the number of nodes and sequence. In an additional file ("*.private.txt") we report for each path the amount of nodes and sequence sole present by this sample.

Available options:

-s, --stats <statistics>. Define the statistics you want to summarize (see above) [default: similarity].

./gretl core -g /path/to/graph.gfa -o /path/to/output.txt

Result

core plot

Path similarity (PS)

Calculate for each path the amount of nodes and sequence at each similarity level.

./gretl ps -g /path/to/graph.gfa -o /path/to/output.txt

Result ps plot

Example output: General path similarity

Similarity	Sequence[bp]	#Node
0	0	0
1	264241	7315
2	10804	2191
3	13800	2240
4	73893	6833
5	597805	7655

Private table:

Path	Sequence[bp]	#Node
ABQ_6.ChrX	47050	336
BIH_4.ChrX	26389	278
ABF_6.ChrX	33120	2181
BPN_2.ChrX	104353	1250
BCK_8.ChrX	53334	3275

Feature

Select nodes based on input settings. The output can be used as input for gfa2bin.

./gretl feature -g /path/to/graph.gfa -o /path/to/nodes.txt -D 10

Result

List of nodes which fulfill the input settings (plain-text, one node per line)

Path

Select paths based on input settings. The output can be used as input for gfa2bin.

./gretl feature -g /path/to/graph.gfa -o /path/to/nodes.txt -s "N/D ration" -m 10

Result

List of paths/samples which fulfill the input settings (plain-text, one node per line)

Bootstrap

Sample-based bootstrapping to calculate number of nodes and sequence for each number of possible samples. Start with a "complete" graph and remove random path for each run. Then recalculate the general statistics. And summarize the amount of sequence/nodes for each level (e.g. similarity).
We recommend bootstrapping a graphs in PanSN-spec. Use --nodes if the bootstrap should only run on a subset of nodes.
You are able to adjust the number of bootstrap, only calculate one "level" or input a meta file as input. Examples are shown in the data/example_data/ directory.
Meta files can be used to use the same "combinations" for multiple graphs. This only works of the paths/samples of the graphs are in the same order.

Available options:

--nodes <nodes>Run bootstrap only on these nodes
--meta-input <meta input> Use a meta file as input.
--level <level>Run bootstrap only for a specific level
--number <number> Number of bootstrap for each number of genomes
--meta-line <meta line> Run a boots trap of a specific line in the meta file.
--meta <meta> Report the meta information in the output.

Example

./gretl bootstrap -g /path/to/graph.gfa -o /path/to/output.txt -n 20

Result

Using this script to get bootstrap plot

Size	Run	Node:1	Node:2	Node:3	Node:4	Node:5	Seq:1	Seq:2	Seq:3	Seq:4	Seq:5
2	0	7651	13495				94383	680604
2	1	12238	10890				112920	666501
2	2	10184	11766				105283	665966
3	0	7773	7263	9587			122129	23996	662555
3	1	7710	7317	9657			140411	25086	663453
3	2	5255	5680	11466			131387	23065	664906
4	2	6756	2420	6487	9325		165241	7105	22037	661811
4	3	7870	3085	7085	7858		220983	19845	74507	598158
4	4	4988	2305	4912	10754		214961	9350	78758	604140
5	0	7315	2191	2240	6833	7655	264241	10804	13800	73893	597805

(Sliding, path) window

Calculate statistics on a node level (graph- or path-based) and summarize them for each path in a sliding window approach. In detail: Iterate over the nodes of a path (window-like), summarize the stats of all nodes in the window and report a single value for each window.

Example

./gretl window -g /path/to/graph.gfa -o /path/to/output.txt -s 1000 --step 100

Result

Using this script to get window plot

Table: Path in col1, similarity values on all the other values (each column is 1000 bp, going 100 bp steps)

ABQ_6.ChrX	5	5	5	5	5	5
BIH_4.ChrX	5	3.5	5	5	5	5
ABF_6.ChrX	5	5	5	5	5	5
BPN_2.ChrX	5	5	5	5	5	5
BCK_8.ChrX	5	5	5	5	5	4.5

Nwindow

Summarizing the graph by a window of nodes. We iterate numerically over the nodes and calculate the statistics for each window. We start at the current node and move away from it based on provided edges, collecting the new nodes. We repeat this process starting at the "new" nodes until one of the following conditions is met:

Jumps: A jumps is defined as difference between the current and the next node. Your input referees to the sum of all jumps in the window.
Steps: A step it the number of moves we make in the graph. Your input is the maximum steps from the starting node.
Sequence: Limit the window by a sequence threshold. We stop if the sequence length is larger than the provided threshold.

Example: How many nodes do I need to collect 1000 bp?

./gretl nwindow -g /path/to/graph.gfa -o /path/to/output.txt --sequence 1000 --node-number

Output: You are able to return the number of collected nodes, the total number of jumps or the total sequence. Some combinations of input limitation and output do not gain any additional information.

nwindow plot

Table: NodesID, Number of nodes, amount of sequence and sum of jumps (collected in a window)

nodeid	node	sequence	jumps
240	54	50623	1321
241	53	50589	1296
242	46	1862	696
243	46	1862	709
244	44	1832	637
245	38	1762	458
246	38	1762	463
247	37	1567	410
248	33	575	280
249	33	575	280
250	33	391	256

Find

Find a specific node (e.g. 10), directed node (e.g. 10+), or edge (e.g. 10+20+) in the graph and get the exact (sequence) position in the paths. Output is a BED file with the positions. You are able to add additional sequence -l on both sites, which can help if you want to realign to a database and the node is very small.

./gretl find -g /path/to/graph.gfa -o /path/to/output.txt --length 1000 -f feature.txt

Example of feature file is data/example_data/dirnodes.txt

Scripts

We provide multiple jupyter notebooks to visualize the output of the tool.

Requirements

Jupyter
Matplotlib
Pandas
Numpy
Seaborn

MoinSebi / gretl

gretl - Graph evaluation toolkit

Description

Requirements on GFA file:

Installation:

Testing

Usage

Stats

ID2INT

Node-list

Core

Path similarity (PS)

Feature

Path

Bootstrap

(Sliding, path) window

Nwindow

Find

Scripts

About

Languages