This repository contains all the necessary code and commands for "GraSSRep: Graph-Based Self-Supervised Learning for Repeat Detection in Metagenomic Assembly."
We propose GraSSRep, a novel approach that leverages the assembly graph’s structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences (contigs) into repetitive and non-repetitive categories. Here is an overview of our method:
The code is based on Python 3.7 and should run on Unix-like operating systems (MacOS, Linux).
To install the dependencies for GraSSRep, you can use the environment.yml
file provided to build a Conda environment with all necessary dependencies.
The GraSSRep environment will need to be activated for each usage.
conda env create -f environment.yml
conda activate GraSSRep
Additionally, you will need to install other packages using pip
after creating and activating the environment.
pip install -r requirements.txt
Finally, make sure you have Spades
and MUMmer
installed. You can follow the installation instructions provided in the Spades and MUMmer repository.
- Folder Structure:
You should have the following directory structure in the project folder:
├── Codes.py
├── Results
├── Data
│ ├── shakya_1
│ │ ├── (read pairs and reference genome)
│ ├── shakya_2
│ │ ├── (read pairs and reference genome)
│ ├── (other folders for different cases)
│ │ ├── (read pairs and reference genome)
You need to place your data files, including read pairs, in .fq
format and reference genome in .fasta
format in the respective folders inside the Data
directory.
For example, for the shakya_1 dataset:
├── Data
│ ├── ...
│ ├── shakya_2
│ ├── shakya_1
│ │ ├── outRead1.fq
│ │ ├── outRead2.fq
│ │ ├── ref_genome.fasta
│ ├── ...
For simulated data, you can run SimulatedData.sh
to simulate reference genomes, generate read-pairs, assemble them, and run the code for the three different cases discussed in the paper.
chmod +x SimulatedData.sh
./SimulatedData.sh
- Generating the results:
For any dataset with the read pairs and possible reference genome, there are two main steps to detect the repeated contigs:
i) Assembly Graph:
In this step, reads are assembled into contigs, the assembly graph is provided by spades
, and the features are calculated for the contigs or nodes of the assembly graph. Note that if the reference genome is available, reads are generated from the reference genome using wgsim
. Ground truth labels for repeats and non-repeat contigs are also found based on the reference genome using Nucmer
.
All these steps are executed in the mainGraph.py
code. In other words, this code includes Steps 1 and 2 in the overview figure.
You can vary the following parameters:
Parameter | Default Value | Description |
---|---|---|
--name |
"shakya_1" | Folder containing reads and reference genomes (if available) in Data folder |
--read1 |
"outRead1.fq" | Read 1 file name |
--read2 |
"outRead2.fq" | Read 2 file name |
--assembly |
1 | Flag to run the metaSpades or not |
(if the assembly graph or spades is generated before, change this to 0) | ||
--idy |
95 | Identity for repeat detection in % |
--nL |
95 | Normalized length for repeat detection in % |
--cN |
2 | Copy number for repeat detection |
--isGT |
1 | Flaf for the availability of the ground truth (reference genome) |
(if there is no reference genome available, change this to 0) | ||
--num_processes |
30 | Number of processors |
Please note that the reference genome is used solely for finding the ground truth contigs and evaluating the model. If a dataset lacks a reference genome, the code can be modified to utilize the provided read pairs for identifying repetitive contigs.
ii) Repeat detection:
Finally, using the assembly graph structure and node features, the mainRepeatDetection.py
code classifies contigs into repeat and non-repeat classes using GNN and in a self-supervised manner and finally applies a fine tuning step. This includes Steps 3, 4, and 5 in the overview figure. Parameters that you can vary for this code are:
Parameter | Default Value | Description |
---|---|---|
--name |
"shakya_1" | Folder containing reads and reference genomes (if available) in Data folder |
--p |
35 | p threshold to generate pseudo-labels and apply fine-tuning |
--N_iter |
10 | Number of iterations to repeat the results and average it |
--isSemiSupervised |
0 | Flag to use semi-supervised learning instead of self-supervised |
--noGNN |
0 | Flag to exclude GNN step or not |
--noRF |
0 | Flag to exclude RF step or not |
The above two-step process reads data files from the Data
directory, generates results for each setup specified in the script, and saves them in the corresponding folder within the Results
directory.
The main files generated by the project are listed below:
-
before_rr.fasta: Contains contigs (nodes) from the assembly graph provided by SPAdes.
-
assembly_graph.fastg: Represents the connections between contigs, forming the assembly graph structure.
Note: The above two files, provided by SPAdes, result from read-pairs and serve as the primary inputs for our method. If these files are present in any other dataset, we can exclude assembly with SPAdes from the process. The following files are generated by GraSSRep:
-
Bandage_map.csv: Provides the mapping between contig labels used for plotting with Bandage and the labels used in our method, in sequential order.
-
G_node_to_index.csv: Contains three columns:
contig
: Represents the label of contigs in sequential order as mentioned in the previous file.idx
: Denotes the index of each contig on the graph for Graph Neural Network (GNN).y_true_B
: Indicates the true label of each contig.
-
final_pred.csv: Shares the same columns as the previous file, with two additional columns:
y_pred_B
: Represents the predicted label for each contig.train_mask
: Indicates whether each contig was included in the training set or not.