RTlabCBM / FidelityFinderSimulation

Python script to simulate an RT-PCR experiment incorporating cDNA barcoding and NGS sequencing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FidelityFinderSimulation

Table of Contents

  1. Introduction
  2. Quick Start
  3. Input Parameters
  4. Sample Output Results
  5. Creative Commons
  6. Citation
  7. Developers

Introduction

Python script to simulate an RT-PCR experiment incorporating cDNA barcoding and NGS sequencing. The program aims to assess the ability of single-strand consensus sequencing (SSCS) to evaluate the fidelity of reverse transcriptases, as well as its accuracy in discarding errors introduced during library preparation and sequencing.

Single Strand Consensus Sequencing method to determine the fidelity of reverse transcriptases: SSCS

The program initiates by generating a specified number of cDNA sequences (initial_seq_number) with a length defined by seq_len. A barcode (also known as Unique Molecular Identifier) is assigned to each cDNA sequence. The length of the barcode can be specified with barcode_len parameter. Random mutations are introduced to the cDNA sequences based on the probability defined by the RT_error_rate parameter. These mutations emulate transcription and reverse transcription errors.

Subsequently, the cDNAs are then amplified through the indicated PCR cycles with a probability of introducing errors in the sequences (determined by PCR_error_rate) or in the barcodes associated (PCR_error_rate_for_barcodes). PCR_selection_rate governs the proportion of sequences amplified during each PCR cycle, thereby simulating PCR bias.

Following PCR amplification, a subset of sequences (determined by NGS_reads_number) undergoes sequencing simulation. Errors may be introduced in the sequences or their associated barcodes, as defined by NGS_error_rate and NGS_error_rate_for_barcodes, respectively. Additionally, the program also allows the introduction of errors in specific positions of the barcodes with varying probabilities, controlled by NGS_error_rate_for_barcodes_hotspots and modulated by hotspots_module. For example, a hotspots_module of 2 introduces errors in even positions of the barcodes, while a value higher than barcode_len has no effect.

Single nucleotide substitution errors are the sole type of simulated errors.

Quick Start

The program is available as a Jupyter Notebook. It can be opened and run with the following Google Colab link: fidelity_finder_simulation

Input Parameters

Parameters that can be provided as input together with example values:

  • Number of cDNA sequences that are generated by reverse transcription
initial_seq_number = 3000
  • Length of the initial_sequences (nt)
seq_len = 300
  • Length of the barcode added to each sequence (nt)
barcode_len = 14
  • Error rate during reverse transcription
RT_error_rate = 0.00002
  • Number of PCR cycles
PCR_cycles_number = 15
  • Proportion of sequences that are amplified in each PCR cycle (0.0-1.0)
PCR_selection_rate = 0.3
  • Error rate during each cycle of PCR
PCR_error_rate = 0.00003
  • Error rate during each cycle of PCR for barcode amplification
PCR_error_rate_for_barcodes = 0.00003
  • Number of reads that are sequenced
NGS_reads_number = 130000
  • Error rate during sequencing
NGS_error_rate = 0.001
  • Error rate during sequencing of barcodes
NGS_error_rate_for_barcodes = 0.001
  • Error rate during sequencing of barcodes hotspots
NGS_error_rate_for_barcodes_hotspots = 0.01
  • Hotspots module. Select a 2 to introduce hotspots in even positions. Select a number higher than barcode_len to avoid hotspots
hotspots_module = 2
  • Cutoffs for sequence filtering. Barcode frequencies equal to or lower are discarded. Provide a comma-separated list
cutoffs_list = "1,2,3,4"
  • Thresholds for consensus construction. Provide a comma-separated list
thresholds_list = "0,75,100"
  • Output_prefix for the generated files
output_prefix = "simulation1"

Sample Output Results

The program generates several files, including graphs, an Excel file with a summary of the obtained results, and .json files with specific data. These are some examples of the output data:

  • Summary data excel (<output_prefix>summary_data.xlsx) image
  • Barcodes size families distribution (<output_prefix>total_frequencies_distribution_graph.png) image
  • Distribution of mutations across the sequence for a fixed cutoff and threshold (<output_prefix>mutations_distribution_graph_cutoff_3_threshold_100) image
  • Error rates by barcode for a fixed threshold value (<output_prefix>all_error_rates_graph.png) image
  • Histogram with the percentages of the highest frequency nucleotide in each position of the aligned sequences used for consensus construction (<output_prefix>Max_frequent_nucleotides_percentages_histogram_cutoff_1_threshold_0.png) image
  • Distribution of barcode size families together with offsprings percentages (<output_prefix>percentage_differences.png) image

Creative Commons

image Attribution-NonCommercial-ShareAlike 4.0 International
(CC BY-NC-SA 4.0)

Citation

We politely request that this work be cited as:
(Citation details are not yet available)

Developers

  • Javier Martínez del Río

About

Python script to simulate an RT-PCR experiment incorporating cDNA barcoding and NGS sequencing


Languages

Language:Jupyter Notebook 100.0%