Please note that this step is only neccessary if you plan to edit scripts and have access to this repository as a collaborator.
- Create a GitHub account here.
- Access the HPCC by going to https://ondemand.hpcc.msu.edu and clicking
dev-intel-16
under theDevelopment Nodes
drop-down, which will open a terminal on a new tab. - In the terminal follow the instructions under "Generating a new SSH key" to create an SSH key here.
- Now that you have the key created you will need to run the
cat ~/.ssh/id_ed25519.pub
and copy the contents it displays (it should look like “ssh-ed25519 AAA …… email@website.com”). - Log in to GitHub and click the icon in the right hand corner and select “Settings”. Click the section labeled "SSH and GPG keys". Next, click "New SSH Key" paste the contents copied above into the box provided, and give it a good title (maybe your computer name) and an expiration date if desired.
The following should be done on the Lab MacBook:
-
Download the zipped data folder to the Desktop and unzip it. Verify that it contains two subfolders called
A
andB
, whereA
contains the sequencing files for the MUT reads, andB
contains the sequencing files for the WT reads. Each file should have a format likeV300098986_L03_PLAujbeR032370-663_1.fq.gz
, where theL03
indicates that the file is from sequencing lane 3, the_1
means that it is theR1
read, and thefq.gz
implies that it is a zipped fastq file. The current sequencing company splits reads across 3 lanes and these three files must be joined together in increasing order by the lane. For example, if the lanes areL02
,L03
, andL04
, we must joinL02
->L03
->L04
such thatL02
is the beginning andL04
is the end. The following script will ideally concatenate the three separate files for each fo the four read types: MUT R1, MUT R2, WT R1, and WT R2. -
Once we check that the data is formatted correctly, upload the data folder to the HPCC. This can be done in two ways:
- Online GUI: Go to https://ondemand.hpcc.msu.edu and navigate to
Files
->Home Directory
. Here, you can navigate through the GUI to the correct folder and hit the blueUpload
button to upload folders or files from the local computer. - Terminal: Open terminal on the MacBook and run the following. This will prompt you to enter a password for your HPCC account (same as MSU account).
scp -r ~/Desktop/DataFolder <username>@hpcc.msu.edu:<DataFolder> # replace <>'s with your MSU username and the name of your data folder
- Online GUI: Go to https://ondemand.hpcc.msu.edu and navigate to
-
Access the HPCC terminal. This can also be done in two ways:
- HPCC Web Terminal:: Go to https://ondemand.hpcc.msu.edu and click
Development Nodes
->dev-intel-16
to open a terminal on a new tab. - Local Terminal: Open terminal on the MacBook and run the following. This will prompt you to enter a password for your HPCC account (same as MSU account).
ssh <username>@hpcc.msu.edu # replace <> with your MSU username
- HPCC Web Terminal:: Go to https://ondemand.hpcc.msu.edu and click
-
Set up an analysis directory adjacent to the data folder. This can be done in two ways:
- Manually: Download this GitHub repository to the MacBook as a zipped folder and unzip it. Next, rename the folder with the name of suppresor mutant. Then, upload the folder to the HPCC so that it is in the folder containing the data folder.
-
- Terminal, after manual download: Open terminal on the MacBook and run the following. This will prompt you to enter a password for your HPCC account (same as MSU account).
scp -r ~/Desktop/DataFolder <username>@hpcc.msu.edu:<DataFolder> # replace <>'s with your MSU username and the name of your data folder
- HPCC Web Terminal, straight from GitHub: Navigate to the HPCC's web terminal and run the following code. (This may require SSH keys to be set-up on the HPCC).
cd ~ git clone git@github.com:yashmanne/Benning_Simple.git mv Benning_Simple <analysis_folder> # replace {analysis_folder} with {your_sample_name}
Once the analysis folder is set-up, open the variables.sh
file on the Web GUI by clicking the three dots next to the file and hitting EDIT
to open the file on a new tab. The script should look as follows:
```bash
#### Change these to what yours are
# data folder
dataLocation="../DefaultData"
# output file
lineName="defaultEMS"
#### Note that in the example case, data was spread across 3 lanes L02, L03, L04. This may not always be the case and might need to be modified
lane1="L02"
lane2="L03"
lane3="L04"
```
Now, change the dataLocation
variable to "../DataFolder"
, where "DataFolder"
is the name of your data folder. Next, change lineName
to "your_sample_name"
, where "your_sample_name"
is the name of your sample. Next, change lane1
to the lowest lane value, in this case "L02"
and lane2
to the next lowest value, and lane3
to the highest lane value. Save the script by hitting the blue SAVE
button.
Once this file is saved, we can run subsequent HPCC scripts!
Navigate to the folder you uploaded from this GitHub to the HPCC (what you just named after your mutant):
cd ./<folder_name>/
To run the analysis, simply run the following in terminal:
bash runScripts.sh
This will run all scripts listed below in the following order. These scripts can be found in the scripts folder of this repository.
import_data.sb
: gathers all data into theinput
folder into a compatible file names.getReferences_Arabidopsis.sb
: downloads necessary reference genome information.runBwaMem2Index_slurm_ArabidopsisEMS.sb
: indexes the reference genome for comparison with sequencing files.runBwaMem2Aln_slurm_ArabidopsisEMS.sb
: aligns the sequencing files to the reference genome and generates.sam
files that contains the alignment data.runSamtoolsSamToBam_slurm_ArabidopsisEMS.sb
: convert alignment data into more efficient.bam
format.runSamtoolsBam_Sort_IndexEMS.sb
: sorts and indexes the alignments for future SNP-calling.runSamtoolsMarkDuplicatesEMS.sb
: removes all duplicates to ensure efficient and accurate SNP-calling.
Each .sb
script will generate a slurm-########.out
file that shows the log of commands run in each script. These files are useful to debug any issues that may pop up. In most cases, it will be easy to debug, if not, the HPCC folks can be contacted here.
Each script can be edited if needed by going to https://ondemand.hpcc.msu.edu/, and clicking Home Directory
under the Files
drop-down. Then, navigate to the desired script and hit EDIT
under the three dots drop-down for the file you want to edit. A new tab will allow you to edit the file and you can save the file by hitting the SAVE
button on the top left.
All scripts are finished when all 7 slurm-########.out
files show up in the scripts folder. If all scripts are successful, there should be multiple non-empty files in the output
folder.
A common issue while running the script may be that 1 or more of the data files are corrupted. In this case, the corrupted portion of the corrupted data files must be cut out. In these cases, it's common to manually contatetate three files of each files and put into the input
folder. Since data has already been added to the input
, script 1 (import_data.sb
) can be skipped and the subsequent scripts can be run by running the following:
```bash
bash runScripts.sh data
```
Additionally, the quality of the sequencing files can be tested by running the runFastqc_slurm_EMS.sb
script, which outputs HTML files containing plots of different metrics of data quality. This can be run separately by doing the following:
```bash
# navigate to scripts folder
cd ./scripts/
sbatch runFastqc_slurm_EMS.sb
```
Once the script is run, its progress can be checked by the following:
```bash
squeue -lu <username> # replace <> with your MSU username
```
Once all the output files have been successfully generated, copy the output files from the HPCC to the Lab Macbook by doing the following:
- Open up terminal.
- Navigate to desktop by doing
cd ~/Desktop/
scp -r <username>@hpcc.msu.edu:<analysis_folder_name>/output . # replace <>'s with your MSU username and name of analysis folder
- Run other lab experiments while files download. (It can take close to an hour)
-
Once all the files have transferred, download Simple to your desktop and unzip. Rename the folder as
Simple
instead ofSimple-master
. Go to thesimple_variables.sh
file underSimple/scripts/
and change theline
variable from “EMS” to your desired sample name as done for thelineName
variable above. -
Next, copy all files in the output folder to the the
Simple/output
folder. -
Now, in
Simple/scripts/
, replace thesimple.sh
with thesimple.sh
file present in thescripts
folder of the data analysis repository. The newsimple.sh
file can be downloaded using the HPCC GUI as done above.
-
Now, SIMPLE is ready to run by doing the following on the MacBook:
cd ~/Desktop/Simple chmod +x ./scripts/simple.sh ./scripts/simple.sh
Come back in a day and it should be ready. Congrats, now you get to do the fun stuff!