bio.sh

Basic intro to Linux and shell script for Bioinformatics

Outline

Unix and the command line
Connect and sharing
Basic commands
Users and groups
Get into your machine and system
Examples for Bioinformatics

1. Unix and the command line

Unix history

The Unix operating system derive from the original AT&T UNIX operating system, developed in the mid 1960s at the Bell Labs research center by Ken Thompson, Dennis Ritchie, among others.

Linux history

A UNIX based operating system
Began in 1991 by Linus Torvalds

Ubuntu

Ubuntu is one (out of many) open source operating system based on Linux. https://ubuntu.com/download/desktop

Linux file system

The Filesystem Hierarchy Standard (FHS) defines the main directories and their contents in Linux operating systems.

Linux shell

Shell: the shell is a program that takes your commands from the keyboard and gives them to the operating system to perform
Terminal: a program that run a shell
Directory: folder, or location of a file

2. Connect and transfer

What I need to connect to a remote Linux server?

Windows: Putty (https://www.putty.org/), Ubuntu based bash shell
Linux/Mac: ssh (built-in on terminal)

SSH syntax

ssh urer@hostname
Example 1: ssh joel@darwin.dei.uc.pt
Example 2: ssh maria@193.137.200.184

SCP syntax SCP is used to copy file to and from the server

scp file.txt urer@hostname:/some/remote/directory ← copy local file to remote server
scp urer@hostname:file.txt /some/local/directory ← copy remote file to my computer
Example 1: scp P00750.fasta joel@darwin.dei.uc.pt:/
Example 2: scp maria@193.137.200.184:P00750.fasta ./

Wget syntax Wget is used to get public files from a server.

wget example-url
Example 1: wget https://www.uniprot.org/uniprot/P00750.fasta

3. Basic commands

ls ← list files from current directory

ls /home/reports ← list files from a specific directory

ls -lhS (optional arguments: http://manpages.ubuntu.com/manpages/trusty/man1/ls.1.html)

head file.txt -n3 ← print the first 3 lines

cp file.txt file1.txt ← make a copy of file.txt

rm file.txt ← delete file.txt

mv file1.txt file.txt ← rename (change location and name) file1.txt to file.txt

mkdir some_folder ← create a folder

mkdir some_folder_{0..9} ← create multiple folders

cd ← go to home directory

cd some_folder ← change folder

cd some_folder/reports ← change folder

cd ~/some_folder/reports ← relative path from home directory

cd ./some_folder/reports ← relative path from current directory

cd ../some_folder/reports ← relative path from parent directory

4. On-the-go Python example

Create/open file with the nano text editor.

nano example.py

import sys
if len(sys.argv)>1:
	for i in range(int(sys.argv[1])):
		print(i)

python3 example.py 10

python3 example.py 10 >out.txt

python3 example.py 10 >>out.txt

python3 example.py 10 | sort -r | head -n3

5. Foreground and background

Terminate a process: CTRL-C
Suspending a process: CTRL-Z
fg ← bring it back:
sleep 20 & ← & means that the command will run on background

6. Usefull -generic- commands

whoami ← Who am I?
lsb_release -a ← linux distribution and version
ifconfig ← get my IP address
df -h ← disk space available
du -sbh ← size of the directory
w ← who is logged to this computer
top ← what processes are running

7. Advanced commands

grep: search for a pattern
- grep "ATG" file
wc -l: count line, word and byte
- wc -l file
- grep "ATG" file | wc -l
sort: sort text file
- sort -r file ← sort file by reverse order
tr: translate, squeeze, and/ delete chars
- cat file | tr -d '>|' ← delete chars from a text file
uniq: filter adjacent matching lines
- $uniq -c file
cut: print selected parts of lines
- cut -d '|' -f3 file ← split line by and get collum number 3
sed: stream editor for filtering and processing text
awk: pattern scanning and processing language

Bash vs Python

Using the best of both worlds (more)

8. Examples for Bioinformatics:

1. Simple processing of a tab delimited file

Get a tab file from Uniprot
- wget -O ./data/9606.uniprot.tab 'https://www.uniprot.org/uniprot/?query=*&format=tab&columns=id,entry%20name,reviewed,protein%20names,genes,organism,length&fil=organism:%22Homo%20sapiens%20(Human)%20[9606]%22'
Format and general stats
- head -n10 ./data/9606.uniprot.tab
- tail -n10 ./data/9606.uniprot.tab
- wc -l ./data/9606.uniprot.tab
Search for a specific patern
- grep "BRCA2" ./data/9606.uniprot.tab
- grep "ubiquitin" ./data/9606.uniprot.tab
Search on a specific col
- awk -F"\t" '$7>2000' ./data/9606.uniprot.tab
- awk -F'\t' '$3 == "unreviewed"' ./data/9606.uniprot.tab
- awk -F'\t' '$3 == "unreviewed"||$7<200' ./data/9606.uniprot.tab
Get all proteins for a specific search
- awk -F"\t" '$7>2000' ./data/9606.uniprot.tab | cut -f1,7 |sort -k2n

2. Search over a FASTA file

Get human fasta file from Uniprot
- wget -O ./data/9606.uniprot.fasta 'https://www.uniprot.org/uniprot/?query=*&format=fasta&fil=organism:%22Homo%20sapiens%20(Human)%20[9606]%22'
Convert fasta file to one sequence per line
- awk '/^>/ {printf("%s%s|",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' <./data/9606.uniprot.fasta >./tmp/9606.uniprot.line.fasta
Get the ids from a fasta file (or other field)
- cat ./tmp/9606.uniprot.line.fasta | cut -d '|' -f2 |sort
Intersect two lists
C2H2 zinc finger motif (assume zinc finger motif to be CXXXCXXXXXXXXXXHXXXH)
- cat ./tmp/9606.uniprot.line.fasta | grep --color "C..C............H...H"
Any regular expression
- cat ./tmp/9606.uniprot.line.fasta | grep --color "L[AST]Q"

3. Using VCF files TBD

VascoMSantos / bio.sh