Basic intro to Linux and shell script for Bioinformatics
- Unix and the command line
- Connect and sharing
- Basic commands
- Users and groups
- Get into your machine and system
- Examples for Bioinformatics
Unix history
The Unix operating system derive from the original AT&T UNIX operating system, developed in the mid 1960s at the Bell Labs research center by Ken Thompson, Dennis Ritchie, among others.
Linux history
- A UNIX based operating system
- Began in 1991 by Linus Torvalds
Ubuntu
- Ubuntu is one (out of many) open source operating system based on Linux. https://ubuntu.com/download/desktop
Linux file system
- The Filesystem Hierarchy Standard (FHS) defines the main directories and their contents in Linux operating systems.
Linux shell
- Shell: the shell is a program that takes your commands from the keyboard and gives them to the operating system to perform
- Terminal: a program that run a shell
- Directory: folder, or location of a file
What I need to connect to a remote Linux server?
- Windows: Putty (https://www.putty.org/), Ubuntu based bash shell
- Linux/Mac: ssh (built-in on terminal)
SSH syntax
ssh urer@hostname
- Example 1:
ssh joel@darwin.dei.uc.pt
- Example 2:
ssh maria@193.137.200.184
SCP syntax SCP is used to copy file to and from the server
-
scp file.txt urer@hostname:/some/remote/directory
← copy local file to remote server -
scp urer@hostname:file.txt /some/local/directory
← copy remote file to my computer -
Example 1:
scp P00750.fasta joel@darwin.dei.uc.pt:/
-
Example 2:
scp maria@193.137.200.184:P00750.fasta ./
Wget syntax Wget is used to get public files from a server.
wget example-url
- Example 1: wget
https://www.uniprot.org/uniprot/P00750.fasta
ls
← list files from current directory
ls /home/reports
← list files from a specific directory
ls -lhS
(optional arguments: http://manpages.ubuntu.com/manpages/trusty/man1/ls.1.html)
head file.txt -n3
← print the first 3 lines
cp file.txt file1.txt
← make a copy of file.txt
rm file.txt
← delete file.txt
mv file1.txt file.txt
← rename (change location and name) file1.txt to file.txt
mkdir some_folder
← create a folder
mkdir some_folder_{0..9}
← create multiple folders
cd
← go to home directory
cd some_folder
← change folder
cd some_folder/reports
← change folder
cd ~/some_folder/reports
← relative path from home directory
cd ./some_folder/reports
← relative path from current directory
cd ../some_folder/reports
← relative path from parent directory
- Create/open file with the nano text editor.
nano example.py
import sys
if len(sys.argv)>1:
for i in range(int(sys.argv[1])):
print(i)
python3 example.py 10
python3 example.py 10 >out.txt
python3 example.py 10 >>out.txt
python3 example.py 10 | sort -r | head -n3
- Terminate a process:
CTRL-C
- Suspending a process:
CTRL-Z
fg
← bring it back:sleep 20 &
← & means that the command will run on background
whoami
← Who am I?lsb_release -a
← linux distribution and versionifconfig
← get my IP addressdf -h
← disk space availabledu -sbh
← size of the directoryw
← who is logged to this computertop
← what processes are running
-
grep
: search for a patterngrep "ATG" file
-
wc -l
: count line, word and bytewc -l file
grep "ATG" file | wc -l
-
sort
: sort text filesort -r file
← sort file by reverse order
-
tr
: translate, squeeze, and/ delete charscat file | tr -d '>|'
← delete chars from a text file
-
uniq
: filter adjacent matching lines$uniq -c file
-
cut
: print selected parts of linescut -d '|' -f3 file
← split line by and get collum number 3
-
sed
: stream editor for filtering and processing text -
awk
: pattern scanning and processing language
- Using the best of both worlds (more)
1. Simple processing of a tab delimited file
-
Get a tab file from Uniprot
wget -O ./data/9606.uniprot.tab 'https://www.uniprot.org/uniprot/?query=*&format=tab&columns=id,entry%20name,reviewed,protein%20names,genes,organism,length&fil=organism:%22Homo%20sapiens%20(Human)%20[9606]%22'
-
Format and general stats
head -n10 ./data/9606.uniprot.tab
tail -n10 ./data/9606.uniprot.tab
wc -l ./data/9606.uniprot.tab
-
Search for a specific patern
grep "BRCA2" ./data/9606.uniprot.tab
grep "ubiquitin" ./data/9606.uniprot.tab
-
Search on a specific col
awk -F"\t" '$7>2000' ./data/9606.uniprot.tab
awk -F'\t' '$3 == "unreviewed"' ./data/9606.uniprot.tab
awk -F'\t' '$3 == "unreviewed"||$7<200' ./data/9606.uniprot.tab
-
Get all proteins for a specific search
awk -F"\t" '$7>2000' ./data/9606.uniprot.tab | cut -f1,7 |sort -k2n
2. Search over a FASTA file
-
Get human fasta file from Uniprot
wget -O ./data/9606.uniprot.fasta 'https://www.uniprot.org/uniprot/?query=*&format=fasta&fil=organism:%22Homo%20sapiens%20(Human)%20[9606]%22'
-
Convert fasta file to one sequence per line
awk '/^>/ {printf("%s%s|",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' <./data/9606.uniprot.fasta >./tmp/9606.uniprot.line.fasta
-
Get the ids from a fasta file (or other field)
cat ./tmp/9606.uniprot.line.fasta | cut -d '|' -f2 |sort
-
Intersect two lists
-
C2H2 zinc finger motif (assume zinc finger motif to be CXXXCXXXXXXXXXXHXXXH)
cat ./tmp/9606.uniprot.line.fasta | grep --color "C..C............H...H"
-
Any regular expression
cat ./tmp/9606.uniprot.line.fasta | grep --color "L[AST]Q"
3. Using VCF files TBD