jafarijason / ops_for_data_scientists

Ops for data scientists

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ops_for_data_scientists

Ops for data scientists

Sources and more readings

https://www.gnu.org/home.en.html

https://dagshub.com/blog/effective-linux-bash-data-scientists/

https://dagshub.com/blog/setting-up-data-science-workspace-with-docker/

https://linuxjourney.com/lesson/touch-command

https://www.redhat.com/en/topics/linux




How do you make the computer do what you want

  • What is Linux?
  • Why do people use it?
  • What is Bash?
  • How to use the terminal?
  • How to exit vim?!
  • Why would you use Linux, Bash, and other system tools?
  • What's the smart way to do it, based on our subjective experience?
  • What common problems will you come across, and how to solve them?
  • What's the mental framework for working with these tools, to gain understanding and learn more by playing?

Who is this for?

The curriculum and some of the tips are aimed at data scientists who want an introduction to the topics of Linux & Bash. However, the data science orientation mainly comes into play in a few domain specific tips, and in the stated motivations to learn these things - if you're an aspiring web developer, there's no reason not to benefit from this guide as well!

What is GNU?

GNU is an operating system that is free software that is, it respects users' freedom. The GNU operating system consists of GNU packages (programs specifically released by the GNU Project) as well as free software released by third parties. The development of GNU made it possible to use a computer without software that would trample your freedom.
We recommend installable versions of GNU (more precisely, GNU/Linux distributions) which are entirely free software. More about GNU below.

What is linux?

you can find simple answer here

  • A family of open source operating systems.
  • Developed by Linus Torvalds, who also invented Git to manage the source code for Linux.
  • An operating system is a program that takes over a bit after your computer turns on.
  • For the first few seconds after your computer switches on, the motherboard runs a small
  • hard-coded operating system called the BIOS, but it quickly hands control over to some operating system kernel, which is installed on one of the hard drives, a USB stick or CD.
  • From that point on, the kernel decides which programs to run when, and how to control physical devices (via drivers).
  • An operating system is a bundle of programs that come packaged together. The kernel is the most important part, but it comes with more programs which help the users communicate with the kernel.
  • e.g. File explorers are part of the OS, but not the kernel - they're just graphical interfaces which sit between the user and the kernel.
  • Operating systems normally also handle file systems, user permissions, memory management, and many other things.
  • The thing that unites all the different operating systems in the Linux family is they all use the same Linux kernel - other parts differ. More on that later in the section about distributions.


What is Linux good for?


An operating system is, surprisingly, just a type of system. Systems are designed by humans, and better designs lead to better performance, stability, and flexibility. Linux is simply a better designed operating system. It's super flexible and stable - "blue screens of death" are exceedingly rare in production Linux servers, and their performance is very reliable. Which is why a vast majority of production systems run on Linux, and that's also why it's good for anyone working in tech to be Linux literate. That includes you, dear reader.

Being open source leads to high quality, as bugs have fewer dark places to hide in. Developers can peer under the covers to make sure their Linux applications will work well, rather than guessing and relying on questionable documentation from closed source operating system developers.

But with great power and flexibility comes a great ability to shoot yourself in the foot. Linux makes that easy as well.


Linux-like systems

  • Mac and Unix are very similar to, but are not Linux technically. You will have a hard time telling the difference, unless you dive deep.
  • Unix is older than Linux and extremely similar - In fact, Linux is an open source re-implementation of Unix (which was closed source, but very good). This is pretty much historic trivia, as Unix is rarely seen nowadays, but know that some people use the words Unix and Linux interchangeably.
  • In general, there’s a name for operating systems that look and feel like Unix – POSIX compliant, or *nix. When you see these words, translate them as “follows the conventions of Linux, such as basic commands for file manipulation (ls, cd, mkdir) and "/" as the root of the file system etc.”
  • GNU is a large set of free software which is the foundation for much of Linux – compilers, C libraries, programs to zip files, and many others. It's also the name of an independent POSIX operating system, with more hardcore ideology around free software than Linux.
  • All of the above systems, as well as Linux itself, are examples of POSIX compliant or *nix systems.

  • There are (too?) many flavours of “real Linux”, called distros or distributions. It can be a headache to differentiate them.
  • A distribution is like a "company", which invents a new operating system. They wrap the Linux Kernel with a new bundle of peripheral programs - i.e. they may use a different mix of GUI programs, support different hardware by default, etc. They release new versions occasionally.
  • Ubuntu. It’s the most user friendly, widely supported, and easy to install.
  • Red Hat Enterprise Linux, or RHEL, is a different distro which is used sometimes in heavy duty production servers.
  • Fedora is the desktop equivalent of RHEL - usually, developers aiming to run their applications on RHEL servers will use Fedora for their development computers, to avoid compatibility issues.
  • Alpine is a super minimal distro which is used for many Docker images.





source



Interfaces

  • When people think of Linux, they usually associate it with a scary terminal (plus attached Anonymous hacker with a hoodie).
  • Don't Panic – it’s not so scary! Today, it’s really easy to install Linux on a computer, with a regular GUI wizard, if you pick a distro that cares about that sort of thing (for example, Ubuntu).
  • We'll focus on terminals / shells in this lecture, since that is always available, and generally where "real work" is done. Production servers will rarely have GUIs. Don't let that discourage you - after you get used to it, using the shell can become much more convenient than GUIs!

CLI

The linux command-line offers a stable of powerful tools that can really aid in boosting productivity as well as gaining an understanding of the current state of your machine (i.e. disk-space, running processes, RAM, CPU-usage).


Working on a remote linux instance is often a great way of becoming familiar with the command-line as you are forced to use it and cannot fall back on Mac’s Finder to navigate the file-system.

Do you want to have terminal like this --vvv-- !!! be with me you will have it

hollywood Genact Blessed-contrib

sudo apt update
sudo apt install hollywood byobu
hollywood
hollywood -s 4
hollywood -q

For more information hollywood, Genact, Blessed-contrib

Prepare environment

For having linux environment there are options

  • Standalone pc or lab-top have linux os on it
  • Using virtual machines(VM) install in your pc or lab-top
  • Having dedicate server in data-centers
  • Having VM in data-centers
  • Having VM in IaaS provider like AWS, GCP, Azure
  • Having container like docker and etc

Install virtualbox

you can download it here based on your os

Download .iso file

In this course we want to work with ubuntu desktop and server

apt or apt-get

sudo apt update
sudo apt upgrade
sudo apt install <package>
sudo apt remove <package>

ssh

sudo apt install openssh-server

form your os terminal in case windows must installed git bash

ssh <username>@<ip address>

firewall

sudo nano /etc/default/ufw
IPV6=yes
sudo ufw default deny incoming
sudo ufw default allow outgoing

sudo ufw allow OpenSSH
sudo ufw allow 22

sudo ufw show added

sudo ufw enable

create password for root user

sudo passwd root

make root user can login with ssh #

sudo nano /etc/ssh/sshd_config

find PermitRootLogin in text ctrl + w and change it to

PermitRootLogin yes

ctrl + x for save and yes

sudo systemctl restart sshd
sudo service sshd restart

reboot

sudo reboot

generate ssh key for your user

cd ~/.ssh
ls -la
ssh-keygen
cd ~/.ssh
ls -la
cat id_rsa.pub

copy the content and add it in what ever service (in this case git hub ) you are using

if you don`t want to add password every time you can copy your os public ssh key in to your linux

ssh-copy-id <user name>@<ip or host name>

Remote ssh with vscode

First you need already installed visual studio code
then you need to add remote ssh extension on your vs code

once you connect it you have open vscode in your linux machine and you can do what ever you want


cd < target folder >
code .

git config

sudo apt install git

~/.gitconfig

[user]
	email = <your email in github>
	name = <your name>
[core]
	excludesFile = ~/.gitignore

~/.gitignore

node_modules
cd
mkdir gitHub
cd gitHub
git clone git@github.com:jafarijason/ops_for_data_scientists.git
cd ops_for_data_scientists

add hostname

export USE_HOSTNAME=<your host name>
sudo echo $USE_HOSTNAME > /etc/hostname
sudo hostname -F /etc/hostname


update and upgrade

sudo apt-get update
sudo apt-get upgrade -y

Zsh and oh-my-zsh #

sudo apt install zsh

sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

~/.zshrc


export ZSH="$HOME/.oh-my-zsh"

ZSH_THEME="fino-time"

plugins=(
git
docker
docker-compose
rsync
aws
cp
dash
pep8
pip
pipenv
postgres
python
sudo
tmux
ubuntu
ufw
aws
)

source $ZSH/oh-my-zsh.sh

add zsh at the end of ~/.bashrc

sudo visudo
YOUR_USERNAME_HERE ALL=(ALL) NOPASSWD: ALL

remote desktop on linux only desktop

sudo apt -y install xrdp tigervnc-standalone-server
sudo systemctl enable xrdp
sudo systemctl start xrdp
sudo ufw allow 3389
sudo ufw reload

Useful commands

echo, date, whoami

echo

man echo
echo "Hello World! "
VAR="Ops for data scientists"
echo $VAR
VAR="Test for re define variable"
echo $VAR

A=2
B=3
C=$A+$B
echo $C
C=`expr $A + $B`
echo $C
C=$(expr $A + $B)
echo $C
C=$(($A + $B))
echo $C

example for bash file

cat ./bash/sum.sh
bash ./bash/sum.sh 10 11
sh ./bash/sum.sh 10 11

chmod +x ./bash/sum.sh
./bash/sum.sh 10 11

example2 for bash file

cat ./bash/sum2.sh
bash ./bash/sum2.sh
sh ./bash/sum2.sh

chmod +x ./bash/sum2.sh
./bash/sum2.sh

pwd

man pwd
pwd
cd .
pwd
cd ..
pwd
cd ...
pwd
cd ~
pwd
cd -

whoami

man whoami
whoami

whatis

whatis ls
whatis cat
whatis bash

man

man ls

add user in linux

sudo adduser test1

sudo usermod -G sudo test1

su - test1

exit

sudo deluser test1 --remove-home

touch

touch <file name>

cat

cat <file name>
cat ./data/geolocation.csv | more
more ./data/geolocation.csv
cat ./data/geolocation.csv | less
less ./data/geolocation.csv
cat ./data/geolocation.csv | head
head ./data/geolocation.csv
cat ./data/geolocation.csv | tail
tail ./data/geolocation.csv

history

history

clear

clear

mkdir

mkdir books paintings
mkdir -p books/hemmingway/favorites

cp, mv, rm

cp ./imgs/distro-family-tree.png /tmp
mkdir -p /tmp/test
cp ./imgs/*.png /tmp/test
cp -r imgs /tmp/test/imgs-copy
cp -R imgs /tmp/test/imgs-copy


mv /tmp/distro-family-tree.png /tmp/distro-family-tree2.png


rm file1
rm -f file1
rm -i file
rm -r directory


rm -rf /tmp/test

ls

ls 
ls -la
ls -hlS
ls -lha

uname

uname
uname --help
uname -a

lsb_release

lsb_release 
lsb_release  --help

watch

watch ls -la 

watch ls -la  -3

ps

ps
ps -a

top

top

htop

htop

btop

# sudo snap install btop
btop

mc

mc

kill

kill -9 PID

df

df 
df -h 

du

du
du -d 2 -h 
du -d 2 -h /
du -d 2 -h /root

scp, winscp

#scp stands for secure copy and is a useful command we can use to send files to and from a remote instance.
#Send to remote:
scp -i ~/.ssh/path/to_pem_file /path/to/local/file ubuntu@IPv4:./desired/file/location
#recursively copy directories
scp -i ~/.ssh/path/to_pem_file -r ubuntu@IPv4:./path/to/desired/folder/ ~/my_projects/
#Download from remote:
scp -i ~/.ssh/path/to_pem_file ubuntu@IPv4:./path/to/desired/file/   ~/my_projects/

nc

nc -zv 10.0.250.2 22-500
nc -zv 127.0.0.1 20-100

crontab #

crontab -e

cat /etc/crontab

#* * * * * /bin/bash /root/gitHub/ops_for_data_scientists/bash/crontab.sh

at

sudo apt install at

at 1pm +  2 days

atq

at rm <id>

/etc/passwd, /etc/shadow

cat  /etc/passwd | less
cat  /etc/shadow | less

network

sudo apt install net-tools
ifconfig

ip add show
ip a s

ss
ss -l4
sudo ss -tulpn


sudo lsof -i -P -n
sudo lsof -i -P -n | grep LISTEN

less /etc/services


cat /etc/services
grep  '22/tcp' /etc/services 

# https://www.cyberciti.biz/faq/how-to-check-open-ports-in-linux-using-the-cli/
sudo netstat
sudo netstat -tulpn
sudo netstat -tulpn | grep LISTEN

wall

wall

find

find $(pwd) -name geolocation.csv
find $(pwd) -type d -name data

tree

tree
tree /
tree ~
tree $(pwd)

tmux

tmux

lynx

sudo apt install lynx
lynx

wget

cd /tmp 
wget https://github.com/jafarijason/ops_for_data_scientists/raw/master/imgs/vscode-ssh3.png

curl

curl ifconfig.me

pv

cd /tmp 
fallocate -l 1G test.img

pv test.img > test3.img

dns server

# in past
cat /etc/resolv.conf 


sudo nano /etc/systemd/resolved.conf

DNS=<dns server 4.2.2.4>
sudo systemctl restart systemd-resolved.service
sudo systemctl status systemd-resolved.service
systemd-resolve --status

Our Good Friend grep

grep is a command-line tool that searches for patterns within a file. grep will print each line within the file that has a pattern match to standard output (terminal screen). This can be especially useful when we maybe want to model or perform EDA on a subset of our data with a given pattern:

grep -n 'California' ./data/geolocation.csv > ./test/new_example_data1.csv
cat ./pythons/test1.py
python3 ./pythons/test1.py 
man grep
cat /etc/passwd
cat /etc/passwd | grep 'root'
grep 'root' /etc/passwd

cat ./data/geolocation.csv | grep 'Cal'  | more

grep 'Cal' < ./data/geolocation.csv

cat ./data/geolocation.csv | grep 'Cal'  | grep 'Roseville'
cat ./data/geolocation.csv | grep 'Cal'  | grep 'Roseville' | grep '29'

ls -la | grep 'data'

tree /
tree / | grep 'hollywood.png'

cat /var/log/syslog
cat /var/log/syslog | grep 'root'

cat ./data/mm
cat ./data/mm | grep -v -e '^$'

tail -f /var/log/auth.log | grep 'su'

tail -f /var/log/auth.log | grep 'su' &

grep -e j ./data/grepTest

grep -f ./data/grepTest  /etc/passwd 

grep -i -f ./data/grepTest  /etc/passwd 

grep -i -v -f ./data/grepTest  /etc/passwd 

cat /etc/ssh/ssh_config
cat /etc/ssh/ssh_config | grep -v '#'
cat /etc/ssh/ssh_config | grep -v ^#
cat /etc/ssh/ssh_config | grep -v ^# | grep -v ^$

grep 'root' /etc/passwd
grep `whoami` /etc/passwd
grep -c -w `whoami` /etc/*.*
grep -s -c -w `whoami` /etc/*
grep -l -s -w `whoami` /etc/*
grep -L -s -w `whoami` /etc/*


grep -H -w root /etc/passwd
grep -H -w root /etc/passwd | cut -f 1 -d :
grep -H -w root /etc/passwd | cut -f 2 -d :
grep -H -w root /etc/passwd | cut -f 8 -d :

grep -T -H -w root /etc/passwd

grep -T -n -H -w root /etc/passwd

grep -T -A 3 -B 3 -n -H -w root /etc/passwd

Basic analysis

Word/symbol count

wc ./data/geolocation.csv

Unique elements

uniq -u ./data/geolocation.csv

cut -d"," -f2 ./data/geolocation.csv | uniq -u 

Acquire data

head /tail

head -n 5 ./data/geolocation.csv
head -n -5 ./data/geolocation.csv
tail -n 15 ./data/geolocation.csv
tail -n -15 ./data/geolocation.csv

column

column -s"," -t ./data/geolocation.csv
column -s"," -t ./data/geolocation.csv | head
column -s"," -t ./data/geolocation.csv | tail

head -n 5 ./data/geolocation.csv | column -s"," -t

cut -d"," -f2,5 ./data/geolocation.csv | head
tail -n +1 ./data/geolocation.csv | sort -t"," -k1,1g -k2,2gr -k2,2

Data cleaning

Drop columns

cut -d"," -f2,5 ./data/geolocation.csv > ./test/new_example_data1.csv

Filtering with grep

grep -n 'Cal' ./data/geolocation.csv 
grep -n 'Cal' ./data/geolocation.csv > ./test/new_example_data1.csv

Sampling

shuf -n 4 ./data/geolocation.csv
tail -n +1 ./test/new_example_data1.csv | shuf -n 4
sudo apt install athena-jot

jot 10
jot -r 5 1 100
jot 10 555
# Printing Column or Field
awk '{print $3 "\t" $4}' ./data/marks.txt
awk '{print $0}' ./data/marks.txt

# Printing All Lines
awk '/a/'  ./data/marks.txt

# Printing Columns by Pattern
awk '/a/' {print $3 "\t" $4}' ./data/marks.txt

# Printing Column in Any Order
awk '/a/' {print $4 "\t" $3}' ./data/marks.txt

# Counting and Printing Matched Pattern
awk '/a/{++cnt} END {print "Count = ", cnt}' ./data/marks.txt
sed 's/unix/linux/' ./data/sed_test.txt 

echo "Welcome To The Geek Stuff" | sed 's/\(\b[A-Z]\)/\(\1\)/g'








install docker

sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

install docker compose

sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
docker-compose --version

About

Ops for data scientists


Languages

Language:Python 61.4%Language:Jupyter Notebook 24.6%Language:Shell 10.3%Language:Dockerfile 3.7%