youvegotmoxie / tech-SysSnapv2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Script details

This repo is managed by cPanel. Please submit all bugs and feature requests via email, or submit a pull request via GitHub. 

Requirements:

  • Root-level SSH access
  • CentOS or RedHat system

Why sys-snap?

Resource shortages can feel overwhelming and  impossible to track down without adequate data to diagnose the problem. Servers inevitably have problems when their sys-admins are not watching. While the Daily Process Log in WHM can be very helpful in these situations, sometimes more information is needed than WHM can provide.

Sys-snap is designed to help you see what is causing the resource shortages, whether CPU or Memory related, even when no one is looking.

This version of sys-snap is specifically designed to be used via SSH by the root user on cPanel servers, which means that this documentation and application is aimed at RedHat and CentOS systems.

History of Sys-Snap

System Snapshot began at EV1 Servers in the late 1990s or early 2000s.  It was written by Mike Kroh before being extended by Nate Custer.  The script is often used when traditional methods of investigation do not shed light on what is causing servers load to skyrocket without warning or the server to crash.  Many versions of the script exist, as it has been carried by various techs to different companies who have extended and modified to fit both their general needs and specific special circumstances.  The version presented here was modified for use by employees at Hostgator and AlphaRed before being merged into a different version being used by cPanel around 2011.  A descendant of the 2011 version was ported to perl by Paul Trost at cPanel in 2013. The recent version discussed in this article was completed by Bryan Christensen at cPanel in 2015.

How to use sys-snap

Install

Sys-snap’s installation is incredibly simple two step process: Download the script, and run the install.

wget -O /root/sys-snap.pl https://raw.githubusercontent.com/cPanelTechs/SysSnapv2/master/sys-snap.pl && cd /root/ && chmod 744 sys-snap.pl && perl sys-snap.pl --start

Once installed, the script continues to run on the server until you stop it, or until the server reboots. After a reboot, if you want the script to resume recording data, you would need to start it again.

Every time you start the sys-snap.pl script, the existing data will be archived in .tar.gz format in the /root/ directory for your records.

Starting and stopping the script

To start the process, run the script with the --start flag:

[~] perl /root/sys-snap.pl --start
Sys-snap is not currently running
Start sys-snap logging to '/root/system-snapshot/' (y/n)?:y
Starting…
[~]

Sys-snap will run in the background. Logs will be written to /root/sys-snapshot/ every minute. Every hour a new folder with the current hour will be created. After 24 hours the folder should look similar to this:

[~/system-snapshot] ls -lah | head
total 104K
drwxr-xr-x  26 root root 4.0K May  5 13:50 ./
dr-xr-x---. 26 root root 4.0K May  4 22:05 ../
drwxr-xr-x   2 root root 4.0K Apr 24 00:59 0/
drwxr-xr-x   2 root root 4.0K Apr 24 01:52 1/
drwxr-xr-x   2 root root 4.0K Apr 24 10:53 10/
drwxr-xr-x   2 root root 4.0K Apr 24 11:54 11/
drwxr-xr-x   2 root root 4.0K Apr 24 12:55 12/
drwxr-xr-x   2 root root 4.0K Apr 24 13:33 13/
drwxr-xr-x   2 root root 4.0K Apr 23 14:56 14/

Each hour will have logs that were created for every minute of the hour:

[~/system-snapshot/0] ls -lah |head
total 3.5M
drwxr-xr-x  2 root root 4.0K Apr 24 00:59 ./
drwxr-xr-x 26 root root 4.0K May  5 13:51 ../
-rw-r--r--  1 root root  54K May  5 00:00 0.log
-rw-r--r--  1 root root  49K May  5 00:10 10.log
-rw-r--r--  1 root root  52K May  5 00:11 11.log
-rw-r--r--  1 root root  50K May  5 00:12 12.log
-rw-r--r--  1 root root  49K May  5 00:14 13.log
-rw-r--r--  1 root root  51K May  2 00:14 14.log
-rw-r--r--  1 root root  49K May  5 00:15 15.log

After 24 hours the logs will start to overwrite the previous logs. Each minute will overwrite the oldest log file. The logs are based on 24 hour time. 0 is 12AM.

To stop the process, run the script with the --stop flag, and the script will ask you to confirm the process it is stopping:

[~] perl /root/sys-snap.pl --stop
Current process: 20081 root     perl sys-snap.pl --start
Stop this process (y/n)?:y
Stopping 20081
[~]

Gathering data from sys-snap

If the load average of your server is larger than the number of processors, load issues can occur. cPanel backups, log processing, and stat processing are delayed if this happens. Tracking down the cause of the load increase is where sys-snap comes in.

Using sar to narrow your window

One utility that can be used in tandem with sys-snap to help you track and diagnose instability is called sar. To verify that the sysstat package is installed, use the command below. Please note: sysstat will only begin recording information after it is installed, and cannot provide insight for a server before the package was installed.

yum install -y sysstat

Using various flags you can display different information that has been recorded about your server’s state. In diagnosing instability, or resource shortages, using the -q and -r flags will likely be most helpful to you. Here is a small piece of output from the 'sar -q' command. The 'ldavg' is the load average. The 'ldavg-#' is the time range for the load average. The three rightmost columns represent the 1, 5, and 15 minutes load averages for the time listed in the leftmost column:

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
...
06:10:01 AM      7    107   1.22   1.19   0.76
06:20:01 AM      5    104   1.31   0.99   0.83
06:30:01 AM     26    151  21.11  13.58   6.76
06:40:11 AM      7    158  21.74  20.63  13.96
06:50:02 AM     25    146  22.29  21.95  17.84
07:00:01 AM     24    148  23.46  23.35  20.46
07:10:02 AM     24    138  23.50  23.05  21.56
07:20:01 AM     23    142  19.20  20.36  20.91
07:30:01 AM     17    135  14.67  16.01  18.45
07:40:01 AM      7    103   1.70   8.44  14.09
07:50:01 AM      4    103   0.06   1.49   7.66
08:00:01 AM      4    105   0.01   0.23   4.02
08:10:01 AM      4    102   0.03   0.08   2.12
08:20:01 AM      5    111   0.05   0.07   1.13
08:30:01 AM      6    110   0.06   0.06   0.60
...

In the above example we see that the load was high from 6:30 AM to 7:50 AM, so that is where we need to focus our investigation. Sys-snap can print high resource using processes for a time range.

Polling sys-snap for specific information using --print

First, go on your server to the directory where the sys-snap.pl script is located, which is /root by default. The --print flag attempts to programmatically calculate what it calls memory and cpu scores. It does this by adding together the %MEM and %CPU columns, respectively. While mathematically incorrect, it gives us a general overview. You will see that each set of processes is sorted by user, and you’ll see a memory- and cpu- score displayed, similar to what is displayed below:

[~] ./sys-snap.pl --print 9:00 10:00
user: root           
cpu-score: 1.30         
memory-score: 299.60      
user: named          
cpu-score: 0.00         
memory-score: 28.60       
user: mysql          
cpu-score: 0.00         
memory-score: 80.60       
user: mailnull       
cpu-score: 0.00         
memory-score: 3.90        
user: dovecot        
cpu-score: 0.00         
memory-score: 1.30        
user: nobody         
cpu-score: 0.00         
memory-score: 19.50       
user: dovenull       
cpu-score: 0.00         
memory-score: 11.70       
[~]

The --print flag assumes you will be polling the sys-snap data for a specific time, so always pass the --print flag with a defined start and end time. For example, if we are trying to look at information between 6:30 and 7:50, as would be helpful given the output from our example above, this command would print information for the time range:

[~] /root/sys-snap.pl --print 6:30 7:50

When you start the script, old data is compressed into a tar file to prevent overwriting. To use sys-snap to parse old data, untar the file and pass the path using the --path flag:

[~] /root/sys-snap.pl --print 6:30 7:50 --dir=/system-snapshot.20150422

If we add the verbose flag to --print we can get even more detail:

[~] /root/sys-snap.pl --print 9:00 10:00 v
< manually truncated for brevity >
user: dovenull       
cpu-score: 0.00      
C: 0.00 proc: \_ dovecot/imap-login
C: 0.00 proc: \_ dovecot/pop3-login
memory-score: 14.40      
M: 8.00 proc: \_ dovecot/imap-login
M: 6.40 proc: \_ dovecot/pop3-login
< manually truncated for brevity >

sys-snap has a myriad of flags to make parsing through the information that is provides easier. Run the script with no flags to get the full list:

[~] ./sys-snap.pl
USAGE: ./sys-snap.pl [options]
--start : creates, disowns, and drops 'sys-snap.pl --start' process into the background
--print <start-time end-time>: Where time HH:MM, prints basic usage by default
--v | v : verbose output from --print
--check : checks if sys-snap is running
--stop : stops sys-snap after confirming PID info
--loadavg <start-time end-time>: Where time HH:MM, prints load average for time period - default 10 min interval
--max-lines : max number of processes printed per mem/cpu section
--no-cpu | --nc : skips CPU output
--no-mem | --nm : skips memory output

You can see some examples of these flags in use below in the ‘Advanced Examples’ section below.

In the example output below, the user 'eve' is showing the highest CPU usage for the interval, and you can see the command that was being run by that user:

user: dovecot         
memory-score: 84.30    memory-score:
M: 84.30 proc: \_ dovecot/auth
M: 0.00 proc: \_ dovecot/anvil
cpu-score: 6.90      
C: 6.90 proc: \_ dovecot/auth
C: 0.00 proc: \_ dovecot/anvil

user: eve
memory-score: 345.00
M: 345.00 proc: /usr/bin/php /home/eve/public_html/website.com/script.php
M: 0.00 proc: /usr/bin/ruby /usr/bin/mongrel_rails start -p 12008 -d -e production -P log/mongrel.pid
cpu-score: 23847.00
C: 23847.00 proc: /usr/bin/php /home/eve/public_html/website.com/script.php
C: 0.00 proc: /usr/bin/ruby /usr/bin/mongrel_rails start -p 12008 -d -e production -P log/mongrel.pid

Based on that output, the next step would be to investigate the '/home/eve/public_html/website.com/script.php script. This could be done by reading the script and checking the '/home/eve/access-logs/' logs to see what the script is doing. Many times there will be several processes and users that will need to be investigated. If a user is causing sever load, it might help to suspend them while the system administrator investigates the issue. If you have CloudLinux, the LVE manager could limit their resources and help increase server stability.

http://docs.cloudlinux.com/cpanel_lve_manager.html

If most of the users have the same resource usage but the server has high load, it’s time to think about upgrading the server hardware or moving some users to a different server.

Advanced uses and examples

Add an alias for quick access

If you want to easily run sys-snap from any directory, add this alias to your /etc/bashrc file:

alias syssnap="/root/sys-snap.pl"

Print only CPU scores

If you think you have a user that is spiking the processor between 11am and 11:15am, you would run a command like this to narrow down the user:

[~] ./sys-snap.pl --print 11:00 11:15 --no-mem | head
user: brock             
cpu-score: 142.76  
user: root           
cpu-score: 46.20       
user: ntp            
cpu-score: 0.00        
user: snapper            
cpu-score: 0.00        
user: mailnull       
cpu-score: 0.00        
user: brock             
cpu-score: 0.00     

Limit the number of lines per user in verbose output

If you want to limit the number of lines that are output per user when you are parsing through the verbose output of --print, use the --max-lines flag.

[~] ./sys-snap.pl --print 10:00 11:00 v --max-lines 5 | head 13
user: root           
cpu-score: 47.70     
C: 41.00 proc: \_ spamd child
C: 6.00 proc: \_ cpanellogd - updating bandwidth
C: 0.70 proc: \_ [cpaneld - servi] <defunct>
C: 0.00 proc: \_ [cgroup]
C: 0.00 proc: \_ [flush-252:0]

memory-score: 541.60     
M: 386.70 proc: /usr/local/cpanel/3rdparty/bin/clamd
M: 65.60 proc: \_ spamd child
M: 23.40 proc: lfd - sleeping
M: 15.60 proc: tailwatchd

Display load averages (helpful for servers without Sar)

If you want to print load averages for a given time frame, you can with the --loadavg flag.

[~] ./sys-snap.pl --loadavg 11:00 11:30
Time	1min-avg 5min-avg 15min-avg
11:00	  0.19	   0.07		0.03
11:10	  0.06	   0.04		0.00
11:20	  0.00	   0.02		0.00
11:30	  0.06	   0.03		0.00

Add -i to change the interval between the load averages displayed

[~] ./sys-snap.pl --loadavg 11:00 11:30 --i=5

Time 	1min-avg 5min-avg 15min-avg
11:00	 0.19	  0.07		0.03
11:05	 0.03	  0.05		0.02
11:10	 0.06	  0.04		0.00
11:15 	 0.02	  0.05		0.00
11:20	 0.00	  0.02		0.00
11:25	 0.31	  0.14		0.10
11:30	 0.06	  0.03		0.00

Parsing archived data

When you start the script you will see that it archives historical data:

[~] cd /root/ && chmod 744 sys-snap.pl && perl sys-snap.pl --start
Sys-snap is not currently running
Start sys-snap logging to '/root/system-snapshot/' (y/n)?:y
Starting...
tar: Removing leading `/' from member names
[~]

If you want to access that historical data, you just need to unarchive the folder and pass the --dir flag:

[~] tar -xf system-snapshot.20150422.1341.tar.gz 
[~] ./sys-snap.pl --print 6:30 7:50 --dir=/root/system-snapshot.20150422.1341 | head -n11
user: root           
cpu-score: 64.20       
memory-score: 1696.60     
user: roompod        
cpu-score: 9.30        
memory-score: 2.10        
user: patrick        
cpu-score: 8.50        
memory-score: 13.10       
user: msusci    
cpu-score: 0.20        
[~] 

Common problems

You may see this error when attempting to install the script:

[~] ./sys-snap.pl --install

Can't locate Time/Piece.pm in @INC (@INC contains: /usr/local/lib/perl5/5.8.8/x86_64-linux /usr/local/lib/perl5/5.8.8 /usr/local/lib/perl5/site_perl/5.8.8/x86_64-linux /usr/local/lib/perl5/site_perl/5.8.8 /usr/local/lib/perl5/site_perl .) at ./sys-snap.pl line 126.
BEGIN failed--compilation aborted at ./sys-snap.pl line 126.
[~]

You can correct it with this command:

[~] cpan -i Time::Piece

About

License:GNU General Public License v3.0


Languages

Language:Perl 100.0%