Java Application Monitoring and Troubleshooting Basics

4. Java Application as a Runtime White Box: App running, JVM and application monitoring, troubleshooting, faults analysing and tuning. 24 hrs / 3 days.

You have Java application running at prod server, Prometheus and ssh terminal. Let's start monitoring!

Training Objectives

Understanding modern application architecture and defect hotspots
Understanding JVM classes, memory and threading architecture
Hands-on skill of monitoring modern applications
Understanding modern IO architecture and its pitfalls
Hands-on skill of monitoring persistent data-driven applications

Prerequisites

Hardware

RAM ≥ 8Гб
Wi-Fi with Internet access

Software at student's developer desktop

Git Installation at Windows:

Download distribution: Git for Windows Setup
Run distribution
All installation questions and options leave as default

AdoptOpenJDK OpenJDK 8 HotSpot Latest Release Installation at Windows:

Download distribution: preffered Windows Installer at corporate repo (softwarecenter:SoftwareID=ScopeId_1999C8B9-1E4A-4832-84B4-AAC87980BDC9/Application_ec47efb0-71ec-4f0d-a374-05df14c3ae65) or fallback Windows installer at official website
Run distribution
All installation questions and options leave as default

VisualVM latest Installation at Windows:

Download distribution: default latest version
Unpack distribution to folder by your choice (e.g. "C:\Program Files")
Install plugins "Visual GC", "VisualVM-XXX" and "Tracer-XXX"

IntelliJ IDEA CE latest Community Edition Installation at Windows:

Download distribution: latest Windows Community edition
Run distribution
All installation questions and options leave as default

Maven latest binary zip Installation at Windows:

Download distribution: latest Binary zip archive
Unpack distribution to folder by your choice (e.g. "C:\Program Files")
Remember this path, later we'll refer it as M2_HOME. Or add Windows system variable M2_HOME with value of path to unpacked distribution folder.
Add full path to "bin" subfolder to Windows system PATH variable
Check installation: run "cmd" program and get Maven version mvn --version

SSH terminal (for Windows only)

Check if you already have SSH client:
run "cmd" program
try run ssh -V

If you don't have pre-installed SSH client, installation PuTTY for Windows at Windows:

Download distribution: MSI Windows Installer 64-bit
Run distribution
All installation questions and options leave as default

JMeter Installation at Windows:

Download distribution: latest Binary ZIP
Unpack distribution to folder by your choice (e.g. "C:\Program Files")
Remember this path, later we'll refer it as JMETER_HOME. Or add Windows system variable JMETER_HOME with value of path to unpacked distribution folder.
Add full path to "bin" subfolder to Windows system PATH variable

Network access from student stations to emulation of prod host

prod host accessible
Ports at {{ prod }}:ports_needed accessible

Agenda (20 a.hr.)

Training introducing and focusing (15m)

Schedule
Trainer
Training overview
Rules

Hands-on: Teams and their demand (15m)

Pairs forming and introduction
Attendees prerequisites check
Topics focus demand from attendees
Additional topics demand form attendees

Java app development crash course (50m)

What do any application doing?

Concept	Metaphor	Code
Thread	Worker man	Thread created by runtime: `java MyApplication`
Thread Pool	Workers Team often the same Role	Typical architecture: pool managed by container, async arch: some pools managed by framework/application
Data input	Visitor's wishes	Console user input
Data processing	Meal recipes, conversation scripts, labor instructions	Code as instructions
Data storing	Persistent production store	Files as persistent store
Data output	Giving away to Visitor his meals	Console output

How we do model the data?

Concept	Metaphor	Code
Primitive Types	People can think and communicate only with numbers and strings	String restaurant menu
Structures	People can think with composite entities, concepts	Domain class and enum
Object of structure	Instance of concept, with its own state differs from other instance	Dealing with particular object while processing request

How we do model the behavior?

Concept	Metaphor	Code
Procedure	Meal recipe or conversation script	Setting behavior with methods
Call stack	Chain of actions workers call at others	Calling method from method
Class	Role: Chief or Waiter, state + bunch of procedures dealing with it	Today we likely divide state and behavior to domain entities and services
Object of class	Johnny the Chief and Maggy the chief differs with its state but have same behavior
Application logic	Scenario how to behave all the workers in any case	Workers takes responsibilities on them to rule at their level

Where data is stored? Core data scopes

Concept	Metaphor	Implementation
Local/method/stack variables	Short-term memory: Chief remember sugar doze only when doing sugaring	Call Stack
Parameters	Details when asking others to do some work: waiter asks johnnyChief.makeMeal(whatMeals?)	Call Stack
Object state	State of worker or structure: its current properties values	Heap object space
- Request scope	Some object state accessible to all the workers in call chain handling request: sticky note or voice message given each worker to next, "not spicy"	Parameters, framework support, ThreadLocal
- Session scope	Some object state accessible to all the workers handling all requests from the same Visitor: "its for table 13"	Framework support
- Singleton/application scope	Some object state accessible to all the workers	Framework support, Language support for static variables
Persistent	Long-term data store surviving system restarts	File, embedded/local database, remote filesystem, remote database
Integration	Data stored and processed by external system	Remote system procedure call, message queue

How do we implement application with Java

Concept	Metaphor	Reality
Runtime	If Developer is CEO setting application logic, Runtime is your vice	JVM API and system library API
Working with thread: Thread API, states, pooling	We can create work force on demand to execute our instructions	But we have some RAM memory and performance cost
Working with class: dynamic classloading	Instructions what to do workers get just in time not ahead but worker remember it till die	But we have run-time latency costs
Working with instance: create and GC	We ask our vice to hire and retire workers	Objects state costs us RAM memory. When object's no longer needed it purged from RAM

Teamwork: Why we do monitoring? What metrics do we consider for dev, test, qa and production environments? (30m)

Why we do monitoring?
What Quality Attributes/NFRs does JVM provide for application?
What Quality Attributes/NFRs do we satisfy with application monitoring?

Java app delivery crash course (50m)

How do we build Java application?

How do we run Java application?

JVM vs JRE vs JDK
Run with JVM
Ways for application run-time parameterization: jvm parameters, program arguments, sys/app properties
Key JVM parameters for memory setup

How do we monitor Java application internals?

JMX simple tooling demo: JVisualVM
JMX architecture overview

Hands-on quest: Simple application local building, running and monitoring (50m)

Given

Satisfied prerequisites
Cloned locally training content

git clone --depth 1 --branch master https://github.com/eugene-krivosheyev/java-application-monitoring-and-troubleshooting
cd java-application-monitoring-and-troubleshooting

Credentials for corporate Maven Artifactory repo set up

cp iaac/roles/maven/files/settings.xml $M2_HOME/conf/
cp iaac/roles/maven/files/settings-security.xml ~/.m2/
mvn --encrypt-master-password {{ trainer_given_master_password }}
vi ~/.m2/settings-security.xml

Set up Maven to ignore SSL cert errors

-Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true -Dmaven.wagon.http.ssl.ignore.validity.dates=true

When

Project application built locally with IDEA's built-in maven mvn clean verify [-DskipTests]
OR built JAR found in iaac/roles/target_app/files
Project application ran locally with CLI

java \
  -Xms128m -Xmx256m \
  -cp target/dbo-1.0-SNAPSHOT.jar \
  -Dapp.property=value \
    com.acme.dbo.Presentation \
      program arguments

JVisualVM profiler ran $JAVA_HOME/bin/jvisualvm
JVisualVM profiler connected to running app Local connections
OS-specific monitoring tool shows application process details

linux$ top [-pid jvmpid]
windows> taskmgr

Then answered and reviewed at debrief

What is the default encoding for I/O?
Does the maximun heap size shown with JVisualVM correspond to -Xmx key?
What is the default heap size for app running?
How many java threads is active within JVM?
How many OS threads is active within OS JVM process?
What is the minimal possible heap size for app running?
What is the difference for profiler times: Self time/Total time, CPU time?

After debrief

Application architecture overview
Java Command Line Inspector

Modern applications architecture and deployment: What tiers do we monitor? (1h)

Tier
Application logic layers: UI/P, API/C, BL/S, DAL/R
Application caching
Thread Pool
JPA Caching
JPA subsystem
Connection Pools
JDBC subsystem
Framework configuration with profiles
Framework for Spring modules management
Framework for Web/SOAP/REST application expose
Framework for Application
Application Server/Servlet Container
JVM: application debug API
JVM: application profiling API
JVM: universal monitoring API
JVM: threads, IO
JVM: memory, GC
JVM: process
Container: Networking
Container: Core
Message queues
DBMS
OS: Threads
OS: Processes
Hardware: HDD/SSD
Hardware: RAM
Hardware: CPU

Tiers and components to monitor diagram

puml

@startuml
!define SPRITESURL https://raw.githubusercontent.com/rabelenda/cicon-plantuml-sprites/v1.0/sprites
!includeurl SPRITESURL/server.puml
!includeurl SPRITESURL/linux.puml
!includeurl SPRITESURL/docker.puml
!includeurl SPRITESURL/java.puml
!includeurl SPRITESURL/tomcat.puml
!includeurl SPRITESURL/cog.puml
!includeurl https://raw.githubusercontent.com/tupadr3/plantuml-icon-font-sprites/v2.4.0/devicons/coda.puml


component "<$server>\nhardware" as hardware #lightgray {
    [CPU]
    [RAM]
    [HDD]
    [LAN]

    component "<$linux>\nOS" as os #white {
        [container support] 
        [process management]
        [thread management]
        [filesystem i/o]
        [network i/o]

        component "<$docker>\ncontainer" as container #lightgray {
            [network virtualization]
            [port mapping]
            [overlay fs]
            database "disk image"
            
            component "<$java>\njvm process" as jvm #white {
                [dynamic class loading]
                [JIT]
                [memory management]
                [heap GC]
                [threads management]
                [filesystem i/o api]
                [network i/o api]
                [monitoring API]
                [profiling API]
                [debug API]

                component "<$tomcat>\nservlet container" as web_container #lightgray {
                    [tcp connections\nmanagement]
                    [thread pool\nmanagement]
                    [http protocol\nhandling]
                    [web application\nlifecycle]
                    [java components\nlifecycle]

                    component "jdbc connection pool" as container_cp {
                        [jdbc driver]
                    }

                    component "<$coda>\nframework modules management system" as spring_boot #white {
                        [framework modules\nmanagement]
                        [application\nconfiguration context\nmanagement]

                        component "<$coda>\napplication framework" as spring_core #lightgray {
                            [application configuration\nhandling]
                            [application configuration\nprofiles support]
                            [application components\nmanagement]
                            [common scopes\nmanagement]
                            [user-defined thread pools\nmanagement]
                            [logging\nmanagement]

                            component "jpa persistent provider" #white {
                                [db data caching\nmanagement]
                                component "jdbc connection pool" as app_cp {
                                    [jdbc driver]
                                }
                            }

                            component "<$coda>\nweb/soap/rest framework" as spring_mvc #white {
                                [http protocol\nabstraction]
                                [request routing]
                                [data\n(de)serialization]
                                [data\nvalidation]
                                [monitoring\nendpoint]

                                component "<$cog>\napplication" as app #lightgray {
                                    [app data\ncaching management] #lightgray 
                                    
                                    package "data access\nlayer" as dal #white {
                                        [repository]
                                    }
                                    package "business logic\nlayer" as bl #white {
                                        [service]
                                    }
                                    package "api\nlayer" as cl #white {
                                        [controller]
                                    }
                                    package "presentation\nlayer" as pl #white {
                                        [view]
                                    }

                                    service -> repository 
                                    controller -> service
                                    view -> controller
                                }
                            }
                        }
                    }
                }     
            }
        } 
    }
}
@enduml

Monitoring architecture overview (30m)

pUML source

@startuml
node "dev station" as devstation {
 [ssh terminal] as terminal
 [browser]
 [jmeter]
 [jvm profiler] as profiler
}

actor Ops as ops
ops --> profiler
ops --> terminal
ops --> browser
ops --> jmeter

node "prod host" as prod {
 [node exporter] as node_exporter

 component [application] {
  [monitoring endpoint] as monitor
 }

 component [prometheus] {
  database metrics_history
 }
  
 [grafana] --> prometheus

 prometheus --> monitor
 prometheus -> node_exporter

 jmeter -> application
 node_exporter --> prod
 
 application --> [External REST service Stub]
}

terminal --> prod
browser --> prometheus
browser --> application
browser --> grafana
browser --> monitor
profiler --> application
@enduml

Demo: monitoring environment (15m)

Node Exporter
Prometheus
Grafana: setting up datasource and export dashboard
Custom Grafana dashboard as training outcome
Adding metrics to Custom Grafana dashboard

Load generation architecture overview (15m)

Types of performance testing except stress testing?
While monitoring: What type should we use? What performance metrics do we test?
Testing vs Monitoring

Demo: load generation tools (15m)

Hands-on quest: Modern application remote building, running and monitoring (50m)

Given

SSH user session with domain account to {{ prod host }} ssh {{ account }}@s-msk-t-jvm-XXX
Demo Application codebase cloned remotely

cd /opt
sudo git clone --depth 1 --branch master https://github.com/eugene-krivosheyev/agile-practices-application
sudo chown {{ account }}:users -R agile-practices-application
cd agile-practices-application

Made Maven able to run with given user

chmod a+x -R /opt/maven

Credentials for corporate Maven Artifactory repo set up

mkdir ~/.m2
cp /opt/maven/settings-security.xml ~/.m2/
mvn --encrypt-master-password {{ trainer_given_master_password }}
vi ~/.m2/settings-security.xml

Demo Application built remotely

cd /opt/agile-practices-application
mvn clean verify [-DskipTests]

External Legacy System REST stub started

cd target/test-classes # cat mappings/legacyAccountingSystemResponse.json
java -jar wiremock-jre8-standalone-2.31.0.jar --port 8888 [--verbose] & # curl localhost:8888/api/account

Application ran at {{ prod }}

cd /opt/agile-practices-application
rm -rf dbo-db

nohup \
  java \
    -Xms128m -Xmx128m \
    -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions \
    -XX:+PrintCompilation -XX:+LogCompilation -XX:LogFile=jit.log \
    -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=heapdump.hprof \
    -XX:+TraceClassLoading -XX:+TraceClassUnloading \
    -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=8 -XX:GCLogFileSize=8m \
    -Xlog:gc*,safepoint:gc.log:time,uptime:filecount=10,filesize=128K \
    -XX:NativeMemoryTracking=detail \
    -Dderby.stream.error.file=log/derby.log \
    -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false \
    -Djava.rmi.server.hostname="$(hostname -i)" -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.rmi.port=9999 \
    -jar dbo-1.0-SNAPSHOT.jar \
      --spring.profiles.active=qa \
      --server.port=8080 \
> /dev/null 2>&1 &

Local load emulation set up: database test data provisioning (dbo-db folder)

cd java-application-monitoring-and-troubleshooting
jmeter -t load.jmx -j log/jmeter/jmeter.log # GUI mode

Read constants section
Set up ${PROD_HOST} constant
Toggled on setup test plan entry
Shown entry setup/Summary Report
Menu: Run/Start
Wait while got samples of ${CLIENTS} constant count
Toggled off setup test plan entry
Toggled on reporting-users, admin-users, operations-users test plan entries
Test plan saved

Local load emulation ran

cd java-application-monitoring-and-troubleshooting
mkdir -p log/jmeter/report
jmeter -n -t load.jmx -j log/jmeter/jmeter.log -l log/jmeter/jmeter.jtl -e -o log/jmeter/report # CLI mode

When

Java Command Line Inspector used to analyse application startup command line
CLI tools used at {{ prod }}

uname --all
cat /etc/os-release
uptime

df -ah
free -m

ps -ef
ps -eaux --forest
ps -eT | grep <pid>

top + 'q'
top + 'f'
top -p <pid>
top -H -p <pid>

vmstat 1 [-w] # mpstat 1
iostat 1 [-xm]
pidstat 1
netstat 1 [-tulnp]

jps [-lvm]
jcmd <pid> help
jcmd <pid> VM.uptime
jcmd <pid> VM.system_properties
jcmd <pid> VM.flags
jcmd <pid> PerfCounter.print
jcmd <pid> GC.heap_info
jcmd <pid> GC.finalizer_info
jcmd <pid> GC.class_stats
jcmd <pid> GC.class_histogram
jcmd <pid> Thread.print

java -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal -version
jinfo <pid>
jinfo -flag PrintGCDetails <pid> # get jvm flag value
jinfo -flag +PrintGCDetails <pid> # change flag value, makes sense only for _manageable_ ones

Web applications used

http://{{ prod }}:8080/dbo/swagger-ui.html

http://{{ prod }}:8080/dbo/actuator/health
http://{{ prod }}:8080/dbo/actuator
http://{{ prod }}:8080/dbo/actuator/metrics
http://{{ prod }}:8080/dbo/actuator/metrics/jvm.memory.max?tag=area:nonheap&tag=id:Metaspace

http://{{ prod }}:8080/dbo/actuator/prometheus

http://{{ prod }}:9090/alerts
http://{{ prod }}:9090/graph
http://{{ prod }}:9090/graph?g0.range_input=15m&g0.tab=0&g0.expr=http_server_requests_seconds_count

http://{{ prod }}:3000

Finally

JMeter load emulation stopped at dev station after ${TEST_DURATION_SEC}
Application gracefully stopped at {{ prod }} curl --request POST http://{{ prod }}:8080/dbo/actuator/shutdown

Then answered and reviewed at debrief

After debrief

Updated your custom Grafana dashboard with metrics you think is important http://{{ prod }}:3000
Full JVM Options Explorer added to bookmarks

Modern applications architecture and deployment: How do we monitor tiers? (1h)

Tier	Implementation	Tools
Application Layers	PWA or Server-side Template Engine, Spring @Controllers, @Services, Spring Data JPA @Repositories	Spring Metrics for Counters, Timers, Long Task Timers, Statistics
Application caching	spring-boot-starter-cache module + built-in default Simple cache provider	Spring Metrics for Caches
Thread Pool	Java built-in ExecutorService	Spring Metrics for DataSources
JPA subsystem and JPA Caching	Hibernate	service:jmx:// Hibernate built-in statistics
JDBC subsystem and Connection Pools	Derby JDBC driver + HikariCP	service:jmx://com.zaxxer.hikari, Spring Metrics for DataSources
Framework for modules management	Spring Boot	spring-boot-actuator + Built-in Micrometer + Prometheus Adapter
Framework for Application	Spring Core + Spring MVC (spring-boot-starter-web)	Spring Metrics for Web Instrumentation [for Prometheus], Core Micrometer [for Prometheus]
Application Server/Servlet Container	spring-boot-starter-tomcat
JVM: application debug API	JPDA	jsadebugd
JVM: application profiling API	JVMTI	hprof
JVM: threads, IO	JVM scheduler, JNI	jstack
JVM: memory, GC	Built-in Garbage Collectors	jstat, jstatd, jmap, jhat removed at jdk9
JVM: universal monitoring API	JMX	jvisualvm
JVM: process	Oracle/OpenJDK JRE	jps, jcmd 9 + jcmd 8 (non-experimental status), jinfo
Containers	Docker	docker cli, docker api for Prometheus, Prometheus cAdvisor
Message queues	n/u	vendor tools, prometheus exporters
DBMS	Apache Derby / Postgresql	vendor tools, Prometheus pg_exporter, pg explain, pg analyse
OS	Linux	ps, top
Hardware	x86	`df`, `free`, SNMP, Prometheus Node Exporter

Teamwork: What metrics do we consider for dev, test, qa and production environments? (15m)

Adding metrics to Custom Grafana dashboard

JIT compilation (45m)

JIT

~~javac~~
JIT compilation and compilation eligibility
Code Cache Memory: JIT native code
C1 and C2 JIT compilers
Compilation threads: C1(1/3) + C2(2/3)
On-stack replacement (OSR) and background compilation queue
C1 & C2 compiler queues dynamically resizes depending on available %CPU
Tiers/Levels:

0: interpreted mode
1: simple C1 compiled code: w/o profiling 
2: limited C1 compiled code: basic profiling
3: full C1 compiled code: full profiling
4: C2 compiled code

Performance-happy path: 0 -> 3 -> 4
Trivial Method: 0 -> {2,3} -> 1
Deoptimization -> level 0 (not entrant or zombie code)

Optimization examples

Dead code elimination
Inlining
C2 Escape analysis for var caching, synchronization, skipping object creation
Intel SSE and AVX instruction set

Tiered compilation trade-offs

Classes compiled -> commited code cache size
Startup time
Application throughput as f(warmup time) CodeCache is full. Compiler has been disabled.

Hands-on quest: JIT compilation monitoring (50m)

Given

Application ran at {{ prod }}
External Legacy System REST stub started
Load emulation ran

When

CLI tools used at {{ prod }}

# java -XX:+PrintCompilation -XX:+PrintInlining -XX:+PrintAssembly -XX:+PrintOptoAssembly (C2 only)
# java -XX:+LogCompilation -XX:LogFile=jit.log

java -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal -version | grep CodeCache
jinfo -flag UseCodeCacheFlushing <pid>
jinfo -flag InitialCodeCacheSize <pid>
jinfo -flag ReservedCodeCacheSize <pid>

jinfo -flag CICompilerCount <pid>
jinfo -flag BackgroundCompilation <pid>

jinfo -flag TieredCompilation <pid>
jinfo -flag CompileThreshold <pid> # _compile_ threshold applies only when standard compilation: -XX:-TieredCompilation
jinfo -flag Tier3InvocationThreshold <pid> # _invocation_ threshold applies when tiered compilation: -XX:+TieredCompilation
jinfo -flag Tier4InvocationThreshold <pid> # _invocation_ threshold applies when tiered compilation: -XX:+TieredCompilation

jstat -compiler <pid>
jstat -printcompilation <pid> [1000]
jstat -snap <pid> -J-Djstat.showUnsupported=true | grep .ci.

jcmd <pid> Compiler.codecache
jcmd <pid> Compiler.codelist | more

Web applications used

http://{{ prod }}:8080/dbo/actuator/metrics
http://{{ prod }}:9090/graph

JConsole Profiler used

jconsole://{{ prod }}:9999/Memory/Code cache
jconsole://{{ prod }}:9999/MBeans

jitwatch tool used

Finally

JMeter load emulation stopped
Application gracefully stopped
Database filled up with tests data removed

Then answered and reviewed at debrief

Did system warm-up while load?
Is Code Cache enough?

After debrief

Updated your custom Grafana dashboard with metrics you think is important

Profilers

Profiling

CPU (threads)
Memory (f.o.a. heap)
Events (allocations, synchronizations, io, etc.)

JConsole, (J)VisualVM

JMX Console
Instrumenting profiler (-)
Sampling profiler (+/-)

async profiler

Sampling profiler (+)
perf events
Flame graphs

JDK Mission Control

JMX Console
Sampling profiler
"OS events"
JVM embeds Flight Recorder events dump

JVM threading management (2h)

JVM threading architecture

Thread definition and duality
Scheduler and preemptive concurrency
Scheduling overhead
Green and native threads
Thread resources consumption: heap, stack, CPU, OS limits for threads and native stack memory

Thread states

Main states and transitions
Types of blocking/waiting

Stack

What does thread do?
Stack trace
Thread dump

Thread pools

Why thread pools?
Pool metrics

Teamwork: Where and when threads start in the application code? (15m)

For given application codebase spot all the timepoints and places of starting thread.

Typical concurrency issues and solutions

Issue	WTF	Solutions
Data Race	Concurrent non-atomic operation execution	Blocking: synchronized/monitor, non-blocking: atomics/CAS
Visibility and Reordering	JVM aggressive optimizations	JMM: synchronized + volatile
Deadlock	Threads blocked by each other	oops (

Teamwork: Is there correlation between sync code presence and application throughput (15m)

Performance = f(% sync code) ?

Concurrent memory-intensive applications design patterns

async operations + thread synchronization: wait/notify
threading patterns for logic processing: sync/async with multiple pools
MMO game/DBMS cases: entry point synchronization, coarse-grained data objects locking, fine-grained data objects subsequent locking
concurrent data structures

Teamwork: What metrics do we consider for dev, test, qa and production environments? (15m)

Adding metrics to Custom Grafana dashboard

Hands-on quest: Threads monitoring (50m)

Given

Application ran at {{ prod }}
External Legacy System REST stub started
Load emulation ran

When

CLI tools used at {{ prod }}

top + 'f' -> nTH -> space -> Esc
jstack <pid> > thread-dump.tdump
scp -P 22 {{ account }}@{{ prod }}:/tmp/thread-dump.txt .

Profiler used

jvisualvm://File/Load (thread-dump.tdump)
jvisualvm://Threads (plugin Threads inspector required)

Thread dump analysed
Web applications used

http://{{ prod }}:8080/dbo/actuator/metrics
http://{{ prod }}:9090/graph

Finally

JMeter load emulation stopped
Application gracefully stopped
Database filled up with tests data removed

Then answered and reviewed at debrief

How many threads are in system?
Native or Green threads implemented by JVM?
How many threads working on requests processing?
Common thread state at normal load?
Is it enough of threads?
Where threads count can be adjusted?
Is there contended in-memory resource?
Hypothesis on what business logic is most CPU consuming
Hypothesis on application threading patterns: (a) connection handling, (b) logic processing, (c) data access?

After debrief

Updated your custom Grafana dashboard with metrics you think is important

JVM IO management (1.5h)

Blocking IO architecture

Synchronous IO concept
What sources do we use to get/store data?

Typical issues

Excessive IO wrapper classes objects allocation
Encoding
Buffering as extra-resource allocation, non-reusability with OS
Buffering as CAP case
Blocking expensive server thread for user data IO
Excessive resource allocation: closing resources

Outgoing connections pooling

Resource pools main metrics

HTTP protocol

Teamwork: What metrics do we consider for dev, test, qa and production environments? (15m)

Adding metrics to Custom Grafana dashboard

Non-blocking IO architecture

Asynchronous IO concept
NIO overview
Threading patterns for connection data processing: sync/async (supported by sync/async Servlets, Spring MVC/WebFlux)

IO data processing architecture patterns

Threading scope patterns: thread-per-request, thread-per-connection, thread-per-session
Threading patterns for logic processing: sync/async with multiple pools
Data access: sync/async (files, DB, HTTP REST calls)

Hands-on quest: IO monitoring (50m)

Given

Application ran at {{ prod }}
External Legacy System REST stub started
Load emulation ran

When

Profiler used

jvisualvm://Threads (plugin Threads inspector required)

Thread dump analysed
Heap dump analysed
Web applications used

http://{{ prod }}:8080/dbo/actuator/metrics
http://{{ prod }}:9090/graph

Finally

JMeter load emulation stopped
Application gracefully stopped
Database filled up with tests data removed

Then answered and reviewed at debrief

Did test clients get http errors?
Incoming http connections long-liveness
What is the system default IO encoding?
Is there excessive IO wrapper objects allocation?
Is outgoing HTTP connections pooled?
Is it enough HTTP connections in pool?
What threading pattern for connection data processing used in application design?
What threading scope pattern used in application design?

After debrief

Updated your custom Grafana dashboard with metrics you think is important

JVM memory management (2h)

JVM memory architecture

[Non-heap] Code Cache: JIT native code, thread stacks, native libraries
[Non-heap] Allocated native memory: direct allocation, NIO buffers
[Non-heap] Compressed Class space
[Non-heap] Perm/Meta space
[Heap/Object space] New/Young generation (including Eden and Survivor spaces)
[Heap/Object space] Old/Tenured generation

Teamwork: Where and when is memory allocated in the application code? (15m)

For given application codebase spot all the timepoints and places of memory allocation.

Garbage dealing

What is garbage: objects ready for gc?
GC Roots: Stack trace, static fields, etc.

Generational GC algorithms

Minor VS Full GC and theirs ratio
Copying collector
Mark-sweep-compact (MSC) collector

GC itself threads

Single-threaded GC
Parallel GC

Concurrency

Stop-the-world pauses issue
Concurrent/low-pause collectors: without stopping application threads when it possible (e.g. Mark phase)
Trade-off: low-pauses (latency) vs CPU utilization taken out from application (throughput) vs Memory overhead (footprint)

Internal Compiler API names and GC framework names

	Serial	Parallel	Parallel + Concurrent
New Gen	`DefNew`	`ParNew`/`PScavenge`
Old Gen	`Tenured`/`PSOld`	`PSParOld`	`CMS`
Whole heap			`G1`

HotSpot collectors

Collector name	Main idea	Outcome	Full support /deprecation /experimental in JDK8	JDK 11	JDK 12	Settings
Serial	Single-threaded for new and old gen. S-t-w for new and old gen. Copying for new gen, MSC for old gen.	Suitable for containerized apps.	+	+	+	`-XX:+UseSerialGC`
Throughput / Parallel	Parallel for new and old gen. S-t-w for new and old gen. Copying for new gen, MSC for old gen.	Maximum throughput ценой latency.	+	+	+	`-XX:+UseParallelGC` `-XX:+UseParallelOldGC`
Concurrent Mark-Sweep (CMS)	Parallel for new and old gen. S-t-w for new gen, mostly concurrent for old gen. Copying for new gen, MS for old gen, Compact for full GC only.	Latency oriented.	+	-	-	`-XX:+UseConcMarkSweepGC`
G1	Multi-regional. Parallel for new and old gen. S-t-w for new gen, partly concurrent (for old gen Mark phase). Copying for new and old gen.	Latency oriented.	+/-	+	+	`-XX:+UseG1GC`
Epsilon	No-Op Garbage Collector	Test and research oriented.	n/a	+	+	`-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC`
ZGC	Scalable low latency concurrent garbage collector.	Latency oriented. Pauses no more 10ms.	n/a	?	?
Shenandoah	Scalable low latency concurrent garbage collector.	Latency oriented.	??	??	??
Actual collectors by JVM vendor and version

G1 tuning

Multi-regional: -XX:G1HeapRegionSize=n, value will be a power of two and can range from 1MB to 32MB. The goal is to have around 2048 regions based on the minimum Java heap size
Dynamic multi-regional: regions made E,S,O dynamically at run-time
Pause time oriented ergonomics:
-XX:MaxGCPauseMillis=200
-XX:GCPauseIntervalMillis
-XX:G1HeapWastePercent=10 (garbage allowed to left, Karl!)
нет гарантий по паузам
Loves large heaps (> 5G) and non-full heaps
in case of heaps < 2G maybe CMS?
G1 tracks gc times to autotune
Footprint overhead ~5-10% (Remember sets, Collection Sets)
Allowed heap size autotune
-XX:AdaptiveSizePolicy
-XX:MinHeapFreeRatio=40 (расширение)
-XX:MinHeapFreeRatio=70 (сжатие)
G1 schedules Old (mixed) GC based on heap usage: -XX:InitiatingHeapOccupancyPercent=45
Young, mixed and full GCs
young regions
mixed: young + 1/8 of old regions
full
Humongous objects corner case
50% of region
The only owner of region(s)
Old gen from the beginning

Heap dumps

Creating heap dump
Analysing heap dump
GC roots
Sizes: Shallow, deep (aggregation) and retained (composition)
Heap Dominators
Objects size and alignment (e.g. %8bytes), jol tool
Object reference types: hard, soft, weak, phantom
Compressed Oops as 35-bit reference (up to 32Gb) stored as 32-bit reference

Teamwork: What metrics do we consider for dev, test, qa and production environments? (15m)

Adding metrics to Custom Grafana dashboard

Native/off-heap memory more deep dive and new settlers

Hands-on quest: Memory monitoring (50m)

Given

Application ran at {{ prod }}
External Legacy System REST stub started
Load emulation ran

When

CLI tools used at {{ prod }}

jstat -gccapacity <pid>
jstat -gcutil <pid> [1000]
jstat -gc <pid>
jstat -gc <pid> 1000 10

jcmd <pid> GC.heap_info
jinfo -flag DisableExplicitGC <pid>
jcmd <pid> GC.run
jcmd <pid> GC.heap_info

jinfo -flag NewRatio <pid>
jinfo -flag NewSize <pid>
jinfo -flag MaxNewSize <pid>
jinfo -flag MetaspaceSize <pid>
jinfo -flag MaxMetaspaceSize <pid>
jinfo -flag ParallelGCThreads <pid>
jinfo -flag ConcGCThreads <pid> 
jinfo -flag MaxGCPauseMillis <pid> # for Throughput collector 
jinfo -flag InitiatingHeapOccupancyPercent <pid> # for G1 collector 

jcmd <pid> GC.class_histogram # jmap -histo[:live] <pid>

jcmd <pid> GC.heap_dump /tmp/dump.hprof # jmap -dump[:live,format=b],file=/tmp/dump.hprof <pid>
devstation$ scp -P 22 {{ account }}@{{ prod }}:/tmp/dump.hprof .

jcmd <pid> VM.native_memory summary

Web applications used

http://{{ prod }}:8080/dbo/actuator/metrics
http://{{ prod }}:9090/graph

Profiler used

jconsole://localhost:9999/Memory/Chart:*
jconsole://localhost:9999/MBeans

jvisualvm://File/Load (dump.hprof)

jvisualvm://VisualGC (plugin required)
jvisualvm://Buffer Pools (plugin required)

Finally

JMeter load emulation stopped
Application gracefully stopped
Database filled up with tests data removed

Then answered and reviewed at debrief

What GC type is used?
Has application stop-the-world GC pauses?
Is enough memory for given load?
Memory region with most risk of OOME
Top memory consuming application classes
What are the memory dominators?
What is the native memory footprint?

After debrief

Updated your custom Grafana dashboard with metrics you think is important

Persistent data management (1.5h)

JDBC subsystem architecture

DB Connection pooling

DB Connection pool architecture
Prepare Statements pooling
Pool metrics

JDBC Transactions

Transaction concept
Isolation levels
Transaction resources trade-offs

Teamwork: What metrics do we consider for dev, test, qa and production environments? (15m)

Adding metrics to Custom Grafana dashboard

JPA architecture

JPA API
EntityManager and Context/Session
Caching levels
JPA transactions architecture

Spring JPA architecture

Spring Data JPA module
Repositories code generation
Spring transaction management

Hands-on quest: JDBC subsystem monitoring (50m)

Given

Application ran at {{ prod }}
External Legacy System REST stub started
Load emulation ran

When

Profiler used

jvisualvm://Threads (plugin Threads inspector required)

Thread dump analysed
Heap dump analysed
Web applications used

http://{{ prod }}:8080/dbo/actuator/metrics
http://{{ prod }}:9090/graph

Finally

JMeter load emulation stopped
Application gracefully stopped
Database filled up with tests data removed

Then answered and reviewed at debrief

Is JDBC connections pooled?
Is it enough JDBC connections in pool?
ORM cache state?
Can we set less transaction isolation level to gain throughput?

After debrief

Updated your custom Grafana dashboard with metrics you think is important

Final retro (15m)

Value taken
Process Improvement Actions
Training Improvement Actions
Custom Grafana dashboards competition