0% found this document useful (0 votes)
164 views37 pages

Rodrigo Freire - Rhel 6 Performance & Tuning

This document discusses performance tuning for Red Hat Enterprise Linux 6. It covers various techniques for optimizing CPU, memory, I/O, and network performance. Specific topics include transparent huge pages, NUMA tuning, scheduler policies, network multiqueueing, and tools like tuned and SystemTap for monitoring and improving system performance. The overall message is that testing, measuring, and iterative tuning is needed to optimize a system for specific workloads.

Uploaded by

Filipe Luciano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views37 pages

Rodrigo Freire - Rhel 6 Performance & Tuning

This document discusses performance tuning for Red Hat Enterprise Linux 6. It covers various techniques for optimizing CPU, memory, I/O, and network performance. Specific topics include transparent huge pages, NUMA tuning, scheduler policies, network multiqueueing, and tools like tuned and SystemTap for monitoring and improving system performance. The overall message is that testing, measuring, and iterative tuning is needed to optimize a system for specific workloads.

Uploaded by

Filipe Luciano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

RHEL 6 PERFORMANCE & TUNING

Rodrigo Freire
Sr. Technical Account Manager
22/Mai/2014

Agenda

Performance Tuning Theory

RHEL6 Performance Improvements

CPU Performance Tuning/Power Management

Memory/NUMA Performance Tuning

Network Performance Tuning

Jeremy Eder / Rodrigo Freire

Performance Tuning Theory

Jeremy Eder / Rodrigo Freire

Performance Tuning Food Groups

Jeremy Eder / Rodrigo Freire

CPU

Memory

I/O

Network

Performance or Reliability?

Faster transactions?
or

Efficiency?

Tradeoffs

Risks

Cost

Jeremy Eder / Rodrigo Freire

Basic OS Setup

Disable unnecessary services and use runlevel 3

Avoid disk access in the critical path

Consider disabling filesystem journaling, {a,dir}time

Ever consider running swapless ? (vm.swappiness)

CPU Isolation?
Be aware of BIOS making Power Management
decisions

Jeremy Eder / Rodrigo Freire

Efficiency decision

Bandwidth: Maximum
possible throughput

Larger and less packets

Throughput: Current
bandwidth usage

High efficiency approach


Low latency approach:
Send data immediately
(more and smaller pkts)

Jeremy Eder / Rodrigo Freire

More bandwidth?

Cutting edge costs

Non-linear $/performance

Benefits?

Jeremy Eder / Rodrigo Freire

In a nutshell...

TEST, MEASURE, TEST, MEASURE, TEST...

Be patient, accurate and methodical...it's iterative

Know your hardware!!!!!!!!!

Be aware of the latency vs throughput balancing act

This cannot be stressed enough...


The enemy of extreme low-latency:
batching/coalescing

Avoid disk when you can...

Use tools such as systemtap and perf

Jeremy Eder / Rodrigo Freire

RHEL6 Performance
Improvements

10

Jeremy Eder / Rodrigo Freire

Performance Improvements in RHEL6

Component

Feature

CPU/Kernel

NUMA Ticketed spinlocks; Completely Fair Scheduler;


Extensive use of Read Copy Update (RCU)
Scales up to 64 VCPUs per guest

Memory

Large memory optimizations: Transparent Huge Pages is


ideal for virtualization

Networking

vhost-net a kernel based virtio w/ better throughput and


latency. SRIOV for ~native performance, RFS/XPS

Block

AIO, MSI, scatter gather.

11

Jeremy Eder / Rodrigo Freire

tuned profile summary...


Tunable

default

enterprise-stor virtual-ho virtual-g


age
st
uest

kernel.sched_min_
granularity_ns

4ms

10ms

10ms

10ms

10ms

kernel.sched_wakeup
_granularity_ns

4ms

15ms

15ms

15ms

15ms

vm.dirty_ratio

20% RAM

40%

10%

40%

40%

vm.dirty_background
_ratio

10% RAM

5%

vm.swappiness

60

10

30

I/O Scheduler
(Elevator)

CFQ

deadline

deadline

deadline

Filesystem Barriers On

Off

Off

Off

CPU Governor

performance

Disk Read-ahead

ondemand

latency-perfo throughput-pe
rmance
rformance

deadline

deadline

performance

performance

4x

Disable THP

Yes

Disable C-States

Yes

12

Jeremy Eder / Rodrigo Freire

Block Devices

13

Jeremy Eder / Rodrigo Freire

Available Technologies

14

Solid-State Device

Spinning HDD

Jeremy Eder / Rodrigo Freire

Virtualization Tuning I/O elevators - OLTP


Performance Impact of I/O Elevators on OLTP Workload
Host running Deadline Scheduler
300K

Transactions per Minute

250K
200K
150K
100K
50K
K

15

Noop
CFQ
Deadline

1Guest

2 Guests

4 Guests

Jeremy Eder / Rodrigo Freire

But if you use central Storage...

16

Use deadline elevator


Storage have its own
optimization algorithm
and caches
VM image is on the
storage?

Jeremy Eder / Rodrigo Freire

CPU Performance Tuning

17

Jeremy Eder / Rodrigo Freire

A word about CPU Power Mgmt...


C&P-states
You probably don't always need what you paid for...
Recent chips from major vendors slow themselves
down
Called P-states
Or lower voltages/disable portions of the core like
timers
Called C-states
And spin them back up on-demand.
Adds latency

Monitoring:
Use powertop, or turbostat from kernel source

18

Jeremy Eder / Rodrigo Freire

CPU Tuning

Variable frequencies
Multiple cores
Power saving modes (cpuspeed governors)
performance
ondemand
userspace

Examples:
echo "performance" > \
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Best of both worlds cron jobs to configure the governor mode using
tuned-adm
tuned-adm profile {default,latency-performance}

19

Jeremy Eder / Rodrigo Freire

Scheduler Policies

TS SCHED_NORMAL (aka SCHED_OTHER, the default policy)

FF SCHED_FIFO (realtime policy, first-in-first-out)


Don't set your RTPRIO to 99. This will starve out kernel threads
that need to run sometimes.
There is no way to fully isolate a core for 100% userspace
processing. Recent study in a previous slide...

RR SCHED_RR, same as FIFO but with a defined quantum


SCHED_RR only useful with > 1 tasks of same priority.

SCHED_BATCH, ISO SCHED_ISO, IDL SCHED_IDLE

Change programmatically, or with chrt


# ps -emo pid,pcpu,psr,nice,cmd,rtprio,policy

20

Jeremy Eder / Rodrigo Freire

Your hardware might be fooling you!

SMIs

BIOS power governors

Broken BIOS!

21

Jeremy Eder / Rodrigo Freire

Memory & NUMA

22

Jeremy Eder / Rodrigo Freire

Transparent Huge Pages

Standard pages: 4 kB

Huge page: 2048 kb

[root@rfreire~]#(disableTHP)
[root@rfreire~]#timememhog1g
real
user
sys

23

0m0.600s
0m0.186s
0m0.412s

512 times larger!


[root@rfreire~]#(enableTHP)
[root@rfreire~]#timememhog1g

Less L1 TLB
consumption
Memory intensive: WIN!

real
user
sys

Jeremy Eder / Rodrigo Freire

0m0.303s
0m0.199s
0m0.100s

Memory Tuning Transparent Hugepages

Introduced in RHEL6.0
Anonymous memory only (swappable, can be disabled)
Can coexist with traditional hugepages
Does not require application support (anon memory).
In RHEL6.2, added counters...explained in transhuge.txt
# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 2018
thp_fault_alloc 7302
thp_fault_fallback 0
thp_collapse_alloc 401
thp_collapse_alloc_failed 0
thp_split 21

24

Jeremy Eder / Rodrigo Freire

NUMA - What is it?

Non-Uniform Memory Access

CPU-Bound

Central memory controller?

Access Cost

25

Jeremy Eder / Rodrigo Freire

NUMA

Multi-socket/Multi-core Architecture used for scaling


RHEL5/6 Completely NUMA Aware
Additional, significant performance gains by
enforcing NUMA locality.

How do you enforce NUMA locality ?


numactl -c1 -m1 ./command
Command executes on CPUs in socket 1
And memory allocations are served out of memory
node 1.
NUMA automation is an area of significant
research and investment by both Red Hat and the
community.
AutoNUMA, schedNUMA, numad
26

Jeremy Eder / Rodrigo Freire

NUMA

27

Jeremy Eder / Rodrigo Freire

NUMA Topology and PCI Bus

Server may have more than 1 PCI bus.


Optimal performance reduces/eliminates inter-node cross-talk. Install
NIC in slot local to node that your application will run on. Use
systemtap numa_faults.stp. irqbalance will learn this soon.
In the below case, the NIC is in PCI bus 0001. CPUs 1,3,5,7 are local
to that PCI slot.

lspci output:
0001:06:00.0 Ethernet controller: Solarflare Communications SFC9020 [Solarstorm]
# cat /sys/devices/pci0000\:00/0000\:00\:00.0/local_cpulist
0,2,4,6
# cat /sys/devices/pci0001\:40/0001\:40\:00.0/local_cpulist
1,3,5,7
# dmesg|grep "NUMA node"
pci_bus 0000:00: on NUMA node 0 (pxm 0)
pci_bus 0001:40: on NUMA node 1 (pxm 1)

28

Jeremy Eder / Rodrigo Freire

NUMA Topology and PCI Bus

29

Jeremy Eder / Rodrigo Freire

Network

30

Jeremy Eder / Rodrigo Freire

Network Determinism

Do you really need to use TCP?


If so, experiment with TCP_NODELAY socket option (Nagle)
From Wikipedia:

if there is new data to send


if the window size >= MSS and available data is >= MSS
send complete MSS segment now
else
if there is unconfirmed data still in the pipe
enqueue data in the buffer until an acknowledge is received
^^^ latency ^^^ more noticeable in high RTT/WAN environments.
else
send data immediately

31

Jeremy Eder / Rodrigo Freire

Buffer Bloat
Buffers are everywhere... www.bufferbloat.net

What is buffer bloat ?


Latency caused by excessive buffering
Side-effect if ignoring latency in the race for
greater throughput.

http://www.bufferbloat.net/projects/bloat/wiki/Introd
uction
Find out about your buffers.
Use 'ss -e' or 'netstat -nt'
NIC ring buffers and new Byte Queue Limits

http://linuxplumbersconf.org/2011/ocw/sessi
ons/171

https://lwn.net/Articles/454390/

32

Jeremy Eder / Rodrigo Freire

Multiqueue Networking (aka RSS)

33

Jeremy Eder / Rodrigo Freire

Invented to allow
Linux networking to
scale along with
hardware
2 socket/8 cores
extremely common,
optimize for this
use-case
Hash of src/dst
IP:PORT determines
receiving CPU

But the OS should handle all of this...

Performance Group couldn't agree more!


Out of the box performance experience highest priority.
Defaults work for the majority of use-cases
Auto-tuning where it doesn't
Hand tuning as a last resort
irqbalance being taught about PCI bus locality
(local_cpulist)
Kernel already knows, RHEL6.3+ will set it for you.
numad can automatically balance NUMA node
utilization to avoid NUMA faults.

34

Jeremy Eder / Rodrigo Freire

Some useful resources

Perftun guide
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Performance_Tuning_Guide
/index.html
Systemtap:
https://access.redhat.com/site/solutions/5441
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/SystemTap_Beginners_Guide/inde
x.html
Benchmarking tools:
https://access.redhat.com/site/solutions/173863
Tuned:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/Tune
d.html
Seekwatcher & blktrace:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/ch06
s03.html
Process Schedulers:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-R
ealtime_Reference_Guide-Priorities_and_policies.html
Memory tuning:
https://access.redhat.com/site/solutions/16995
Numad:
https://access.redhat.com/site/articles/223693
RPS and RSS:
35
https://access.redhat.com/site/solutions/62869

Jeremy Eder / Rodrigo Freire

36

Jeremy Eder / Rodrigo Freire

THANK YOU!

Rodrigo Freire
[email protected]
http://people.redhat.com/rfreire/cce-perftun-bsb.pdf

You might also like