Quantcast
Channel: Brendan's blog » Performance
Viewing all articles
Browse latest Browse all 26

The USE Method: Solaris Performance Checklist

$
0
0

The USE Method provides a strategy for performing a complete a check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

In this post, I’ll provide an example of a USE-based metric list for the Solaris operating system (I’m writing this for later Solaris 10 or Oracle Solaris 11 systems; I’ll do illumos/SmartOS separately, later). This is primarily intended for system administrators of the physical systems.

Physical Resources

component type metric
CPU utilization per-cpu: mpstat 1, “idl”; system-wide: vmstat 1, “id”; per-process: prstat -c 1 (“CPU” == recent), prstat -mLc 1 (“USR” + “SYS”); per-kernel-thread: lockstat -Ii rate, DTrace profile stack()
CPU saturation system-wide: uptime, load averages; vmstat 1, “r”; DTrace dispqlen.d (DTT) for a better “vmstat r”; per-process: prstat -mLc 1, “LAT”
CPU errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling)
Memory capacity utilization system-wide: vmstat 1, “free” (main memory), “swap” (virtual memory); per-process: prstat -c, “RSS” (main memory), “SIZE” (virtual memory)
Memory capacity saturation system-wide: vmstat 1, “sr” (bad now), “w” (was very bad); vmstat -p 1, “api” (anon page ins == pain), “apo”; per-process: prstat -mLc 1, “DFL”; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname
Memory capacity errors fmadm faulty and prtdiag for physical failures; fmstat -s -m cpumem-retire (ECC events); DTrace failed malloc()s
Network Interfaces utilization nicstat (latest version here); kstat; dladm show-link -s -i 1 interface
Network Interfaces saturation nicstat; kstat for whatever custom statistics are available (eg, “nocanputs”, “defer”, “norcvbuf”, “noxmtbuf”); netstat -s, retransmits
Network Interfaces errors netstat -i, error counters; dladm show-phys; kstat for extended errors, look in the interface and “link” statistics (there are often custom counters for the card)
Storage device I/O utilization system-wide: iostat -xnz 1, “%b”; per-process: DTrace iotop
Storage device I/O saturation iostat -xnz 1, “wait”; DTrace iopending (DTT), sdqueue.d (DTB)
Storage device I/O errors iostat -En; DTrace I/O subsystem, eg, ideerr.d (DTB), satareasons.d (DTB), scsireasons.d (DTB), sdretry.d (DTB)
Storage capacity utilization swap: swap -s; file systems: “df -h”; plus other commands depending on FS type
Storage capacity saturation not sure this one makes sense – once its full, ENOSPC
Storage capacity errors DTrace; /var/adm/messages file system full messages
Storage controller utilization iostat -Cxnz 1, compare to known IOPS/tput limits per-card
Storage controller saturation look for kernel queueing: sd (iostat “wait” again), ZFS zio pipeline
Storage controller errors DTrace the driver, eg, mptevents.d (DTB); /var/adm/messages
Network controller utilization infer from nicstat and known controller max tput
Network controller saturation see network interface saturation
Network controller errors kstat for whatever is there / DTrace
CPU interconnect utilization cpustat (CPC) for CPU interconnect ports, tput / max (eg, see the amd64htcpu script)
CPU interconnect saturation cpustat (CPC) for stall cycles
CPU interconnect errors cpustat (CPC) for whatever is available
Memory interconnect utilization cpustat (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters
Memory interconnect saturation cpustat (CPC) for stall cycles
Memory interconnect errors cpustat (CPC) for whatever is available
I/O interconnect utilization busstat (SPARC only); cpustat for tput / max if available; inference via known tput from iostat/nicstat/…
I/O interconnect saturation cpustat (CPC) for stall cycles
I/O interconnect errors cpustat (CPC) for whatever is available

  • CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel.
  • lockstat and plockstat are DTrace-based since Solaris 10 FCS.
  • vmstat “r”: this is coarse as it is only updated once per second.
  • CPC == CPU Performance Counters (aka “Performance Instrumentation Counters” (PICs), or “Performance Monitoring Events”), read via programmable registers on each CPU, by cpustat(1M) or the DTrace “cpc” provider. These have traditionally been hard to work with due to differences between CPUs, but are getting much easier with the PAPI standard. Still, expect to spend some quality time (days) with the processor vendor manuals (what “cpustat -h” tells you to read), and to post-process cpustat with awk or perl. See my short talk (video) about CPC (2010). (Many years ago, I made a toolkit including CPC scripts – CacheKit – that was too much work to maintain.)
  • Memory capacity utilization: interpreting vmstat’s “free” has been tricky across different Solaris versions (we documented it in the Perf & Tools book), due to different ways it was calculated, and tunables that affect when the system will kick-off the page scanner. It’ll also typically shrink as the kernel uses unused memory for caching (ZFS ARC).
  • Be aware that kstat can report bad data (so can any tool); there isn’t really a test suite for kstat data, and engineers can add new code paths and forget to add the counters.
  • DTT == DTraceToolkit scripts, DTB == DTrace book scripts.
  • CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
  • I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).

Software Resources

component type metric
Kernel mutex utilization lockstat -H (held time); DTrace lockstat provider
Kernel mutex saturation lockstat -C (contention); DTrace lockstat provider; spinning shows up with dtrace -n 'profile-997 { @[stack()] = count(); }'
Kernel mutex errors lockstat -E, eg recusive mutex enter (other errors can cause kernel lockup/panic, debug with mdb -k)
User mutex utilization plockstat -H (held time); DTrace plockstat provider
User mutex saturation plockstat -C (contention); prstat -mLc 1, "LCK"; DTrace plockstat provider
User mutex errors DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)
Process capacity utilization sar -v, “proc-sz”; kstat, “unix:0:var:v_proc” for max, “unix:0:system_misc:nproc” for current; DTrace (`nproc vs `max_nprocs)
Process capacity saturation not sure this makes sense; you might get queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full
Process capacity errors “can’t fork()” messages
Thread capacity utilization user-level: kstat, “unix:0:lwp_cache:buf_inuse” for current, prctl -n zone.max-lwps -i zone ZONE for max; kernel: mdb -k or DTrace, “nthread” for current, limited by memory
Thread capacity saturation threads blocking on memory allocation; at this point the page scanner should be running (vmstat “sr”), else examine using DTrace/mdb.
Thread capacity errors user-level: pthread_create() failures with EAGAIN, EINVAL, …; kernel: thread_create() blocks for memory but won’t fail.
File descriptors utilization system-wide (no limit other than RAM); per-process: pfiles vs ulimit or prctl -t basic -n process.max-file-descriptor PID; a quicker check than pfiles is ls /proc/PID/fd | wc -l
File descriptors saturation does this make sense? I don’t think there is any queueing or blocking, other than on memory allocation.
File descriptors errors truss or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), …).
  • lockstat/plockstat often drop events due to load; I often roll my own to avoid this using the DTrace lockstat/plockstat provider (examples in the DTrace book).
  • File descriptor utilization: while other OSes have a system-wide limit, Solaris doesn’t (at least at the moment, this could change; see my writeup about it).

What’s Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.


Viewing all articles
Browse latest Browse all 26

Trending Articles