The USE Method: Solaris Performance Checklist

The USE Method provides a strategy for performing a complete a check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

In this post, I’ll provide an example of a USE-based metric list for the Solaris operating system (I’m writing this for later Solaris 10 or Oracle Solaris 11 systems; I’ll do illumos/SmartOS separately, later). This is primarily intended for system administrators of the physical systems.

Physical Resources

component	type	metric
CPU	utilization	per-cpu: `mpstat 1`, “idl”; system-wide: `vmstat 1`, “id”; per-process: `prstat -c 1` (“CPU” == recent), `prstat -mLc 1` (“USR” + “SYS”); per-kernel-thread: `lockstat -Ii rate`, DTrace profile stack()
CPU	saturation	system-wide: `uptime`, load averages; `vmstat 1`, “r”; DTrace dispqlen.d (DTT) for a better “vmstat r”; per-process: `prstat -mLc 1`, “LAT”
CPU	errors	`fmadm faulty`; `cpustat` (CPC) for whatever error counters are supported (eg, thermal throttling)
Memory capacity	utilization	system-wide: `vmstat 1`, “free” (main memory), “swap” (virtual memory); per-process: `prstat -c`, “RSS” (main memory), “SIZE” (virtual memory)
Memory capacity	saturation	system-wide: `vmstat 1`, “sr” (bad now), “w” (was very bad); `vmstat -p 1`, “api” (anon page ins == pain), “apo”; per-process: `prstat -mLc 1`, “DFL”; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname
Memory capacity	errors	`fmadm faulty` and `prtdiag` for physical failures; `fmstat -s -m cpumem-retire` (ECC events); DTrace failed malloc()s
Network Interfaces	utilization	`nicstat` (latest version here); `kstat`; `dladm show-link -s -i 1 interface`
Network Interfaces	saturation	`nicstat`; `kstat` for whatever custom statistics are available (eg, “nocanputs”, “defer”, “norcvbuf”, “noxmtbuf”); `netstat -s`, retransmits
Network Interfaces	errors	`netstat -i`, error counters; `dladm show-phys`; `kstat` for extended errors, look in the interface and “link” statistics (there are often custom counters for the card)
Storage device I/O	utilization	system-wide: `iostat -xnz 1`, “%b”; per-process: DTrace iotop
Storage device I/O	saturation	`iostat -xnz 1`, “wait”; DTrace iopending (DTT), sdqueue.d (DTB)
Storage device I/O	errors	`iostat -En`; DTrace I/O subsystem, eg, ideerr.d (DTB), satareasons.d (DTB), scsireasons.d (DTB), sdretry.d (DTB)
Storage capacity	utilization	swap: `swap -s`; file systems: “df -h”; plus other commands depending on FS type
Storage capacity	saturation	not sure this one makes sense – once its full, ENOSPC
Storage capacity	errors	DTrace; /var/adm/messages file system full messages
Storage controller	utilization	`iostat -Cxnz 1`, compare to known IOPS/tput limits per-card
Storage controller	saturation	look for kernel queueing: sd (iostat “wait” again), ZFS zio pipeline
Storage controller	errors	DTrace the driver, eg, mptevents.d (DTB); /var/adm/messages
Network controller	utilization	infer from `nicstat` and known controller max tput
Network controller	saturation	see network interface saturation
Network controller	errors	`kstat` for whatever is there / DTrace
CPU interconnect	utilization	`cpustat` (CPC) for CPU interconnect ports, tput / max (eg, see the amd64htcpu script)
CPU interconnect	saturation	`cpustat` (CPC) for stall cycles
CPU interconnect	errors	`cpustat` (CPC) for whatever is available
Memory interconnect	utilization	`cpustat` (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters
Memory interconnect	saturation	`cpustat` (CPC) for stall cycles
Memory interconnect	errors	`cpustat` (CPC) for whatever is available
I/O interconnect	utilization	`busstat` (SPARC only); `cpustat` for tput / max if available; inference via known tput from iostat/nicstat/…
I/O interconnect	saturation	`cpustat` (CPC) for stall cycles
I/O interconnect	errors	`cpustat` (CPC) for whatever is available

CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel.
lockstat and plockstat are DTrace-based since Solaris 10 FCS.
vmstat “r”: this is coarse as it is only updated once per second.
CPC == CPU Performance Counters (aka “Performance Instrumentation Counters” (PICs), or “Performance Monitoring Events”), read via programmable registers on each CPU, by cpustat(1M) or the DTrace “cpc” provider. These have traditionally been hard to work with due to differences between CPUs, but are getting much easier with the PAPI standard. Still, expect to spend some quality time (days) with the processor vendor manuals (what “cpustat -h” tells you to read), and to post-process cpustat with awk or perl. See my short talk (video) about CPC (2010). (Many years ago, I made a toolkit including CPC scripts – CacheKit – that was too much work to maintain.)
Memory capacity utilization: interpreting vmstat’s “free” has been tricky across different Solaris versions (we documented it in the Perf & Tools book), due to different ways it was calculated, and tunables that affect when the system will kick-off the page scanner. It’ll also typically shrink as the kernel uses unused memory for caching (ZFS ARC).
Be aware that kstat can report bad data (so can any tool); there isn’t really a test suite for kstat data, and engineers can add new code paths and forget to add the counters.
DTT == DTraceToolkit scripts, DTB == DTrace book scripts.
CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).

Software Resources

component	type	metric
Kernel mutex	utilization	`lockstat -H` (held time); DTrace lockstat provider
Kernel mutex	saturation	`lockstat -C` (contention); DTrace lockstat provider; spinning shows up with `dtrace -n 'profile-997 { @[stack()] = count(); }'`
Kernel mutex	errors	`lockstat -E`, eg recusive mutex enter (other errors can cause kernel lockup/panic, debug with `mdb -k`)
User mutex	utilization	`plockstat -H` (held time); DTrace plockstat provider
User mutex	saturation	`plockstat -C` (contention); `prstat -mLc 1`, "LCK"; DTrace plockstat provider
User mutex	errors	DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)
Process capacity	utilization	`sar -v`, “proc-sz”; `kstat`, “unix:0:var:v_proc” for max, “unix:0:system_misc:nproc” for current; DTrace (`nproc vs `max_nprocs)
Process capacity	saturation	not sure this makes sense; you might get queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full
Process capacity	errors	“can’t fork()” messages
Thread capacity	utilization	user-level: `kstat`, “unix:0:lwp_cache:buf_inuse” for current, `prctl -n zone.max-lwps -i zone ZONE` for max; kernel: `mdb -k` or DTrace, “nthread” for current, limited by memory
Thread capacity	saturation	threads blocking on memory allocation; at this point the page scanner should be running (vmstat “sr”), else examine using DTrace/`mdb`.
Thread capacity	errors	user-level: pthread_create() failures with EAGAIN, EINVAL, …; kernel: thread_create() blocks for memory but won’t fail.
File descriptors	utilization	system-wide (no limit other than RAM); per-process: `pfiles` vs `ulimit` or `prctl -t basic -n process.max-file-descriptor PID`; a quicker check than pfiles is `ls /proc/PID/fd \| wc -l`
File descriptors	saturation	does this make sense? I don’t think there is any queueing or blocking, other than on memory allocation.
File descriptors	errors	`truss` or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), …).

lockstat/plockstat often drop events due to load; I often roll my own to avoid this using the DTrace lockstat/plockstat provider (examples in the DTrace book).
File descriptor utilization: while other OSes have a system-wide limit, Solaris doesn’t (at least at the moment, this could change; see my writeup about it).

What’s Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.

The USE Method: Solaris Performance Checklist

Physical Resources

Software Resources

What’s Next

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112