SarCheck(TM): Automated Analysis of AIX sar and ps data

(English text version 6.00.02)


This is an analysis of the data contained in the file octsar1. The data was collected from 10/06/2003 to 10/10/2003, from the RS/6000 IBM Model 7042/7043 (ED) system 'localhost'. There were 255 data records collected over 5 days used to produce this analysis. The operating system used to produce the sar report was Release 4.3 of AIX. The operating system as reported by /usr/bin/oslevel is AIX Release 4.3.3.0. The sysconf subroutine reports that 1 processor is configured and 1 processor is online. 64 megabytes of memory are present.

Data collected by the ps -elf command during 5 days between 10/06/2003 and 10/10/2003 will also be analyzed. This program will attempt to match the starting and ending times of the ps -elf data with those of the sar report file named octsar1.

Table of Contents

SUMMARY

When the data was collected, no CPU bottleneck could be detected. A memory bottleneck was seen. No significant I/O bottleneck was seen.

At least one possible memory leak has been detected. See the Resource Analysis section for details.

NOTE: The file /opt/sarcheck/etc/sarcheck_parms was seen but no changes have been made to the thresholds used by SarCheck's rules and algorithms. This does not indicate a problem and the file is probably being used to control SarCheck's menu defaults.

RECOMMENDATIONS SECTION

All recommendations contained in this report are based solely on the conditions which were present when the performance data was collected. It is possible that conditions which were not present at that time may cause some of these recommendations to result in worse performance. To minimize this risk, analyze data from several different days, implement only regularly occurring recommendations, and implement them one at a time.

Additional memory may improve performance. If possible, borrow some memory for test purposes, and monitor system performance and resource utilization before and after its installation.

A CPU upgrade is not recommended because the current CPU had significant unused capacity.

No disk recommendations have been made because no bottleneck was seen.

RESOURCE ANALYSIS SECTION

Average CPU utilization (%usr + %sys) was only 0.6 percent. This indicates that spare CPU capacity exists. If any performance problems were seen during the entire monitoring period, they were not caused by a lack of CPU power. CPU utilization peaked at 23 percent during multiple time intervals.

The CPU was waiting for I/O (%wio) an average of 0.2 percent of the time. This statistic does not indicate the presence of an I/O bottleneck. The time that the system was waiting for I/O peaked at 11 percent during multiple time intervals.

Graph of CPU utilization

The CPU was idle (neither busy nor waiting for I/O) and had nothing to do an average of 99.2 percent of the time. If overall performance was good, this means that on average, the CPU was lightly loaded. If performance was generally unacceptable, the bottleneck may have been caused by remote file I/O which cannot be directly measured with sar and therefore cannot be considered by SarCheck.

The run queue had an average length of 1.1 which indicates that processes were generally not bound by latent demand for CPU resources. Average run queue length (when occupied) peaked at 4.0 from 11:50:00 to 12:00:01, on 10/07/2003. During that interval, the queue was occupied 0 percent of the time. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak CPU queuing, then a performance bottleneck may be the CPU. The following graph shows the both run queue length and occupancy. The occupancy is shown as %runocc/100, where a run queue occupied 100 percent of the time would be shown a vertical line reaching a height of 1.0.

Graph of run queue length

Modest buffer cache activity was seen in the sar -b data. This indicates that some process is using raw block or raw character devices and a small amount of activity is not unusual.

The average context switch rate (cswch/s) was 32.75 per second. The context switch rate (cswch/s) peaked at 195.0 per second from 15:10:01 to 15:20:01, on 10/06/2003. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak context switching, then a problem may be that too many processes were blocked for I/O or IPC.

The following statistics indicate the system could benefit from additional memory and noticeable performance degradation was likely.

There was no indication of swapped out processes in the ps -elf data. Processes which have been swapped out are usually found only on systems that have a very severe memory shortage.

The average number of page replacement cycles per second (cycle/s) was 0.00. If values greater than zero had been seen, a memory shortage might exist. Data from the cycle/s column does not indicate a lack of physical memory. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak replacement cycle activity, then a shortage of physical memory may be performance bottleneck.

The average number of kernel threads waiting to be paged in (swpq-sz) was 1.31. The average number of kernel threads waiting to be paged in (swpq-sz) peaked at 2.0 during multiple time intervals. A more useful statistic is sometimes available by multiplying the swpq-sz data by the percent of time the queue was occupied. In this case, the average was 0.0009. and the peak was 0.05 from 07:00:00 to 07:10:00, on 10/06/2003. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst when the number of kernel threads waiting to be paged in was at its peak, then a shortage of physical memory may be performance bottleneck.

The following graph shows any significant statistics relating to page replacement cycle rate, number of kernel threads waiting to be paged in, and number of swapped processes.

Graph of page replacement cycle rate and swap queue size

The average page out rate to the paging spaces was 0.28 per second. The paging space page out rate peaked at 13.25 from 12:18:34 to 12:19:49, on 10/10/2003. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst when the paging space page out rate was at its peak, then a shortage of physical memory may be performance bottleneck. The following graph shows the rate of paging operations to the paging spaces.

Graph of paging space page in and page out rate

The current setting for maxpin (vmtune -M) leaves 12.57 megabytes of memory unpinnable. No recommendation made because no problem was seen.

No I/O bottleneck was seen in the sar statistics, therefore no changes are recommended for maxpgahead (vmtune -R).

The value of numclust (vmtune -c) is 1. If fast disk devices, disk arrays, or striped logical volumes are in use, the performance of disk writes could be improved by increasing this value. SarCheck does not have access to enough information about the system's disk devices to make any specific recommendation for tuning numclust.

The average rate of System V semaphore calls (sema/s) was 0.3 per second. System V semaphore activity (sema/s) peaked at a rate of 23.75 per second from 14:50:00 to 15:00:00, on 10/07/2003. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak semaphore activity, then that activity may be a performance bottleneck and application or database activity related to semaphore usage should be looked at more closely. No problems have been seen, and no changes have been recommended for System V semaphore parameters. Note that SarCheck only checks these parameter's relationships to each other since semaphore usage data is not available.

The average rate of System V (msg/s) message calls was 0.020 per second. No problems have been seen, and no changes have been recommended for System V message parameters. Note that SarCheck only checks these parameter's relationships to each other since message usage data is not available.

There were no times when enforcement of the process threshold limit (kproc-ov) prevented the creation of kernel processes. This indicates that no problems were seen in this area.

The ratio of exec to fork system calls was 0.89. This indicates that PATH variables are efficient.

The average system-wide local I/O rate as measured by the r+w/s column in the sar -d data was 0.38 per second. This I/O rate peaked at 26 per second from 10:00:00 to 10:10:00, on 10/10/2003.

Graph of Total Disk I/O rate

The following graph shows the average percent busy and service time for 2 disks. The lack of disk activity can be clearly seen in the %busy statistics for the disks.

Graph of up to 5 disks, not sorted by percent busy or service time

The -dtoo switch has been used to format disk statistics into the following table.

Disk Device Statistics
Disk Device Average Percent Busy Peak Percent Busy Queue Depth when occupied Average Service Time
hdisk0 0.11 10.0 0.7 0.0
cd0 0.00 0.0 0.0 0.0

The disk device hdisk0 was busy an average of 0.11 percent of the time and had an average queue length of 0.7 (when occupied). This indicates that the device is not a performance bottleneck.

The disk device cd0 was busy an average of 0.00 percent of the time and had an average queue length of 0.0 (when occupied). This indicates that the device is not a performance bottleneck.

At multiple peak times on 10/10/2003 ps -elf data indicated that there were 53 processes present. This was the largest number of processes seen with ps -elf but it is not likely to be the absolute peak because the operating system does not store the true "high-water mark" for this statistic.

Graph of the number of processes present

The -ptbl switch has been used to format ps -elf data into the following table.

Interesting ps -elf data
Command User Process ID Percent CPU Average PRI NI Memory Growth Memory Use
/usr/netscape/communicator/us/netscape_aix4 drw 14522 0.83 60.00 20 25884.0 kb/hr 26616 kb
/usr/lpp/X11/bin/X root 2154 0.37 60.43 20 839.0 kb/hr 6768 kb
/usr/netscape/communicator/us/netscape_aix4 drw 13482 0.09 60.00 20 14376.0 kb/hr 24676 kb
/usr/dt/bin/dtterm drw 11624 0.17 60.02 20 215.9 kb/hr 1184 kb
/usr/netscape/communicator/us/netscape_aix4 drw 16476 0.37 60.00 20 203018.2 kb/hr 24868 kb

CAPACITY PLANNING SECTION

This section is designed to provide the user with a rudimentary linear capacity planning model and should be used for rough approximations only. These estimates assume that an increase in workload will affect the usage of all resources equally. These estimates should be used on days when the load is heaviest to determine approximately how much spare capacity remains at peak times.

WARNING: Data in this section may be inaccurate because the length of the average sampling interval was only 10.00 minutes. When the interval is less than 10 minutes, peak statistics are likely to underestimate the remaining amount of CPU or disk capacity.

Based on the limited data available in these sar reports, the system should be able to support a limited increase in workload at peak times before the first resource bottleneck affects performance. See the following paragraphs for additional information.

Graph of remaining room for growth

The CPU can support an increase in workload of at least 100 percent at peak times. Since page outs and/or swapping were detected, an increase in workload should be accompanied by an increase in memory. The busiest disk can support a workload increase of at least 100 percent at peak times. For more information on peak resource utilization, refer to the Resource Analysis section of this report.

Please note: In no event can Aptitune Corporation be held responsible for any damages, including incidental or consequent damages, in connection with or arising out of the use or inability to use this software. All trademarks belong to their respective owners. Evaluation copy for: Your Company. This software expires on 02/19/2004 (mm/dd/yyyy). Code version: 6.00.02. Serial number: 00061729.

Thank you for trying this evaluation copy of SarCheck. To order a licensed version of this software, just type 'analyze -o' at the prompt to produce the order form, and follow the instructions.

(c) copyright 1995-2004 by Aptitune Corporation, Plaistow NH, USA, All Rights Reserved. http://www.sarcheck.com

Statistics for system, localhost
  Start of peak interval End of peak interval Date of peak interval
System ID on sar report, 000481674C00     
System ID of this system, 000481674C00     
System model number is, IBM Model 7042/7043 (ED)     
Statistics collected from, 10/06/2003     
Statistics collected until, 10/10/2003     
Average CPU utilization, 0.6%     
Peak CPU utilization, 23% Multiple peaks Multiple peaks  
Average user CPU utilization, 0.5%     
Average sys CPU utilization, 0.1%     
Average waiting for I/O, 0.2%     
Average run queue length, 1.1     
Peak run queue length, 4.0 11:50:00 12:00:01 10/07/2003
Average run queue occupancy, 1.4%     
Average swap queue length, 0.00     
Peak swap queue length, 0.048 Multiple peaks Multiple peaks  
Peak page replacement cycle rate, 0.00      
Max paging space page outs, 13.25 12:18:34 12:19:49 10/10/2003
Max paging space page ins, 5.80 12:20:01 12:20:46 10/10/2003
Max swapped processes seen by ps, 0      
Max number of processes seen by ps, 53 Multiple peaks    
Average context switch rate, 32.75/sec     
Number of kproc overflows seen, 0     
Disk device w/highest peak, hdisk0     
Avg pct busy for that disk, 0.1%     
Peak pct busy for that disk, 10.0% 10:00:00 10:10:00 10/10/2003
Approx CPU capacity remaining, 100%+     
Approx I/O bandwidth remaining, 100%+     
Can memory support add'l load, Limited