SarCheck(TM): Automated Analysis of AIX sar and ps data (English text version 6.00.02) This is an analysis of the data contained in the file octsar1. The data was collected from 10/06/2003 to 10/10/2003, from the RS/6000 IBM Model 7042/7043 (ED) system 'localhost'. There were 255 data records collected over 5 days used to produce this analysis. The operating system used to produce the sar report was Release 4.3 of AIX. The operating system as reported by /usr/bin/oslevel is AIX Release 4.3.3.0. The sysconf subroutine reports that 1 processor is configured and 1 processor is online. 64 megabytes of memory are present. Data collected by the ps -elf command during 5 days between 10/06/2003 and 10/10/2003 will also be analyzed. This program will attempt to match the starting and ending times of the ps -elf data with those of the sar report file named octsar1. SUMMARY When the data was collected, no CPU bottleneck could be detected. No significant I/O bottleneck was seen. A change to at least one tunable parameter has been recommended. No impending capacity limits were noted by SarCheck's capacity planning feature. At least one possible memory leak has been detected. See the Resource Analysis section for details. NOTE: The file /opt/sarcheck/etc/sarcheck_parms was seen but no changes have been made to the thresholds used by SarCheck's rules and algorithms. This does not indicate a problem and the file is probably being used to control SarCheck's menu defaults. RECOMMENDATIONS SECTION NOTE: The following recommendations are being made in an environment where a large amount of spare capacity existed. If the sar data used to produce this report comes from a time when activity was uncharacteristically light, these recommendations may not help performance when the system is busy. All recommendations contained in this report are based solely on the conditions which were present when the performance data was collected. It is possible that conditions which were not present at that time may cause some of these recommendations to result in worse performance. To minimize this risk, analyze data from several different days, implement only regularly occurring recommendations, and implement them one at a time. Change the value of the thrashing threshold from 6 to 0 with the command '/usr/samples/kernel/schedtune -h 0'. This change disables the virtual memory manager's ability to suspend processes. If it causes performance degradation, add memory. Changes made with schedtune are immediate and will last until the next reboot. If these changes improve performance, put the command in /etc/inittab or one of the /etc/rc scripts that are run when the system is booted. A CPU upgrade is not recommended because the current CPU had significant unused capacity. No disk recommendations have been made because no bottleneck was seen. RESOURCE ANALYSIS SECTION Average CPU utilization (%usr + %sys) was only 0.6 percent. This indicates that spare CPU capacity exists. If any performance problems were seen during the entire monitoring period, they were not caused by a lack of CPU power. CPU utilization peaked at 23 percent during multiple time intervals. The CPU was waiting for I/O (%wio) an average of 0.2 percent of the time. This statistic does not indicate the presence of an I/O bottleneck. The time that the system was waiting for I/O peaked at 11 percent during multiple time intervals. The CPU was idle (neither busy nor waiting for I/O) and had nothing to do an average of 99.2 percent of the time. If overall performance was good, this means that on average, the CPU was lightly loaded. If performance was generally unacceptable, the bottleneck may have been caused by remote file I/O which cannot be directly measured with sar and therefore cannot be considered by SarCheck. The run queue had an average length of 1.1 which indicates that processes were generally not bound by latent demand for CPU resources. Average run queue length (when occupied) peaked at 4.0 from 11:50:00 to 12:00:01, on 10/07/2003. During that interval, the queue was occupied 0 percent of the time. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak CPU queuing, then a performance bottleneck may be the CPU. There was no run queue activity to graph. Modest buffer cache activity was seen in the sar -b data. This indicates that some process is using raw block or raw character devices and a small amount of activity is not unusual. The average context switch rate (cswch/s) was 32.75 per second. The context switch rate (cswch/s) peaked at 195.0 per second from 15:10:01 to 15:20:01, on 10/06/2003. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak context switching, then a problem may be that too many processes were blocked for I/O or IPC. The following statistics indicate that there was a surplus of memory. This is the ideal situation for best performance. There was no indication of swapped out processes in the ps -elf data. This is to expected on a system with a memory surplus. The average number of page replacement cycles per second (cycle/s) was 0.00. If values greater than zero had been seen, a memory shortage might exist. Data from the cycle/s column does not indicate a lack of physical memory. Page replacement cycles should not occur on a system with a memory surplus. The average number of kernel threads waiting to be paged in (swpq-sz) was 1.31. The average number of kernel threads waiting to be paged in (swpq-sz) peaked at 2.0 during multiple time intervals. A lack of activity in this resource is consistent with a system that has a memory surplus. A recommendation has been made to disable thrashing control with schedtune -h. More information about this recommendation can be found on page 74 of Rudy Chukran's "Accelerating AIX". If this change causes performance to get worse, more memory should be added to the system because the current 64 megabytes is not enough. The current setting for maxpin (vmtune -M) leaves 12.57 megabytes of memory unpinnable. No recommendation made because no problem was seen. No I/O bottleneck was seen in the sar statistics, therefore no changes are recommended for maxpgahead (vmtune -R). The value of numclust (vmtune -c) is 1. If fast disk devices, disk arrays, or striped logical volumes are in use, the performance of disk writes could be improved by increasing this value. SarCheck does not have access to enough information about the system's disk devices to make any specific recommendation for tuning numclust. The average rate of System V semaphore calls (sema/s) was 0.3 per second. System V semaphore activity (sema/s) peaked at a rate of 23.75 per second from 14:50:00 to 15:00:00, on 10/07/2003. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak semaphore activity, then that activity may be a performance bottleneck and application or database activity related to semaphore usage should be looked at more closely. No problems have been seen, and no changes have been recommended for System V semaphore parameters. Note that SarCheck only checks these parameter's relationships to each other since semaphore usage data is not available. The average rate of System V (msg/s) message calls was 0.020 per second. No problems have been seen, and no changes have been recommended for System V message parameters. Note that SarCheck only checks these parameter's relationships to each other since message usage data is not available. There were no times when enforcement of the process threshold limit (kproc-ov) prevented the creation of kernel processes. This indicates that no problems were seen in this area. The ratio of exec to fork system calls was 0.89. This indicates that PATH variables are efficient. The average system-wide local I/O rate as measured by the r+w/s column in the sar -d data was 0.38 per second. This I/O rate peaked at 26 per second from 10:00:00 to 10:10:00, on 10/10/2003. The disk device hdisk0 was busy an average of 0.11 percent of the time and had an average queue length of 0.7 (when occupied). This indicates that the device is not a performance bottleneck. The disk device cd0 was busy an average of 0.00 percent of the time and had an average queue length of 0.0 (when occupied). This indicates that the device is not a performance bottleneck. At multiple peak times on 10/10/2003 ps -elf data indicated that there were 53 processes present. This was the largest number of processes seen with ps -elf but it is not likely to be the absolute peak because the operating system does not store the true "high-water mark" for this statistic. A possible memory leak was seen in /usr/netscape/communicator/us/netscape_aix4, owned by drw, pid 14522. Between 11:00:00 and 11:20:00 on 10/09/2003, this process grew from 17988 to 26616 kb. Memory usage grew at an average rate of 25884.0 kb/hr during that interval. A possible memory leak was seen in /usr/lpp/X11/bin/X, owned by root, pid 2154. Between 10:54:57 and 12:20:46 on 10/10/2003, this process grew from 5568 to 6768 kb. Memory usage grew at an average rate of 839.0 kb/hr during that interval. A possible memory leak was seen in /usr/netscape/communicator/us/netscape_aix4, owned by drw, pid 13482. Between 10:30:00 and 11:00:00 on 10/07/2003, this process grew from 17488 to 24676 kb. Memory usage grew at an average rate of 14376.0 kb/hr during that interval. A possible memory leak was seen in /usr/dt/bin/dtterm, owned by drw, pid 11624. Between 11:10:00 and 11:40:01 on 10/09/2003, this process grew from 1076 to 1184 kb. Memory usage grew at an average rate of 215.9 kb/hr during that interval. A possible memory leak was seen in /usr/netscape/communicator/us/netscape_aix4, owned by drw, pid 16476. Between 12:18:34 and 12:20:46 on 10/10/2003, this process grew from 17424 to 24868 kb. Memory usage grew at an average rate of 203018.2 kb/hr during that interval. CAPACITY PLANNING SECTION This section is designed to provide the user with a rudimentary linear capacity planning model and should be used for rough approximations only. These estimates assume that an increase in workload will affect the usage of all resources equally. These estimates should be used on days when the load is heaviest to determine approximately how much spare capacity remains at peak times. WARNING: Data in this section may be inaccurate because the length of the average sampling interval was only 10.00 minutes. When the interval is less than 10 minutes, peak statistics are likely to underestimate the remaining amount of CPU or disk capacity. Based on the limited data available in these sar reports, the system should be able to support a substantial increase in workload before impending CPU, memory, disk, or system table bottlenecks are seen. Run SarCheck regularly to detect bottlenecks before they impact performance. Please note: In no event can Aptitune Corporation be held responsible for any damages, including incidental or consequent damages, in connection with or arising out of the use or inability to use this software. All trademarks belong to their respective owners. Evaluation copy for: Your Company. This software expires on 02/19/2004 (mm/dd/yyyy). Code version: 6.00.02. Serial number: 00061729. Thank you for trying this evaluation copy of SarCheck. To order a licensed version of this software, just type 'analyze -o' at the prompt to produce the order form, and follow the instructions. (c) copyright 1995-2004 by Aptitune Corporation, Plaistow NH, USA, All Rights Reserved. http://www.sarcheck.com Statistics for system, localhost System ID on sar report, 000481674C00 System ID of this system, 000481674C00 System model number is, IBM Model 7042/7043 (ED) Statistics collected from, 10/06/2003 Statistics collected until, 10/10/2003 Average CPU utilization, 0.6% Peak CPU utilization, 23% Average user CPU utilization, 0.5% Average sys CPU utilization, 0.1% Average waiting for I/O, 0.2% Average run queue length, 1.1 Peak run queue length, 4.0 Average run queue occupancy, 1.4% Average swap queue length, 0.00 Peak swap queue length, 0.048 Peak page replacement cycle rate, 0.00 Max swapped processes seen by ps, 0 Max number of processes seen by ps, 53 Average context switch rate, 32.75/sec Number of kproc overflows seen, 0 Disk device w/highest peak, hdisk0 Avg pct busy for that disk, 0.1% Peak pct busy for that disk, 10.0% Approx CPU capacity remaining, 100%+ Approx I/O bandwidth remaining, 100%+ Can memory support add'l load, Yes