This is an analysis of the data contained in the file /develop/cm/3daysar. The data was collected from 2007/04/02 to 2007/04/04, from the system 'gti6a003'. There were 795 data records collected over 3 days used to produce this analysis. The operating system used to produce the sar report was Release 5.2 of AIX. The system configuration data in the sar report indicated that 4.0 processors were configured. 4096 megabytes of memory were seen in the system configuration data.
The date format used in this report is yyyy/mm/dd. The date format was set in the sarcheck_parms file.
Data collected by the ps -elf command during 3 days between 2007/04/02 and 2007/04/04 will also be analyzed. This program will attempt to match the starting and ending times of the ps -elf data with those of the sar report file named 3daysar.
Table of Contents
When the data was collected, no CPU bottleneck could be detected. A memory bottleneck was seen. No significant I/O bottleneck was seen. A change to at least one tunable parameter has been recommended.
At least one possible runaway process has been detected. A suspiciously large process has been detected. See the Resource Analysis section for details.
Some of the defaults used by SarCheck's rules have been overridden using the sarcheck_parms file. See the Custom Settings section of the report for more information.
All recommendations contained in this report are based on the conditions which were present when the performance data was collected. It is possible that conditions which were not present at that time may cause some of these recommendations to result in worse performance. To minimize this risk, analyze data from several different days, implement only regularly occurring recommendations, and implement them one at a time or as groups of related parameters.
Additional memory may improve performance. If possible, borrow some memory for test purposes, and monitor system performance and resource utilization before and after its installation.
NOTE: The following 3 vmo changes should be made all at once. These changes are based on information presented at the IBM System p Technical University.
Change the value of the lru_file_repage parameter from 1 to 0 with the command 'vmo -o lru_file_repage=0'. The -o flag changes the value of a parameter only until the next reboot. To make the change permanent, use the command 'vmo -p -o lru_file_repage=0'. The lru_file_repage parameter is used to change the algorithms used by the LRUD (page stealing daemon).
Change the value of the lru_poll_interval parameter from 0 to 10 with the command 'vmo -o lru_poll_interval=10'. The -o flag changes the value of a parameter only until the next reboot. To make the change permanent, use the command 'vmo -p -o lru_poll_interval=10'. The lru_poll_interval parameter is used to make the page stealing daemon more responsive.
Change the value of the minperm% parameter from 20 to 1 with the command 'vmo -o minperm%=1'. The -o flag changes the value of a parameter only until the next reboot. To make the change permanent, use the command 'vmo -p -o minperm%=1'.
This is the end of this set of vmo parameter changes that should be implemented together.
NOTE: The following changes to the maxfree and minfree parameters should be implemented at the same time.
Change the value of maxfree from 128 to 960 with the command 'vmo -o maxfree=960'. The -o flag changes the value of a parameter only until the next reboot. To make the change permanent, use the command 'vmo -p -o maxfree=960'. This change is recommended based on formulas discussed at IBM's pSeries Technical University. The recommended value for minfree is 480 and the value for maxpgahead was 8. The j2_maxPageReadAhead value used was 128. The value of lcpu reported by sar was 4.0. The number of memory pools seen was 1. The magnitude of this change has been limited to prevent the recommendation of very large changes. Changing this parameter in smaller increments is a much safer way to tune the system.
Change the value of minfree from 120 to 480 with the command 'vmo -o minfree=480'. The -o flag changes the value of a parameter only until the next reboot. To make the change permanent, use the command 'vmo -p -o minfree=480'. This change is recommended based on formulas discussed at IBM's pSeries Technical University. The following data was used in this calculation: The number of memory pools seen was 1. The value of lcpu reported by sar was 4.0. The number of active CPUs reported by sysconf is 1.
Change the value of the j2_maxRandomWrite parameter from 0 to 128 with the command 'ioo -o j2_maxRandomWrite=128'. The -o flag changes the value of a parameter only until the next reboot. To make the change permanent, use the command 'ioo -p -o j2_maxRandomWrite=128'. This recommendation will reduce the number of pages of random writes that are allowed to collect in memory before they are flushed to disk by the write behind algorithm.
Change the scheduler's ratio of CPU penalty to recent CPU usage (the R value) from 16 to 5 with the command 'schedo -o sched_R=5'. The -o flag changes the value of a parameter only until the next reboot. To make the change permanent, use the command 'schedo -p -o sched_R=5'. This change should help the scheduler distinguish between background processes and those running as interactive foreground processes.
Fix the problems seen in the layout of the system's paging spaces. Exact recommendations require information about future plans, such as any physical volumes which may be added shortly or anticipated changes to system load and storage requirements. Here is the problem seen by this program:
Average CPU utilization (%usr + %sys) was only 46.0 percent. This indicates that spare CPU capacity exists. If any performance problems were seen during the entire monitoring period, they were not caused by a lack of CPU power. User CPU as measured by the %usr column in the sar -u data averaged 32.7 percent and system CPU (%sys) averaged 13.3 percent. The sys/usr ratio averaged 0.41 : 1. CPU utilization peaked at 73 percent from 15:45:01 to 15:50:00, on 2007/04/03. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak CPU utilization, then a performance bottleneck may be the CPU.
The CPU was waiting for I/O (%wio) an average of 3.3 percent of the time. This infers that the system could have been occasionally I/O bound. The time that the system was waiting for I/O peaked at 58 percent from 02:40:24 to 02:45:00, on 2007/04/03. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period when the system was waiting for I/O, then a performance bottleneck may be caused by processes waiting for I/O.

The CPU was idle (neither busy nor waiting for I/O) and had nothing to do an average of 50.7 percent of the time. If overall performance was good, this means that on average, the CPU was lightly loaded. If performance was generally unacceptable, the bottleneck may have been caused by remote file I/O which cannot be directly measured with sar and therefore cannot be considered by SarCheck.
The run queue had an average length of 2.8 which indicates that processes were generally not bound by latent demand for CPU resources. Occasionally the average length of the run queue (when occupied) exceeded 4. The run queue was usually occupied, despite the lack of a significant run queue length. This condition is usually seen when the number of CPU-intensive processes is low. It is likely that the performance of these processes is closely related to CPU speed. Average run queue length (when occupied) peaked at 8.0 from 07:15:00 to 07:20:00, on 2007/04/04. During that interval, the queue was occupied 100 percent of the time. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak CPU queuing, then a performance bottleneck may be the CPU.
The following graph shows both the run queue length and occupancy. The occupancy is shown as %runocc/100, where a run queue occupied 100 percent of the time would be shown a vertical line reaching a height of 1.0.

The minimum multiprogramming level (v_min_process in schedo) has been set to 2. This is a safe value for small configurations and may be low for larger configurations. This parameter is very dependent of workload and the correct value cannot be determined with sar and ps data. A memory shortage has been seen and a value which is too low may cause performance problems. More information can be found on the web by using your favorite search engine.
Reducing the scheduler's ratio of CPU penalty to recent CPU usage, also known as the R value from 16 to 5 should improve the performance of foreground processes if background jobs or long running non-interactive foreground jobs are using significant CPU resources. More information can be found on the web by using your favorite search engine.
The average rate of System V semaphore calls (sema/s) was 0.001 per second. No problems have been seen, and no changes have been recommended for System V semaphore parameters. Note that SarCheck only checks these parameter's relationships to each other since semaphore usage data is not available.
No System V message activity (msg/s) was seen. No problems have been seen, and no changes have been recommended for System V message parameters. Note that SarCheck only checks these parameter's relationships to each other since message usage data is not available.
There were no times when enforcement of the process threshold limit (kproc-ov) prevented the creation of kernel processes. This indicates that no problems were seen in this area.
The ratio of exec to fork system calls was 0.93. This indicates that PATH variables are efficient.
No buffer cache activity was seen in the sar -b data. This is normal for AIX systems, which typically do not use the traditional buffer cache.
There was no indication of swapped out processes in the ps -elf data. Processes which have been swapped out are usually found only on systems that have a very severe memory shortage.
The average number of page replacement cycles per second calculated from the vmstat -s data was 0.0006. The number of page replacement cycles per second (from vmstat -s) peaked at 0.0067 from 22:00:00 to 22:20:00, on 2007/04/03. This means that the page stealer was scanning memory at a rate of roughly 27 mb/sec during the peak. We are collecting data on this statistic and have not yet been able to quantify when this value is high enough to indicate a problem. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak replacement cycle activity, then a shortage of physical memory may be performance bottleneck.
The average number of kernel threads waiting to be paged in (swpq-sz) was 1.64. The average number of kernel threads waiting to be paged in (swpq-sz) peaked at 7.6 from 02:40:24 to 02:45:00, on 2007/04/03. When the peak was reached, the swap queue was occupied 91 percent of the time. A more useful statistic is sometimes available by multiplying the swpq-sz data by the percent of time the queue was occupied. In this case, the average was 0.17 and the peak was 6.92 from 02:40:24 to 02:45:00, on 2007/04/03. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst when the number of kernel threads waiting to be paged in was at its peak, then a shortage of physical memory may be performance bottleneck.
The following graph shows any significant statistics relating to page replacement cycle rate, number of kernel threads waiting to be paged in, and number of swapped processes. The page cycle replacement rate has been calculated using the "revolutions of the clock hand" field reported by vmstat -s.

The average page out rate to the paging spaces was 4.12 per second. The paging space page out rate peaked at 81.75 from 18:20:00 to 18:40:01, on 2007/04/04. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst when the paging space page out rate was at its peak, then a shortage of physical memory may be performance bottleneck. The following graph shows the rate of paging operations to the paging spaces.

There was 1 paging space seen with the lsps -a command. The size of the paging space was 4096 megabytes and the size of physical memory was 4096 megabytes. From 02:00:00 to 08:00:00 on 2007/04/02 paging space usage peaked at approximately 2457.0 megabytes, which is about 60 percent of the page space available.

The recorded setting for maxpin% leaves 815.10 megabytes of memory unpinnable. A memory-poor environment was seen even though most of the system's memory was unpinnable.

The average rate at which I/O was blocked because the kernel had to wait for a free bufstruct (called fsbuf in vmstat -v) was 0.02 per second. The peak rate was 1.76 per second from 08:00:00 to 08:20:00, on 2007/04/02. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst when the kernel had to wait for bufstructs, then a problem may be that bufstructs could not be allocated quickly enough to meet the I/O load. A recommendation to increase the number of bufstructs was not made because a memory-poor environment was seen.

The above graph shows when the rate of I/O blocking was highest. If these times are the ones when performance was poor, if may be possible to improve performance by increasing the appropriate number of buffers.
The average context switch rate (cswch/s) was 19024.98 per second. The context switch rate (cswch/s) peaked at 23652.0 per second from 13:40:00 to 13:45:00, on 2007/04/04. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak context switching, then a problem may be that too many processes were blocked for I/O or IPC.
A group of changes have been recommended to some of the VMM parameters that are managed with the vmo utility. These changes are based on new information presented at the IBM System p Technical University. The changes are a significant departure from previous IBM recommendations. In the past, IBM recommended reducing the values of maxperm%, maxclient%, and minperm%. The new recommendations use two new parameters whose names start with "lru" to control page stealing algorithms instead of using maxperm%, maxclient%, minperm%, strict_maxperm, and strict_maxclient. Those values should now be left at (or returned to) their defaults, except for the lowering of minperm%.
The following graph and table show the relationship between used memory, maxperm%, maxclient%, numperm, numclient, and minperm%. Because the values of maxperm% and maxclient% are the same, only one can be seen in the graph.

| VMM Statistics | ||
|---|---|---|
| Metric | Average | Range |
| Memory in use: The percentage of memory being used for either file or non-file pages | 95.6% | 59.3 - 100.0 |
| Non-file: IBM frequently calls this 'computational memory'. | 58.8% | 49.0 - 66.7 |
| numperm: Memory which holds file pages (JFS, JFS2, NFS, etc.) | 32.2% | 0.9 - 49.2 |
| numclient: Memory which holds everything except JFS pages. | 36.8% | 2.6 - 51.0 |
| Parameter | Value | |
| maxperm% | 80.0 | |
| maxclient% | 80.0 | |
| minperm% | 20.0 | |
NOTE: It's unusual for the average value of numclient to be higher than numperm. Please verify that this is real by running vmo -o or vmtune from time to time and checking these values manually. This doesn't seem to indicate a problem, but it doesn't match the documentation.
No I/O bottleneck was seen in the sar statistics, therefore no changes are recommended for maxpgahead. The value of minpgahead was set to 2. This is the kind of small value that typically works best in most environments.
No I/O bottleneck was seen in the sar statistics, therefore no changes are recommended for j2_maxPageReadAhead. The value of j2_minPageReadAhead was set to 2. This is the kind of small value that typically works best in most environments.
The value of numclust is 1. If fast disk devices, disk arrays, or striped logical volumes are in use, the performance of disk writes could be improved by increasing this value. SarCheck does not have access to enough information about the system's disk devices to make any specific recommendation for tuning numclust.
The value of maxrandwrt was 0. This value causes random JFS writes to stay in RAM until a sync operation.
The value of j2_maxRandomWrite was 0. This value causes random JFS2 writes to stay in RAM until a sync operation. A change to the value of j2_maxRandomWrite has been made in order to assure that there aren't enough writes to cause performance problems during a sync operation.
The average system-wide local I/O rate as measured by the r+w/s column in the sar -d data was 488.9 per second. This I/O rate peaked at 5119 per second from 15:55:00 to 16:00:00, on 2007/04/03. The average size of an I/O based on the r+w/s and blks/s columns was 10.2 blocks, or 1.3 pages. The iostat utility reports that 28.9 percent of disk data transferred were writes and the rest were reads.

I/O pacing was not in use. A significant amount of fast I/O was seen to at least one disk device and the I/O rate peaked from 15:55:00 to 16:00:00, on 2007/04/03. Consider turning on I/O pacing if interactive performance or keyboard response problems were seen. This is a technique to limit the amount of I/O that a process can perform, typically as a way of preventing batch jobs from hurting interactive response time when high I/O rates are present.
The following graph shows the average/peak percent busy and average service time for up to 5 disks, sorted by percent busy.

Note: 14 disks were present. By default, the presence of more than 12 disks causes SarCheck to only report on the busiest disks. This is meant to control the verbosity of this report. To see all disks included in the report, use the -d option.
The -dtoo switch has been used to format disk statistics into the following table.
The following disk analysis has been sorted by the average percent of time the disk was busy.
Please note that if RAID devices were present, %busy statistics reported for them are likely to be inaccurate and should be viewed skeptically. The presence of a RAID device is generally invisible to the operating system and therefore invisible to this program.
The disk device hdisk0 was busy an average of 13.86 percent of the time and had an average queue length of 0.0 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 02:40:24 to 02:45:00, on 2007/04/03, the disk was 93.0 percent busy. Peak disk busy statistics can be used to help understand performance problems. If performance was worst when the disk was busiest, then a performance bottleneck may be that disk.
The disk device hdisk1 was busy an average of 11.90 percent of the time and had an average queue length of 0.0 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 02:40:24 to 02:45:00, on 2007/04/03, the disk was 86.0 percent busy.
The disk device hdisk11 was busy an average of 6.85 percent of the time and had an average queue length of 0.0 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 05:00:00 to 05:05:00, on 2007/04/02, the disk was 79.0 percent busy.
The disk device hdisk10 was busy an average of 4.46 percent of the time and had an average queue length of 0.0 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 02:05:01 to 02:10:00, on 2007/04/02, the disk was 58.0 percent busy.
The disk device hdisk4 was busy an average of 3.50 percent of the time and had an average queue length of 0.1 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 05:00:00 to 05:05:00, on 2007/04/02, the disk was 62.0 percent busy.
The disk device hdisk8 was busy an average of 3.40 percent of the time and had an average queue length of 0.1 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 05:00:00 to 05:05:00, on 2007/04/02, the disk was 62.0 percent busy.
Data collected by ps -elf indicated that at 17:20:00 on 2007/04/03 there were a peak of 207 processes present. This was the largest number of processes seen with ps -elf but it is not likely to be the absolute peak because the operating system does not store the true "high-water mark" for this statistic. There were an average of 174.4 processes present.

The -ptoo switch has been used to format ps -elf data into the following table.
| Interesting ps -elf data | |||||||
|---|---|---|---|---|---|---|---|
| Command | User | Process ID | Percent CPU | Average PRI | NI | Memory Growth | Memory Use |
| otherapp | ctmem | 1474654 | 13.77 | 60.00 | 20 | 0.00 mb/hr | 575.11 mb |
| mainapp | root | 1388792 | 72.29 | 112.33 | 20 | 0.71 mb/hr | 5.76 mb |
| mainapp | root | 1413302 | 62.04 | 90.67 | 20 | 0.39 mb/hr | 5.81 mb |
| mainapp | root | 1413308 | 56.97 | 95.71 | 20 | 0.32 mb/hr | 5.93 mb |
| mainapp | root | 1532110 | 72.53 | 109.33 | 20 | 0.21 mb/hr | 5.42 mb |
Unusually large process size seen in otherapp, owned by ctmem, pid 1474654. The size of this process was 575.11 mb.
CPU usage seen in mainapp, owned by root, pid 1388792. Between 02:00:00 and 02:40:00 on 2007/04/02, 1735 seconds of CPU time were used. CPU utilization by this process averaged 72.29 percent of a single processor during that interval. The nice value (NI) for this process was 20. The priority (PRI) for this process ranged from 97 to 120 and the average was 112.33
CPU usage seen in mainapp, owned by root, pid 1413302. Between 18:20:05 and 20:00:01 on 2007/04/03, 3720 seconds of CPU time were used. CPU utilization by this process averaged 62.04 percent of a single processor during that interval. The nice value (NI) for this process was 20. The priority (PRI) for this process ranged from 82 to 97 and the average was 90.67
CPU usage seen in mainapp, owned by root, pid 1413308. Between 20:20:00 and 22:20:00 on 2007/04/03, 4102 seconds of CPU time were used. CPU utilization by this process averaged 56.97 percent of a single processor during that interval. The nice value (NI) for this process was 20. The priority (PRI) for this process ranged from 78 to 113 and the average was 95.71
CPU usage seen in mainapp, owned by root, pid 1532110. Between 18:40:01 and 19:20:00 on 2007/04/04, 1740 seconds of CPU time were used. CPU utilization by this process averaged 72.53 percent of a single processor during that interval. The nice value (NI) for this process was 20. The priority (PRI) for this process ranged from 93 to 120 and the average was 109.33
This section is designed to provide the user with a rudimentary linear capacity planning model and should be used for rough approximations only. These estimates assume that an increase in workload will affect the usage of all resources equally. These estimates should be used on days when the load is heaviest to determine approximately how much spare capacity remains at peak times.
Based on the data available, the system cannot support an increase in workload at peak times without some loss of performance or reliability, and the bottleneck is likely to be disk I/O. Implementation of some of the suggestions in the recommendations section may help to increase the system's capacity.

In the above graph, the end of the memory bar is tapered because it represents more of an approximation than the others. The CPU can support an increase in workload of approximately 23 percent at peak times. Since a non-trivial level of page outs and/or swapping were detected, the amount of memory present will have trouble supporting an increase in workload of more than roughly 25 percent at peak times. The busiest disk can support a workload increase of approximately 0 percent at peak times. For more information on peak resource utilization, refer to the Resource Analysis section of this report.
The default CPULIM threshold was changed in the sarcheck_parms file from 20.00 to 40.00 percent.
The default GRAPHDIR was changed with the -gd switch to /test1.
Please note: In no event can Aptitune Corporation be held responsible for any damages, including incidental or consequent damages, in connection with or arising out of the use or inability to use this software. All trademarks belong to their respective owners. Evaluation copy for: Your Company. This software expires on 2007/07/11 (yyyy/mm/dd). Code version: 6.03.09. Serial number: 48485757.
Thank you for trying this evaluation copy of SarCheck. To order a licensed version of this software, just type 'analyze -o' at the prompt to produce the order form, and follow the instructions.
(c) copyright 1995-2007 by Aptitune Corporation, Plaistow NH, USA, All Rights Reserved. http://www.sarcheck.com
| Statistics for system, gti6a003 | ||||
|---|---|---|---|---|
| Start of peak interval | End of peak interval | Date of peak interval | ||
| System ID on sar report, | 0005B8CA4C00 | |||
| System ID of this system, | 000481674C00 | |||
| System model number is, | IBM Model 7042/7043 (ED) | |||
| Statistics collected from, | 2007/04/02 | |||
| Statistics collected until, | 2007/04/04 | |||
| Average CPU utilization, | 46.0% | |||
| Peak CPU utilization, | 73% | 15:45:01 | 15:50:00 | 2007/04/03 |
| Average user CPU utilization, | 32.7% | |||
| Average sys CPU utilization, | 13.3% | |||
| Average waiting for I/O, | 3.3% | |||
| Average run queue length, | 2.8 | |||
| Peak run queue length, | 8.0 | 07:15:00 | 07:20:00 | 2007/04/04 |
| Average run queue occupancy, | 86.2% | |||
| Average swap queue length, | 0.17 | |||
| Peak swap queue length, | 6.9 | 02:40:24 | 02:45:00 | 2007/04/03 |
| Peak page replacement cycle rate, | 0.01 | 01:00:00 | 01:05:03 | 2007/04/02 |
| Max paging space page outs, | 81.75 | 18:20:00 | 18:40:01 | 2007/04/04 |
| Max paging space page ins, | 41.93 | 02:40:12 | 03:00:01 | 2007/04/03 |
| Peak page stealer scan rate, | 27 MB/sec | 22:00:00 | 22:20:00 | 2007/04/03 |
| Max swapped processes seen by ps, | 0 | |||
| Avg number of processes seen by ps, | 174.4 | |||
| Max number of processes seen by ps, | 207 | 17:20:00 | 2007/04/03 | |
| Average % memory in use, | 95.6% | |||
| Average % non-file pages, | 58.8% | |||
| Average numperm value, | 32.2% | |||
| Average numclient value, | 36.8% | |||
| Average context switch rate, | 19024.98/sec | |||
| Number of kproc overflows seen, | 0 | |||
| Disk device w/highest peak, | hdisk0 | |||
| Avg pct busy for that disk, | 13.9% | |||
| Peak pct busy for that disk, | 93.0% | 02:40:24 | 02:45:00 | 2007/04/03 |
| Avg I/Os blocked for fsbuf, | 0.02/sec | |||
| Peak I/Os blocked for fsbuf, | 1.76/sec | 08:00:00 | 08:20:00 | 2007/04/02 |
| Approx CPU capacity remaining, | 23.3% | |||
| Approx I/O bandwidth remaining, | 0.0% | |||
| Can memory support add'l load, | Limited | |||