SarCheck(TM): Automated Analysis of SCO UNIX sar and ps data

(English text version 3.00)


This is an analysis of the data contained in the file sar23. The data was collected on 07/03/1997, from 09:00:04 to 19:00:51, from system 'snippy'. There were 10 data records used to produce this analysis. Operating system is SCO UNIX 3.2v4.2. 1 processor is present. 4 megabytes of memory are present.

Data collected by the ps -elf command on 07/03/1997 from 08:00:00 to 17:30:00, and stored in the file /usr/local/ps/19970703, will also be analyzed.

SUMMARY

When the data was collected, a moderate CPU bottleneck may have existed. A memory bottleneck was seen. At least one disk drive was busy enough to cause performance degradation. A change has been recommended to at least one tunable parameter. A possible runaway process has been detected. See the Resource Analysis Section for details. Limits to future growth have been noted in the Capacity Planning section.

RECOMMENDATIONS SECTION

All recommendations contained in this report are based solely on the conditions which were present when the performance data was collected. It is possible that conditions which were not present at that time may cause some of these recommendations to result in worse performance. To minimize this risk, analyze data from several different days, implement only regularly occurring recommendations, and implement them one at a time.

Because heavy CPU utilization was seen, adjusting process priorities with the nice(C) command, optimizing applications, or a CPU upgrade may help performance.

Additional memory may improve performance. If possible, borrow some memory for test purposes, and monitor system performance and resource utilization before and after its installation.

While buffer cache statistics indicate that more system buffers would reduce I/O and improve performance, increasing the number of system buffers is not recommended at this time due to the presence of a memory-poor condition. Consider adding memory before increasing the number of buffers.

Change the value of NHBUF from 512 to 256. This change will save 4096 bytes of memory. The parameter NHBUF can be changed by running the configure(ADM) utility and going to category 1.

Change the value of NPROC from 100 to 80. This change will save 7440 bytes of memory. The parameter NPROC can be changed by running the configure(ADM) utility and going to category 4.

Change the value of NINODE from 300 to 255. This change will save 7740 bytes of memory. The parameter NINODE can be changed by running the configure(ADM) utility and going to category 3. Please note that this recommendation will result in the current value of NINODE being less than NFILE. While SCO documentation does not recommend this, many sites run this way with no problems.

Change the value of S5CACHEENTS from 600 to 660. This change is recommended because the namei cache hit ratio was only 82.1 percent. This change will use an additional 2400 bytes of memory. The parameter S5CACHEENTS can be changed by running the configure(ADM) utility and going to category 3. No change to S5HASHQS is recommended at this time.

Change the value of the NMPHEADBUF parameter from 150 to 180. This parameter is used to set the number of buffer headers. These structures keep track of outstanding scatter/gather requests. This change will use an additional 2160 bytes of memory. The parameter NMPHEADBUF can be changed by running the configure(ADM) utility and going to category 3.

Consider balancing the load on disk devices by moving some of the I/O from wd-0 to wd-1 which was only 11.0 percent busy. Please note that available disk space statistics are not available in the sar -d report, and therefore have not been considered in these recommendations.

The implementation of all recommended changes will save 14716 bytes of memory. We do not recommend implementing all of these changes at once.

Once you use the configure(ADM) utility to change the value of parameters, you should relink the kernel and reboot the system in order to implement the changes. More information on the configure utility and relinking the kernel is available in the System Administrator's Reference.

RESOURCE ANALYSIS SECTION

The CPU was regularly more than 80 percent busy. This indicates that an intermittent CPU bottleneck may exist which can cause inconsistent performance. CPU utilization peaked at 98 percent from 12:00:06 to 13:00:21. Peak resource utilization statistics can be used to help understand performance problems. If performance was worst during the period of peak CPU utilization, then the performance bottleneck may be the CPU. A possible runaway process has been detected.

The run queue had an average depth of 2.0 and was regularly 3 or more. The run queue was usually not occupied, despite the frequent presence of a significant run queue depth. This is usually indicative of CPU activity which occurs in bursts. The run queue depth is the average number of processes which were ready to run.

More than 7 percent of the CPU's time was regularly spent waiting for disk I/O. This indicates an the possibility of an intermittent I/O bottleneck. Disk statistics confirm the presence of an I/O bottleneck.

The cache hit ratio of logical reads was regularly less than 90 percent. Increasing the number of buffers would help to reduce I/O and improve performance, but this is not recommended due to the apparent presence of a memory shortage.

The size of NHBUF was unusually large. Due to the memory-poor condition of this system, the large hash buckets may not be worth the memory.

In the event of a system crash, an average of 20 seconds worth of data will be lost because it will not have been written to disk. This is controlled by the NAUTOUP and BDFLUSHR parameters. This statistic has been calculated using the formula: NAUTOUP + (BDFLUSHR / 2).

The ratio of exec to fork system calls was 1.19. This indicates that PATH variables are efficient.

The namei cache hit ratio was only 82.1 percent. S5CACHEENTS and possibly S5HASHQS values should be increased until the percent of hits averages 90 percent or above.

At least one indication of a memory shortage was seen in the following statistics: Data collected with ps -elf shows that the sched daemon used 1 seconds of CPU time. This indicates a memory shortage. Data collected with ps -elf shows that the vhand daemon used 7 seconds of CPU time. This indicates a possible memory shortage, which is confirmed by other statistics related to memory utilization.

Some of the swap area was used during the monitoring period, confirming that the system is memory-poor.

The average number of free pages usually did not stray far above the value of GPGSHI. This indicates that vhand, the page stealing daemon, was usually active and the memory poor condition seen on this system has resulted in increased CPU overhead as well as additional disk activity.

Direct disk access was seen during the monitoring period and NPBUF, the number of control blocks was sufficient to meet the system's needs.

The rate of scatter/gather requests (mpbuf/s) is regularly greater than one per second, which frequently indicates that a disk adapter does not support the scatter/gather performance optimization. If this is the case, performance may be improved by upgrading to a disk adapter which supports scatter/gather.

At times, none of the buffer headers used to keep track of scatter/gather requests were available. This causes requests to be individually sent to the disk adapter, bypassing the performance gains of scatter/gather. Recommendations for increasing the number of buffer headers have been made in the recommendations section.

Scatter/gather is an I/O optimization technique which groups together filesystem requests which are physically adjacent to each other. The recommendation for increasing NMPHEADBUF, the number of buffer header, is unusually small because swapping was seen, indicating the system was memory-poor.

NINODE is set to a value less than NFILE. SCO documentation does not recommend this, but many sites run this way with no problems.

All system tables were less than 80 percent full. This indicates that the system tables are not in danger of filling, but grossly oversized tables can waste memory. If any changes to table size are needed, the changes will be in the recommendations section. Peak table usage statistics (max used/table size) as reported by sar: Process table: 20/100. Inode table: 57/300. Open file table: 41/375. Lock Table: 1/100.

The size of MAXUP is currently 50, which is sufficiently smaller than NPROC. The sar utility reported that the value of NPROC was 100.

The process table, controlled by the NPROC parameter, was much larger than necessary. Reducing its size to 50 would save 18600 bytes of memory. A significant reduction in NPROC would have to be accompanied by a reduction in MAXUP. This is a fairly drastic example and not a specific recommendation to change the table size. SarCheck will make a recommendation in the Recommendations Section if it is trying to solve a specific problem.

The inode table, controlled by the NINODE parameter, was much larger than necessary. Reducing its size to 114 would save 31992 bytes of memory. This is a fairly drastic example and not a specific recommendation to change the table size. SarCheck will make a recommendation in the Recommendations Section if it is trying to solve a specific problem.

The file table, controlled by the NFILE parameter, was much larger than necessary. Even though sar statistics indicate that this is unlikely to have any specific impact on performance, reducing its size to 100 would save 3300 bytes of memory. This is a fairly drastic example and not a specific recommendation to change the table size. SarCheck will make a recommendation in the Recommendations Section if it is trying to solve a specific problem.

The lock table, controlled by the FLCKREC parameter, was much larger than necessary. Even though sar statistics indicate that this is unlikely to have any specific impact on performance, reducing its size to 50 would save 1700 bytes of memory. This is a fairly drastic example and not a specific recommendation to change the table size. SarCheck will make a recommendation in the Recommendations Section if it is trying to solve a specific problem.

The size of S5HASHQS was unusually large. Due to the memory-poor condition of this system, the large hash buckets may not be worth the memory.

The device wd-0 was busy an average of 63.0 percent of the time and had an average queue depth of 3.1 (when occupied). This indicates that the device was likely to be a performance bottleneck. During the peak interval from 15:00:51 to 16:00:06, the disk was 84.4 percent busy. Peak disk busy statistics can be used to help understand performance problems. If performance was worst during the period when the disk was busiest, then the performance bottleneck may be that disk. The average service time reported for this device and its accompanying disk subsystem was 10.3 milliseconds. This is relatively fast. Service time is the delay between the time a request was sent to a device and the time that the device signaled completion of the request.

The device wd-1 was busy an average of 11.0 percent of the time and had an average queue depth of 1.4 (when occupied). This indicates that the device is not a performance bottleneck. The average service time reported for this device and its accompanying disk subsystem was 12.6 milliseconds. This service time is acceptable.

CPU usage seen in spinloop, owned by don, pid 9533. Between 12:00:01 and 14:30:00, 7482 seconds of CPU time were used. CPU utilization by this process averaged 83.14 percent during that interval.

CAPACITY PLANNING SECTION

The section is designed to provide the user with a rudimentary linear capacity planning model and should be used for rough approximations only. These estimates assume that an increase in workload will affect the usage of all resources equally. These estimates should be used on days when the load is heaviest to determine approximately how much spare capacity remains at peak times.

Based on the limited data available in this single sar report, the system cannot support an increase in workload at peak times without some loss of performance or reliability, and the bottleneck is likely to be CPU utilization. Implementation of some of the suggestions in the recommendations section may help to increase the system's capacity.

The CPU can not support any increase in workload without performance degradation at peak times. Since paging and/or swapping were detected, any increase in workload should be accompanied by an increase in memory. The busiest disk can support a workload increase of approximately 7 percent at peak times. For more information on peak CPU and disk utilization, refer to the Resource Analysis section of this report.

All system tables measured by sar -v can hold at least twice as many entries as were seen.

Please note: In no event can Aurora Software Inc. be held responsible for any damages, including incidental or consequent damages, in connection with or arising out of the use or inability to use this software. All trademarks belong to their respective owners.

This is beta quality software and is to be used only in conjunction with a beta test program. This software is likely to contain defects and its recommendations should be regarded skeptically. This software provided for the exclusive use of: Your Company. This software expires on 09/30/1997 (mm/dd/yyyy). Code version: 3.00. Serial number: 00026146.

(c) copyright 1994-1997 by Aurora Software Inc., Plaistow NH, USA, All Rights Reserved. http://www.sarcheck.com