This article is the second article in our series about SharePoint. In our first article we provided a few SharePoint basic system performance counters that you can use to monitor the overall health of the server itself that is running SharePoint; a requirement before you can monitor SharePoint effectively and detect and identify possible bottlenecks. In this article we’ll discuss SharePoint monitoring more in depth and provide some best practices.
SharePoint 2010 offers a number of built-in monitoring features for diagnosing problems and performance. By default diagnostic logging is enabled and while most of the time these default settings will be sufficient, there might be times when you want to make changes to these settings. For example, if you are making a major change to your environment, you can configure a more verbose level of logging to track more closely if everything is working as planned.
In addition to diagnostic logging, SharePoint uses scheduled jobs to collect monitoring data related to the health and usage of your environment. The collected data includes performance counters, event log data, Web analysis reports, and administrative reports. Note that you can change the schedule of these so-called Timer jobs in SharePoint to collect data more often or less frequent.
The table below lists the settings that can help you, as an administrator, to manually configure the performance counters and other settings.
|Event Log Flooding Protection||Disabled||The default value is Enabled. It can be disabled to collect as much monitoring data as possible. For normal operations, it should be enabled.|
|Timer Job Schedule|
|Microsoft SharePoint Foundation Usage Data Import||5 minutes||The default value is 30 minutes. Lowering this setting imports the data into the usage database more frequently, and is particularly useful when troubleshooting. For normal operations, it should be 30 minutes.|
|Enable all diagnostic providers||Enabled||The default value is Disabled except for the “Search Health Monitoring – Trace Events” provider. These providers collect health data for various features and components. For normal operations, you may want to revert to the default.|
|Set “job-diagnostics-performance-counter-wfe-provider” and “job-diagnostics-performance-counter-sql-provider” Schedule Intervals||1 minute||The default value is 5 minutes. Lowering this setting can poll data more frequently, and is particularly useful when troubleshooting. For normal operations, it should be 5 minutes.|
|Enable stack tracing for content requests||Enabled||The default value is Disabled. Enabling this setting allows diagnosis of content requests failures using the process stack trace. For normal operations, it should be disabled.|
|Enable the Developer Dashboard||Enabled||The default value is Disabled. Enabling this setting allows diagnosis of slow pages, or other problems by using the Developer Dashboard. For normal operations, and once troubleshooting is no longer necessary, it should be disabled.|
|Usage Data Collection|
|Content Import UsageContent Export UsagePage RequestsFeature UseSearch Query UseSite Inventory Usage
|Enabled||Enabling the logging of this set of counters allows you to collect more usage data across the environment and to better understand the traffic patterns in the environment.|
You can add the performance counters listed below if you are using the usage database. These counters are logged automatically and by default have an interval of 30 minutes. Performance counters are added to the usage database using the PowerShell cmdlett Add-SPDiagnosticsPerformanceCounter. For example to add the Processor Time counter you would enter the following command:
Add-SPDiagnosticsPerformanceCounter -Category “Processor” -Counter “% Processor Time” -Instance “_Total” –WebFrontEnd
Note that you would only have to enter this command on one of the Web servers, that is, if you have more than one Web server in your SharePoint environment.
The table below lists the available system counters that can be added to the usage database using the Add-SPDiagnosticsPerformanceCounter command.
|% Processor Time||This shows processor usage over a period of time. If this is consistently too high, you may find performance is adversely affected. Remember to count “Total” in multiprocessor systems. You can measure the utilization on each processor as well, to ensure balanced performance between cores.|
|– Avg. Disk Queue Length||This shows the average number of both read and write requests that were queued for the selected disk during the sample interval. A bigger disk queue length may not be a problem as long as disk reads/writes are not suffering and the system is working in a steady state without expanding queuing.|
|Avg. Disk Read Queue Length||The average number of read requests that are queued.|
|Avg. Disk Write Queue Length||The average number of write requests that are queued.|
|Disk Reads/sec||The number of reads to disk per second.|
|Disk Writes/sec||The number of writes to disk per second.|
|– Available Mbytes||This shows the amount of physical memory available for allocation. Insufficient memory will lead to excessive use of the page file and an increase in the number of page faults per second.|
|– Cache Faults/sec||This counter shows the rate at which faults occur when a page is sought in the file system cache and is not found. This may be a soft fault, when the page is found in memory, or a hard fault, when the page is on disk.The effective use of the cache for read and write operations can have a significant effect on server performance. You must monitor for increased cache failures, indicated by a reduction in the Async Fast Reads/sec or Read Aheads/sec.|
|– Pages/sec||This counter shows the rate at which pages are read from or written to disk to resolve hard page faults. If this rises, it indicates system-wide performance problems.|
|– % Used and % Used Peak||The server paging file, sometimes called the swap file, holds “virtual” memory addresses on disk. Page faults occur when a process has to stop and wait while required “virtual” resources are retrieved from disk into memory. These will be more frequent if the physical memory is inadequate.|
|– Total Bytes/sec||This is the rate at which data is sent and received via the network interface card. You may need to investigate further if this rate is over 40-50 percent network capacity. To fine-tune your investigation, monitor Bytes received/sec and Bytes Sent/sec.|
|– Working Set||This counter indicates the current size (in bytes) of the working set for a given process. This memory is reserved for the process, even if it is not in use.|
|– % Processor Time||This counter indicates the percentage of processor time that is used by a given process.|
|Thread Count (_Total)||The current number of threads.|
|Requests Total||The total number of requests since the service was started.|
|Requests Queued||Microsoft SharePoint Foundation 2010 provides the building blocks for HTML pages that are rendered in the user browser over HTTP. This counter shows the number of requests waiting to be processed.|
|Request Wait Time||The number of milliseconds that the most recent request waited in the queue for processing. As the number of wait events increases, users will experience degraded page-rendering performance.|
|Requests Rejected||The total number of requests not executed because of insufficient server resources to process them. This counter represents the number of requests that return a 503 HTTP status code, indicating that the server is too busy.|
|Requests Executing (_Total)||The number of requests currently executing.|
|Requests/Sec (_Total)||The number of requests executed per second. This represents the current throughput of the application. Under constant load, this number should remain within a certain range, barring other server work (such as garbage collection, cache cleanup thread, external server tools, and so on).|
|.NET CLR Memory|
|# Gen 0 Collections||Displays the number of times the generation 0 objects (that is, the youngest, most recently allocated objects) are garbage collected since the application started. This number is useful as a ratio of #Gen 0: #Gen 1: #Gen 2 to make sure that the number of Gen 2 collections does not greatly exceed Gen 0 collections, optimally by a factor of 2.|
|# Gen 1 Collections||Displays the number of times the generation 1 objects are garbage collected since the application started.|
|# Gen 2 Collections||Displays the number of times the generation 2 objects are garbage collected since the application started. The counter is incremented at the end of a generation 2 garbage collection (also called a full garbage collection).|
|% Time in GC||Displays the percentage of elapsed time that was spent performing a garbage collection since the last garbage collection cycle. This counter usually indicates the work done by the garbage collector to collect and compact memory on behalf of the application. This counter is updated only at the end of every garbage collection. This counter is not an average; its value reflects the last observed value. This counter should be under 5% in normal operation.|
SQL Server Counters
In addition to system and Web related performance counters, you should monitor the SQL server objects and counters as well. The list below shows what counters are available.
|Objects and Counters||Description|
|General Statistics||This object provides counters to monitor general server-wide activity, such as the number of current connections and the number of users connecting and disconnecting per second from computers running an instance of SQL Server.|
|User Connections||This counter shows the amount of user connections on your instance of SQL Server. If you see this number rise by 500 percent from your baseline, you may see a performance reduction.|
|Databases||This object provides counters to monitor bulk copy operations, backup and restore throughput, and transaction log activities. Monitor transactions and the transaction log to determine how much user activity is occurring in the database and how full the transaction log is becoming. The amount of user activity can determine the performance of the database and affect log size, locking, and replication. Monitoring low-level log activity to gauge user activity and resource usage can help you to identify performance bottlenecks.|
|Transactions/sec||This counter shows the amount of transactions on a given database or on the entire SQL Server instance per second. This number is to help you create a baseline and to help you troubleshoot issues.|
|Locks||This object provides information about SQL Server locks on individual resource types.|
|Number of Deadlocks/sec||This counter shows the number of deadlocks on the SQL Server per second. This should normally be 0.|
|Average Wait Time (ms)||This counter shows the average amount of wait time for each lock request that resulted in a wait.|
|Lock Wait Time (ms)||This counter shows the total wait time for locks in the last second.|
|Lock Waits/sec||This counter shows the number of locks per second that could not be satisfied immediately and had to wait for resources.|
|Latches||This object provides counters to monitor internal SQL Server resource locks called latches. Monitoring the latches to determine user activity and resource usage can help you to identify performance bottlenecks.|
|Average Latch Wait Time (ms)||This counter shows the average latch wait time for latch requests that had to wait.|
|Latch Waits/sec||This counter shows the number of latch requests per second that could not be granted immediately.|
|SQL Statistics||This object provides counters to monitor compilation and the type of requests sent to an instance of SQL Server. Monitoring the number of query compilations and recompilations and the number of batches received by an instance of SQL Server gives you an indication of how quickly SQL Server is processing user queries and how effectively the query optimizer is processing the queries.|
|SQL Compilations/sec||This counter indicates the number of times the compile code path is entered per second.|
|SQL Re-Compilations/sec||This counter indicates the number of times statement recompiles are triggered per second.|
|Plan Cache||This object provides counters to monitor how SQL Server uses memory to store objects such as stored procedures, ad hoc and prepared Transact-SQL statements, and triggers.|
|Cache Hit Ratio||This counter indicates the ratio between cache hits and lookups for plans.|
|Buffer Cache||This object provides counters to monitor how SQL Server uses memory to store data pages, internal data structures, and the procedure cache, as well as counters to monitor the physical I/O as SQL Server reads and writes database pages.|
|Buffer Cache Hit Ratio||This counter shows the percentage of pages found in the buffer cache without having to read from disk. The ratio is the total number of cache hits divided by the total number of cache lookups since an instance of SQL Server was started.|
This represents the entire list of available system objects and performance counters that you can add to your SharePoint 2010 monitoring solution, and we’ll admit it is a quite extensive number of counters. In the next article of our SharePoint series we’ll discuss how to detect bottlenecks, what counters to use, and what to do to resolve bottlenecks.
See also SharePoint Performance Monitoring article.