Wednesday, April 24, 2013

Analyze GemFire Performance using VSD

 

VSD – Visual Statistics Display

In a distributed systems data management is very complex problem space but GemFire gives very excellent results that everyone should agree once they realized but of course we need something to monitor the performance , resources and runtime of GemFire all that time, yeah we have Hyperic.

In simple form I would like to explain that how the VSD differs from Hyperic while monitoring.

Hyperic – monitoring tool with alert system

VSD – monitoring tool with out alert system.

VSD is a visual tool for analyzing GemFire statistics. It reads GemFire statistics from special statistics archive files created by GemFire, and renders their graphs for analysis. It is not a real-time online monitoring tool, such as Hyperic as I said , so it does not have the real-time monitoring and alerting capabilities that they have. On the other hand, it is the most powerful tool for examining the state of a vFabric GemFire system, as it provides access to all the statistics collected by GemFire. No real-time monitoring tool can do that, as the amount of statistics that GemFire collects is prohibitive for real-time collection in a distributed system.

Having a complete view into the state of a GemFire process for performance analysis, and tracking down problems by performing offline analysis of distributed systems using statistics gathered by the cluster. It is also helpful any time we need to verify the runtime state of a distributed system, for example: upon startup or data loading; to make sure that all the nodes are present and see one another, that all the entries are loaded and well balanced across all the nodes; or that JVM heaps have enough headroom; etc.

important statistics that are useful in verifying the state of a distributed system, including its configuration, resource usage, and throughput for different operations.

setting the VSD configuration properties:

statistic-sampling-enabled=true, and statistic-archive-file=myStats.gfs. As the collection of statistics at the default sampling rate of 1s does not affect performance, it should always be turned on–during development, testing, and in production.

There is a special category of statistics called time-based statistics that can be very useful in troubleshooting and assessing performance of some GemFire operations, but they should be used with caution because their collection can affect performance. They can be enabled using the property enable-time-statistics=true.

Once a distributed system is up and running, every GemFire instance will have its statistics file created. you can copy all the stat files into one directory so that you  can easily load them into VSD.

Always you can find the VSD under tools of GemFire7 Product folder like shown on the below screen.

image

detailed documentation if you need here

Analyzing the Data

Once you have VSD running and statistics archives loaded, it will be populated with lots of interesting data, as shown in the screen shot below.

image

Runtime Configuration

 

  • The number of peer nodes in the system:
  • DistributionStats:nodes. This value should be the same for every node in the system.
  • The number of clients and client connections for each server: CacheServerStats: currentClients, andcurrentClientConnections
  • The number of data entries:
    • CachePerfStats:entries. Each region has its own CachePerfStats instance per JVM namedRegionStats-<region name>, or RegionStats-partition-<region name> for partitioned regions, and its entries statistic is the number of entries for that region in the JVM.
    • DiskRegionStatistics (a per region disk statistic category about the region’s disk use): entriesInVM, and entriesOnlyOnDisk show the number of entries in the JVM (which can also be on disk too), and the number of entries that are only on disk, respectively.
  • Partitioned Region Configuration: One of the main parameters of Partitioned Region (PR) configuration is the primary bucket distribution. To make sure that primary buckets for a PR are evenly distributed, check thePartitionedRegionStats.primaryBucketCount statistic for each partition. This statistic shows the number of primary buckets in a partition.
Resources

The resources that are vital for normal operation and performance are memory, file descriptors (most importantly sockets, then files), CPU, network, and disk (when disk operations, such as overflow and persistence, are involved). The following stats cover all those:

  • Memory: There are several stats categories that show memory usage, for different types and granularity of memory.
    • Heap: VMMemoryUsageStats:vmHeapMemoryStats are all about heap usage, as are the memory stats under VMStats:vmStats: freeMemory, totalMemory, maxMemory.
    • Non-heap memory: VMMemoryUsageStats:vmNonHeapMemoryStats.
    • System-wide memory stats as reported by the OS: The OS statistic category (e.g.LinuxSystemStats on Linux) includes various system level memory statistics, such as freeMemory, which shows the free memory on the host (as opposed to related to the JVM process),physicalMemory (total physical memory on the host), paging related statistics (pagesSwappedIn, pagesSwappedOut, unallocatedSwap).
    • Client and gateway queue sizes: while not actual resources, these queues may be responsible for increased memory usage, so it’s good to keep them in mind when investigating memory issues. The client queue stats are in ClientSubscriptionStats category: eventsQueued, and eventsRemoved. The difference between the two is the current queue size. The gateway queue stats are inGatewayStatistics (GatewaySenderStatistics as of GemFire 7.0) category: eventQueueSize is the size of the queue.
  • File Descriptors: file descriptor related statistics are captured in the category VMStats: fdsOpen and fdLimitshow the number of open file descriptors, and the limit on file descriptors for the host, respectively
  • CPU: The CPU usage is captured in OS statistic category, e.g. LinuxSystemStats. The statistic cpuActiveshows the percentage of the total available CPU time that has been used in a non-idle state.
  • System load: OS statistic category (e.g. LinuxSystemStats) includes the loadAverage1, loadAverage5, loadAverage15 statistics, which show the average system load for 1, 5, and 15 minutes.
  • Network: OS stats also include network related stats for received (recv) and transmitted traffic (recvBytes, xmitBytes, recvErrors, xmitErrors). Note that some of these statistics may be incorrect in GemFire versions prior to 6.6.2 due to a bug that is fixed in GemFire 6.6.2.
  • Disk: DiskDirStatistics:diskSpace shows the amount of disk space used for GemFire disk storage on a given disk. Above mentioned entriesOnlyOnDisk, and entriesInVM  from DiskRegionStatistics are useful for determining the distribution of data between memory and disk, for regions that use disk overflow/persistence.

No comments :