Steps to Add a Meter

  1. Clarify the Purpose of the Meter
  2. Define the Measurement
  3. Design the Meter
  4. Evaluate Sources of Information
  5. Instrument the Code

Each step is described more fully in the sections below.

We strongly advise making every attempt to complete each step before proceeding to the next.

Rationale.

We have made each of these assumptions numerous times. Each time, we came to regret the assumption. The "obvious" place to measure often turns out to be incomplete, incorrect, or otherwise inappropriate. 

We acknowledge that it can be difficult or impossible to complete each step before proceeding to the next. We strongly advise making the attempt.

Clarify the Purpose of the Meter

Clarify the purpose for adding this meter to Geode.

Rationale. Clarifying the purpose of the meter will help you:

Define the Measurement

Describe as precisely as possible the quantity to measure.

Key elements of the measurement to define:

See below for guidance about each element.

Audience focus. Define the measurement entirely in terms that the audience understands. Define the measurement in such a way that the audience can easily understand which measurements relate most directly to their current goals, questions, and challenges.

Rationale. Defining these elements will help you:

Define the Attribute and the Entity to Measure

Each meter measures some attribute of some entity. Identify and describe the attribute and the type of entity.

The attribute. The attribute is the characteristic that we want to measure about some entity.

Example: A geode.cache.entries gauge measures the number of entries in some local region.

Example: A jvm.memory.used gauge measures the amount memory in use from some pool of memory.

The type of entity. The entity is the thing whose attribute we want to measure.

Example: Each geode.cache.entries gauge reports the number of entries in a particular local region: That part of a region served by a particular cache server. The measurement is about that local region. The type of entity in this case is local region. We use a region tag to identify the particular local region measured by each meter.

Example: Each jvm.memory.used gauge measures the amount memory in use in a particular pool of memory. The measurement is about that pool of memory. The type of entity is in this case pool of memory. We use an id tag to identify the particular pool of memory measured by each meter.

The smallest entity. In general, we want to identify the smallest entity that we are interested in that has the attribute. This gives users insight into  Because we will add tags that identify interesting scopes that encompass the entity, the user's monitoring system can aggregate the meters from the individually measured entities to compute metrics the measured attribute in larger scopes.

Example: Each geode.cache.entries gauge reports the number of entries in a particular local region. Later we will add tags to each meter that identify larger scopes, and the user can use those tags aggregate these local region entry counts. The user can, for example:

What affects the attribute. Identify the operations, events, or conditions that cause the attribute to change. Focus on causes that are meaningful to your audience. The user will use your meter to make inferences about those operations, events, and conditions. Identifying these causes will be useful as you identify potential instrumentation sites.

Define the Scope of the Measurement

Each measurement measures within one or more scopes of interest to your audience. To identify those scopes, look for various boundaries that encompass the entity being measured, or in which the entity participates.

Example: Each geode.cache.entries gauge measures within several scopes:

As the example shows, there are several kinds of boundaries to consider:

The example also shows that:

Audience focus. Identify scopes of interest to your audience—those scopes that your audience may wish to use to select and sort measurements for display and analysis. Of particular interest are the scopes that help to identify attribute being measured.

Rationale. Defining the scope of the measurement will help you:

Define the Conditions of the Measurement

You may wish to report measurements selectively, either by reporting a measurement only in certain circumstances, or by reporting a given measurement differently in different circumstances.

Key questions:

Deciding whether to measure. You may wish to measure the attribute (or whether to report a measurement) only under certain conditions.

Example: As we initially defined the geode.function.executions timer, we intended to report only executions of user-defined functions, and not functions defined internally by Geode. Though we have not implemented this distinction, it is an example of the kind of distinction we considered.

Choosing among meters. You may wish to create multiple meters for the same attribute, and select among them to record measurements in different circumstances.

Example: Geode defines two geode.cache.gets timers for each region. One timer reports cache hits, and one reports cache misses. Together these two meters report all get operations on the region.

Example: Geode defines two geode.function.executions timers for each function. One timer reports successful executions, and one reports failed executions. Together these two meters report all executions of the function.

Rationale. Defining the selection criteria for the measurement will help you:

Design the Meter

Select the Type of Meter

Select the general type of meter you want to use to report measurements:

Select the category of meter that best suits the nature of the measurement.

The Micrometer library defines Java interfaces and classes that represent several variations of these categories. For details, see Instrument the Code, below.

Name the Meter

Identify the attribute. Name each meter in a way that clearly identifies the attribute it measures.

Example: jvm.memory.used identifies that the gauge reports some amount of JVM memory used.

Example: geode.function.executions identifies that the timer reports the number and durations of function executions.

Example: geode.cache.entries identifies that the gauge reports a number of entries.

Consider (with caution) identifying the entity type. Consider including the entity type in the name, though it is often (or usually) better to omit it.

Example: geode.function.executions identifies that the meter reports executions of a function. Executions is the attribute being reported. Function is the type of entity whose executions are being reported.

Before including the entity type in the meter name, consider:

Example: We considered (and rejected) geode.cache.region.entries, which would identify that the meter reports not on the cache as a whole, but on a particular region. In the end, we decided that the region tag sufficed to identify the kind of entity whose entry count the meter reports.

Style. After reviewing the naming conventions of meters packaged with Micrometer, we have adopted these style guidelines for naming meters:

Describe the Meter

Concisely describe the meter, including all key details of your definition.

Example (geode.cache.gets): "Total time and count for GET requests from Java or native clients."

Note how this description identifies an important boundary of measurement: It measures only those GET requests from Java clients and native clients. Including such details in your description helps your audience understand what is included in the measurement and what is excluded.

Identify the Unit of Measure

If the unit of measure is not obvious from the meter name, identify the unit of measure.

Define Tags

A tag is a key/value pair that represents some detail about the source or circumstances of a measurement.

General advice:

Example: The geode.cache.gets meter has these tags:

The geode.cache.gets meter also has these pre-defined tags, which Geode automatically adds to every meter:

Example: The jvm.memory.used meter (defined by Micrometer) has these tags:

Pre-defined tags. Geode's metrics framework automatically adds several tags to each meter:

You do not need to add these tags yourself.

Tag names and values. Micrometer does not allow null tag keys and tag values. Some meter registry implementations do not allow empty tag values.

Meter ID = name + tags. A meter is identified not only by its name, but by its name and its tags. Thus each combination of name and tags creates a distinct meter.

Combinations of tag keys. Within a single meter registry, make sure that every meter with a given name has exactly the same set of tag keys:

This restriction arises from certain meter registry implementations, such as Micrometer's PrometheusMeterRegistry, that users may wish to use to publish Geode's meters to external monitoring systems.

Note that it is specifically the PrometheusMeterRegistry, and not Prometheus itself, that enforces the restriction. Prometheus appears to allow similarly-named meters to have different sets of tag keys. This means it is permissible (by Prometheus, at least) for tag keys to differ between Geode instances.

We have not tested other monitoring systems to verify whether they similarly allow tag keys to differ between Geode instances.

Note also that this restriction applies only to the set of tag keys. Tag values may vary freely from meter to meter.

Evaluate Sources of Information

General advice (details TBD):

Instrument the Code

General advice (details in sections below):

Select a Meter Implementation

Micrometer defines a number of meter types. See the Micrometer documentation for details. Geode adds several custom meter types (noted below) that associate meters with stats.

Choose the appropriate meter implementation depending on:

Counters. A counter represents a monotonic increasing quantity. Each counter has a count() method that reports its measured value.

Gauges. A gauge represents a quantity that can go up or down. Each gauge has a value() method that reports its measured value.

Timers. A timer represents both the total number of occurrences of some event and the total durations of those events. Each timer has a count() method that reports the number events and a totalTime() method that reports the total duration of events.

Place the Meter in a Stats Class

Encapsulate meters in stats classes. Create and register meters only in stats classes. Interact with meters only in stats classes. Use stats classes to manage the lifetime of meters.

Rationale. Much existing Geode code already uses one or more domain-specific stats classes for instrumentation. Placing meters in existing stats classes avoids complicating the domain code with additional instrumentation noise.

Even if no relevant stats class exists, creating a new stats class to encapsulate meters allows the instrumented code to focus on reporting domain events (e.g. reporting a get operation just finished) rather than on the non-domain details of what and how to measure. And adding a stats class allows instrumenting the code using an already ubiquitous style of instrumentation.

Adding or changing a stats class. It is uncommon for an existing stats method to know exactly the information required for a new meter.

Obtain the Meter Registry

During cache creation, Geode automatically creates and configures its meter registry. The registry is managed by a "metrics service" owned by the InternalDistributedSystem. You can obtain the meter registry through the InternalDistributedSystem  or, for convenience, from the InternalCache:

The code you are instrumenting, or the stats class in which you are adding the meters, may also offer access to Geode's meter registry.

Add the Meter to Geode's Meter Registry

Each meter type includes a builder that you can use to progressively define a meter, then register the defined meter with the meter registry.

Timer example:

Timer cacheGetsHitTimer = Timer.builder("geode.cache.gets")
    .description("Total time and count for GET requests from Java or native clients.")
    .tag("region", region.getName())
    .tag("result", "hit")
    .register(meterRegistry)

Gauge example:

Gauge entriesGauge = Gauge.builder("geode.cache.entries", region::getLocalSize)
        .description("Current number of entries in the region.")
        .tag("region", region.getName())
        .tag("data.policy", region.getDataPolicy().toString())
        .baseUnit("entries")
        .register(meterRegistry);

Note that when you build a Gauge, you must tell it how to make a measurement. In this example, line 1 configures the gauge to use a Supplier<Number> (defined by the region::getLocalSize method reference) to measure the entry count.

An alternate builder() method takes a T object and a ToDoubleFunction<T>, and creates its own measurement supplier that applies the given function to the given object.

FunctionCounter and FunctionTimer are configured similarly. You must tell them how to make their measurements.

LegacyStatCounter example:

Counter eventsReceivedCounter = LegacyStatCounter.builder("geode.gateway.receiver.events")
        .longStatistic(stats, eventsReceivedId)
        .description("total number events across the batched received by this GatewayReceiver")
        .baseUnit("operations")
        .register(meterRegistry);

Note that line 2 links the LegacyStatCounter to a specific statistic (eventsReceivedId) in a specific Statistics instance (stats).

The LegacyStatCounter builder also has a doubleStatistic method that links the counter to a double stat.

LegacyStatTimer is configured similarly, using builder methods that allow you to forward its count and duration increments to associated long or double stats.

Manage the Meter's Lifetime

Give each meter the same lifetime as the entity whose attributes it measures:

Example: Each geode.cache.entries meter reports the number of entries in a given region. Each region's geode.cache.entries meter should be registered when the region is created and removed from the registry and closed when the region is destroyed.

Rationale: Each meter consumes memory. Publishing each meter consumes CPU cycles. In long-running systems, where the measured objects come and go, leftover meters can accumulate, consuming an increasing amount of memory and CPU time.

Geode's of Micrometer allows the user to publish the measurements to external monitoring systems for long-term storage. As a result, it is unnecessary for Geode to retain meters that measure objects that no longer exist.

Avoid Redundant Meters

Do not create meters whose values can be derived from other meters. For example:

Rationale. External monitoring systems can compute these derived values from the series of measurements over time.