Status

Current state: Under Discussion

Discussion thread: here

Voting thread: TBD

Released: TBD

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Scope

The scope of this CEP is to design a robust system how diagnostic events should be visible in virtual tables and to provide a pluggable way to persist diagnostic events to whatever sink with the reference implementation of a bin logger based on Chronicle queues.

Goals

This CEP meets its goals if the below is possible to do:

select diagnostic events from a virtual table
see what event classes and event types are possible to be subscribed to
subscribe / unsubscribe to an event of all types or just some type
unsubscribe from all events
getting the raw diagnostic data into vtable column for respective event
provide pluggable way to specify a consumer
provide reference implementation of a persistent consumer storing events in a bin log
provide a CLI tool for the inspection of the content of such bin log
Configure maximum number of events per event class in memory via cassandra.yaml (currently it is set to 200 directly in the code)

Motivation

Diagnostic events are currently retrievable by JMX only. It would be more appropriate for less technically savvy users, as well as for making the user experience more comfortable, to use CQL to see what diagnostic events were emitted, lowering the barrier for Cassandra users to see more into Cassandra state as well as raise the overall observability capabilities Cassandra offers.

The current solution also lacks any persistent layer. The persistence of diagnostic events is necessary for offline investigation what events a node fired before it was shut down or crashed. For now, when a node goes down, events will not be accessible anymore as they are stored in a memory. We think that the persistence layer would enable post-mortem / forensics investigation into what state a node was in before it went offline for whatever reason.

Audience

techops, operators and Cassandra developers

Approach

Firstly, we need to design a way how to render diagnostic events in virtual tables from diagnostic events framework and how to show them in vtables. After that, we will cover the implementation details of doing so.

Proposed Changes

Virtual tables

We suggest to add these virtual tables:

system_views.diagnostic_event_types

cassandra@cqlsh> describe system_views.diagnostic_events_types ;

/*
Warning: Table system_views.diagnostic_events_types is a virtual table and cannot be recreated with CQL.
Structure, for reference:
VIRTUAL TABLE system_views.diagnostic_events_types (
class text,
type text,
PRIMARY KEY (class, type)
) WITH CLUSTERING ORDER BY (type ASC)
AND comment = 'List of diagnostic events and their event types avaliable for subscription.';
*/

system_views.diagnostic_events

This virtual table serves as the view of stored diagnostic events.

"value" column contains a string representation of a DiagnosticEvent object.

cassandra@cqlsh> DESCRIBE system_views.diagnostic_events;
/*
Warning: Table system_views.diagnostic_events is a virtual table and cannot be recreated with CQL.
Structure, for reference:
VIRTUAL TABLE system_views.diagnostic_events (
class text,
timestamp timestamp,
type text,
value text,
PRIMARY KEY (class, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC)
AND comment = 'Diagnostic events';
*/

Imagine that a user wants to create a role with a password which is not valid. If an operator is subscribed to guardrail events, the attempt about invalid password will be visible in this virtual table. Every failures and warnings from guardrails will be visible here.

This is just a representative example, other diagnostic events are fired upon various actions Cassandra does, all these events would be present there as well.

cassandra@cqlsh> CREATE ROLE st3 WITH PASSWORD = 'abc' AND LOGIN = true;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Password was not set as it violated configured password strength policy. To fix this error, the following has to be resolved: Password must be 8 or more characters in length. Password must contain 1 or more uppercase characters. Password must contain 1 or more digit characters. Password must contain 1 or more special characters. Password matches 1 of 4 character rules, but 3 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration."

cassandra@cqlsh> expand on;
EXPAND set to ON
cassandra@cqlsh> select * from system_views.diagnostic_events;

@ Row 1
-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
class | org.apache.cassandra.db.guardrails.GuardrailEvent
timestamp | 2024-07-16 18:49:44.357000+0000
type | FAILED
value | {name=password, message=Guardrail password violated: [INSUFFICIENT_DIGIT, INSUFFICIENT_CHARACTERISTICS, TOO_SHORT, INSUFFICIENT_SPECIAL, INSUFFICIENT_UPPERCASE]}

These types of queries will be possible:

select * from system_views.diagnostic_events;

select * from system_views.diagnostic_events where class = 'org.apache.cassandra.db.guardrails.GuardrailEvent' and type = 'FAILED';

select * from system_views.diagnostic_events where class = 'org.apache.cassandra.db.guardrails.GuardrailEvent' and timestamp >= '2024-07-16 18:48:02+0000' and timestamp <= '2024-07-16 18:55:04+0000' and type = 'FAILED';

select * from system_views.diagnostic_events where class = 'org.apache.cassandra.db.guardrails.GuardrailEvent' and timestamp >= '2024-07-16 18:48:02+0000' and timestamp <= '2024-07-16 18:55:04+0000'

select * from system_views.diagnostic_events where class = 'org.apache.cassandra.db.guardrails.GuardrailEvent' and timestamp >= '2024-07-16 18:48:02+0000' and type = 'FAILED'

We can either TRUNCATE or DELETE specifying the event class. Events will be removed from the underlying in-memory store.

TRUNCATE system_views.diagnostic_events;

DELETE FROM system_views.diagnostic_events WHERE class = 'org.apache.cassandra.db.guardrails.GuardrailEvent';

system_views.diagnostic_events_subscriptions

The second table tells what events we are subscribed to. Without subscription, a respective diagnostic event is not emitted hence it will not appear in system_views.diagnostic_events.

There are two levels we can subscribe with. The first level is class level. The second level is more granular and it narrows down the type of an event class to subscribe to.

cassandra@cqlsh> describe system_views.diagnostic_events_subscriptions ;

/*
Warning: Table system_views.diagnostic_events_subscriptions is a virtual table and cannot be recreated with CQL.
Structure, for reference:
VIRTUAL TABLE system_views.diagnostic_events_subscriptions (
class text,
type text,
PRIMARY KEY (class, type)
) WITH CLUSTERING ORDER BY (type ASC)
AND comment = 'Diagnostic events subscriptions';
*/

We subscribe via INSERT. Because type is clustering column, we need to specify both class and type upon insertion.

INSERT INTO system_views.diagnostic_events_subscriptions (class, type) VALUES ( 'org.apache.cassandra.db.guardrails.GuardrailEvent', 'WARNED');

cassandra@cqlsh> SELECT * FROM system_views.diagnostic_events_subscriptions ;

@ Row 1
-------+---------------------------------------------------
class | org.apache.cassandra.db.guardrails.GuardrailEvent
type | WARNED

(1 rows)

If we want to subscribe to all event types of an event class, we use special value of 'all'.

INSERT INTO system_views.diagnostic_events_subscriptions (class, type) VALUES ( 'org.apache.cassandra.db.guardrails.GuardrailEvent', 'all');

cassandra@cqlsh> SELECT * FROM system_views.diagnostic_events_subscriptions ;

@ Row 1
-------+---------------------------------------------------
class | org.apache.cassandra.db.guardrails.GuardrailEvent
type | FAILED

@ Row 2
-------+---------------------------------------------------
class | org.apache.cassandra.db.guardrails.GuardrailEvent
type | WARNED

(2 rows)

If we want to subscribe to all event classes with all event types, we use 'all' value for class as well:

cassandra@cqlsh> INSERT INTO system_views.diagnostic_events_subscriptions (class, type) VALUES ( 'all', 'all');
cassandra@cqlsh> SELECT count(*) FROM system_views.diagnostic_events_subscriptions ;

@ Row 1
-------+-----
count | 103

We can unsubscribe from a specific event of a type by DELETE-ing such row:

cassandra@cqlsh> DELETE FROM system_views.diagnostic_events_subscriptions WHERE class = 'org.apache.cassandra.db.guardrails.GuardrailEvent' AND TYPE = 'WARNED';
cassandra@cqlsh> SELECT * FROM system_views.diagnostic_events_subscriptions WHERE class = 'org.apache.cassandra.db.guardrails.GuardrailEvent';

@ Row 1
-------+---------------------------------------------------
class | org.apache.cassandra.db.guardrails.GuardrailEvent
type | FAILED

We can unsubscribe from an event class irrelevant of event type by removing whole partition where event class is the partition key:

cassandra@cqlsh> DELETE FROM system_views.diagnostic_events_subscriptions WHERE class = 'org.apache.cassandra.db.guardrails.GuardrailEvent';
cassandra@cqlsh> SELECT * FROM system_views.diagnostic_events_subscriptions WHERE class = 'org.apache.cassandra.db.guardrails.GuardrailEvent';

class | type
-------+------

(0 rows)

Finally, we can unsubscribe from everything by executing TRUNCATE:

cassandra@cqlsh> TRUNCATE system_views.diagnostic_events_subscriptions ;
cassandra@cqlsh> select count(*) from system_views.diagnostic_events_subscriptions ;

@ Row 1
-------+---
count | 0

(1 rows)

`Implementation details`

It is important to realize that the events we see in the virtual table are not the ones which are persisted in a bin log. Currently, before this CEP is implemented, the way diagnostic framework works is that there is an in-memory event store of DiagnosticEventMemoryStore implementing DiagnosticEventStore interface. This event store just stores the events in memory, obviously, and there is so many instances of DiagnosticEventMemoryStore class as many event classes we have subscribed to, while these stores are instantiated lazily / on demand when the first event of a respective diagnostic event class is about to be persisted (in memory).

When we want to base our solution on Chronicle Queues, to be aligned with FQL and Audit, the way the bin logs work (e.g for Audit logs) is that there is a pluggable mechanism where a user can specify a custom implementation of a logger in cassandra.yaml through which these events will be stored into a bin log.

If we use the same approach here for diagnostic events (persisting them into a bin log), we would have hard time to "query the events back" from a bin log, back to virtual table. We would need to basically replay the log every single time we select, filter out what we do not need based on class / type used in a query and so on. This is very clunky way of doing it and actually not desirable, because there is already existing mechanism of the events retrieval via JMX which already reads these events from the memory. Also, bin log files might have custom rolling cycle, where an individual bin log might be in gigabytes, in theory. If a bin log rolls, are we going to replay all logs? Or just the newest one? etc. etc. ... If we were about to select all diagnostic events from it, does that mean that we would need to process gigabytes / megabytes of data every single time? Are we going to cap it?

Diagnostic event service acts around a concept of "consumers". A consumer is provided upon subscription and then events flow into these consumers. For now, there is one Consumer (in Java API sense) which is a method which puts all events into above-mentioned DiagnosticEventStore-s. The solution to this problem we propose is that we would continue to use the concept of consumers but here we would have two of them - one consumer would insert events into an implementation of IDiagnosticLogger called InMemoryDiagnosticLogger and the second consumer would be also implementation of IDiagnosticLogger called BinDiagnosticLogger. The abstraction of consumers was already in place, we just generalized this idea - one consumer for storing to the memory, another consumer for storing to a bin log. Upon querying system_views.diagnostic_events, we would fetch all events from the first consumer - from the memory, we would not query from the bin log one. Bin log would act as append-only structure and never read from Cassandra in runtime, inspection is possible only offline (or following a bin log on new entries online, similarly done for fql / audit logs.).

Already existing in-memory storage of diagnostic events is limited on its capacity. There is hard limit for 200 events per event class. If we have, lets say, 100 event classes, there will ever be at maximum 20 000 diagnostic events stored - that is being said if we subscribe to all classes, while the bin log sink would not have any limit on its size.

The current implementation has no way how to reset / clear all events from the memory. As of now, when the diagnostic framework is disabled, the events stay in the memory. We provide the way how to reset / cleanup the events stored in memory. Events will be wiped out only on demand, either via CQL or by JMX. When diagnostics framework is disabled, we still want to see the events which are there, even no new events are coming anymore. We preserve the current behavior, we just provide a way how to remove all events in case an operator is interested in it explicitly.

To summarize, there are two layers we operate on - in memory store and bin log store:

We want to be able to have bin log sink turned off but keep in-memory diagnostics working.
When we enable whole diagnostics framework, we want to automatically store events not only in memory but in a bin log as well (if it is configured in cassandra.yaml and enabled).
We want to disable all diagnostics (in-memory and bin log sink) by one operation.

As an example, a node would have diagnostic_events_enabled: true to enable whole machinery. More to it, it would have

diagnostic_logging_options:
enabled: false
logger:
- class_name: BinDiagnosticLogger
diagnostic_log_dir: /tmp/diagnostics

in cassandra.yaml but it would be disabled (enabled is false).

When a node starts, the in-memory sink would be active, so a user who subscribes to the events will start to see them in virtual tables but nothing would be in the bin log.

A user can execute "nodetool enablediagnosticslog" to dynamically start to use bin log as a sink as well (or calling respective JMX method). Under the hood, this logger would be just another consumer of events. A user could stop to use bin log as a sink via nodetool disablediagnosticlog or via JMX operation.

A reader sees that the concept of in-memory storage stays, we would just add the possibility to use bin log as a sink and this would be possible to be managed independently.

`Nodetool commands`

nodetool enablediagnosticlog

This command would enable both bin log (if configured in cassandra.yaml) and in-memory log.

There will be possibility to turn on only in-memory logging by --skip-persistent-log flag and it will be possible to turn on only persistent logging by --skip-memory-log. When both arguments are specified, nothing is enabled. When --use-node-config is specified, it will use the parameters which are specified in cassandra.yaml instead of the arguments which are specified on the command line. If that flag is not used, then values on the command line (or default values when not specified) will be specified. The same parameters as for e.g. enableauditlog will be available. diagnostic_logs_dir will not be possible to override, as is similarly done for directory of auditlogs.

nodetool disablediagnosticlog

This command would disable both bin log (when enabled) and in-memory log.

It will be possible to keep logging into memory by specifying --keep-in-memory-log flag. It will be possible to clean all in-memory log events by specifying --clean flag.

command	action
`nodetool disablediagnosticlog --keep-in-memory-log --clean`	This will turn off persistent logging only and it will clean events in memory.
`nodetool disablediagnosticlog --keep-in-memory-log`	This will turn off persistent logging, memory logging will be enabled and no events from there would be cleaned.
`nodetool disablediagnosticlog`	This disables both in-memory as well as persistent logging but keeps events in memory.
`nodetool disablediagnosticlog --clean`	This cleans diagnostic logs from memory and stops both in-memory as well as persistent logging.

nodetool getdiagnosticlog

This command would get the configuration of diagnotic big log, similarly as it is done for audit log, for example.

$ nodetool getdiagnosticlog
enabled true - this reflects DatabaseDescriptor.diagnostic_events_enabled
memory_enabled false - state of in-memory logging
persistent_enabled true - state of persistent logging
event_class_capacity 200 - maximum number of events per a specific event class for in-memory logger
logger BinDiagnosticLogger - other configuration properties for persistent logging
diagnostic_logs_dir /tmp/diagnostics
archive_command
roll_cycle HOURLY
block true
max_log_size 17179869184
max_queue_weight 268435456
max_archive_retries 10

Pluggability

Section similar to audit log / fql will be added to cassandra.yaml where a user can specify a custom logger implementation.

#diagnostic_logging_options:
# enabled: false
# logger:
# - class_name: BinDiagnosticLogger
# diagnostic_log_dir:
# roll_cycle: HOURLY
# block: true
# max_queue_weight: 268435456 # 256 MiB
# max_log_size: 17179869184 # 16 GiB
# # archive command is "/path/to/script.sh %path" where %path is re/bplaced with the file being rolled:
# archive_command:
# max_archive_retries: 10

Offline inspection of diagnostic events

There will be a CLI tool in tools/bin/diagnosticlogviewer, the logical functionality is the same as similar tooling tools/bin/auditlogviewer.

Class diagrams

We are providing the class diagrams to show the suggested class hierarchy. This CEP is also consolidated the hierarchy of loggers, both audit and diagnostic ones.

FQL logger, while also using bin log, will not be sharing the same hierarchy as shown below when this CEP is delivered but it might be the object of further re-factorization in the future as that is just a coding exercise.

New or Changed Public Interfaces

ILogger - core interface of the hierarchy

IDiagnosticLogger - interface which all diagnostic loggers need to implement

AbstractBinLogger - base of all bin loggers with core functionality.

AbstractMessage - static inner class of AbstractBinLogger. This class is meant to be extended in concrete bin log implementations. There will be two implementations - Message in BinAuditLogger and Message in BinDiagnosticLogger.

There will be new MBean methods in DiagnosticEventServiceMBean as well as methods using them via NodeProbe and similar.

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?
- Diagnostic bin log is disabled by default. It is enabled when configured in cassandra.yaml or enabled via JMX.
If we are changing behavior how will we phase out the older behavior?
- the only behavioral change is that when diagnostic log is disabled, then we unsubscribe from all events as well. It means that when a user enables logging again, they would need to re-subscribe. This is subject of change and matter of agreement.
If we need special migration tools, describe them here.
- not applicable
When will we remove the existing behavior?
- not applicable

Test Plan

The implementation should be tested by standard means of unit tests and/or distributed tests when necessary.

Space shortcuts