Status

Current state: Under discussion

Discussion thread: https://cwiki.apache.org/confluence/display/SOLR/SIP-3+Solr-specific+log+and+thread+indexing+and+UI

JIRA: SOLR-14121 - Getting issue details... STATUS

Released: none

Motivation

When finding trouble to shoot in Solr, we spend a lot of time looking at log files and stack traces. And collecting them. And writing scripts to parse them. And looking at multi-gigabyte log files hoping the problem catches our eye. Wouldn't it be nice to, you know, use some kind of searching program that would let us navigate all that more easily?

Public Interfaces

There needs to be a UI that allows dynamic faceting (i.e. the ability to choose "facet on node, collection, core, etc." as well as text search of the message. For crude prototyping I used the /browse (velocity) templating engine, but that's going away.

One decision will be to provide a completely separate UI or build this into the Admin UI.

Proposed Changes

I created a prototype groovy script that uses regexes and then used the browse handler to see what's possible. Here are some relevant bits of information from that prototyping. I'll be happy to show anyone interested what it looks like if they promise not to criticize the UI Ping me (Erick Erickson) if you'd like to discuss this.

Log files

In my prototype there are several bits of information to extract that are very useful:

log level (INFO,WARN, etc)
core/replica
collection
exceptions
- First line recognized. For instance, being able to facet on "the first from Solr" is extremely useful. Say I have 1,000 exceptions. Seeing that 950 of them are on the same line in Solr helps me quickly focus on the important part. Or, conversely, note that those 950 exceptions are all query parsing errors and I should ignore them
- The entire exception should be indexed as a single document, not one-document-per-line
QTime
- doing interval faceting on QTimes is very useful.
Queries
- Splitting the query up on all the "&" and putting it in a multi-line text field is very useful.
Timestamp

Thread dumps

In one case we had a situation where we had over 1,000 threads. By allowing faceting on the first line that we recognized (say that contained org.apache.lucene or org.apache.solr) we were able to very quickly understand what to focus on. The algorithm is simple, "put the first string in the thread dump that matches one of these patterns in a string field".

NOTE: this has a lot in common with indexing exceptions from a log file.

Collecting the data

For both thread dumps and log file indexing, it's critical that each document be a "unit" that may extend over multiple lines. Thread dumps and exceptions are obvious "units".

There are two modes that come to mind:

Live troubleshooting where we can consume system resources could get this information dynamically. That would require new commands, perhaps an admin operation "index all the logs" or "index a thread dump from all running Solr instances"
Any support organization, whether internal or external, has the problem of post-processing log files provided by their users. To support that mode we need a way to ingest from a filesystem. It may not be acceptable to put the additional burden of this processing on a live system so offloading the parsing process, especially for log files, is desirable. Also, a batch mode would allow this to be used with older versions of Solr.

Automatically

I'm imagining an admin API call that:

Queried Zookeeper to find all Solr instances and sent commands that could perform the following:
- Take a thread dump, parse it, and send it to a collection
- Index the log files, parse them, and send them to a collection
- NOTE: the collection could be external to the cluster.
"Somehow" restrict what was collected and indexed. Some ideas that spring to mind:
- When indexing logfiles, restrict the indexed data to a:
  - time range
  - collection
  - node
  - ???

Manually (batch mode)

As above, while it's nice to have a command that says "go out to your live system and do all this work", to be effective for support organizations and to be acceptable for live installations, it's also necessary to collect this information from a live system and ship it off somewhere to be analyzed. So we need the ability to do the same work from files sitting on the filesystem.

Also, the automatic mode probably wouldn't be available until 9x. Having a batch mode would allow ingestion of log files from any version of Solr, subject to the ability to parse the log messages.

Log formats

This is a bit of a sticky wicket. I used regexes in my prototype to allow for the fact that users can modify their log file format. So the format for the Solr-specific information is variable. Users can alter the full or partial class specification (e.g. "o.a.l.class" or "org.apache.lucene.class"), the time format, the position of the level, etc. Three options come to mind:

Only support the log file format created with the default configs
Support the default log file format OOB, but provide the ability to specify other patterns via configuration (how? Perhaps by providing the log4j2 config file)
Query the log4j format and figure it all out automatically (I have no idea how to do this yet).

This could be phased of course. It would be a huge help for many clients to just support the default logfile formats.

Configuration

Details TBD. See the above for "Log formats" for one set of issues.

The above section that outlines faceting on the first recognized line of a thread dump could be extended if it were possible to configure a "list of interesting packages". By that I mean an arbitrary string that got its own entry in the document to allow faceting. Imagine, for instance, that a user has a custom component. In addition to "org.apache.lucene" and "org.apache.solr", being able to facet on the first line of a thread dump or exception that mentioned "org.my.custom.component" would be useful.

How to configure for static ingestion, with all the variants of log formats is a question. Perhaps require the associated log4j2 configuration file the be available?

Combining with Streaming Expressions

Joel Bernstein has been working a complimentary effort involving Streaming Expressions and some of the visualization tools. The two efforts are complimentary, should we combine them? In some sense, Joel's effort helps identify that there is a problem in the first place and when it occurred. This effort is more trying to find the cause of a problem.

Miscellaneous

It's useful to have a "batch" for each set of data ingested. Say you have three different sets of logfiles from different days. It's useful to see aggregate information over all three days, but it's also useful to have a convenient way to examine each individually as well as delete documents you no longer care about.

Scaling at very large installations may be an issue. There can be billions of log records, we need a way to selectively index only what we need in these situations.

Remember that log files can be automatically zipped up when rotated, we need to handle this.

The log directory can contain other files (gc logs etc). Should we skip those? Hey! Let's index those too for analysis! (actually, that last isn't as facetious as it sounds)

Should we recursively descend the log directory in automatic mode (I don't think so).

Should we recursively descend the directory in batch mode? If so, how to include/exclude files?

Compatibility, Deprecation, and Migration Plan

TBD is whether this has a stand-alone version that does not require tight integration with Solr. I know that sounds odd given the mention of how Solr-specific this is, but the prototype I built is a Groovy script that indexes via http exclusively, it has no dependency on Solr. By not using SolrJ etc, we can make this independent of the Solr version, making it useful for the large installed base in batch mode. Additionally, maintenance would be easier.

TBD: is whether the UI can be built into the Admin UI or is completely separate. If it was separate, all we'd have to build into Solr is the ability to handle the "Automatic" indexing in the section above. That is an API to index the parsed log file or thread dump to a specified URL (or even a SolrJ connection since that would be embedded in Solr). How to share the parsing code in that case?

Test Plan

TBD, let's nail down whether this is worth pursuing before this level of detail.

Rejected Alternatives

One question that immediately pops out is "why not use Logstash etc"? So far there are several reasons, although if they're bogus we shouldn't re-invent the wheel:

You have to know what you're looking for in order to find it, these tools are also hit-or-miss in terms of finding what's important.
What this proposal has in mind is Lucene/Solr specific.
It's important to have multi-line support.

Space shortcuts

Page tree

Status

Motivation

Public Interfaces

Proposed Changes

Log files

Thread dumps

Collecting the data

Automatically

Manually (batch mode)

Log formats

Configuration

Combining with Streaming Expressions

Miscellaneous

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

5 Comments

David Smiley

Erick Erickson

Bram Van Dam

Erick Erickson

Alexandre Rafalovitch

Space shortcuts

Page tree

SIP-3 Solr-specific log and thread indexing and UI

Status

Motivation

Public Interfaces

Proposed Changes

Log files

Thread dumps

Collecting the data

Automatically

Manually (batch mode)

Log formats

Configuration

Combining with Streaming Expressions

Miscellaneous

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

5 Comments

David Smiley

Erick Erickson

Bram Van Dam

Erick Erickson

Alexandre Rafalovitch