Status
Current state: Under discussion
Discussion thread: https://cwiki.apache.org/confluence/display/SOLR/SIP-3+Solr-specific+log+and+thread+indexing+and+UI
JIRA:
-
SOLR-14121Getting issue details...
STATUS
Released: none
Motivation
When finding trouble to shoot in Solr, we spend a lot of time looking at log files and stack traces. And collecting them. And writing scripts to parse them. And looking at multi-gigabyte log files hoping the problem catches our eye. Wouldn't it be nice to, you know, use some kind of searching program that would let us navigate all that more easily?
Public Interfaces
There needs to be a UI that allows dynamic faceting (i.e. the ability to choose "facet on node, collection, core, etc." as well as text search of the message. For crude prototyping I used the /browse (velocity) templating engine, but that's going away.
One decision will be to provide a completely separate UI or build this into the Admin UI.
Proposed Changes
I created a prototype groovy script that uses regexes and then used the browse handler to see what's possible. Here are some relevant bits of information from that prototyping. I'll be happy to show anyone interested what it looks like if they promise not to criticize the UI Ping me (Erick Erickson) if you'd like to discuss this.
Log files
In my prototype there are several bits of information to extract that are very useful:
- log level (INFO,WARN, etc)
- core/replica
- collection
- exceptions
- First line recognized. For instance, being able to facet on "the first from Solr" is extremely useful. Say I have 1,000 exceptions. Seeing that 950 of them are on the same line in Solr helps me quickly focus on the important part. Or, conversely, note that those 950 exceptions are all query parsing errors and I should ignore them
- The entire exception should be indexed as a single document, not one-document-per-line
- QTime
- doing interval faceting on QTimes is very useful.
- Queries
- Splitting the query up on all the "&" and putting it in a multi-line text field is very useful.
- Timestamp
Thread dumps
In one case we had a situation where we had over 1,000 threads. By allowing faceting on the first line that we recognized (say that contained org.apache.lucene or org.apache.solr) we were able to very quickly understand what to focus on. The algorithm is simple, "put the first string in the thread dump that matches one of these patterns in a string field".
NOTE: this has a lot in common with indexing exceptions from a log file.
Collecting the data
For both thread dumps and log file indexing, it's critical that each document be a "unit" that may extend over multiple lines. Thread dumps and exceptions are obvious "units".
There are two modes that come to mind:
- Live troubleshooting where we can consume system resources could get this information dynamically. That would require new commands, perhaps an admin operation "index all the logs" or "index a thread dump from all running Solr instances"
- Any support organization, whether internal or external, has the problem of post-processing log files provided by their users. To support that mode we need a way to ingest from a filesystem. It may not be acceptable to put the additional burden of this processing on a live system so offloading the parsing process, especially for log files, is desirable. Also, a batch mode would allow this to be used with older versions of Solr.
Automatically
I'm imagining an admin API call that:
- Queried Zookeeper to find all Solr instances and sent commands that could perform the following:
- Take a thread dump, parse it, and send it to a collection
- Index the log files, parse them, and send them to a collection
- NOTE: the collection could be external to the cluster.
- "Somehow" restrict what was collected and indexed. Some ideas that spring to mind:
- When indexing logfiles, restrict the indexed data to a:
- time range
- collection
- node
- ???
- When indexing logfiles, restrict the indexed data to a:
Manually (batch mode)
As above, while it's nice to have a command that says "go out to your live system and do all this work", to be effective for support organizations and to be acceptable for live installations, it's also necessary to collect this information from a live system and ship it off somewhere to be analyzed. So we need the ability to do the same work from files sitting on the filesystem.
Also, the automatic mode probably wouldn't be available until 9x. Having a batch mode would allow ingestion of log files from any version of Solr, subject to the ability to parse the log messages.
Log formats
This is a bit of a sticky wicket. I used regexes in my prototype to allow for the fact that users can modify their log file format. So the format for the Solr-specific information is variable. Users can alter the full or partial class specification (e.g. "o.a.l.class" or "org.apache.lucene.class"), the time format, the position of the level, etc. Three options come to mind:
- Only support the log file format created with the default configs
- Support the default log file format OOB, but provide the ability to specify other patterns via configuration (how? Perhaps by providing the log4j2 config file)
- Query the log4j format and figure it all out automatically (I have no idea how to do this yet).
This could be phased of course. It would be a huge help for many clients to just support the default logfile formats.
Configuration
Details TBD. See the above for "Log formats" for one set of issues.
The above section that outlines faceting on the first recognized line of a thread dump could be extended if it were possible to configure a "list of interesting packages". By that I mean an arbitrary string that got its own entry in the document to allow faceting. Imagine, for instance, that a user has a custom component. In addition to "org.apache.lucene" and "org.apache.solr", being able to facet on the first line of a thread dump or exception that mentioned "org.my.custom.component" would be useful.
How to configure for static ingestion, with all the variants of log formats is a question. Perhaps require the associated log4j2 configuration file the be available?
Combining with Streaming Expressions
Joel Bernstein has been working a complimentary effort involving Streaming Expressions and some of the visualization tools. The two efforts are complimentary, should we combine them? In some sense, Joel's effort helps identify that there is a problem in the first place and when it occurred. This effort is more trying to find the cause of a problem.
Miscellaneous
It's useful to have a "batch" for each set of data ingested. Say you have three different sets of logfiles from different days. It's useful to see aggregate information over all three days, but it's also useful to have a convenient way to examine each individually as well as delete documents you no longer care about.
Scaling at very large installations may be an issue. There can be billions of log records, we need a way to selectively index only what we need in these situations.
Remember that log files can be automatically zipped up when rotated, we need to handle this.
The log directory can contain other files (gc logs etc). Should we skip those? Hey! Let's index those too for analysis! (actually, that last isn't as facetious as it sounds)
Should we recursively descend the log directory in automatic mode (I don't think so).
Should we recursively descend the directory in batch mode? If so, how to include/exclude files?
Compatibility, Deprecation, and Migration Plan
TBD is whether this has a stand-alone version that does not require tight integration with Solr. I know that sounds odd given the mention of how Solr-specific this is, but the prototype I built is a Groovy script that indexes via http exclusively, it has no dependency on Solr. By not using SolrJ etc, we can make this independent of the Solr version, making it useful for the large installed base in batch mode. Additionally, maintenance would be easier.
TBD: is whether the UI can be built into the Admin UI or is completely separate. If it was separate, all we'd have to build into Solr is the ability to handle the "Automatic" indexing in the section above. That is an API to index the parsed log file or thread dump to a specified URL (or even a SolrJ connection since that would be embedded in Solr). How to share the parsing code in that case?
Test Plan
TBD, let's nail down whether this is worth pursuing before this level of detail.
Rejected Alternatives
One question that immediately pops out is "why not use Logstash etc"? So far there are several reasons, although if they're bogus we shouldn't re-invent the wheel:
- You have to know what you're looking for in order to find it, these tools are also hit-or-miss in terms of finding what's important.
- What this proposal has in mind is Lucene/Solr specific.
- It's important to have multi-line support.
5 Comments
David Smiley
Why is this a SIP and not just some tool/hack separate from Solr? I suppose it's okay nonetheless... even if it's some side-tool; there's something to be said for soliciting input if we want it to appeal to many.
At first, I thought where you were going with this might have been (another) tool to help us understand the test output on CI servers. That is something that needs further help.
Erick Erickson
Because we don't give much help to operations people trying keep Solr running and/or analyze problems and it's about time we paid some attention to that issue. Do note that I mentioned making it a package rather than something built into the core.
This issue could easily expand to having an "operations support package" as time passes, this is a first step. Whether it actually expands to something more is TBD.
We need to look at the issue of supporting operations more seriously. Support for developers/testing is important, don't get me wrong. That said, having tools to help operations people to do some of their own troubleshooting rather than be faced with trying to make sense of 10G of log files or thread dumps from 50 servers that are painful to collect in the first place would be an enormous help to people actually using Solr day to day.
Bram Van Dam
I suggest including the following fields in addition to the ones you have above:
In my experience this can be very useful when tracking down hard to find bugs.
Erick Erickson
Thanks for your thoughts. Thread is certainly a good one
As for the other two, that "just happens" (assuming we're talking exceptions here). It's a little hard to visualize from just words, but one thing that's specified is "a list of patterns to bucket on in exceptions".
So let's say may stacktrace is something like:
then the first line of the stack trace that matches any of the patterns is indexed as a string. So if I'd specified
"org.apache.solr", then the entire line, including line number would be indexed and facetable:
If I'd specified both "org.apache.solr" and "org.apache.lucene", then the line:
would be indexed, but not the SolrQueryParserBase line. So you get a facet count for the first line that matches any pattern.
And when faceting on this field, it's pretty in-your-face where most of your exceptions are. Unsurprisingly, it's often a zillion query parsing exceptions, but it's easy enough to filter them out with a query.
In my prototype, there's a lot of hovering over the facet because the facet is unreadably long so you have to hover over the facet, then decide whether to expand. Some re-arranging would help there.
What's unclear to me is whether the patterns should be regexes or just substrings, but that's a detail at this point. Note that the parsing is sensitive to the format specified in the log4j file, so they need to be configurable. Ditto for what people are interested in, i.e. I might have a custom plugin that I want to have a pattern match on.
This works for thread dumps too.
Hmmmm. There's some chance for a false positive here though if the stack trace differs after the first line matched. It'd be interesting to, say, make part of the line a hash the N lines following the first one matched as part of the facet, or maybe hash all the consecutive lines after the match that contain any of the specified patterns. I like the second option best I think.
Alexandre Rafalovitch
I did something with thread dumps a very very long time. So long ago it is not even on the web anymore, but in the Wayback machine. Basically, I parsed dumped stack traces and visualized the locks to detect the deadlocks and livelocks. Back then it was XML, XSLT, and Graphviz. These days, we could probably do it with Solr, nested records, stream expresssions and Graph format. Am I on the right track there? I am a tiny bit confused because log analysis and stack trace analysis are two different things in my mind.