Date: November 18th, 11am PST
Agenda / Discussion Topics
The agenda is largely based around themes that Mark Miller raised recently. Mark Miller said a lot of things and was not very coherent either. He gave some WIP code to his colleagues and did some virtual meetings with them and with me (David). In this meeting, we will attempt to distill the major ideas, both technical and community/process.
- Big picture: "Solr Cloud 2" isn't a rewrite; it's fixing a 100 little things that are greater than the sum of their parts. Exception to this is pervasive use of Curator and large work in the Overseer.
- Surprising insight: Seek performance and you may achieve stability – a more important goal
- Our instability is difficult to diagnose because hard to reproduce problems
- Focusing on performance tightens the timings and exposes real/existing problems that used to occur rarely
- Me: Implies we have benchmarks regularly running, similar to Lucene nightly benchmarks
- Another outcome may be code clarity / reasoning?
- More use of Apache Curator
- Not using Curator is tech debt. Less code for us to maintain; simpler code remains. Curator is faster & safer than our feeble attempts, generally.
- Lets get familiar with it
- TBD if Mark left notes on what becomes of it.
- Need continuous attention to cleanup. Sometimes we don't log enough, sometimes too much.
- Separate log configs depending on what work you are focusing on.
- Colored logs.
- We don't finish efforts. Finishing means completely removing the old stuff.
- Hurts the community; fewer contributors due to our complexity
- Code reviews. Do we change policy?
- Should we adopt a formal decision process for proposing major changes/APIs to Solr, aka "Solr Implementation Proposal (SIP)"?
- Code level, e.g. Javadocs
- High level. new Dev guide somewhere (different from the Solr Ref Guide!)
Attended by: Andrzej Bialecki, Anshum Gupta, Cassandra Targett, Chris Hostetter, David Smiley, Erick Erickson, Gus Heck, Ishan Chattopadhyaya, Jan Hoydahl, Jason Gerlowski, Mike D. (Apple), Noble Paul, Shawn Heisey, Scott Blum, Tomas F. Lobbe, Yonik Seeley
Duration: 95 minutes.
Mark’s WIP Code
We heard from Thomas, Anshum, and Mike D. (Mark Miller’s colleagues.) who spoke with Mark at length and have his work in progress.
- Mark shared a large code dump with his colleagues.
- He’s not comfortable sharing it in this state as it’s too work-in-progress. His colleagues are going to respect this and not simply share it as-is.
- His colleagues are actively working together on teasing out separate improvements that will each get their own JIRA issue and code. This is hard work and it will occur rather slowly over time (probably more than a month). And when each issue is filed with code, it will usually be WIP. Rarely will it be something immediately committable.
Migrating to use Curator is a great thing for many reasons (see agenda) including performance, though it’s not a singular solution for any/most SolrCloud problems. Probably no draw-backs but it’s work. Changing this (or many SolrCloud internals for that matter) cause tests to break (Mark said), and it’ll take time to fix such tests.
- It “clings to leadership” much more than it should.
- SolrCloud over-uses the Overseer for too many functions that could be done without it. We’ll probably always want an Overseer though.
- Noble’s work on Smart State Caching (SOLR-13951)will help.
- Sometimes writing directly to ZooKeeper (helped with Curator recipes) is sufficient.
Solr doesn’t have much service protection. If you create thousands of collections, it’ll lock up and become inoperable. Scott reported that If you boot up a 100+ node cluster, SolrCloud won’t get to a happy state; currently you need to start them gradually. A well-written service won’t lock-up; it will make the client wait and/or give the client an error. The autoscaling framework is supposed to help; it’s a start and AB is working on that somewhat. It’s probably not the only answer here.
Ishan is working on addressing the need for continuously running benchmarks [SOLR-13933]. Having such benchmarks is rather foundational for the theme of performance improvements. And that, perhaps surprisingly, helps us achieve stability.
Mark believes in tightly limiting test times that shouldn’t take long. He used this while working on his improvements. Smiley suspects this approach may only be useful in local dev but not in C.I. where virtual overloaded machines cold be quite slow. Furthermore, he believes the objective there can be addressed better via benchmarks.
Much was said but unclear what action to take here; it’s a bike-shed topic. Separate concerns depending on the audience -- production users or us developers? Hoss reminded us of the LogLevel annotation and suggested it’d be neat if the level could be automatically set to debug based on the package of the test.
The overarching theme of what Mark raised is perhaps tech debt. Some miscellaneous things to add here: We should spend more effort removing old things (Smiley cares about this). And for lots of functionality to continue to maintain, we hope the plugin system would lead to a future where Solr needn’t absorb everything or needn’t be official contribs.
We want to get reviews, even extremely superficial reviews that might not look at the code but do look at the description and comments said about the state of the code. Apparently ASF “RTC” policy suggests 3 binding votes are required, which is of course an extremely high bar and not paletable. Even without formally changing, we’d like to try out a 6 month period of behaving this way for all but the most trivial of changes. Smiley takes an action item to make a specific proposal soon.
Major Change Proposals
See Kafka “KIP” as an example. This interests us but it’s very unfamiliar to us. We want to try it out. Perhaps SOLR-13951 might be worth experimenting with this. Perhaps a Confluence page is the right place to put the text? Issue argued Google Docs is more collaborative, e.g. inline commenting. Hoss argued Confluence has this to, but might need to be enabled. Today, without a major change proposal mechanism, some JIRA issues are onerous to decipher. Irrespective of this, Hoss advocated we continously update our JIRA issue descriptions to be useful during the course of the issue, especially at conclusion.
We agreed we need several layers of docs: Javadocs, Developer Guide, User Guide. Javadocs is clearly in the code and we want more of them! Developer Guide is currently unknown wether we prefer Confluence or use asciidoc/markdown in a dedicated directory in our code repo.
We really liked meeting to discuss these matters. Gus and Jan proposed doing this quarterly timed to occur near when the ASF board reports are due so that we can discuss anything to add.
- Mark's colleagues to introduce Mark's code piece by piece into new JIRA issues over time
- Ishan to introduce a periodic benchmark system
- Noble to try out a "Solr Improvement Proposal" or some-such in a new initiative pertaining to ZK / clusterstate matters.
- David to propose a code review proposal to discuss on the dev list
- David to organize the next meeting near March 1st (before ASF board report being due that month)