Date: November 18th, 11am PST

Agenda / Discussion Topics

The agenda is largely based around themes that Mark Miller raised recently.  Mark Miller said a lot of things and was not very coherent either.  He gave some WIP code to his colleagues and did some virtual meetings with them and with me (David).  In this meeting, we will attempt to distill the major ideas, both technical and community/process.

  • Technical
    • Big picture:  "Solr Cloud 2" isn't a rewrite; it's fixing a 100 little things that are greater than the sum of their parts.  Exception to this is pervasive use of Curator and large work in the Overseer.
    • Surprising insight: Seek performance and you may achieve stability – a more important goal
      • Our instability is difficult to diagnose because hard to reproduce problems
      • Focusing on performance tightens the timings and exposes real/existing problems that used to occur rarely
      • Me: Implies we have benchmarks regularly running, similar to Lucene nightly benchmarks
      • Another outcome may be code clarity / reasoning?
    • More use of Apache Curator
      • Not using Curator is tech debt.  Less code for us to maintain; simpler code remains.  Curator is faster & safer than our feeble attempts, generally.
      • Lets get familiar with it
    • Overseer
      • TBD if Mark left notes on what becomes of it.
    • Logging
      • Need continuous attention to cleanup.  Sometimes we don't log enough, sometimes too much.
      • Separate log configs depending on what work you are focusing on.
      • Colored logs.
    • Tech-Debt
      • We don't finish efforts.  Finishing means completely removing the old stuff.
      • Hurts the community; fewer contributors due to our complexity
  • Community

Meeting Notes

Attended by: Andrzej Bialecki, Anshum Gupta, Cassandra Targett, Chris Hostetter, David Smiley, Erick Erickson, Gus Heck, Ishan Chattopadhyaya, Jan Hoydahl, Jason Gerlowski, Mike D. (Apple), Noble Paul, Shawn Heisey, Scott Blum, Tomas F. Lobbe, Yonik Seeley

Duration: 95 minutes.

Mark’s WIP Code

We heard from Thomas, Anshum, and Mike D. (Mark Miller’s colleagues.) who spoke with Mark at length and have his work in progress.

  • Mark shared a large code dump with his colleagues.
  • He’s not comfortable sharing it in this state as it’s too work-in-progress.  His colleagues are going to respect this and not simply share it as-is.
  • His colleagues are actively working together on teasing out separate improvements that will each get their own JIRA issue and code.  This is hard work and it will occur rather slowly over time (probably more than a month). And when each issue is filed with code, it will usually be WIP.  Rarely will it be something immediately committable.

Apache Curator

Migrating to use Curator is a great thing for many reasons (see agenda) including performance, though it’s not a singular solution for any/most SolrCloud problems.  Probably no draw-backs but it’s work. Changing this (or many SolrCloud internals for that matter) cause tests to break (Mark said), and it’ll take time to fix such tests. 

Overseer

  • It “clings to leadership” much more than it should.
  • SolrCloud over-uses the Overseer for too many functions that could be done without it.  We’ll probably always want an Overseer though.

Service Protection

Solr doesn’t have much service protection.  If you create thousands of collections, it’ll lock up and become inoperable.  Scott reported that If you boot up a 100+ node cluster, SolrCloud won’t get to a happy state; currently you need to start them gradually.  A well-written service won’t lock-up; it will make the client wait and/or give the client an error. The autoscaling framework is supposed to help; it’s a start and AB is working on that somewhat.  It’s probably not the only answer here.

Benchmarks

Ishan is working on addressing the need for continuously running benchmarks [SOLR-13933].  Having such benchmarks is rather foundational for the theme of performance improvements. And that, perhaps surprisingly, helps us achieve stability.

Tests

Mark believes in tightly limiting test times that shouldn’t take long.  He used this while working on his improvements. Smiley suspects this approach may only be useful in local dev but not in C.I. where virtual overloaded machines cold be quite slow.  Furthermore, he believes the objective there can be addressed better via benchmarks.

Logging

Much was said but unclear what action to take here; it’s a bike-shed topic.  Separate concerns depending on the audience -- production users or us developers?  Hoss reminded us of the LogLevel annotation and suggested it’d be neat if the level could be automatically set to debug based on the package of the test.

Tech Debt

The overarching theme of what Mark raised is perhaps tech debt.  Some miscellaneous things to add here: We should spend more effort removing old things (Smiley cares about this).  And for lots of functionality to continue to maintain, we hope the plugin system would lead to a future where Solr needn’t absorb everything or needn’t be official contribs.

Code Reviews

We want to get reviews, even extremely superficial reviews that might not look at the code but do look at the description and comments said about the state of the code.  Apparently ASF “RTC” policy suggests 3 binding votes are required, which is of course an extremely high bar and not paletable. Even without formally changing, we’d like to try out a 6 month period of behaving this way for all but the most trivial of changes.  Smiley takes an action item to make a specific proposal soon.

Major Change Proposals

See Kafka “KIP” as an example.  This interests us but it’s very unfamiliar to us.  We want to try it out. Perhaps SOLR-13951 might be worth experimenting with this.  Perhaps a Confluence page is the right place to put the text? Issue argued Google Docs is more collaborative, e.g. inline commenting. Hoss argued Confluence has this to, but might need to be enabled.  Today, without a major change proposal mechanism, some JIRA issues are onerous to decipher. Irrespective of this, Hoss advocated we continously update our JIRA issue descriptions to be useful during the course of the issue, especially at conclusion.

Documentation

We agreed we need several layers of docs:  Javadocs, Developer Guide, User Guide. Javadocs is clearly in the code and we want more of them!  Developer Guide is currently unknown wether we prefer Confluence or use asciidoc/markdown in a dedicated directory in our code repo.

Closing Remarks

We really liked meeting to discuss these matters.  Gus and Jan proposed doing this quarterly timed to occur near when the ASF board reports are due so that we can discuss anything to add.

Action items

  • Mark's colleagues to introduce Mark's code piece by piece into new JIRA issues over time
  • Ishan to introduce a periodic benchmark system
  • Noble to try out a "Solr Improvement Proposal" or some-such in a new initiative pertaining to ZK / clusterstate matters.
  • David to propose a code review proposal to discuss on the dev list
  • David to organize the next meeting near March 1st (before ASF board report being due that month)

3 Comments

  1. I was having side conversations as well, so coherent is partly a matter of perspective. I’m done here guys, don’t worry, but I want to leave you in the best light I can.

    No one really understands this system. It’s not understandable by a human. Even the parts people understand, the don’t actually understand what’s happening at a detailed level. I have to spend A LOT of time to understand anything - and I have to add logging like I’m building from scratch. I did a lot of dumb and inefficient things when I started putting this together - tons of it still happens. You and I didn’t even know it, our logging is almost useless in my useless opinion.

    Ive explained to my colleagues how to fix this. Maybe they understood me, maybe they didn’t. Step one, allow the system to stop properly, even on failure. Short tests, short logs. Make the logging good. Make the tests fast. Start understanding what’s happening. Make it less stupid. Make it less blocking. Make it faster. Fix the bugs that fall out. Add tests that are not insane. If you any of you tried to build this and started with the insane randomized and non basic tests we have, you would get nowhere. Basic tests need to show evidence of basic working before you see if something can survive a meet grinder. I’ve written a novel to my teammates. They can share what they find useful.

    The basics are - no one understands this system, it’s buggy and stupid and inefficient. We all know how to build though - if you don’t turn this into skmethjng YOU would build, you will lose it. Right now, the best devs won’t touch the core cause that’s what the best devs do. The cowboy devs, also valuable people, will touch the core. The is problematic for obvious reasons. Good luck guys! Tons more, but if you just do this, you don’t need my spoon feeding. 

  2. And guys - seriously our tests can be as solid and fast as lucenes. I’ve seen that. So at this point, after this long, it’s Solr negligence that has Lucene devs rightly pissed at the state of our email failures. You guys can chase your tails with bad apples and little fixes all you want. We have been doing it for a decade. You are just hiding the screams. It’s negligence. Lucene should kick Solr out at this point unless you address it.