Status

Current state: Under Discussion

Discussion threads:


JIRA SOLR-14726 - Getting issue details... STATUS , many others, TBD

Released: TBD (target 9.0)

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.


Motivation

As Solr has grown, the examples have become a mix of ancient documents, kitchen-sink additions with complicated - and often confusing - interplay of definitions, left-over configurations conflicting or out of sync with documentation. The "more info" links mostly point into legacy wiki that is two generations of redirects behind the current Reference Guide. Solutions were introduced with a fix-the-pain approach, that have also caused magic paths or pushed demonstration configurations into consolidated defaults. The new features are often not demonstrated as adding new example requires understanding the existing one.

The default configuration files have grown to the ridiculous sizes with a lot of that size being commented out out-of-date defaults and explanations (that should be or even already are in Reference Guide) or comments that will go away on first API-driven rewrite:

Line Counts

as shippedNo comments% reduction
solrconfig.xml122621282%
managed-schema103152349%


For solrconfig.xml, this bloated configuration is both confusing for people trying to identify significant configuration entries and potentially dangerous, such as remote streaming enabled by default until recently.

For managed-schema, all comments go away on first rewrite, making them completely unsuitable for any significant education purposes.

Similarly, our out-of-the-box filesystem layout has legacy/incremental setup different from our lessons learned in docker/service/3rd party/production layouts (logical locations for solr.in.sh, logs, pid, solr.home, live vs non-live directories). We also have magic syntax around creating examples that hides just enough internal machinery to make it very hard to run those examples multiple times and to understand when things go wrong. This will especially hit those that try to run multiple Solr instances on the same machine.

The examples themselves are out of date, demonstrate legacy features (techproducts) and sometimes (films) became less viable because the external source of interesting data has disappeared. The examples are also not providing enough records/fields to show advanced Solr capabilities or even basic nested ones. One example (schemaless) is no longer different from a standard core created with one or two commands, apart from the behind-the-scenes logging magic that will not work outside of the example directory.

Some of the other examples are going away as part of other initiatives (DIH). Some other examples demonstrate the features that we strongly do not recommend in production and spend a lot of time advising people on the mailing list to undo what they learned from our own default schemas (Tika integration, schemaless mode as default chain).

Finally, the recent attempt to do getting started guide with initial focus on the cloud setup may have made the comprehension of what Solr is actually doing more complex and - again, because of magical nature of examples directory - not easily reproducible.

All together, this makes new users confused about getting examples running, understanding what they are actually running, learning about latest features of Solr and knowing how they can apply that learning from example configurations to their own. They are also going into production with kitchen-sink configurations that everybody is afraid to modify.

Public Interfaces

This will affect all the examples. It may affect some of the directories, startup scripts, documentation, and tests.

Proposed Changes

  1. Go through the default configuration files line by line.
    1. Ensure that any documentation and explanation not yet in the Reference Guide are moved there. Delete any significant passage and replace them with Ref Guide links to ensure a single-source of truth ( SOLR-11875 - Getting issue details... STATUS SOLR-14841 - Getting issue details... STATUS SOLR-14834 - Getting issue details... STATUS )
    2. Delete any default blocks that do not use parameter substitutions and point them to RefGuide for the section and to the API to get the real defaults as appropriate
    3. Delete legacy sections that 'no longer work' (e.g. jmx, possibly EditorialMarkerFactory)
    4. Delete workaround explanations for those migration from Solr prior to Solr 7? (Document them on RefGuide ?)
  2. Review directory layouts current state
    1. Compare:
      1. Out-of-the-box for default install
      2. Out-of-the-box example install and hacks (e.g. in bin/solr)
      3. serviceinstall scripts
      4. docker setup ( SOLR-11245 - Getting issue details... STATUS )
      5. Existing issues: SOLR-13035 - Getting issue details... STATUS   SOLR-6671 - Getting issue details... STATUS  
    2. Clarify naming for locations of:
      1. Static O/S global part of running solr
      2. Writable O/S global part of running solr (only pid file or more?)
      3. Server/Node level information (start.in.sh?, logs? configsets? solr.xml) - there may be several of this on a physical server, such as in cloud example. Or put all those in solr.home and have cores one level lower under coreRootDirectory (in solr.xml, but see SOLR-14097 - Getting issue details... STATUS
      4. Collection/Core level information (core.properties)
      5. Individual directories per core (conf, data) - some of these already can be in other locations
  3. Refactor example directory and associated commands to reduce magic
    1. This mainly affects log configuration and logging directory locations and figuring out what is the directory above solr home
    2. May also involve exploration about configsets and environmental override directories
  4. Create new examples ( SOLR-10329 - Getting issue details... STATUS , testable? SOLR-11352 - Getting issue details... STATUS )
    1. Create a base learning config that is either based on default or has even simpler its own ( SOLR-13652 - Getting issue details... STATUS )
    2. Setup new dataset (https://www.fakenamegenerator.com can generate 100k records with many interesting fields under CC license (https://creativecommons.org/licenses/by-sa/3.0/us/, similar to CC license used by films example already)
      1. Split records into different formats to demonstrate XML, CSV, multiple JSONs, nested records, etc
    3. Create a number of additive configurations+examples, that augment base configuration to demonstrate specific features with point precision
    4. Move non-essential schema definitions (e.g. languages) from default into alternative schema (new kitchen-sink). Should it be copy/paste XML or API commands, To Be Explored ( SOLR-11033 - Getting issue details... STATUS )
    5. Update documentation to use new examples to demonstrate features that used to use older configsets
    6. Use short names for analyzer/filter/tokenizer wherever possible ( SOLR-13691 - Getting issue details... STATUS ) - make sure they are easily discoverable in documentation as well
  5. Rewrite Getting Started guide that focuses on simplest path through
    1. Start from standalone mode
    2. Explain what is happening with cross-references for more details (teach troubleshooting skills early)
    3. Use API as much as possible, but not at a cost of readability/comprehension
    4. Demonstrate recent APIs/features
    5. Build up to the cloud example
  6. Bigger changes that needs further discussion
    1. Delete ALL DIH examples in bulk - DONE ( SOLR-14066 - Getting issue details... STATUS , SOLR-14783 - Getting issue details... STATUS )
    2. Delete Tika configuration and refer to the manual for configuration and warning ( SOLR-13973 - Getting issue details... STATUS )
    3. Move schemaless mode into learning chain ( SOLR-14701 - Getting issue details... STATUS SOLR-11741 - Getting issue details... STATUS )
    4. Delete (refactor) techproducts example and its files (but what about tests?)
    5. Delete Velocity example ( SOLR-14065 - Getting issue details... STATUS )
    6. V2 vs V1 API for examples (V2 is not available for standalone mode in 8.6.1)
    7. post tool vs curl
    8. Interplay with Admin UI changes in progress (e.g. how much to leverage/demonstrate it)
    9. Neither default nor techproducts are realistic production schemes - a whole separate but related discussion (Jira exists?)
    10. It seems that even though Velocity/DIH/others have been deprecated, they have not actually been removed from code/documentation for 9.0 yet. Are there Jiras for that already?
  7. Other cleanup
    1. Fix the dead/legacy wiki.apache.org links ( SOLR-14834 - Getting issue details... STATUS )

Compatibility, Deprecation, and Migration Plan

  • Existing users will only be affected when they look at examples again to learn additional features
  • The directory locations may change, but possibly in a very minor way. If 3rd party tools hardcode paths, this may need a call-out
  • Tests use both default and techproducts scheme. They would need to be migrated

Security considerations

This proposal should not affect or possibly improve the security.

Test Plan

All existing tests should run. Additional tests may be needed?

Rejected Alternatives

The current status is broken in 100 different small ways. The discussions and attempts to fix them are happening in parallel efforts, but they do it from a functional (rather than critical path) point of view. Being separate efforts, their priority and impact is often not fully appreciated without a higher-level critical path discussion.

It may be possible to create just a minimal learning schema and/or a couple of examples, but this would still not address that, once the person tries to add new functionality or test new features, they are not supported. Nor will it address kitchen-sink production deploys.

Related previous explorations and feature tests

Learning vs Production vs kitchen sink setup

Learning config

  • Should be as small as possible and still load in both standalone and cloud configurations
  • Should have every line to have a purpose and be explained with RefGuide references
  • managed-schema should be ordered in the order of reading comprehension (fieldType, related fields, uniqueKey declaration next to ID)
  • Additional examples should layer on top of learning schema to demonstrate different features
  • schemaless mode (to be rewritten to be learning mode) is a separate example
  • Related issues:

Production config

  • managed-schema should be minimal to allow users to include what is actually needed
  • solrconfig.xml
    • should be fairly comprehensive, but obscure defaults and detailed explanation should live in RefGuide. From experience, nobody updates the schema files unless forced to (it still points to wiki)
    • there should be some easy way to tell solrconfig.xml nested structure where a new configuration needs to go (or focus on configoverlay and config API if it is fixed )

Kitchen sink config

  • Is there a point to have a kitchen sink config that is basically a reference of field type definitions? That's where all the language variants could go.
  • managed-schema points
    • having kitchen-sink default configset allows us to put some inline comments that make no sense in either production or learning schema as their files may get rewritten on use
    • may be write locked to clearly indicate it is not for real use
    • kitchen sink may be the only one with commented out analyzer lines


Lessons learned

From DIH Cleanup ( SOLR-14783 - Getting issue details... STATUS )

  • To get DIH to work, we had to add permissions into solr/server/etc/security.policy, which is very low-level. Is it going to be an issue? Do we need a way for packages to explain such needs on install? Are there more examples like that? Also, it is great that somebody commented it properly, otherwise it would just be sitting there forever


  • No labels