Status

Current state: "Accepted"

Discussion thread: https://lists.apache.org/thread/n1hyo9wod5mqc02sh388dlzr2k29qmhn

JIRA: SOLR-16727 - Getting issue details... STATUS

Solr Operator Issue #536

Released:

  • Solr 9.3
  • Solr Operator v0.8.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.

Motivation

The Solr Autoscaling framework was deprecated in later 8.x versions, and removed in the 9.0 release.

Given that Kubernetes has Autoscaling support built-in via the HorizontalPodAutoscaler, and Solr has official support for running on Kubernetes, it seems like a good fit for the next iteration of Solr Autoscaling.

This autoscaling implementation is not meant to replace every part of the removed Autoscaling framework.

The first goal is to support scaling up/down Solr Nodes and moving replicas after scale up/down to spread load evenly.

Public Interfaces

Solr Interfaces

Two new public interfaces are needed, one API addition, one API change and one ReplicaPlacementPlugin method.

A new super class will take the balancing and placement logic for all built-in PlacementPlugins, the "OrderedNodePlacementPlugin".
Built-in PlacementPlugins will now implement OrderedNodePlacementPlugin, which requires one method:

protected abstract Map<Node, WeightedNode> getBaseWeightedNodes(
  PlacementContext placementContext,
  Set<Node> nodes,
  Iterable<SolrCollection> relevantCollections,
  boolean skipNodesWithErrors) throws PlacementException {} 

This method lets each PlacementPlugin return a mapping of Node to the relevant WeightedNode for that implementation.

WeightedNode is an abstract class that each PlacementPlugin that extends OrderedNodePlacementPlugin must implement. It determines how replicas should be placed for that PlacementPlugin. Nodes with lower weights will have replicas placed on them, and Nodes that have higher weights will have replicas taken off of them.

API

Balance Replicas:

v2:

PUT /api/cluster/replicas/balance
{
"nodes": [], (Optional)
"waitForFinalState": false,
  "async": "async"
}

Migrate Replicas: (An extension of the existing ReplaceNode command)

v2:

POST /api/cluster/replicas/migrate
{
"sourceNodes": [],
"targetNodes": [], (Optional) // defaults to liveNodes that are not sourceNodes
"waitForFinalState": false,
  "async": "async"
}

PlacementPlugin

public interface BalanceRequest extends ModificationRequest {}

public interface PlacementPlugin {

public BalancePlan computeBalancing(
BalanceRequest balanceRequest, PlacementContext placementContext)
throws PlacementException, InterruptedException {


This method will compute a list of replicas to be balanced across the given nodes.

The BalanceReplicas request can then, take this list, create the new replicas using similar logic to the ReplaceNode command, then delete the old replicas afterwards. 

Solr Operator Interfaces

SolrCloud CRD:

spec:
...
scaling:
populatePodsOnScaleUp: true
vacatePodsOnScaleDown: true

If the user want the Solr Operator to move replicas around when scaling up/down, they will use the "scaling.populatePodsOnScaleUp" and "scaling.vacatePodsOnScaleDown" options.


The Solr Operator managing the HPA for users will not be a part of this SIP. HPAs are very custom to each users usage of Solr, and Kubernetes makes it easy to point an HPA at a SolrCloud.

Since the Operator will not control the HPA, it cannot disable it during rolling restarts and other maintenance operations. However due to the new LockedCluster operations logic, we can make sure that scaling does not take place during other cluster maintenance.

Proposed Changes

This feature will require changes to both Solr and the Solr Operator. Since the Solr Operator supports a range of Solr versions, this will not be available for Solr Operator users until they upgrade to a version of Solr that implements this SIP.

Solr Changes

The two main APIs that the Solr Operator would need to call to Solr to implement this functionality are:

  • Move replicas off of node, because it will no longer be in use
    • This already exists, and has been improved in SOLR-15803 to optimally place all replicas from a node across the cluster.
    • A new option, MigrateReplicas will be added that the Solr Operator can use in the future to move replicas off of multiple nodes at once.
      SOLR-16855 - Getting issue details... STATUS
  • Move replicas onto node, because it is now a part of the cluster
    • Instead of just moving replicas onto a node, we will introduce an API to balance replicas across a set of nodes, or a whole SolrCloud.
      SOLR-16806 - Getting issue details... STATUS

In order to implement this logic, we would need new interfaces and methods in the placement package, as described above. Since we have 4 different built-in PlacementPlugins, we would need to implement this feature for those built in plugins. Instead of implementing this for each built-in PlacementPlugin, we will rewrite the existing PlacementPlugins to extend OrderedNodePlacementPlugin, which implements computePlacements and computeBalancing. Each PlacementPlugin will then implement a node weighting that will determine where replicas should be placed/moved.

Solr Operator Changes

The Solr Operator would need four changes:

  • If enabled, On scale-down of the StatefulSet, first move replicas off of the pods that will be deleted.
  • If enabled, On scale-up of the StatefulSet, afterwards move replicas onto the pods that have been created.
  • During Cloud maintenance, disable scaling activity.


Compatibility, Deprecation, and Migration Plan

  • This feature will require changes to both Solr and the Solr Operator. Since the Solr Operator supports a range of Solr versions, this will not be available for Solr Operator users until they upgrade to a version of Solr that implements this SIP.
  • Existing users of the Solr Operator will see these new features used by default.
    • The populatePods option requires a new Solr Version (9.3), so the Operator will skip the logic if it receives an error that indicates the SolrCloud does not support the command.
  • The new MigrateReplicas command is only available in Solr 9.3+, so the ReplaceNode command will be used for now. Later the operator can try to use MigrateReplicas and fallback on ReplaceNode if necessary.

Security considerations

No Security Concerns

Test Plan

For the Solr APIs, we can use unit tests to test for dispersion for BalanceReplicas and MigrateReplicas, like the current ReplaceNode tests.

For the Solr Operator, we will use the e2e testing framework to test that replicas are moved on node scale-up and scale-down.
This is similar to how the tests for Solr Clouds with ephemeral data are done (Replicas have to be moved on replica deletion, since the data is ephemeral).

Rejected Alternatives

No alternatives have been rejected yet.

  • No labels