This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

I.Problem Statement:

Adding inverted indices to frequently used dimensions is very helpful to cut down query latency. Currently, whether or not applying an inverted index is purely the decision of system admin. A lot of manual work can go into this and there are several drawbacks:

  1. The problem of choosing a proper set of dimensions to apply inverted indices on can be very time consuming. A system admin has to pull up the logs and go through queries for most frequently used dimensions. Without established tool and methodology this can also be fault-prone.
  2. Furthermore, the approach in 1 does not guarantee inverted indices found are optimal or targeting the right queries (say, one wants to decrease the latency slow running queries with >90th% latency)
  3. There is an on average 100% storage penalty for adding inverted index to a specific column. And there is a non-zero cost1 of applying inverted index in the filtering phase, meaning there is a "sweet spot" for the number of inverted index.
  4. For a lot of use cases with and-connected predicates, it is most important to pick the one dimension with best selectivity, which has the best effect in cutting down the time spent in lookup phase. Over optimizing beyond this point will not be cost efficient.
  5. It is desirable for the system to swiftly give a recommendation upon changing of query pattern and data pattern.


1.Latency Modeling:

Latency (execution time) exposed in broker log is comprised of:

  1. Queueing Time
  2. Lookup Time
  3. Aggregation Time
  4. Random Noise

1 and 4 are relatively hard to predict and irrelevant to our ultimate goal of inverted index recommendation, so we want to get rid of them from our query stats. By replaying real queries on test node with low QPS, the queueing time is minimized. Now, we want to correlate the entriesScannedInFilter with lookup time and entriesScannedPostFilter with aggregation time using a linear model, which can be expressed as:

We use minimum binning to minimize noise. Which is essentially dividing the space of entriesScannedInFilter×entriesScannedPostFilter into small buckets of a×b, and for all the {(entriesScannedInFilter, entriesScannedPostFilter): ExecutionTime} falling into a given bucket, we select the one with minimum ExecutionTime.

The experiment on 6 use cases show good fitting results. We can establish entriesScannedInFilter is a good indicator of the latency contributed by lookup phase. With this established, we can analyze the potential of improvement we can make by calculate the contribution of entriesScannedInFilter to whole execution time. More importantly, this can function as  a good weight for our recommendation algorithm inn the following next section.

2.Parser-based Scoring:

From the discussion above, we can conclude that to recommend a proper inverted index based on data characteristics and query pattern, the following observations on different aspects of the problem are helpful:

  1.  From the aspect of all the queries, if a dimension is very frequently used across all queries, it is a good candidate for inverted index, because many queries can be optimized by applying only one inverted index.
  2.  From the perspective of data characteristics, it is desirable to add inverted index on columns with high selectivity. Because the evaluation of inverted index is always prioritized, picking a column with high selectivity for inverted index will often significantly reduce the numEntriesScannedInFilter.
  3. From the perspective of a single query, to better leverage the cardinality information, we should also consider the structure of the queries, namely the AND-connected and OR-connected clauses. Because for AND-connected clause, the scan range will be significantly reduced by applying inverted index on only one of the dimensions, whereas in OR-connected clauses it can only cut the number of entries scanned for that particular dimension.

With these observations, we design an algorithm based on the PQL parser generated by Antlr. For each clause, we can apply two functions, key and S, which are defined recursively as following:

at top level, the set of columns in key will be extracted, whose corresponding votes will be increase by:

After processing all the queries, the columns will be ranked by their votes, generating a sequence of inverted index recommendation. 

III.Tool Design:

1.Query Source:

This tool looks for patterns in query text and execution details to give recommendations. This piece of information is exposed at broker level. Users can choose to write log files to local broker disk or publish log events to streaming platforms. Regardless of the data pipeline, following are necessary fields in the execution log for this tool to come up with a recommendation.


      "name": "query",

      "type": "string",

      "description": "The query text received, in Pinot Query Language.”



      "name": "queryProcessingDuration",

      "type": "long",

      "description": "Time taken to process the query, in milli-seconds”



      "name": "scannedEntriesInFilterCount",

      "type": "long",

      "description": "The number of entries that had to be scanned during the filter phase."



      "name": "scannedEntriesPostFilterCount",

      "type": "long",

      "description": "Number of entries scanned in the post-filter phase"


tuner.query.src.QuerySrc is a generic interface to accommodate for different input pipelines. On top of this, LogQuerySrcImpl is implemented to process lines in a log file, meanwhile calling BrokerLogParserImpl to generate IndexSuggestQueryStatsImpl objects wrapping the above fields. These IndexSuggestQueryStatsImpl objects will be feed to the analysis strategy.

2.Metadata Source:

To leverage the characteristic of data, this tool will access the and index_map file for each segment. Specifically, the following fields are useful:


      "name": "cardinality",

      "type": "long",

      "description": "The cardinality of a given segment, for a given table and a given column.”



      "name": "totalDocs",

      "type": "long",

      "description": "Total number of documents(rows) in a given segment, for a given table and a given column.”



      "name": "isSorted",

      "type": "long",

      "description": "If the data is sorted in a given segment, for a given table and a given column."



      "name": "totalNumberOfEntries",

      "type": "long",

      "description": "The total number of entries (values/multi-values) of a given segment, for a given table and a given column."



      "name": "invertedIndexSize",

      "type": "long",

      "description": "The size of inverted index in a given segment (if applicable), for a given table and a given column."


MetaManager is the generic interface for accessing the metadata. For now only the (weighted) average of cardinality, (weighted) average of entries per multi-value, total number of segments sorted, and inverted index size are used. The  implementation, JsonFileMetaManagerImpl is to fetch and provide these information from a file, as we have yet to store all these info in ZK or implement an API to access this information. An additional command CollectSegmentMetadata is provided to generate such packed metadata (in json format) from the tarred segments in controller. 

3.Segment Metadata Collector:

A tool is written to extract the relevant fields in tarred segments and pack them into a json file. The tool will work on the segment directory of controller which has a structure of segmentDirectory/tableNameWithoutType/segment. The and index_map files will be decompressed and parsed. The segment level properties are structured as /tableNameWithoutType/columnName/segmentName/property. And the sum of cardinality×totalDocs, totalDocs, isSortedtotalNumberOfEntries, invertedIndexSize, segmentCount will also be calculated and stored as column level metadata. This file will be used as described in 2.

4.Tool Structure

The tool is based on a driver which can provide high efficient multithreading capability. The tuning applications are build on top of this driver. This structure can be easily translate to Map-Reduce for scalable and realtime processing of query logs.


Get percentile report on numEntriesScannedInFilter:

$ ./ EntriesScannedQuantileReport -h
Usage: EntriesScannedQuantileReport
	-log                      <String>                      : Path to broker log file. (required=true)
	-tables                                                 : Comma separated list of table names to work on without type (unset run on all tables) (required=false)
	-help                                                   : Print this message. (required=false) 
$ ./ EntriesScannedQuantileReport -log ~/Documents/userTablesBrokerLog.log

Collect and pack in Json:

$ ./ CollectMetadataForIndexTuning -h
Usage: CollectMetadataForIndexTuning
	-out                      <String>                      : An empty directory to work on, for tmp files and output metadata.json file,must have r/w access (required=true)
	-segments                 <String>                      : The directory containing /tableNamesWithoutType/{tarred segments} (required=true)
	-tables                                                 : Comma separated list of table names to work on without type (unset to run on all tables) (required=false)
	-help                                                   : Print this message. (required=false)
$ ./ CollectMetadataForIndexTuning -out ~/tmpDirectory -segments /.../pinot/pinot_segments/pinot/ -tables userTable1

Get inverted index recommendation:

$ ./ IndexTuner -h
Usage: IndexTunerCommand
	-metadata                 <String>                      : Path to packed metadata file (json), CollectMetadataForIndexTuning can be used to create this. (required=true)
	-log                      <String>                      : Path to broker log file. (required=true)
	-index                    <inverted/sorted>             : Select target index. (required=true)
	-strategy                 <freq/parser>                 : Select tuning strategy. (required=true)
	-selectivityThreshold     <long>                        : Selectivity threshold (>1), default to 1,  (required=false)
	-entriesScannedThreshold  <long>                        : Log lines with numEntriesScannedInFilter below this threshold will be excluded. (required=false)
	-numQueriesThreshold      <long>                        : Tables with log lines scanned threshold will be excluded. (required=false)
	-tables                                                 : Comma separated list of table names to work on without type (unset run on all tables) (required=false)
	-untrackedInvertedIndex                                 : "{\"tabelNameWithoutType1\": [\"colName1\",\"colName2\"]}" (required=false)
	-help                                                   : Print this message. (required=false)
$ ./ IndexTuner -index inverted -strategy parser -log ~/Documents/userTablesBrokerLog.log -metadata ~/tmpDirectory/metadata.json -untrackedInvertedIndex "{\"userTable1\":[\"untrackedColumn1\"]}" -selectivityThreshold 10 -tables userTable1

III.Future Improvement:

1. Adding desirableNumEntriesScannedInFilter:

To complete the picture of how many inverted index we should add, Sunitha Beeram  suggests that we can use a threshold desirableNumEntriesScannedInFilter, with which we can estimate by adding how many inverted index can we lower the actual numEntriesScannedInFilter to the desirableNumEntriesScannedInFilter. We could also potentially use a feed-back controlled incremental inverted index application with this threshold. Such process will make the model strongly coherent with SLA.

2.Known Issue with numEntriesScannedInFilter cut-down estimation for large multi-value columns.

Right now in the model, we are using:

to estimate the very part of number of entries scanned in filter we can cut down by adding an inverted index. There is actually a known bias in this estimation. A more accurate form will be:

However the number of entries scanned in filter is not exposed in column granularity right now, so this cannot be built. The impact of this for MV columns is already mitigated by some level by not adding the average number of entries per doc into the denominator of S. But it will be ideal if we can expose this and make accurate prediction. And this should be  doable without too much overhead. Thanks to Mayank Shrivastava for bringing up this in my presentation.


  1.  It can potentially hurt the performance if we use inverted index for very long range query, where a lot of bitmaps are or-ed.
  • No labels


  1. This is awesome Jia Guo. I liked the systematic approach taken to deduce some of the conclusions. Few questions/suggestions

    •  what are key and S representing. 
    • Is it possible to come up with how much would the latency improve overall? approximate is fine. That will help in deciding if the additional storage is worth it. 
    • Is it possible to extend this to say when star-tree will be useful? Think about optimizing entriesScannedPostFilter instead of entriedScannedInFilter.

    1. Hello Kishore,

      1. S is representing Selectivity, basically (S-1)/S is the portion of range of docs we can shrink by adding inverted index on a column.

      2. It is possible and it is already in EntriesScannedQuantileReport. But because we are not exposing the pure execution/lookup time, this estimation will mostly not be made because the model cannot fit queueing/noise. After we expose that filtering time metric it will be possible.

      3. I haven't thought much about star-tree right now, but it is a very promising future work.


  2. Can we use enhance this to suggest inverted indices purely based on the cardinality of each column without looking at the logs?. This can be something user can run on a sample data. We can assume that all columns are queried equally or take the columns frequently queried as inputs.

    1. Without looking at the logs, we can base the method on:

      1. Apply inverted index on high selectivity columns.
      2. Apply inverted index on multi-value columns with large number of entries per doc.

      And the things we can leverage are:

      1. The characteristics of data, which include the current fields in metadata. But looking at the actual data distribution may have privacy related issue.
      2. The semantic meaning of the column name, for ex xxx-id can be a good candidate.

      But again this method can be inaccurate without the knowledge of log.