Overview

Currently gfsh lucene query uses the default StringQueryProvider, which provides syntax as

 
gfsh> search lucene --name=personIndex --region=/Person --queryString=john* --defaultField=name


StringQueryProvider cannot provide complex search with combination of conditions, especially mixed with numeric range search, for example:

search for a person whose name must contain "john", salary might within 750000 to 80000. 

To fulfill the requirement, we will specify points config map for each numeric field into the built-in StringQueryProvider. 

Approach

Lucene's StandardQueryParser can parse the syntax (see chapter "Gfsh command line syntax" for detail) if specify some fields are numeric. 

 

The numeric fields could be Integer, Float, or Double. 

To do that, the parser should set the PointsConfigMap with a name and type mapping. 

Our index contains the indexed field list. It also saved the meta-info of each field's type somewhere. Get the meta-info of field-type mapping and create the PointsConfigMap, then set it into parser. 

Challenge

  • The meta-info is saved into each serializer, no generic interface. The code needs to refactor.
  • Need to explicitly specify HeterogeneousLuceneSerializer as default serializer, which saved the meta-info in its private 

    mappers. 

  • The FlatFormatSerializer is implemented in different way. It has no mappers data structure.  
  • Pdx data type is parsed by pdxMapper, how to get the meta-info is still unknown. 

Gfsh command line syntax

 There's no change in gfsh. Current gfsh parameters have supported the numeric query syntax. 

# create index with 4 numeric fields
gfsh> create lucene index --name=personIndex --region=/Person --field=name,email,address,revenue,revenue_float,revenue_double,revenue_long


# find a exact match for a numeric field
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue=763000" --defaultField=name
 key   |                                                                                                     | score
------ | --------------------------------------------------------------------------------------------------- | -----
key763 | Person{name='Tom763 Zhou', email='tzhou763@example.com', revenue=763000, homepage='Page{id=763, c.. | 1


# use 2 SHOULD conditions, which is equivalent to "A OR B"
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue=763000 revenue=764000" --defaultField=name
 key   |                                                                                                     | score
------ | --------------------------------------------------------------------------------------------------- | -----
key763 | Person{name='Tom763 Zhou', email='tzhou763@example.com', revenue=763000, homepage='Page{id=763, c.. | 1
key764 | Person{name='Tom764 Zhou', email='tzhou764@example.com', revenue=764000, homepage='Page{id=764, c.. | 1


# use 2 MUST conditions, which is equivalent to "A AND B". Lucene recognizes "+" as MUST
gfsh>search lucene --region=/Person --name=personIndex --queryString="+revenue>763000 +revenue<766000" --defaultField=name
 key   |                                                                                                     | score
------ | --------------------------------------------------------------------------------------------------- | -----
key765 | Person{name='Tom765 Zhou', email='tzhou765@example.com', revenue=765000, homepage='Page{id=765, c.. | 1
key764 | Person{name='Tom764 Zhou', email='tzhou764@example.com', revenue=764000, homepage='Page{id=764, c.. | 1


# >=, <= are valid syntax for inclusive condition
gfsh>search lucene --region=/Person --name=personIndex --queryString="+revenue>=763000 +revenue<=766000" --defaultField=name
 key   |                                                                                                     | score
------ | --------------------------------------------------------------------------------------------------- | -----
key766 | Person{name='Tom766 Zhou', email='tzhou766@example.com', revenue=766000, homepage='Page{id=766, c.. | 1
key765 | Person{name='Tom765 Zhou', email='tzhou765@example.com', revenue=765000, homepage='Page{id=765, c.. | 1
key764 | Person{name='Tom764 Zhou', email='tzhou764@example.com', revenue=764000, homepage='Page{id=764, c.. | 1
key763 | Person{name='Tom763 Zhou', email='tzhou763@example.com', revenue=763000, homepage='Page{id=763, c.. | 1

# Another way to specify range query
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue:[763000 TO 766000]" --defaultField=name

# Query on float, double, long fields. All the 4 numeric types (integer, float, double, long) are supported
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue_float:[763000.0 TO 766000.0]" --defaultField=name
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue_double:[763000 TO 766000]" --defaultField=name
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue_long:[763000 TO 766000]" --defaultField=name

# Combination query to return a subset
gfsh>search lucene --region=/Person --name=personIndex --queryString="+revenue_long:[763000 TO 766000] +revenue_float:[762000 TO 765000]" --defaultField=name
 key   |                                                                                                     | score
------ | --------------------------------------------------------------------------------------------------- | -----
key765 | Person{name='Tom765 Zhou', email='tzhou765@example.com', revenue=765000, homepage='Page{id=765, c.. | 1
key764 | Person{name='Tom764 Zhou', email='tzhou764@example.com', revenue=764000, homepage='Page{id=764, c.. | 1
key763 | Person{name='Tom763 Zhou', email='tzhou763@example.com', revenue=763000, homepage='Page{id=763, c.. | 1


# Lucene recognizes "-" as NOT. One NOT condition will reduce results. 
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue<2000 revenue>9997000 -name=Tom9998*" --defaultField=name
 key   |                                                                                                     | score
------ | --------------------------------------------------------------------------------------------------- | -----
key0   | Person{name='Tom0 Zhou', email='tzhou0@example.com', revenue=0, homepage='Page{id=0, content="Hel.. | 1
key1   | Person{name='Tom1 Zhou', email='tzhou1@example.com', revenue=1000, homepage='Page{id=1, content=".. | 1
key9999| Person{name='Tom9999 Zhou', email='tzhou9999@example.com', revenue=9999000, homepage='Page{id=999.. | 1
Note: name=Tom9998* equals name:Tom9998* and Tom9998* (when defaultField is name)


# Query with a numeric field in a JSON object
gfsh>search lucene --region=/Person --name=personIndex --queryString="revenue="400000" --defaultField=name
 key   |                                                                                                     | score
------ | --------------------------------------------------------------------------------------------------- | -----
key400 | Person{name='Tom400 Zhou', email='tzhou400@example.com', revenue=400000, homepage='Page{id=400, c.. | 1
json1  | PDX[8776019,__GEMFIRE_JSON]{revenue=400000, address=PDX[16524384,__GEMFIRE_JSON]{city=New York, p.. | 1


# Query in numeric field in a nested object
gfsh>search lucene --region=/Customer --name=customerIndex --queryString="+contacts.revenue:[763000 TO 766000] +revenue:[762000 TO 765000]" --defaultField=name
Note: Both conditions take effect and display 3 (not 4) Customer objects

Java API

There is no change to Java API.


  • No labels