Initial Changes
The lucene.version in gradle/dependency-versions.properties was changed:
From:
lucene.version = 6.4.1
To:
lucene.version = 7.1.0
Compilation
With this change, compilation failed with:
geode/open/geode-lucene/src/main/java/org/apache/geode/cache/lucene/internal/repository/IndexRepositoryImpl.java:125: error: incompatible types: possible lossy conversion from long to int totalHits = docs.totalHits;
org.apache.lucene.search.TopDocs totalHits changed from int to long.
That required a few changes.
org.apache.geode.cache.lucene.internal.repository.IndexRepositoryImpl query was changed:
From:
int totalHits = 0;
To:
long totalHits = 0;
org.apache.geode.cache.lucene.internal.LuceneIndexStats endRepositoryQuery signature was changed:
From:
public void endRepositoryQuery(long start, final int totalHits)
To:
public void endRepositoryQuery(long start, final long totalHits) {
Unit Test Execution
Issue
org.apache.geode.cache.lucene.internal.repository.IndexRepositoryImplJUnitTest queryShouldUpdateStats failed with:
org.apache.geode.cache.lucene.internal.repository.IndexRepositoryImplJUnitTest > queryShouldUpdateStats FAILED Argument(s) are different! Wanted: luceneIndexStats.endRepositoryQuery( <any long>, (Integer) 2 ); -> at org.apache.geode.cache.lucene.internal.repository.IndexRepositoryImplJUnitTest.queryShouldUpdateStats(IndexRepositoryImplJUnitTest.java:163) Actual invocation has different arguments: luceneIndexStats.endRepositoryQuery( 0L, (Long) 2L ); -> at org.apache.geode.cache.lucene.internal.repository.IndexRepositoryImpl.query(IndexRepositoryImpl.java:136) at org.apache.geode.cache.lucene.internal.repository.IndexRepositoryImplJUnitTest.queryShouldUpdateStats(IndexRepositoryImplJUnitTest.java:163)
Resolution
The org.apache.geode.cache.lucene.internal.repository.IndexRepositoryImplJUnitTest queryShouldUpdateStats test was changed:
From:
verify(stats, times(1)).endRepositoryQuery(anyLong(), eq(2));
To:
verify(stats, times(1)).endRepositoryQuery(anyLong(), eq(2l));
Issue
Several org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit tests failed with the exception below including:
- luceneQueryReturnsCorrectResultsAfterClientAndServersAreRolledOver
- luceneQueryReturnsCorrectResultAfterTwoLocatorsWithTwoServersAreRolled
luceneQueryReturnsCorrectResultsAfterServersRollOverOnPartitionRegion
org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit > luceneQueryReturnsCorrectResultsAfterClientAndServersAreRolledOver[0] FAILED org.apache.geode.test.dunit.RMIException: While invoking org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit$$Lambda$204/1077331008.run in VM 2 running on Host 192.168.2.6 with 4 VMs at org.apache.geode.test.dunit.VM.invoke(VM.java:393) at org.apache.geode.test.dunit.VM.invoke(VM.java:363) at org.apache.geode.test.dunit.VM.invoke(VM.java:308) at org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit.verifyLuceneQueryResultInEachVM(LuceneSearchWithRollingUpgradeDUnit.java:553) at org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit.putSerializableObjectAndVerifyLuceneQueryResult(LuceneSearchWithRollingUpgradeDUnit.java:500) at org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit.luceneQueryReturnsCorrectResultsAfterClientAndServersAreRolledOver(LuceneSearchWithRollingUpgradeDUnit.java:264) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit.verifyLuceneQueryResults(LuceneSearchWithRollingUpgradeDUnit.java:536) at org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit.lambda$verifyLuceneQueryResultInEachVM$b83f705c$2(LuceneSearchWithRollingUpgradeDUnit.java:553) Caused by: java.lang.IllegalArgumentException: Could not load codec 'Lucene62'. Did you forget to add lucene-backward-codecs.jar? Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene62' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [Lucene70, Asserting, CheapBastard, FastCompressingStoredFields, FastDecompressionCompressingStoredFields, HighCompressionCompressingStoredFields, DummyCompressingStoredFields, SimpleText]
Resolution
The following line was added to geode-lucene/build.gradle:
testRuntime 'org.apache.lucene:lucene-backward-codecs:' + project.'lucene.version'
Issue
org.apache.geode.cache.lucene.internal.distributed.DistributedScoringJUnitTest uniformDistributionProducesComparableScores failed with:
org.apache.geode.cache.lucene.internal.distributed.DistributedScoringJUnitTest > uniformDistributionProducesComparableScores FAILED org.junit.ComparisonFailure: expected:<key-[6]> but was:<key-[5]> at org.junit.Assert.assertEquals(Assert.java:115) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.geode.cache.lucene.internal.distributed.DistributedScoringJUnitTest.uniformDistributionProducesComparableScores(DistributedScoringJUnitTest.java:119)
Analysis
This test compares the scores for query results from one IndexRepositoryImpl vs the scores for query results from three IndexRepositoryImpls. The test expects the ordering to be the same but its not.
In Lucene 6.4.1, logging shows the scores like:
Single IndexRepositoryImpl Result: singleQueryResult=[key=key-0 score=1.8132632, key=key-3 score=1.2614386, key=key-6 score=1.0530728, key=key-5 score=1.0299048, key=key-7 score=0.63357] Three IndexRepositoryImpl Result: collector1QueryResult=[key=key-0 score=1.8866812] collector2QueryResult=[key=key-3 score=1.1131656, key=key-5 score=0.5000039] collector3QueryResult=[key=key-6 score=0.59086174, key=key-7 score=0.33355096] mergedResult=[key=key-0 score=1.8866812, key=key-3 score=1.1131656, key=key-6 score=0.59086174, key=key-5 score=0.5000039, key=key-7 score=0.33355096]
In Lucene 7.1.0, logging shows the scores like:
Single IndexRepositoryImpl Result: singleQueryResult=[key=key-0 score=1.9968563, key=key-3 score=1.2946333, key=key-5 score=1.1341828, key=key-6 score=1.0530728, key=key-7 score=0.73058176] Three IndexRepositoryImpl Result: collector1QueryResult=[key=key-0 score=2.0834162] collector2QueryResult=[key=key-3 score=1.13994, key=key-5 score=0.5442147] collector3QueryResult=[key=key-6 score=0.59086174, key=key-7 score=0.3901917] mergedResult=[key=key-0 score=2.0834162, key=key-3 score=1.13994, key=key-6 score=0.59086174, key=key-5 score=0.5442147, key=key-7 score=0.3901917]
The difference is key-5 and key-6 are in different orders in the single IndexRepositoryImpl query vs the multiple IndexRepositoryImpl query.
Issue
org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit luceneQueryReturnsCorrectResultsAfterServersRollOverOnPartitionRegion failed with:
[info 2017/10/18 16:14:35.167 PDT <RMI TCP Connection(1)-192.168.2.6> tid=0x13] Got result: EXCEPTION_OCCURRED java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit.verifyLuceneQueryResults(LuceneSearchWithRollingUpgradeDUnit.java:536) at org.apache.geode.cache.lucene.LuceneSearchWithRollingUpgradeDUnit.lambda$verifyLuceneQueryResultInEachVM$b83f705c$2(LuceneSearchWithRollingUpgradeDUnit.java:553) Caused by: org.apache.geode.InternalGemFireError: Unable to create index repository at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.lambda$computeRepository$0(AbstractPartitionedRepositoryManager.java:118) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager$$Lambda$35/248730969.apply(Unknown Source) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.computeRepository(AbstractPartitionedRepositoryManager.java:108) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.getRepository(AbstractPartitionedRepositoryManager.java:137) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.getRepositories(AbstractPartitionedRepositoryManager.java:91) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.distributed.LuceneQueryFunction.execute(LuceneQueryFunction.java:105) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.internal.cache.PartitionedRegionDataStore.executeOnDataStore(PartitionedRegionDataStore.java:2911) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.internal.cache.partitioned.PartitionedRegionFunctionStreamingMessage.operateOnPartitionedRegion(PartitionedRegionFunctionStreamingMessage.java:98) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:332) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1108) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.lang.Thread.run(Thread.java:745) at org.apache.geode.distributed.internal.ReplyException.handleAsUnexpected(ReplyException.java:89) at org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.waitForCacheOrFunctionException(FunctionStreamingResultCollector.java:465) at org.apache.geode.internal.cache.partitioned.PRFunctionStreamingResultCollector.getResult(PRFunctionStreamingResultCollector.java:91) at org.apache.geode.cache.lucene.internal.LuceneQueryImpl.findTopEntries(LuceneQueryImpl.java:121) at org.apache.geode.cache.lucene.internal.LuceneQueryImpl.findKeys(LuceneQueryImpl.java:69) ... 31 more Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(segments_2)): 7 (needs to be between 4 and 6) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:302) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.computeIndexRepository(IndexRepositoryFactory.java:84) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.computeRepository(PartitionedRepositoryManager.java:42) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.lambda$computeRepository$0(AbstractPartitionedRepositoryManager.java:116) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager$$Lambda$35/248730969.apply(Unknown Source) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.computeRepository(AbstractPartitionedRepositoryManager.java:108) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.getRepository(AbstractPartitionedRepositoryManager.java:137) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.getRepositories(AbstractPartitionedRepositoryManager.java:91) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.cache.lucene.internal.distributed.LuceneQueryFunction.execute(LuceneQueryFunction.java:105) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.internal.cache.PartitionedRegionDataStore.executeOnDataStore(PartitionedRegionDataStore.java:2911) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.internal.cache.partitioned.PartitionedRegionFunctionStreamingMessage.operateOnPartitionedRegion(PartitionedRegionFunctionStreamingMessage.java:98) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:332) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1108) at Remote Member '192.168.2.6(57006)<v3>:32773(version:GEODE 1.2.0)' in java.lang.Thread.run(Thread.java:745)
Analysis
This test does not fail every time.
The test:
- creates a LuceneIndex in 3 servers on older version servers
- rolls each server one at a time and:
- creates the region
- executes a query
- verifies the results
The LuceneQueryImpl.findKeys method causes LuceneQueryFunction to be invoked in an older member. That older member tries to create an IndexWriter and fails.
The exception in the older member is:
[vm2_v120] [info 2017/10/25 16:33:49.576 PDT <Pooled Waiting Message Processor 7> tid=0x78] Exception thrown while constructing Lucene Index for bucket:53 for file region:/__PR/_B__index#__aRegion.files_53 [vm2_v120] [warn 2017/10/25 16:33:49.577 PDT <Pooled Waiting Message Processor 7> tid=0x78] Task failed with exception [vm2_v120] org.apache.geode.InternalGemFireError: Unable to create index repository [vm2_v120] at org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.lambda$computeRepository$0(AbstractPartitionedRepositoryManager.java:118) [vm2_v120] at org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager$$Lambda$35/1749028657.apply(Unknown Source) [vm2_v120] at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853) [vm2_v120] at org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.computeRepository(AbstractPartitionedRepositoryManager.java:108) [vm2_v120] at org.apache.geode.cache.lucene.internal.LuceneBucketListener.lambda$afterPrimary$0(LuceneBucketListener.java:40) [vm2_v120] at org.apache.geode.cache.lucene.internal.LuceneBucketListener$$Lambda$34/1771106111.run(Unknown Source) [vm2_v120] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [vm2_v120] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [vm2_v120] at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662) [vm2_v120] at org.apache.geode.distributed.internal.DistributionManager$6$1.run(DistributionManager.java:989) [vm2_v120] at java.lang.Thread.run(Thread.java:745) [vm2_v120] Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(segments_2)): 7 (needs to be between 4 and 6) [vm2_v120] at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216) [vm2_v120] at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:302) [vm2_v120] at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286) [vm2_v120] at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938) [vm2_v120] at org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.computeIndexRepository(IndexRepositoryFactory.java:84) [vm2_v120] at org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.computeRepository(PartitionedRepositoryManager.java:42) [vm2_v120] at org.apache.geode.cache.lucene.internal.AbstractPartitionedRepositoryManager.lambda$computeRepository$0(AbstractPartitionedRepositoryManager.java:116) [vm2_v120] ... 10 more
The SegmentInfos.readCommit method is reading an int from a FileIndexInput which is gotten from RegionDirectory.openInput:
public IndexInput openInput(String name, IOContext context) throws IOException { ensureOpen(); final File file = fs.getFile(name); return new FileIndexInput(name, file); }
The FileSystem.getFile method does a get on the fileAndChunkRegion:
final File file = (File) fileAndChunkRegion.get(name);
I'm not sure which File is being requested but all of the activity on that bucket is from current version vm0:
[vm0] [warn 2017/10/25 16:33:37.613 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.fdx [vm0] [warn 2017/10/25 16:33:37.620 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.fdt [vm0] [warn 2017/10/25 16:33:37.672 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.nvd [vm0] [warn 2017/10/25 16:33:37.682 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.nvm [vm0] [warn 2017/10/25 16:33:37.697 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.nvd [vm0] [warn 2017/10/25 16:33:37.699 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.nvm [vm0] [warn 2017/10/25 16:33:37.703 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.fdt [vm0] [warn 2017/10/25 16:33:37.706 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.fdx [vm0] [warn 2017/10/25 16:33:37.708 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1_Lucene50_0.doc [vm0] [warn 2017/10/25 16:33:37.711 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1_Lucene50_0.pos [vm0] [warn 2017/10/25 16:33:37.714 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1_Lucene50_0.tim [vm0] [warn 2017/10/25 16:33:37.717 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1_Lucene50_0.tip [vm0] [warn 2017/10/25 16:33:37.721 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1_Lucene50_0.tim [vm0] [warn 2017/10/25 16:33:37.724 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1_Lucene50_0.tip [vm0] [warn 2017/10/25 16:33:37.725 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1_Lucene50_0.doc [vm0] [warn 2017/10/25 16:33:37.729 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1_Lucene50_0.pos [vm0] [warn 2017/10/25 16:33:37.730 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.fnm [vm0] [warn 2017/10/25 16:33:37.733 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.fnm [vm0] [warn 2017/10/25 16:33:37.737 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.cfs [vm0] [warn 2017/10/25 16:33:37.739 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.cfe [vm0] [warn 2017/10/25 16:33:37.750 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.cfe [vm0] [warn 2017/10/25 16:33:37.752 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.cfs [vm0] [warn 2017/10/25 16:33:37.754 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=_1.si [vm0] [warn 2017/10/25 16:33:37.758 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=_1.si [vm0] [warn 2017/10/25 16:33:37.778 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.createFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=pending_segments_2 [vm0] [warn 2017/10/25 16:33:37.788 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=pending_segments_2 [vm0] [warn 2017/10/25 16:33:37.789 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.renameFile fileAndChunkRegion=_B__index#__aRegion.files_53; source=pending_segments_2 [vm0] [warn 2017/10/25 16:33:37.790 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.updateFile fileAndChunkRegion=_B__index#__aRegion.files_53; key=pending_segments_2 [vm0] [warn 2017/10/25 16:33:37.792 PDT <Event Processor for GatewaySender_AsyncEventQueue_index#_aRegion_3> tid=0x5a] FileSystem.putIfAbsentFile fileAndChunkRegion=_B__index#__aRegion.files_53; name=segments_2
So, it looks like a newer version server with Lucene 7 puts a File into the fileAndChunkRegion, and an older version server with Lucene 6 gets it out and attempts to read it unsuccessfully.
I added the luceneQueryReturnsCorrectResultsAfterClientAndServersAreRolledOverAllBucketsCreated test to LuceneSearchWithRollingUpgradeDUnit that reproduces this issue every time.
The main sequence in this test that causes the failure is:
- start two servers with old version using Lucene 6
- roll one server to new version server using Lucene 7
- do puts into primary buckets in new server which creates entries in the fileAndChunk region with Lucene 7 format
- stop the new version server which causes the old version server to become primary for those buckets
- do a query which causes the IndexFormatTooNewException to be thrown
Hydra Test Execution
The lucene.bt and backwardCompatibilityLucene.bt had only 1 hung test which I don't think was related to Lucene 7.x.
lucene/serialParReg.conf P 00:11:14 closed/pivotalgf-assembly/build/hydraTest/serialParReg-1023-192645 lucene/concParReg.conf P 00:24:33 closed/pivotalgf-assembly/build/hydraTest/concParReg-1023-193804 lucene/serialParRegPersist.conf P 00:15:20 closed/pivotalgf-assembly/build/hydraTest/serialParRegPersist-1023-200244 lucene/serialParRegIncremental.conf P 00:10:49 closed/pivotalgf-assembly/build/hydraTest/serialParRegIncremental-1023-201809 lucene/concParRegPersist.conf P 00:22:47 closed/pivotalgf-assembly/build/hydraTest/concParRegPersist-1023-202902 lucene/concParRegIncremental.conf P 00:12:30 closed/pivotalgf-assembly/build/hydraTest/concParRegIncremental-1023-205153 lucene/concParRegBridge.conf P 00:25:06 closed/pivotalgf-assembly/build/hydraTest/concParRegBridge-1023-210430 lucene/concParRegBridge.conf P 00:25:05 closed/pivotalgf-assembly/build/hydraTest/concParRegBridge-1023-212942 lucene/concParRegBridge.conf P 00:25:46 closed/pivotalgf-assembly/build/hydraTest/concParRegBridge-1023-215452 lucene/concParRegBridge.conf P 00:23:51 closed/pivotalgf-assembly/build/hydraTest/concParRegBridge-1023-222043 lucene/concParRegBridgePersist.conf P 00:24:18 closed/pivotalgf-assembly/build/hydraTest/concParRegBridgePersist-1023-224441 lucene/concParRegBridgePersist.conf P 00:23:09 closed/pivotalgf-assembly/build/hydraTest/concParRegBridgePersist-1023-230905 lucene/concParRegBridgePersist.conf P 00:22:24 closed/pivotalgf-assembly/build/hydraTest/concParRegBridgePersist-1023-233220 lucene/concParRegBridgePersist.conf P 00:22:15 closed/pivotalgf-assembly/build/hydraTest/concParRegBridgePersist-1023-235449 lucene/serialParRegHA.conf P 00:10:56 closed/pivotalgf-assembly/build/hydraTest/serialParRegHA-1024-001710 lucene/concParRegHA.conf P 00:32:02 closed/pivotalgf-assembly/build/hydraTest/concParRegHA-1024-002811 lucene/serialParRegHA.conf P 00:11:25 closed/pivotalgf-assembly/build/hydraTest/serialParRegHA-1024-010017 lucene/concParRegHA.conf P 00:32:37 closed/pivotalgf-assembly/build/hydraTest/concParRegHA-1024-011147 lucene/serialParRegHAPersist.conf P 00:13:38 closed/pivotalgf-assembly/build/hydraTest/serialParRegHAPersist-1024-014428 lucene/serialParRegHAPersist.conf P 00:13:36 closed/pivotalgf-assembly/build/hydraTest/serialParRegHAPersist-1024-015811 lucene/concParRegHAPersist.conf P 00:19:42 closed/pivotalgf-assembly/build/hydraTest/concParRegHAPersist-1024-021152 lucene/concParRegHAPersist.conf P 00:20:06 closed/pivotalgf-assembly/build/hydraTest/concParRegHAPersist-1024-023138 lucene/concParRegHABridge.conf P 00:34:45 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridge-1024-025150 lucene/concParRegHABridge.conf P 00:33:29 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridge-1024-032640 lucene/concParRegHABridge.conf P 00:33:07 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridge-1024-040015 lucene/concParRegHABridge.conf P 00:33:54 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridge-1024-043328 lucene/concParRegHABridgePersist.conf P 00:21:58 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridgePersist-1024-050729 lucene/concParRegHABridgePersist.conf P 00:21:55 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridgePersist-1024-052931 lucene/concParRegRecover.conf P 00:13:42 closed/pivotalgf-assembly/build/hydraTest/concParRegRecover-1024-055135 lucene/concParRegRecover.conf P 00:15:49 closed/pivotalgf-assembly/build/hydraTest/concParRegRecover-1024-060521 lucene/concParRegRecover.conf P 00:14:59 closed/pivotalgf-assembly/build/hydraTest/concParRegRecover-1024-062115 lucene/luceneQueriesWithRebalance.conf P 00:06:02 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithRebalance-1024-063618 lucene/luceneQueriesWithRebalance.conf P 00:06:53 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithRebalance-1024-064226 lucene/luceneQueriesWithRebalancePersist.conf P 00:11:01 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithRebalancePersist-1024-064923 lucene/luceneQueriesWithRebalancePersist.conf P 00:09:16 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithRebalancePersist-1024-070030 lucene/luceneQueriesWithHA.conf P 00:12:03 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHA-1024-070951 lucene/luceneQueriesWithHA.conf H 00:16:37 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHA-1024-072200 lucene/luceneQueriesWithHA.conf P 00:12:38 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHA-1024-073844 lucene/luceneQueriesWithHA.conf P 00:12:56 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHA-1024-075128 lucene/luceneQueriesWithHA.conf P 00:13:05 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHA-1024-080428 lucene/luceneQueriesWithHAPersist.conf P 00:14:38 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHAPersist-1024-081740 lucene/luceneQueriesWithHAPersist.conf P 00:15:38 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHAPersist-1024-083223 lucene/luceneQueriesWithHAPersist.conf P 00:15:51 closed/pivotalgf-assembly/build/hydraTest/luceneQueriesWithHAPersist-1024-084807 lucene/concDestroyIndex.conf P 00:12:10 closed/pivotalgf-assembly/build/hydraTest/concDestroyIndex-1024-090403 lucene/backwardCompatibility/concParRegBridge.conf P 00:25:50 closed/pivotalgf-assembly/build/hydraTest/concParRegBridge-1024-091618 lucene/backwardCompatibility/concParRegBridge.conf P 00:25:09 closed/pivotalgf-assembly/build/hydraTest/concParRegBridge-1024-094213 lucene/backwardCompatibility/concParRegBridgePersist.conf P 00:23:35 closed/pivotalgf-assembly/build/hydraTest/concParRegBridgePersist-1024-100726 lucene/backwardCompatibility/concParRegBridgePersist.conf P 00:24:43 closed/pivotalgf-assembly/build/hydraTest/concParRegBridgePersist-1024-103106 lucene/backwardCompatibility/concParRegHABridge.conf P 00:33:06 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridge-1024-105556 lucene/backwardCompatibility/concParRegHABridgePersist.conf P 00:21:18 closed/pivotalgf-assembly/build/hydraTest/concParRegHABridgePersist-1024-112907
The errors.txt file contained:
THREAD vm_0_thr_7_accessor1_mclaren_11724 Subthread Dynamic Client VM Stopper HANG Timeout during dynamic action: Failed to stop vm_3 within 300 seconds hydra.HydraTimeoutException: Failed to stop vm_3 within 300 seconds at hydra.ClientMgr.killClientVm(ClientMgr.java:969) at hydra.ClientMgr.stopClientVm(ClientMgr.java:819) at hydra.ClientMgr._stopClientVm(ClientMgr.java:761) at hydra.ClientMgr$2.run(ClientMgr.java:720) at java.lang.Thread.run(Thread.java:745) --------------------------------------------------------------------------------
The thread dump contained these two threads that look interesting:
"vm_0_thr_7_accessor1_mclaren_11724 Subthread" #7932 daemon prio=5 os_prio=0 tid=0x00007fd5e0003800 nid=0x2c56 in Object.wait() [0x00007fd6f2aeb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hydra.DynamicActionUtil.waitForDynamicActionToComplete(DynamicActionUtil.java:79) - locked <0x00000000ec1b8740> (a hydra.DynamicActionUtil$PidSignal) at hydra.DynamicActionUtil.runActionThread(DynamicActionUtil.java:53) at hydra.ClientVmMgr.reserveAndStop(ClientVmMgr.java:567) at hydra.ClientVmMgr.stop(ClientVmMgr.java:534) at hydra.ClientVmMgr.stop(ClientVmMgr.java:390) at util.StopStartVMs.run(StopStartVMs.java:86) at java.lang.Thread.run(Thread.java:745) "vm_0_thr_7_accessor1_mclaren_11724" #7928 daemon prio=5 os_prio=0 tid=0x00007fd6f8006000 nid=0x2c47 in Object.wait() [0x00007fd74962f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1245) - locked <0x00000000ec1bb1b0> (a hydra.HydraSubthread) at java.lang.Thread.join(Thread.java:1319) at util.StopStartVMs.joinStopStart(StopStartVMs.java:400) at util.StopStartVMs.stopStartVMs(StopStartVMs.java:307) at parReg.ParRegTest.cycleVMsNoWait(ParRegTest.java:5029) at lucene.LuceneTest.stopStartVMs(LuceneTest.java:641) at lucene.LuceneTest.HydraTask_stopStartVMs(LuceneTest.java:637) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at hydra.MethExecutor.execute(MethExecutor.java:182) at hydra.MethExecutor.execute(MethExecutor.java:150) at hydra.TestTask.execute(TestTask.java:191) at hydra.RemoteTestModule$1.run(RemoteTestModule.java:212)