Graph Traversal | Apache Solr Reference Guide 6.6

Graph traversal with streaming expressions uses the gatherNodes function to perform a breadth-first graph traversal.

The gatherNodes function can be combined with the scoreNodes function to provide recommendations. gatherNodes can also be combined with the wider streaming expression library to perform complex operations on gathered node sets.

gatherNodes traversals are distributed within a SolrCloud collection and can span collections.

gatherNodes is designed for use cases that involve zooming into a neighborhood in the graph and performing precise traversals to gather node sets and aggregations. In these types of use cases gatherNodes will often provide sub-second performance. Some sample use cases are provided later in the document.

This document assumes a basic understanding of graph terminology and streaming expressions. You can begin exploring graph traversal concepts with this Wikipedia article. More details about streaming expressions are available in this Guide, in the section Streaming Expressions.

Basic Syntax

We’ll start with the most basic syntax and slowly build up more complexity. The most basic syntax for gatherNodes is:

gatherNodes(emails,
            walk="johndoe@apache.org->from",
            gather="to")

Let’s break down this simple expression.

The first parameter, emails, is the collection being traversed. The second parameter, walk, maps a hard-coded node ID ("johndoe@apache.org") to a field in the index (from). This will return all the edges in the index that have johndoe@apache.org in the from field.

The gather parameter tells the function to gather the values in the `to `field. The values that are gathered are the node IDs emitted by the function.

In the example above the nodes emitted will be all of the people that "johndoe@apache.org" has emailed.

The walk parameter also accepts a list of root node IDs:

gatherNodes(emails,
            walk="johndoe@apache.org, janesmith@apache.org->from",
            gather="to")

The gatherNodes function above finds all the edges with "johndoe@apache.org" or "janesmith@apache.org" in the from field and gathers the to field.

Like all Streaming Expressions, you can execute a gatherNodes expression by sending it to the /stream handler. For example:

curl --data-urlencode 'expr=gatherNodes(emails,
                                        walk="johndoe@apache.org, janesmith@apache.org->from",
                                        gather="to")' http://localhost:8983/solr/emails/stream

The output of this expression would look like this:

{
  "result-set": {
    "docs": [
      {
        "node": "slist@campbell.com",
        "collection": "emails",
        "field": "to",
        "level": 1
      },
      {
        "node": "catherine.pernot@enron.com",
        "collection": "emails",
        "field": "to",
        "level": 1
      },
      {
        "node": "airam.arteaga@enron.com",
        "collection": "emails",
        "field": "to",
        "level": 1
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 44
      }
    ]
  }
}

All of the tuples returned have the node field. The node field contains the node IDs gathered by the function. The collection, field, and level of the traversal are also included in the output.

Notice that the level is "1" for each tuple in the example. The root nodes are level 0 (in the example above, the root nodes are "johndoe@apache.org, janesmith@apache.org") By default the gatherNodes function emits only the *leaf nodes* of the traversal, which is the outer-most node set. To emit the root nodes you can specify the scatter parameter:

gatherNodes(emails,
            walk="johndoe@apache.org->from",
            gather="to",
            scatter="branches, leaves")

The scatter parameter controls whether to emit the branches with the leaves. The root nodes are considered "branches" because they are not the outer-most level of the traversal.

When scattering both branches and leaves the output would like this:

{
  "result-set": {
    "docs": [
      {
        "node": "johndoe@apache.org",
        "collection": "emails",
        "field": "node",
        "level": 0
      },
      {
        "node": "slist@campbell.com",
        "collection": "emails",
        "field": "to",
        "level": 1
      },
      {
        "node": "catherine.pernot@enron.com",
        "collection": "emails",
        "field": "to",
        "level": 1
      },
      {
        "node": "airam.arteaga@enron.com",
        "collection": "emails",
        "field": "to",
        "level": 1
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 44
      }
    ]
  }
}

Now the level 0 root node is included in the output.

Aggregations

gatherNodes also supports aggregations. For example:

gatherNodes(emails,
            walk="johndoe@apache.org, janesmith@apache.org->from",
            gather="to",
            count(*))

The expression above finds the edges with "johndoe@apache.org" or "janesmith@apache.org" in the from field and gathers the values from the to field. It also aggregates the count for each node ID gathered.

A gathered node could have a count of 2 if both "johndoe@apache.org" and "janesmith@apache.org" have emailed the same person. Node sets contain a unique set of nodes, so the same person won’t appear twice in the node set, but the count will reflect that it appeared twice during the traversal.

Edges are uniqued as part of the traversal so the count will not reflect the number of times "johndoe@apache.org" emailed the same person. For example, personA might have emailed personB 100 times. These edges would get uniqued and only be counted once. But if person personC also emailed personB this would increment the count for personB.

The aggregation functions supported are count(*), sum(field), min(field), max(field), and avg(field). The fields being aggregated should be present in the edges collected during the traversal. Later examples (below) will show aggregations can be a powerful tool for providing recommendations and limiting the scope of traversals.

Nesting gatherNodes functions

The gatherNodes function can be nested to traverse deeper into the graph. For example:

gatherNodes(emails,
            gatherNodes(emails,
                        walk="johndoe@apache.org->from",
                        gather="to"),
            walk="node->from",
            gather="to")

In the example above the outer gatherNodes function operates on the node set collected from the inner gatherNodes function.

Notice that the inner gatherNodes function behaves exactly as the examples already discussed. But the walk parameter of the outer gatherNodes function behaves differently.

In the outer gatherNodes function the walk parameter works with tuples coming from an internal streaming expression. In this scenario the walk parameter maps the node field to the from field. Remember that the node IDs collected from the inner gatherNodes expression are placed in the node field.

Put more simply, the inner expression gathers all the people that "johndoe@apache.org" has emailed. We can call this group the "friends of johndoe@apache.org". The outer expression gathers all the people that the "friends of johndoe@apache.org" have emailed. This is a basic friends-of-friends traversal.

This construct of nesting gatherNodes functions is the basic technique for doing a controlled traversal through the graph.

Cycle Detection

The gatherNodes function performs cycle detection across the entire traversal. This ensures that nodes that have already been visited are not traversed again. Cycle detection is important for both limiting the size of traversals and gathering accurate aggregations. Without cycle detection the size of the traversal could grow exponentially with each hop in the traversal. With cycle detection only new nodes encountered are traversed.

Cycle detection does not cross collection boundaries. This is because internally the collection name is part of the node ID. For example the node ID "johndoe@apache.org", is really emails/johndoe@apache.org. When traversing to another collection "johndoe@apache.org" will be traversed.

Filtering the Traversal

Each level in the traversal can be filtered with a filter query. For example:

gatherNodes(emails,
            walk="johndoe@apache.org->from",
            fq="body:(solr rocks)",
            gather="to")

In the example above only emails that match the filter query will be included in the traversal. Any Solr query can be included here. So you can do fun things like geospatial queries, apply any of the available query parsers, or even write custom query parsers to limit the traversal.

Root Streams

Any streaming expression can be used to provide the root nodes for a traversal. For example:

gatherNodes(emails,
            search(emails, q="body:(solr rocks)", fl="to", sort="score desc", rows="20")
            walk="to->from",
            gather="to")

The example above provides the root nodes through a search expression. You can also provide arbitrarily complex, nested streaming expressions with joins, etc., to specify the root nodes.

Notice that the walk parameter maps a field from the tuples generated by the inner stream. In this case it maps the to field from the inner stream to the from field.

Skipping High Frequency Nodes

It’s often desirable to skip traversing high frequency nodes in the graph. This is similar in nature to a search term stop list. The best way to describe this is through an example use case.

Let’s say that you want to recommend content for a user based on a collaborative filter. Below is one approach for a simple collaborative filter:

Find all content userA has read.
Find users whose reading list is closest to userA. These are users with similar tastes as userA.
Recommend content based on what the users in step 2 have read, that userA has not yet read.

Look closely at step 2. In large graphs, step 2 can lead to a very large traversal. This is because userA may have viewed content that has been viewed by millions of other people. We may want to skip these high frequency nodes for two reasons:

A large traversal that visit millions of unique nodes is slow and takes a lot of memory because cycle detection is tracked in memory.
High frequency nodes are also not useful in determining users with similar tastes. The content that fewer people have viewed provides a more precise recommendation.

The gatherNodes function has the maxDocFreq param to allow for filtering out high frequency nodes. The sample code below shows steps 1 and 2 of the recommendation:

 gatherNodes(logs,
             search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:view", qt="/export"),
             walk="articleID->articleID",
             gather="userID",
             fq="action:view",
             maxDocFreq="10000",
             count(*)))

In the example above, the inner search expression searches the logs collection and returning all the articles viewed by "user1". The outer gatherNodes expression takes all the articles emitted from the inner search expression and finds all the records in the logs collection for those articles. It then gathers and aggregates the users that have read the articles. The maxDocFreq parameter limits the articles returned to those that appear in no more then 10,000 log records (per shard). This guards against returning articles that have been viewed by millions of users.

Tracking the Traversal

By default the gatherNodes function only tracks enough information to do cycle detection. This provides enough information to output the nodes and aggregations in the graph.

For some use cases, such as graph visualization, we also need to output the edges. Setting trackTraversal="true" tells gatherNodes to track the connections between nodes, so the edges can be constructed. When trackTraversal is enabled a new ancestors property will appear with each node. The ancestors property contains a list of node IDs that pointed to the node.

Below is a sample gatherNodes expression with trackTraversal set to true:

gatherNodes(emails,
            gatherNodes(emails,
                        walk="johndoe@apache.org->from",
                        gather="to",
                        trackTraversal="true"),
            walk="node->from",
            trackTraversal="true",
            gather="to")

Cross-Collection Traversals

Nested gatherNodes functions can operate on different SolrCloud collections. This allow traversals to "walk" from one collection to another to gather nodes. Cycle detection does not cross collection boundaries, so nodes collected in one collection will be traversed in a different collection. This was done deliberately to support cross-collection traversals. Note that the output from a cross-collection traversal will likely contain duplicate nodes with different collection attributes.

Below is a sample gatherNodes expression that traverses from the "emails" collection to the "logs" collection:

gatherNodes(logs,
            gatherNodes(emails,
                        search(emails, q="body:(solr rocks)", fl="from", sort="score desc", rows="20")
                        walk="from->from",
                        gather="to",
                        scatter="leaves, branches"),
            walk="node->user",
            fq="action:edit",
            gather="contentID")

The example above finds all people who sent emails with a body that contains "solr rocks". It then finds all the people these people have emailed. Then it traverses to the logs collection and gathers all the content IDs that these people have edited.

Combining gatherNodes With Other Streaming Expressions

The gatherNodes function can act as both a stream source and a stream decorator. The connection with the wider stream expression library provides tremendous power and flexibility when performing graph traversals. Here is an example of using the streaming expression library to intersect two friend networks:

            intersect(on="node",
                      sort(by="node asc",
                           gatherNodes(emails,
                                       gatherNodes(emails,
                                                   walk="johndoe@apache.org->from",
                                                   gather="to"),
                                       walk="node->from",
                                       gather="to",
                                       scatter="branches,leaves")),
                       sort(by="node asc",
                            gatherNodes(emails,
                                        gatherNodes(emails,
                                                    walk="janedoe@apache.org->from",
                                                    gather="to"),
                                        walk="node->from",
                                        gather="to",
                                        scatter="branches,leaves")))

The example above gathers two separate friend networks, one rooted with "johndoe@apache.org" and another rooted with "janedoe@apache.org". The friend networks are then sorted by the node field, and intersected. The resulting node set will be the intersection of the two friend networks.

Sample Use Cases

Calculate Market Basket Co-occurrence

It is often useful to know which products are most frequently purchased with a particular product. This example uses a simple market basket table (indexed in Solr) to store past shopping baskets. The schema for the table is very simple with each row containing a basketID and a productID. This can be seen as a graph with each row in the table representing an edge. And it can be traversed very quickly to calculate basket co-occurrence, even when the graph contains billions of edges.

Here is the sample syntax:

top(n="5",
    sort="count(*) desc",
    gatherNodes(baskets,
                random(baskets, q="productID:ABC", fl="basketID", rows="500"),
                walk="basketID->basketID",
                fq="-productID:ABC",
                gather="productID",
                count(*)))

Let’s break down exactly what this traversal is doing.

The first expression evaluated is the inner random expression, which returns 500 random basketIDs, from the baskets collection, that have the productID "ABC". The random expression is very useful for recommendations because it limits the traversal to a fixed set of baskets, and because it adds the element of surprise into the recommendation. Using the random function you can provide fast sample sets from very large graphs.
The outer gatherNodes expression finds all the records in the baskets collection for the basketIDs generated in step 1. It also filters out productID "ABC" so it doesn’t show up in the results. It then gathers and counts the productID’s across these baskets.
The outer top expression ranks the productIDs emitted in step 2 by the count and selects the top 5.

In a nutshell this expression finds the products that most frequently co-occur with product "ABC" in past shopping baskets.

Using the scoreNodes Function to Make a Recommendation

This use case builds on the market basket example above that calculates which products co-occur most frequently with productID:ABC. The ranked co-occurrence counts provide candidates for a recommendation. The scoreNodes function can be used to score the candidates to find the best recommendation.

Before diving into the syntax of the scoreNodes function it’s useful to understand why the raw co-occurrence counts may not produce the best recommendation. The reason is that raw co-occurrence counts favor items that occur frequently across all baskets. A better recommendation would find the product that has the most significant relationship with productID ABC. The scoreNodes function uses a term frequency-inverse document frequency (TF-IDF) algorithm to find the most significant relationship.

How It Works

The scoreNodes function assigns a score to each node emitted by the gatherNodes expression. By default the scoreNodes function uses the count(*) aggregation, which is the co-occurrence count, as the TF value. The IDF value for each node is fetched from the collection where the node was gathered. Each node is then scored using the TF*IDF formula, which provides a boost to nodes with a lower frequency across all market baskets.

Combining the co-occurrence count with the IDF provides a score that shows how important the relationship is between productID ABC and the recommendation candidates.

The scoreNodes function adds the score to each node in the nodeScore field.

Example Syntax

top(n="1",
    sort="nodeScore desc",
    scoreNodes(top(n="50",
                   sort="count(*) desc",
                   gatherNodes(baskets,
                               random(baskets, q="productID:ABC", fl="basketID", rows="500"),
                               walk="basketID->basketID",
                               fq="-productID:ABC",
                               gather="productID",
                               count(*)))))

This example builds on the earlier example "Calculate market basket co-occurrence".

Notice that the inner-most top function is taking the top 50 products that co-occur most frequently with productID ABC. This provides 50 candidate recommendations.
The scoreNodes function then assigns a score to the candidates based on the TF*IDF of each node.
The outer top expression selects the highest scoring node. This is the recommendation.

Recommend Content Based on Collaborative Filter

In this example we’ll recommend content for a user based on a collaborative filter. This recommendation is made using log records that contain the userID and articleID and the action performed. In this scenario each log record can be viewed as an edge in a graph. The userID and articleID are the nodes and the action is an edge property used to filter the traversal.

Here is the sample syntax:

top(n="5",
    sort="count(*) desc",
    gatherNodes(logs,
                top(n="30",
                    sort="count(*) desc",
                    gatherNodes(logs,
                                search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:read", qt="/export"),
                                walk="articleID->articleID",
                                gather="userID",
                                fq="action:read",
                                maxDocFreq="10000",
                                count(*))),
                walk="node->userID",
                gather="articleID",
                fq="action:read",
                count(*)))

Let’s break down the expression above step-by-step.

The first expression evaluated is the inner search expression. This expression searches the logs collection for all records matching "user1". This is the user we are making the recommendation for.

There is a filter applied to pull back only records where the "action:read". It returns the articleID for each record found. In other words, this expression returns all the articles "user1" has read.
The inner gatherNodes expression operates over the articleIDs returned from step 1. It takes each articleID found and searches them against the articleID field.

Note that it skips high frequency nodes using the maxDocFreq param to filter out articles that appear over 10,000 times in the logs. It gathers userIDs and aggregates the counts for each user. This step finds the users that have read the same articles that "user1" has read and counts how many of the same articles they have read.
The inner top expression ranks the users emitted from step 2. It will emit the top 30 users who have the most overlap with user1’s reading list.
The outer gatherNodes expression gathers the reading list for the users emitted from step 3. It counts the articleIDs that are gathered.

Any article selected in step 1 (user1 reading list), will not appear in this step due to cycle detection. So this step returns the articles read by the users with the most similar readings habits to "user1" that "user1" has not read yet. It also counts the number of times each article has been read across this user group.
The outer top expression takes the top articles emitted from step 4. This is the recommendation.

Protein Pathway Traversal

In recent years, scientists have become increasingly able to rationally design drugs that target the mutated proteins, called oncogenes, responsible for some cancers. Proteins typically act through long chains of chemical interactions between multiple proteins, called pathways, and, while the oncogene in the pathway may not have a corresponding drug, another protein in the pathway may. Graph traversal on a protein collection that records protein interactions and drugs may yield possible candidates. (Thanks to Lewis Geer of the NCBI, for providing this example).

The example below illustrates a protein pathway traversal:

gatherNodes(proteins,
            gatherNodes(proteins,
                        walk="NRAS->name",
                        gather="interacts"),
            walk="node->name",
            gather="drug")

Let’s break down exactly what this traversal is doing.

The inner gatherNodes expression traverses in the proteins collection. It finds all the edges in the graph where the name of the protein is "NRAS". Then it gathers the proteins in the interacts field. This gathers all the proteins that "NRAS" interactions with.
The outer gatherNodes expression also works with the proteins collection. It gathers all the drugs that correspond to proteins emitted from step 1.
Using this stepwise approach you can gather the drugs along the pathway of interactions any number of steps away from the root protein.

Exporting GraphML to Support Graph Visualization

In the examples above, the gatherNodes expression was sent to Solr’s /stream handler like any other streaming expression. This approach outputs the nodes in the same JSON tuple format as other streaming expressions so that it can be treated like any other streaming expression. You can use the /stream handler when you need to operate directly on the tuples, such as in the recommendation use cases above.

There are other graph traversal use cases that involve graph visualization. Solr supports these use cases with the introduction of the /graph request handler, which takes a gatherNodes expression and outputs the results in GraphML.

GraphML is an XML format supported by graph visualization tools such as Gephi, which is a sophisticated open source tool for statistically analyzing and visualizing graphs. Using a gatherNodes expression, parts of a larger graph can be exported in GraphML and then imported into tools like Gephi.

There are a few things to keep mind when exporting a graph in GraphML:

The /graph handler can export both the nodes and edges in the graph. By default, it only exports the nodes. To export the edges you must set trackTraversal="true" in the gatherNodes expression.
The /graph handler currently accepts an arbitrarily complex streaming expression which includes a gatherNodes expression. If the streaming expression doesn’t include a gatherNodes expression, the /graph handler will not properly output GraphML.
The /graph handler currently accepts a single arbitrarily complex, nested gatherNodes expression per request. This means you cannot send in a streaming expression that joins or intersects the node sets from multiple gatherNodes expressions. The /graph handler does support any level of nesting within a single gatherNodes expression. The /stream handler does support joining and intersecting node sets, but the /graph handler currently does not.

Sample Request

curl --data-urlencode 'expr=gatherNodes(enron_emails,
                                        gatherNodes(enron_emails,
                                                    walk="kayne.coulter@enron.com->from",
                                                    trackTraversal="true",
                                                    gather="to"),
                                        walk="node->from",
                                        scatter="leaves,branches",
                                        trackTraversal="true",
                                        gather="to")' http://localhost:8983/solr/enron_emails/graph

Sample GraphML Output

<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph id="G" edgedefault="directed">
     <node id="kayne.coulter@enron.com">
           <data key="field">node</data>
           <data key="level">0</data>
           <data key="count(*)">0.0</data>
     </node>
     <node id="don.baughman@enron.com">
           <data key="field">to</data>
           <data key="level">1</data>
           <data key="count(*)">1.0</data>
     </node>
     <edge id="1"  source="kayne.coulter@enron.com"  target="don.baughman@enron.com"/>
     <node id="john.kinser@enron.com">
           <data key="field">to</data>
           <data key="level">1</data>
           <data key="count(*)">1.0</data>
    </node>
    <edge id="2"  source="kayne.coulter@enron.com"  target="john.kinser@enron.com"/>
    <node id="jay.wills@enron.com">
          <data key="field">to</data>
          <data key="level">1</data>
          <data key="count(*)">1.0</data>
    </node>
    <edge id="3"  source="kayne.coulter@enron.com"  target="jay.wills@enron.com"/>
</graph></graphml>

Stream Evaluator Reference Parallel SQL Interface