WordPress Recommendations with Neo4j – Part 4: PageRank with APOC Procedures

Since the 3.0 release of Neo4j, the system has supported Procedures.  Unlike Unmanaged Extensions which are called via the REST API, Procedures can be invoked directly through a cypher statement.  The yielded values of the procedure, whether they be nodes, relationships or other arbitrary values can then be used within the cypher statement.

CALL my.procedure() YIELD node, score
SET node.score = score

Unfortunately for me, like Unmanaged Extensions, Procedures are written in Java.  However, luckily for me and the Neo4j community as a whole, Neo have already done a lot of the heavy lifting for us already and combined a huge number of utilities and algorithms into a single, easy to install package.  Introducing, APOC Procedures.

APOC Procedures

APOC – the technician of the Nebuchadnezzar in the Matrix, “Awesome Package of Components” or “Awesome Procedures on Cypher”  depending on who you ask – comes with over 200 procedures out of the box and thanks to the awesome Neo4j community that number is constantly growing.  APOC procedures range from anything as simple as loading JSON from a URL and date and time utilities all the way to complex graph algorithms including betwenness, centrality and Dijkstra’s shortest path algorithm.

Installing APOC

Installing APOC is surprisingly easy.  All you need to do is head over to the Github repository, find the appropriate release for your version of Neo4j (3.1.x, 3.2.x), download the apoc-3.x.jar file and place it the plugins directory of your Neo4j install and restart the server.

Once installed open up the explorer at :7474 and run the following query to get a list of all procedures with a name and description.

CALL apoc.help("apoc") YIELD name, text

The full documentation can be found at https://neo4j-contrib.github.io/neo4j-apoc-procedures/


PageRank, named after Google cofounder Larry Page, is the algorithm used by Google to rank their search engine results.  The algorithm will use the number and quality of links to a node within our graph to provide a centrality score.  The higher the better.  PageRank uses a variant of what is known as Eigenvector Centrality, where the rank of nodes are affected by the ranking of adjacent nodes.  Relationships to higher scoring nodes will contribute more to the score of a node than the connections to any lower scoring nodes.

In our recommendation graph,  the algorithm will count the connection to categories, and most importantly the links between Users and Posts to rank nodes by importance.

One gotcha of this procedure in APOC, is that although we can specify the nodes we would like to calculate a score on, when calculating a PageRank for a set of nodes the procedure will compute scores for the entire graph at once.  In reality this is a good thing as the eigenvector centrality will mean, but this can be very memory intensive on larger graphs.

We can view a list of PageRank queries by running the following query.

CALL apoc.help("apoc.algo.pagerank") YIELD name, text
name text
apoc.algo.pageRank CALL apoc.algo.pageRank(nodes) YIELD node, score – calculates page rank for given nodes
apoc.algo.pageRankStats CALL apoc.algo.pageRankStats({iterations:_,types:_,write:true,…}) YIELD nodeCount – calculates page rank on graph for given nodes and potentially writes back
apoc.algo.pageRankWithConfig CALL apoc.algo.pageRankWithConfig(nodes,{iterations:_,types:_}) YIELD node, score, info – calculates page rank for given nodes
apoc.algo.pageRankWithCypher CALL apoc.algo.pageRankWithCypher({iterations, node_cypher, rel_cypher, write, property, numCpu}) – calculates page rank based on cypher input


To run PageRank, we need to provide a collection of nodes to the apoc.algo.pageRank procedure.

MATCH (p:Post) WITH COLLECT(p) as posts
CALL apoc.algo.pageRank(posts) YIELD node, score
RETURN node.title, score

This query will provide us with similar results to below.

node.title score
WordPress Recommendations with Neo4j 0.8957
Quick TDD setup with Node, ES6, Gulp and Mocha 0.6485
ES6 Import & Export – A beginners guide 0.5532
ES6 Promises – 5 Things I Wish I’d Known 0.2284
2,100 startups in 1 building? 0.2124

We can also return scores for pages in a particular Category by tweaking the query above before passing the collection of nodes to the procedure.

MATCH (c:Category {slug:"neo4j"})<-[:HAS_TAXONOMY]-(p:Post) WITH COLLECT(p) as posts
CALL apoc.algo.pageRank(posts) YIELD node, score
RETURN node.title, score

Configuring PageRank

By default the apoc.algo.pageRank procedure will use all relationships in order to calculate a ranking.  From our dataset, you can see that we have taxonomy tags that at this stage may not be useful for ranking our nodes.

The apoc.algo.pageRankWithConfig procedure will allow us to define types.  This procedure accepts two arguments, firstly the nodes to calculate scores from and a configuration object.  To specify the relationships that we would like to include, we can provide them in a pipe delimited string.  At this point, we can also set the number of iterations we would like to run while computing the graph.

MATCH (p:Post) WITH COLLECT(p) AS posts
CALL apoc.algo.pageRankWithConfig(posts, {iterations:3, types:"VISITED"}) YIELD node, score
RETURN node, score

You can view the full documentation for the PageRank algorithm procedure at https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_pagerank_algorithm

Other Useful Procedures

PageRank is the tip of the iceberg when it comes to APOC procedures.  We could also look at using the following procedures to get some interesting insight into our Graph.

  • apoc.algo.closeness and apoc.algo.betweeness – calculate the connectivity and centrality and of users and articles.
  • apoc.spatial.* – Use the User’s location to provide location based recommendations or weight results by distance.
  • apoc.es.* – Integrate results with Elasticsearch for better search based capabilities.

Check out the Official User Guide and Repository for more inspiration.