Loading Twitter Data into Neo4j with APOC

For the Graph Hack at this years Graph Connect, myself and the Dead Pony Club aimed to combine candidate datasets with donation data,  Twitter and fake news sources to attempt to find out which politicians, if any, were directly influenced by fake news.

My task for the evening was to pull Twitter information into our Graph.  Being relatively new to the APOC library, I was genuinely surprised as to how easy this could be done with a couple of queries.  APOC comes with apoc.load.json and  apoc.load.jsonParams functions that make it really easy to ingest third party APIs with Cypher and load them directly into the graph without writing an application.

Getting a valid Access Token

The first step was to get a valid access token that would be used to identify the user and grant access to the API.   You can create your own application by logging into Twitter Application Management.  Once you have an application, head to the Keys and Access Tokens tab and take the Consumer Key and Consumer Secret.  These should be Base64 encoded in {key}:{secret} format and then posted to Twitter’s API as a Basic authorization header.

curl -X POST "https://api.twitter.com/oauth2/token"
-H "Authorization: Basic <Base64EncodedValue>" -d "grant_type=client_credentials"

The API will return a bearer token that can then be used in all of our queries.

{"token_type":"bearer","access_token":"AAAAAAAAAAAAAAAAAAAAAIivBQ..."}

There are more elegant ways of setting this up for use in a Cypher query but for brevity I ended up setting a parameter in the Neo4j Browser.  This also closely mimics the way you would run a cypher query from an application using parameters.

:param token: <access_token property>

Retrieving a Tweet

I wanted to write a cypher query that could be as close to a real world implementation as possible.  I also wanted the dataset to grow organically.  With that in mind,  the best approach was to create a query that would find any tweets in the database with an ID but no inward :TWEETED relationship, pull the information from the graph.

I picked a tweet that I knew had a lot of information against it; a retweet of another tweet, with retweets of it’s own, mentions and hashtags and created that node in the database.

CREATE (t:Tweet {id_str:"847738850990411776"})

The next step was to use APOC to get the information from the Twitter API.    As I mentioned earlier, the APOC library comes with two functions for pulling JSON.  apoc.load.json  can be used in scenarios where only a simple HTTP GET request is required but as I needed to provide a bearer token this required using apoc.load.jsonParams.  jsonParams accepts three arguments; the URL, a map of configuration options and the request payload in string format.

WITH
'https://api.twitter.com/1.1/statuses/show.json?id=' as url,
{token} as token // set in our :param token:'...' query
MATCH (t:Tweet)
WHERE NOT (t)<-[:TWEETED]-()
CALL apoc.load.jsonParams(url + t.id_str, {Authorization:"Bearer "+token},null) yield value
RETURN value
“value”
{“coordinates”:null,”retweeted”:false,,”entities”:{“hashtags”:[{“text”:”graphconnect”,”indices”:[68,81]}],…}

Straight away, I was able to pull an output and API it into the value property.  From then it was just a case of taking the JSON output line by line and start to map the query.  I found that Mark Needham’s Graphing the ‘My name is…’ Twitter Meme provided a good starting point but was falling over when I tried to import tweets that were not replies.  Instead I broke the query down into sections to create the single query.  In order to populate any replies I could just run the query again.

WITH t, value AS status, value.user AS user, value.entities AS entities

// Update Tweet Properties
SET t.text = status.text, t.created_at = status.created_at, t.retweet_count = status.retweet_count, t.favorite_count = status.favorite_count

After using the WITH statement to pull out the information that I need from the response, I first go ahead and update the tweet node with the properties returned by the API.

The next step was to create a relationship with the authoring user.  Firstly I needed to create the User within the database.  By using MERGE I was able to create a new :User node where none exists or simply update the node’s properties if the node already exists before relating the User to the Tweet.

// Create Author
MERGE (u:User {screen_name:user.screen_name})
SET u.name = user.name, u.friends_count = user.friends_count, u.followers_count = user.followers_count, u.picture=user.profile_image_url
MERGE (u)-[:TWEETED]->(t)

The API result comes with an array of linked entities  including any Hashtags included in the tweet, any users mentioned in the tweet and any links along with their position in the text. Having this information in an array makes it easy to create in the graph. The array can be piped into a sub-statement using the FOREACH operation.  This approach can be used for Hashtags, Mentions and URLs.

// Create Hashtags
FOREACH (h in entities.hashtags
| MERGE (ht:Hashtag {name:h.text}) MERGE (t)-[:MENTIONS_HASHTAG]->(ht))

// Mentions
FOREACH (m in entities.user_mentions
| MERGE (mu:User {screen_name:m.screen_name}) MERGE (t)-[:MENTIONS_USER]->(mu))

// URLs
FOREACH (m in entities.urls
| MERGE (mu:URL {url:m.url}) MERGE (t)-[:MENTIONS_URL]->(mu))

Creating relationships to quoted tweets and replied to tweets in a single query caused a few headaches. As the information is not always present, attempting to run a merge query on a null value meant that early queries would fall over.  Luckily, with a little hack I was able to create a statement would iterate through an array when these values existed, or where the information had not been provided Cypher would iterate through an empty array and the inner query would not be executed.

This approach was used for both Quoted Statuses and Replies.

// Quoted Status?
FOREACH (s in CASE WHEN status.quoted_status_id IS NOT NULL THEN [status.quoted_status] ELSE [] END | MERGE (qt:Tweet {id_str:s.id_str}) MERGE (t)-[:QUOTES_TWEET]->(qt) )

// Reply to
FOREACH (s_id_str in
CASE WHEN status.in_reply_to_status_id_str IS NOT NULL
THEN [status.in_reply_to_status_id_str]
ELSE [] END
| MERGE (qt:Tweet {id_str:s_id_str}) MERGE (t)-[:IN_REPLY_TO]->(qt) )

Growing the Graph

The query is designed to pick up any Tweet node without a relationship to an author and pull the information in. By design, any tweets that the API response mentions are created with only an id_str so they will be picked up the next time the query is run.

The same approach can be taken to pulling in user information. Finding a newly created user, for example searching for any nodes without an updated_at property or where the record is out of date and pulling that information in from the Twitter API can be done in the same manner. Queries can also be written that call the API to find the User’s timeline and create those Tweet nodes. Regular calls to find who a user follows and is followed by will also help to organically grow the graph.

The flexibility of Cypher coupled with APOC’s ability for you to talk to API’s means that creating a graph based on third party data is extremely easy.