Neo4j is a highly popular graph database management system used to store, manage, and query data as graphs. It is based on the property graph model and is designed to handle complex and highly interconnected data. Unlike traditional relational databases, Neo4j stores data in nodes and relationships, allowing for flexible and efficient querying of relationships and patterns within the data. Not only does it have a user-friendly query language (Cypher), it consists of several plugins that help in visualizing graphs and performing data science operations on networks.
This blog will aim to explain the findings of various queries on a Twitch streamer dataset on Neo4j. Jump straight to the clustering section through this link: Clustering using GDS
Setting Up
Loading the dataset
The dataset consisted of over 150,000 nodes and 6,500,000 edges and it was evident that using Cypher’s native LOAD_CSV function would take too long (if it doesn’t throw an OutOfMemory error before that). After some trial and error, we found that Neo4j’s admin-import terminal command set up the network in under a minute regardless of the system’s processing power.
neo4j-admin database import full --skip-duplicate-nodes --nodes=import/nodes_header.csv,import/large_twitch_features.csv --relationships=import/edges_header.csv,import/large_twitch_edges.csv --overwrite-destination=true --verbose
The above command loads the network containing 168,114 nodes and 6,797,557 edges in around 20 seconds, with all the node attributes intact. In comparison, LOAD_CSV was still loading edges after 40 minutes.
Note the nodes_header.csv and edges_header.csv files in the command. Header rows need to contain the data-type alongside it in order for the data to be imported correctly (entries are considered string by default).
It is to be noted that using this command will load the attributes as node properties instead of labels. To convert properties to labels, check out the Clustering using GDS section.
Cypher Queries
Cypher is Neo4j’s query language used to perform queries that are used extensively in data science due to their ability to handle complex data relationships and extract meaningful insights from large datasets. Cypher is designed to be both intuitive and expressive, allowing data scientists to write complex queries that can be easily understood by others.
The simplest Cypher query would be
match (n) return n
[This command returns all the nodes of the network. Note that Neo4j Browser has a defined limit to how many nodes can be displayed in one query (this amount can be changed). Due to this, the max amount of nodes that will be visible on-screen will be under this limit and thus, not all nodes will be present.
For this dataset, the top 10 nodes (on the basis of number of connections) can be obtained by:
match (s)-[]->(t) return s.numeric_id, size(collect(t)) as connections order by connections desc limit 10
To set the criteria as number of views, the Cypher command would be
match (n) return n.numeric_id, n.views as gamers order by n.views desc limit 10
Clustering using GDS
Clustering is the process of grouping similar nodes together based on certain criteria. Neo4j provides a plugin called Graph Data Science (GDS) that includes several clustering algorithms under “Community Detection”. Clusters were formed on this dataset using GDS’s Louvain community detection function. We were able to generate 19 clusters through this method.
In order to do this, the network has to be saved as a “graph”. This can be done using the following command
CALL gds.graph.project.cypher(
'twitch',
'MATCH (n)
RETURN
id(n) AS id,
n.views AS views',
'MATCH (n)-[]->(m) RETURN id(n) AS source, id(m) AS target'
)
YIELD
graphName, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodes, rels
This saves a graph in the current runtime with the name “twitch” along with the specified features.
Louvain clustering can be invoked using
call gds.louvain.write('twitch', {writeProperty:'louvain'})
This calls the Louvain clustering method, and saves the result as a node property under the name “louvain”. This node property can be changed to a node label (to use for displaying clusters separately) using the following
match (n)
call apoc.create.addLabels([id(n)], [toString(n.louvain)])
yield node
with node remove node.louvain return node
The clusters along with their sizes can be viewed using
match (n) return labels(n) as label, count(labels(n)) as size
order by size desc
That’s all the exploration done on the Twitch Gamers dataset using Neo4j. Thank you for reading!