In my previous blog, we saw about neo4j with some basic example. In this post, let’s see an overview graph algorithms in neo4j.
Here is a list of the many algorithms that Neo4j uses in its graph analytics platform, along with an explanation of what they do.
These algorithms determine the importance of distinct nodes in a network:
- PageRank – Estimates a current node’s importance from its linked neighbors and then again from their neighbors. A node’s rank is derived from the number and quality of its transitive links to estimate influence. PageRank is used in quite a few ways to estimate importance and influence. It’s used to suggest Twitter accounts to follow and for general sentiment analysis. PageRank is also used in machine learning to identify the most influential features for extraction.
- Betweenness Centrality – Measures the number of shortest paths (first found with breadth-first search) that pass through a node. Nodes that most frequently lie on shortest paths have higher betweenness centrality scores and are the bridges between different clusters. It is often associated with the control over the flow of resources and information. Betweenness centrality applies to a wide range of problems in network science and is used to pinpoint bottlenecks or likely attack targets in communication and transportation networks. Betweenness Centrality has also be used to evaluate information flows between multiplayer online gamers and expertise sharing communities of physicians.
- Closeness Centrality – Measures how central a node is to all its neighbours within its cluster. Nodes with the shortest paths to all other nodes are assumed to be able to reach the entire group the fastest. Closeness centrality is applicable in a number of resources, communication, and behavioural analysis, especially when interaction speed is significant. It has been used for identifying the best location of new public services for maximum accessibility. In social network analysis, it is used to find people with the ideal social network location for faster dissemination of information.
These algorithms evaluate how a group is clustered or partitioned:
- Louvain – Measures the quality (i.e. presumed accuracy) of a community grouping by comparing its relationship density to a suitably defined random network. It’s often used to evaluate the organization of complex networks and community hierarchies in particular. It’s also useful for initial data preprocessing in unsupervised machine learning. Louvain is used to evaluate social structures on Twitter, LinkedIn, and YouTube. It’s used in fraud analytics to evaluate whether a group has just a few bad behaviors or is acting as a fraud ring that would be indicated by a higher relationship density than average.
- Label Propagation – Spreads labels based on neighbourhood majorities as a means of inferring clusters. This extremely fast graph partitioning requires little prior information and is widely used in large-scale networks for community detection. It’s a key method for understanding the organisation of a graph and is often a primary step in other analysis. Label propagation has diverse applications, from understanding consensus formation in social communities to identifying sets of proteins that are involved together in a process (functional modules) for biochemical networks. It’s also used in semi- and unsupervised machine learning as an initial pre-processing step.
- (Weakly) Connected Components – Finds groups of nodes where each node is reachable from any other node in the same group, regardless of the direction of relationships. It provides near constant-time operations (independent of input size) to add new groups, merge existing groups, and determine whether two nodes are in the same group. Union-find/connected components is often used in conjunction with other algorithms, especially for high-performance grouping. As a preprocessing step for undirected graphs, it helps quickly identify disconnected groups.
- Strongly Connected Components – Locates groups of nodes where each node is reachable from every other node in the same group following the direction of relationships. It’s often applied from a depth-first search. Strongly connected is often used to enable running other algorithms independently on an identified cluster. As a preprocessing step for directed graphs, it helps quickly identify disconnected groups. In retail recommendations, it helps identify groups with strong affinities that then are used for suggesting commonly preferred items to those within that group who have not yet purchased the item.
- Triangle Count / Clustering Coefficient – Measures how many nodes have triangles and the degree to which nodes tend to cluster together. The average clustering coefficient is 1 when there is a clique and 0 when there are no connections. For the clustering coefficient to be meaningful, it should be significantly higher than a version of the network where all of the relationships have been shuffled randomly.
These algorithms help find the shortest path or evaluate the availability and quality of routes:
- Minimum Weight Spanning Tree – Calculates the paths along a connected tree structure with the smallest value (weight of the relationship such as cost, time, or capacity) associated with visiting all nodes in the tree. Minimum weight spanning tree is widely used for network designs: least-cost logical or physical routing such as laying cable, fastest garbage collection routes, capacity for water systems, efficient circuit designs, and much more.
- All Pairs Shortest Path – Calculates a shortest path forest (group) containing all shortest paths between the nodes in the graph. It’s commonly used for understanding alternate routing when the shortest route is blocked or becomes sub-optimal. All-pairs shortest path is used to evaluate alternate routes for situations, such as a freeway backup or network capacity. It’s also key in logical routing to offer multiple paths; for example, call routing alternatives.
- Single Source Shortest Path – Calculates a path between a node and all other nodes whose summed value (weight of relationships such as cost, distance, time, or capacity) to all other nodes is minimal. Single-source shortest path is often applied to automatically obtain directions between physical locations, such as driving directions via Google Maps.