Modeling Data with Property Graphs 288

Part I. A Guided Tour of the Social Web Prelude

7.3. Modeling Data with Property Graphs 288

You may recall from Section 2.3.2.2 on page 78 that graphs were introduced in passing as a means of representing, analyzing, and visualizing social network data from Face‐

book. This section provides a more thorough discussion and hopefully serves as a useful primer for graph computing. Even though it is still a bit under the radar, the graph computing landscape is emerging rapidly given that graphs are a very natural abstraction for modeling many phenomena in the real world. Graphs offer a flexibility in data rep‐

resentation that is especially hard to beat during data experimentation and analysis when compared to other options, such as relational databases. Graph-centric analyses are certainly not a panacea for every problem, but an understanding of how to model your data with graphical structures is a powerful addition to your toolkit.

A general introduction to graph theory is beyond the scope of this chapter, and the discussion that follows simply attempts to provide a gentle introduction to key concepts as they arise. You may enjoy the short YouTube video “Graph Theory—An Introduction!” if you’d like to accumulate some general background knowledge before proceed‐

ing.

The remainder of this section introduces a common kind of graph called a property graph for the purpose of modeling GitHub data as an interest graph by way of a Python package called NetworkX. A property graph is a data structure that represents entities with nodes and relationships between the entities with edges. Each vertex has a unique identifier, a map of properties that are defined as key/value pairs, and a collection of edges. Likewise, edges are unique in that they connect nodes, can be uniquely identified, and can contain properties.

Figure 7-2 shows a trivial example of a property graph with two nodes that are uniquely identified by X and Y with an undescribed relationship between them. This particular graph is called a digraph because its edges are directed, which need not be the case unless the directionality of the edge is rooted in meaning for the domain being modeled.

Figure 7-2. A trivial property graph with directed edges

Expressed in code with NetworkX, a trivial property graph could be constructed as shown in Example 7-4. (You can use pip install networkx to install this package if you aren’t using the book’s turnkey virtual machine.)

288 | Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More

Example 7-4. Constructing a trivial property graph

import networkx as nx

# Create a directed graph g = nx.DiGraph()

# Add an edge to the directed graph from X to Y g.add_edge('X', 'Y')

# Print some statistics about the graph print nx.info(g)

print

# Get the nodes and edges from the graph print "Nodes:", g.nodes()

print "Edges:", g.edges() print

# Get node properties

print "X props:", g.node['X']

print "Y props:", g.node['Y']

# Get edge properties

print "X=>Y props:", g['X']['Y']

print

# Update a node property

g.node['X'].update({'prop1' : 'value1'}) print "X props:", g.node['X']

print

# Update an edge property

g['X']['Y'].update({'label' : 'label1'}) print "X=>Y props:", g['X']['Y']

Sample output from the example follows:

Name:

Type: DiGraph Number of nodes: 2 Number of edges: 1

Average in degree: 0.5000 Average out degree: 0.5000

7.3. Modeling Data with Property Graphs | 289

Nodes: ['Y', 'X']

Edges: [('X', 'Y')]

X props: {}

Y props: {}

X=>Y props: {}

X props: {'prop1': 'value1'}

X=>Y props: {'label': 'label1'}

In this particular example, the add_edge method of the digraph adds an edge from a node that’s uniquely identified by X to a node that’s uniquely identified by Y, resulting in a graph with two nodes and one edge between them. In terms of its unique identifier, this node would be represented by the tuple (X,Y) since both nodes that it connects are uniquely identified themselves. Be aware that adding an edge from Y back to X would create a second edge in the graph, and this second edge could contain its own set of edge properties. In general, you wouldn’t want to create this second edge since you can get a node’s incoming or outgoing edges and effectively traverse the edge in either direction, but there may be some situations in which it is more convenient to explicitly include the additional edge.

The degree of a node in a graph is the number of incident edges to it, and for a directed graph, there is a notion of in degree and out degree since edges have direction. The average in degree and average out degree values provide a normalized score for the graph that represents the number of nodes that have incoming and outgoing edges. In this particular case, the directed graph has a single directed edge, so there is one node with an outgoing edge and one node with an incoming edge.

The in and out degree of a node is a fundamental concept in graph theory. Assuming you know the number of vertices in the graph, the average degree provides a measure of the graph’s density: the number of actual edges compared to the number of possible edges if the graph were fully connected. In a fully connected graph, each node is con‐

nected to every other node, and in the case of a directed graph, this means that all nodes have incoming edges from all other nodes.

You calculate the average in degree for an entire graph by summing the values of each node’s in degree and dividing the total by the number of nodes in the graph, which is 1 divided by 2 in Example 7-4. The average out degree calculation is computed the same way except that the sum of each node’s out degree is used as the value to divide by the number of nodes in the graph. When you’re considering an entire directed graph, there will always be an equal number of incoming edges and outgoing edges because each

290 | Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More

1. A more abstract version of a graph called a hypergraph contains hyperedges that can connect an arbitrary number of vertices.

edge connects only two nodes,1 and the average in degree and average out degree values for the entire graph will be the same.

In the general case, the maximum values for average in and out degree in a graph are one less than the number of nodes in the graph.

Take a moment to convince yourself that this is the case by consid‐

ering the number of edges that are necessary to fully connect all of the nodes in a graph.

In the next section, we’ll construct an interest graph using these same property graph primitives and illustrate these additional methods at work on real-world data. First, take a moment to explore by adding some nodes, edges, and properties to the graph. The NetworkX documentation provides a number of useful introductory examples that you can also explore if this is one of your first encounters with graphs and you’d like some extra instruction as a primer.

The Rise of Big Graph Databases

This chapter introduces property graphs, a versatile data structure that can be used to model complex networks with nodes and edges as simple primitives. We’ll be modeling the data according to a flexible graph schema that’s based largely on natural intuition, and for a narrowly focused domain, this pragmatic approach is often sufficient. As we’ll see throughout the remainder of this chapter, property graphs provide flexibility and versatility in modeling and querying complex data.

NetworkX, the Python-based graph toolkit used throughout this book, provides a pow‐

erful toolbox for modeling property graphs. Be aware, however, that NetworkX is an in- memory graph database. The limit of what it can do for you is directly proportional to the amount of working memory that you have on the machine on which you are running it. In many situations, you can work around the memory limitation by constraining the domain to a subset of the data or by using a machine with more working memory. In an increasing number of situations involving “big data” and its burgeoning ecosystem that largely involves Hadoop and NoSQL databases, however, in-memory graphs are simply not an option.

There is a nascent ecosystem of so-called “big graph databases” that leverage NoSQL databases for storage and provide property graph semantics that are worth noting.

Titan, a promising front-runner, and other big graph databases that adopt the property graph model present an opportunity for a departure from the semantic web stack as we know it, involving technologies such as RDF Schema, OWL, and SPARQL. The idea

7.3. Modeling Data with Property Graphs | 291

behind the technologies involved in the semantic web stack is that they provide a mech‐

anism for representing and consolidating data from more complex domains, thereby making it possible to arrive at a standardized vocabulary that can be meaningfully queried. Unfortunately, there have been litanies of historical challenges involved in ap‐

plying these technologies at web scale. One of Titan’s promising key claims is that it is designed to effectively manage the memory hierarchy in order to scale well.

It will be exciting to see what the future holds as big graph databases based upon NoSQL databases and the property graph model fuse with the ideas and technologies involved in a more traditional semantic web toolchain. The next chapter introduces some current web innovations involving microformats that are a step in the general direction of a more semantic web; it ends with a brief example that demonstrates inferencing on a simple graph with a small subset of technology akin to the semantic web stack.

Modeling Data with Property Graphs 288

Why Is Twitter All the Rage? 6

Creating a Twitter API Connection 12