Using Nodes as Pivots for More Efficient Queries 3- 123docz.net

Part I. A Guided Tour of the Social Web Prelude

7.4. Analyzing GitHub Interest Graphs 292

7.4.4. Using Nodes as Pivots for More Efficient Queries 311

Another characteristic of the data to consider is the popularity of programming lan‐

guages that are employed by users. It could be the case that users star projects that are implemented in programming languages that they are at least loosely interested in and able to use themselves. Although we have the data and the tools to analyze users and popular programming languages with our existing graph, our schema currently has a shortcoming. Since a programming language is modeled as an attribute on a repository, it is necessary to scan all of the repository nodes and either extract or filter by this attribute in order to answer nontrivial questions.

For example, if we wanted to know which programming languages a user programs in using the current schema, we’d need to look up all of the repositories that user gazes at, extract the lang properties, and compute a frequency distribution. This doesn’t seem too cumbersome, but what if we wanted to know how many users program in a partic‐

ular programming language? Although the answer is computable with the existing schema, it requires a scan of every repository node and a count of all of the incoming

“gazes” edges. With a modification to the graph schema, however, answering this ques‐

tion could be as simple as accessing a single node in the graph. The modification would involve creating a node in the graph for each programming language that has incoming programs edges that connect users who program in that language, and outgoing imple ments edges that connect repositories.

Figure 7-6 illustrates our final graph schema, which incorporates programming lan‐

guages as well as edges between users, repositories, and programming languages. The overall effect of this schema change is that we’ve taken a property of one node and created an explicit relationship in the graph that was previously implicit. From the standpoint of completeness, there is no new data, but the data that we do have can now be computed on more efficiently for certain queries. Although the schema is fairly simple, the universe of possible graphs adhering to it that could be easily constructed and mined for valuable knowledge is immense.

The nice thing about having a single node in the graph that corresponds to a program‐

ming language, as opposed to representing a programming language as a property on many nodes, is that a single node acts as a natural point of aggregation. Central points of aggregation can greatly simplify many kinds of queries, such as finding maximal cliques in the graph, as described in Section 2.3.2.2 on page 78. For example, finding the maximal clique of users who all follow one another and program with a particular language can be more efficiently computed with NetworkX’s clique detection algo‐

7.4. Analyzing GitHub Interest Graphs | 311

rithms since the requirement of a particular programming language node in the clique significantly constrains the search.

Figure 7-6. A graph schema that includes GitHub users, repositories, and programming languages

Example 7-14 introduces some sample code that constructs the updates as depicted in the final graph schema. Because all of the information that we need to construct the additional nodes and edges is already present in the existing graph (since we have already stored the programming language as a property on the repository nodes), no additional requests to the GitHub API are necessary.

Example 7-14. Updating the graph to include nodes for programming languages

# Iterate over all of the repos, and add edges for programming languages

# for each person in the graph. We'll also add edges back to repos so that

# we have a good point to "pivot" upon.

repos = [n

for n in g.nodes_iter()

if g.node[n]['type'] == 'repo']

for repo in repos:

lang = (g.node[repo]['lang'] or "") + "(lang)"

stargazers = [u

for (u, r, d) in g.in_edges_iter(repo, data=True) if d['type'] == 'gazes'

]

for sg in stargazers:

g.add_node(lang, type='lang')

g.add_edge(sg, lang, type='programs') g.add_edge(lang, repo, type='implements')

312 | Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More

Our final graph schema is capable of answering a variety of questions. A few questions that seem ripe for investigation at this point include:

• Which languages do particular users program with?

• How many users program in a particular language?

• Which users program in multiple languages, such as Python and JavaScript?

• Which programmer is the most polyglot (programs with the most languages)?

• Is there a higher correlation between particular languages? (For example, given that a programmer programs in Python, is it more likely that this same programmer also programs in JavaScript or with Go based upon the data in this graph?) Example 7-15 provides some sample code that is a good starting point for answering most of these questions, and others like them.

Example 7-15. Sample queries for the final graph

# Some stats print nx.info(g) print

# What languages exist in the graph?

print [n

for n in g.nodes_iter()

if g.node[n]['type'] == 'lang']

print

# What languages do users program with?

print [n

for n in g['ptwobrussell(user)']

if g['ptwobrussell(user)'][n]['type'] == 'programs']

# What is the most popular programming language?

print "Most popular languages"

print sorted([(n, g.in_degree(n)) for n in g.nodes_iter()

if g.node[n]['type'] == 'lang'], key=itemgetter(1), reverse=True)[:10]

print

# How many users program in a particular language?

python_programmers = [u

for (u, l) in g.in_edges_iter('Python(lang)') if g.node[u]['type'] == 'user']

print "Number of Python programmers:", len(python_programmers) print

javascript_programmers = [u for

(u, l) in g.in_edges_iter('JavaScript(lang)')

7.4. Analyzing GitHub Interest Graphs | 313

if g.node[u]['type'] == 'user']

print "Number of JavaScript programmers:", len(javascript_programmers) print

# What users program in both Python and JavaScript?

print "Number of programmers who use JavaScript and Python"

print len(set(python_programmers).intersection(set(javascript_programmers)))

# Programmers who use JavaScript but not Python

print "Number of programmers who use JavaScript but not Python"

print len(set(javascript_programmers).difference(set(python_programmers)))

# XXX: Can you determine who is the most polyglot programmer?

Sample output follows:

Name:

Type: DiGraph

Number of nodes: 48952 Number of edges: 174180 Average in degree: 3.5582 Average out degree: 3.5582

[u'PHP(lang)', u'Clojure(lang)', u'ActionScript(lang)', u'Logtalk(lang)', u'Scilab(lang)', u'Processing(lang)', u'D(lang)', u'Pure Data(lang)', u'Java(lang)', u'SuperCollider(lang)', u'Julia(lang)', u'Shell(lang)', u'Haxe(lang)', u'Gosu(lang)', u'JavaScript(lang)', u'CLIPS(lang)', u'Common Lisp(lang)', u'Visual Basic(lang)', u'Objective-C(lang)', u'Delphi(lang)', u'Objective-J(lang)', u'PogoScript(lang)', u'Scala(lang)', u'Smalltalk(lang)', u'DCPU-16 ASM(lang)', u'FORTRAN(lang)', u'ASP(lang)', u'XML(lang)', u'Ruby(lang)', u'VHDL(lang)', u'C++(lang)', u'Python(lang)', u'Perl(lang)', u'Assembly(lang)', u'CoffeeScript(lang)', u'Racket(lang)', u'Groovy(lang)', u'F#(lang)', u'Opa(lang)', u'Fantom(lang)', u'Eiffel(lang)', u'Lua(lang)', u'Puppet(lang)', u'Mirah(lang)', u'XSLT(lang)', u'Bro(lang)', u'Ada(lang)', u'OpenEdge ABL(lang)', u'Fancy(lang)', u'Rust(lang)', u'C(lang)', '(lang)', u'XQuery(lang)', u'Vala(lang)', u'Matlab(lang)', u'Apex(lang)', u'Awk(lang)', u'Lasso(lang)', u'OCaml(lang)', u'Arduino(lang)', u'Factor(lang)', u'LiveScript(lang)', u'AutoHotkey(lang)', u'Haskell(lang)', u'HaXe(lang)', u'DOT(lang)', u'Nu(lang)', u'VimL(lang)', u'Go(lang)', u'ABAP(lang)', u'ooc(lang)', u'TypeScript(lang)', u'Standard ML(lang)', u'Turing(lang)', u'Coq(lang)', u'ColdFusion(lang)', u'Augeas(lang)', u'Verilog(lang)', u'Tcl(lang)',

u'Nimrod(lang)', u'Elixir(lang)',u'Ragel in Ruby Host(lang)', u'Monkey(lang)', u'Kotlin(lang)', u'C#(lang)',u'Scheme(lang)', u'Dart(lang)', u'Io(lang)', u'Prolog(lang)', u'Arc(lang)', u'PowerShell(lang)', u'R(lang)',

u'AppleScript(lang)', u'Emacs Lisp(lang)', u'Erlang(lang)']

[u'JavaScript(lang)', u'Java(lang)', u'Python(lang)']

Most popular languages

[(u'JavaScript(lang)', 851), (u'Python(lang)', 715), ('(lang)', 642), (u'Ruby(lang)', 620), (u'Java(lang)', 573), (u'C(lang)', 556),

314 | Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More

(u'C++(lang)', 508), (u'PHP(lang)', 477), (u'Shell(lang)', 475), (u'Objective-C(lang)', 435)]

Number of Python programmers: 715 Number of JavaScript programmers: 851

Number of programmers who use JavaScript and Python 715

Number of programmers who use JavaScript but not Python 136

Although the graph schema is conceptually simple, the number of edges has increased by nearly 50% because of the additional programming language nodes! As we see from the output for a few sample queries, there are quite a large number of programming languages in use, and JavaScript and Python top the list. The primary source code for the original repository of interest is written in Python, so the emergence of JavaScript as a more popular programming language among users may be indicative of a web development audience. Of course, it is also the case that JavaScript is just a popular programming language, and there is often a high correlation between JavaScript for a client-side language and Python as a server-side language. Ruby is also a popular pro‐

gramming language and is commonly used for server-side web development. It appears fourth in the list. The appearance of '(lang)' as the third most popular language is an indication that there are 642 repositories to which GitHub could not assign a program‐

ming language, and in aggregate, they rolled up into this single category.

The possibilities are immense for analyzing a graph that expresses people’s interests in other people, open source projects in repositories, and programming languages. What‐

ever analysis you choose to do, think carefully about the nature of the problem and extract only the relevant data from the graph for analysis—either by zeroing in on a set of nodes to extract with NetworkX graph’s subgraph method, or by filtering out nodes by type or frequency threshold.

A bipartite analysis of users and programming languages would like‐

ly be a worthwhile endeavor, given the nature of the relationship be‐

tween users and programming languages. A bipartite graph involves two disjoint sets of vertices that are connected by edges between the sets. You could easily remove repository nodes from the graph at this point to drastically enhance the efficiency of computing global graph statistics (the number of edges would decrease by over 100,000).

7.4. Analyzing GitHub Interest Graphs | 315

Using Nodes as Pivots for More Efficient Queries 311

Why Is Twitter All the Rage? 6

Creating a Twitter API Connection 12