Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
VII-O-1
A SHALLOW APPROACH FOR QUERYING GRAPH DATABASE
Dƣơng Quang Hƣng, Nguyễn Minh Nhựt, Nguyễn Trần Minh Thƣ, Bùi Đắc Thịnh
Information System Department
University of Science, Ho Chi Minh City
ABSTRACT
Rapidly growing on information system applications bydiverse human demands has led to the essential
requirements on data storing problem. NoSQL database, the most common way beyonding traditional data
models used to store structured data, is applied in improving performance on system with scalable database.
Among them, Graph database takes reponsibility of storing and querying data related to graph nodes and links
which are considerable as large scalable data. In this paper, we proposed a work on analyzing the pros and
cons of Graph database, in comparison with traditional data models, along with building an experimental
scenario to evaluate querying progress on time efficiency. The evaluation on the real data crawled from an
operating information system shows out the reason that going for Graph database would be a justifiable
decision on scalable data.
Keywords: NoSQL; graph database; graph model; scalable data model.
INTRODUCTION
Relational databases have been around for many decades and are the prefer database technology for most
traditional data storages and retrieval applications [8]. In particular, they usually use SQL, a declarative query
language to exploit such databases. In such many analysis, relational databases are generally efficient in case
data doesn’t contain many relationships, which require join operations between large tables and cost massive
plenty of time. Although there have been different approaches such as XML or object databases, they are all
absorbed by almost relational database management systems (RDBMSs) [1,2,9].Recently, there has been many
shifts in data stores called NoSQL movements, created by challenges of high-performance on reading and
writing big data effectively, with the development of the Internet and cloud computing [1,9]. Until now, NoSQL
still has many definitions to present its core themes. In [9], the authors defined NoSQL as a set of concepts that
allows any rapid and efficient processing of data sets with a focus on performance, reliability, and agility. The
most important point in NoSQL that differs with SQL is that it’s free of joins and schema. NoSQL allows not
only to create data without entity model but also to extract data without joins, which is considered as most costly
time reason.
Not like relational databases, NoSQL uses a diversity of data store types, from the simple key-value store
to column-family, an extend of column in relational databases, to graph stores used to associate relationships, to
document stores used for variable data [2,9]. Among them, graph database is the most appropriated solution for
dense relationship problems. As the system of a sequence of nodes and relationships, graph store is used to
facesuch typical problems as social networks, fraud detection, or relationship-heavy data, where graphs are truly
one of the most useful structures for modeling objects and links [1,2,5,8,9]. In graph store, each two nodes are
linked by some relationships and both of them, even relationships, have their own properties which are stored in
key-value fields [9].
In this paper, we present a shallow approach to query graph data store on the crawled real data from an
operating information system. In initial experiment, we evaluate the time efficency of common and advanced
queries on two database management systems in representation for relational database and NoSQL database. In
addition, we also deploy an information system using graph database to demonstrate the feasibility of our
application and data.
The remains of this paper are organized as follows. First, Section II presents the related work on NoSQL
and graph data store in particular. Next, we describle the approach to query data on graph database. Then,
Section IV present some experiments on the crawled real data. Finally, conclusion is presented in Section V.
RELATED WORK
There have been many studies on investigation of alternative storages to relational databases. In some way,
NoSQL is the blanket term for them. In this term, many projects such as Cassandra, BigTable, CouchDB,
Voldemort, Dynamo,… are presented and are used more widely [1,8].BigTable [12] is, in effect, a database
system created and used by Google, with large-scale, fast, and distributed. While Cassandra is developed by
Facebook, an open-source, distributed key-value data store [13],project Voldemort is LinkedIn’s large-scale,
persistent hash table working in distributed enviroment and being designed majorly to handle errors [14].
ISBN: 978-604-82-1375-6
3
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
Most recently, some very new projects, like Redis [11], is suitable for proving high performance
computing to small amount of data but not big data storage; or we could include MongoDB, the hybrid form
between relational database and non-relational database. MongoDB supports a range of complex data types with
powerful query language: most of functions like querying in single-table, and effective index, which can make
itself 10 times faster accessing than MySQL, as it claimed [10].
While [1] pointed out the main features in NoSQL that differs with relational databases are considered as
four aspects: concurrently reading and updating, supporting mass storage and access requirements, easy on
scalability, and low cost [1,8,9]; authors in [7] claimed that there are just two possible reason to move to NoSQL
but not relational databases: performance and flexibility. These judgementsare somewhat precise in business
problems of massively complex relationships between objects such as social networking, rules-based engines,
mashups. In these case, graph system is the most suitable for quickly analyzing complex network structures,
even with mining patterns [8,9].
Graph store represents any complex network problem as graphthat contains nodes on vertices, relationships
on edges and their properties. The relationship can be thought of as the connection between the objects from real
world objects [9]. The author in [9] also pointed out queries in graph data stores are similar to traversing nodes in
a common graph: what the shortest path between two nodes is, what nodes have nearest nodes that have given
properties,…
Although graph data store can meet the existing problems, there is still a few of experiments to compare
graph data store with the relational databases. In [1], the authors just gave some options to consider in which
properties that NoSQL is well-adapted. The authors in [8] achieved results at specific aspects: designing some
experiments on comparison of MySQL [16], representative relational database and Neo4j [1, 3, 4], representative
NoSQL. The experiments based on a predefined set of queries, evaluated processing speed on both data store
managements. However, data is random characters (8K or 32K) and is not real-world data.
Compared with previous work, our work makes some contributions to the advancement of judging the
NoSQL movement as follows:
We present the evaluation on time efficiency and make comparision between a relational database
management system and a graph data store system. The evaluation is processed on real-world data, which is
crawled from an operating information system, by using meaningful graph queries.
We build and deploy another information system using graph data store and graph queries to illustrate the
feasibility of using graph data store in action.
To our knowledge, this is one of the first works that exploits real-world data to compare the performance
between relational database and NoSQL, in particular: MSSQL server [17] and Neo4j graph database [1, 3, 4].
A SHALLOW APPROACH FOR QUERYING DATA
Aiming to the target of comparising time efficiency performance, we carried out some specific database
management systems, on both relational and graph database. Based on related work and some technology
knowledge, we decided to choose MSSQL server, representative of relational database and Neo4j, representative
of graph database, NoSQL.
The process to benchmark two systems’ performance is as follows: firstly, we build a crawler to get the
real-world data from Foody (http://www.foody.vn), the system with more than one million users.The data is
about a social network, in which food courts are the nuclei. Then, based on our knowledge in data schema, we
create two schemas which each of them corresponding to a database management system (MSSQL server or
Neo4j). Next, the crawled data is ETL processing [15] before being constructed fulfilled databases. The
experiments are processed on these databases with the same predefined meaningful queries which are suitable
and essential for real applications.
ISBN: 978-604-82-1375-6
4
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
We present some objects’s brief descriptions in Table 1.
Table 1.Object description in data
Object Name (vn-vi)
Object name (en-us)
Description
THANHVIEN
USER
Member’s information
DIADIEMANUONG
FOODCOURT
BANBE
FRIEND
BINHLUAN
COMMENT
DANHGIA
RATING
DIADIEM_MONAN
FOODCOURT _FOOD
MONDACTRUNG
TYPICALFOOD
THANHVIEN_CHECKIN_DIADIEM
USER_CHECKIN_FOODCOURT
THANHVIEN_LIKE_DIADIEM
USER_LIKE_FOODCOURT
Food court information
Information of Member’s
friend
Member’s comment for a
venue
Member’s rating point for a
venue
Relationship between
COURT and FOOD
Food information,
corresponding to some
courts
Relationship between USER
and COURT, related to
action ―Check-in‖
Relationship between USER
and COURT, related to
action ―Like‖
EXPERIMENT
To evaluate the time efficiency of queries on two database management systems, we predefined three
queries corresponding to existing problems on dense relationship network data. The queries are presented in
Table 2. We plot the running time on a 2.1 GHz CPU CoreI3, 2GB RAM. Execution time is measured in
miliseconds (ms).
Table 2 .Experiment query on food court data
ID
Query
1
Finding friends of friend in variety of depth-level
2
Browsing food courts that friends used to check-in, like, comment or rate, with the
given properties
Suggesting food courts that followed a pattern (User used to come X then coming Y)
3
Data
Data for experiments is the full data as we presented above. Table 3 describes the number of records of
each object. Data for experiments is the same inMSSQL server and Neo4j.
Table 3. Data record used for experiment
Object Name (vn-vi)
Object name (en-us)
Number of records
THANHVIEN
USER
41881
DIADIEMANUONG
BANBE
BINHLUAN
DANHGIA
DIADIEM_MONAN
MONDACTRUNG
THANHVIEN_CHECKIN_DIADIEM
THANHVIEN_LIKE_DIADIEM
VENUE
FRIEND
COMMENT
RATING
VENUE_CUISINE
TYPICALCUISINE
USER_CHECKIN_VENUE
USER_LIKE_VENUE
16633
86746
28148
14345
17571
14045
3886
13980
ISBN: 978-604-82-1375-6
5
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
Query
As we presented above, the experiments will evaluate two systems on three queries respectively. In each
following subsection, purpose of each query and its performance in time (ms) would be described and analyzed
seriously.
i. Query 1: Finding friends of friend in variety of depth-level.
This query is used to find friends along with their properties, with a given user and depth-level. It can be
described as follows: the current user’s name is Nam; this query targets to find all friends of Nam with given
depth-level; assuming that the depth-level equals to 2, the mention-aboved query will find not only friends of
Nam but also all friends of friends of Nam. The query’s experimental result on two database system is presented
in Figure 1. In which, we should say that costly time of this query in Neo4j tend to be stable when the depthlevel increases while the one in MSSQL serverrapidly increasewhen the depth-level equals 5. Figure 2 also
shows that costly time in Neo4j is proportional with depth-level increment but just slightly, in comparision with
performance in MSSQL server.
140000
120000
Time (ms)
100000
80000
60000
RDBMS
40000
Graph Datastore
20000
0
1
2
3
4
5
Depth - level
Figure 1. First query’s experimental result on time costing
14000
12000
10000
8000
Graph Datastore
6000
4000
2000
0
6
7
8
9
10
Figure 2. First query’s experiment on Neo4j with high depth-level
ii. Query 2: Browsing food courts that friends used to check-in, like, comment or rate, with the given
properties.
This query is used to list all food courts related to current user’s friends (check-in, like, comment or rate)
with several given properties. The result is presented in Figure 3 and we measured costly time according to times
ISBN: 978-604-82-1375-6
6
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
of execution. Excuding that Neo4j is considered as 20 times faster than MSSQL server in this query, Neo4j still
express its stable execution.
10000
9000
8000
Time (ms)
7000
6000
5000
4000
RDBMS
3000
Graph Datastore
2000
1000
0
1st
2nd
3rd
4th
5th
Execution Times
Figure 3. Second query’s experimental result on time costing
iii. Query 3: Suggesting food courts that followed a pattern (User used to come X then coming Y).
In real world, there is a demand that people need suggestion before giving their decision. We assumed that
when user A visited food court X, user A tends to visit food court Y and so on. With a large data, the patterns
will be generated and this query is used to suggest users these patterns. Absolutely, the properties of the ―next‖
food court will be listed also. In this case, we try to explore whether how costly time increase for each database
system when more criteria (action check-in, like, comment) are included. The result is presented in Figure 4 and
Figure 5. When the query included more criteria, absolutely that costly time will increase on both database
system, but we can see the Neo4j’s stable is clearly evident.
80000
70000
Time (ms)
60000
50000
40000
RDBMS
30000
Graph Datastore
20000
10000
0
4
5
6
Number of criteria
Figure 4. Third query’s experimental result on time costing according to included criteria
ISBN: 978-604-82-1375-6
7
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
350
300
Time (ms)
250
200
150
Graph Datastore
100
50
0
4
5
6
7
Number of criteria
Figure 5. Third query’s experimental result on time costing on Neo4j
Application development
Based on the characteristics of crawled data and several functional and non-functional requirements, we
developed an information system application that uses ASP.NET MVC [18] and Neo4j community server [3, 4],
aiming to indicate the feasibility of an approach to store and query large scalabledata.
The application is deployed as a website that is the same purpose with Foody but focusing on advanced
queries that utilize graph data store’s ability.
CONCLUSION
In this paper, we presented a shallow approach to query data on graph database and made comparison with
the relational database. We also described the advantage and disadvantage of Graph database Neo4j, in
comparison with MSSQL server as a case study. Graph database is compatible with scalable data which can be
represented as nodes and links between them. Experiments show that graph database is critically effective than
relational database in case queries is complex and require join operations between the objects. Drawbacks, in
simple queries or on sparse relationship data, relational database still express its high performance compared
with graph database. So that graph database is actually suitable with large scale and dense data.Anyway, one of
the reason is that relational database has many constraints in data, which is considered as not important at real
time in graph data store.
However, there are still some limitations in our research such as the specific interfaces of SQL and
NoSQL. In this case, they are MSSQL server and Neo4j on NoSQL. To get a objective glance, the comparision
in a set of interfaces should be included on the crawled real data, which we have done well. Moreover, the
application we built should be deployed in reality to get feedback on rising of scalable data.
REFERENCES
[1]. Han, Jing, et al. "Survey on NoSQL database." Pervasive computing and applications (ICPCA), 2011 6th
international conference on. IEEE, 2011.
[2]. Robinson, Ian, Jim Webber, and Emil Eifrem. Graph databases. " O'Reilly Media, Inc.", 2013.
[3]. Miller, Justin J. "Graph Database Applications and Concepts with Neo4j." (2013).
[4]. Partner, Jonas, Aleksa Vukotic, and Nicki Watt. Neo4j in Action. O'Reilly Media, 2013.
[5]. Neubauer, Peter. "Graph databases, NOSQL and Neo4j." (2010).
[6]. Holzschuher, Florian, and René Peinl. "Performance of graph query languages: comparison of cypher,
gremlin and native access in neo4j." Proceedings of the Joint EDBT/ICDT 2013 Workshops. ACM, 2013.
[7]. Stonebraker, Michael. "SQL databases v. NoSQL databases." Communications of the ACM 53.4 (2010):
10-11.
[8]. Vicknair, Chad, et al. "A comparison of a graph database and a relational database: a data provenance
perspective." Proceedings of the 48th annual Southeast regional conference. ACM, 2010.
[9]. McCreary, Dan, and Ann Kelly. "Making Sense of NoSQL." Greenwich, Conn.: Manning Publications
(2013).
ISBN: 978-604-82-1375-6
8
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
[10]. Banker, Kyle. MongoDB in action. Manning Publications Co., 2011.
[11]. Carlson, Josiah L. Redis in Action. Manning Publications Co., 2013.
[12]. Chang, Fay, et al. "Bigtable: A distributed storage system for structured data." ACM Transactions on
Computer Systems (TOCS) 26.2 (2008): 4.
[13]. Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM
SIGOPS Operating Systems Review 44.2 (2010): 35-40.
[14]. Sumbaly, Roshan, et al. "Serving large-scale batch computed data with project voldemort." Proceedings
of the 10th USENIX conference on File and Storage Technologies. USENIX Association, 2012.
[15]. Karakasidis, Alexandros, Panos Vassiliadis, and Evaggelia Pitoura. "ETL queues for active data
warehousing." Proceedings of the 2nd international workshop on Information quality in information
systems. ACM, 2005.
[16]. MySQL: the world's most popular open source database. MySQL AB, 1995.
[17]. Mistry, Ross, and Stacia Misner. Introducing Microsoft® MSSQL server® 2012. " O'Reilly Media, Inc.",
2012.
[18]. Esposito, Dino. Programming Microsoft ASP. NET MVC. Pearson Education, 2011.
ISBN: 978-604-82-1375-6
9
... on graph database and made comparison with the relational database We also described the advantage and disadvantage of Graph database Neo4j, in comparison with MSSQL server as a case study Graph. .. that graph database is actually suitable with large scale and dense data.Anyway, one of the reason is that relational database has many constraints in data, which is considered as not important... the ACM 53.4 (2010): 10-11 [8] Vicknair, Chad, et al "A comparison of a graph database and a relational database: a data provenance perspective." Proceedings of the 48th annual Southeast regional