A SHALLOW APPROACH FOR QUERYING GRAPH DATABASE

Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM VII-O-1 A SHALLOW APPROACH FOR QUERYING GRAPH DATABASE Dƣơng Quang Hƣng, Nguyễn Minh Nhựt, Nguyễn Trần Minh Thƣ, Bùi Đắc Thịnh Information System Department University of Science, Ho Chi Minh City ABSTRACT Rapidly growing on information system applications bydiverse human demands has led to the essential requirements on data storing problem. NoSQL database, the most common way beyonding traditional data models used to store structured data, is applied in improving performance on system with scalable database. Among them, Graph database takes reponsibility of storing and querying data related to graph nodes and links which are considerable as large scalable data. In this paper, we proposed a work on analyzing the pros and cons of Graph database, in comparison with traditional data models, along with building an experimental scenario to evaluate querying progress on time efficiency. The evaluation on the real data crawled from an operating information system shows out the reason that going for Graph database would be a justifiable decision on scalable data. Keywords: NoSQL; graph database; graph model; scalable data model. INTRODUCTION Relational databases have been around for many decades and are the prefer database technology for most traditional data storages and retrieval applications [8]. In particular, they usually use SQL, a declarative query language to exploit such databases. In such many analysis, relational databases are generally efficient in case data doesn’t contain many relationships, which require join operations between large tables and cost massive plenty of time. Although there have been different approaches such as XML or object databases, they are all absorbed by almost relational database management systems (RDBMSs) [1,2,9].Recently, there has been many shifts in data stores called NoSQL movements, created by challenges of high-performance on reading and writing big data effectively, with the development of the Internet and cloud computing [1,9]. Until now, NoSQL still has many definitions to present its core themes. In [9], the authors defined NoSQL as a set of concepts that allows any rapid and efficient processing of data sets with a focus on performance, reliability, and agility. The most important point in NoSQL that differs with SQL is that it’s free of joins and schema. NoSQL allows not only to create data without entity model but also to extract data without joins, which is considered as most costly time reason. Not like relational databases, NoSQL uses a diversity of data store types, from the simple key-value store to column-family, an extend of column in relational databases, to graph stores used to associate relationships, to document stores used for variable data [2,9]. Among them, graph database is the most appropriated solution for dense relationship problems. As the system of a sequence of nodes and relationships, graph store is used to facesuch typical problems as social networks, fraud detection, or relationship-heavy data, where graphs are truly one of the most useful structures for modeling objects and links [1,2,5,8,9]. In graph store, each two nodes are linked by some relationships and both of them, even relationships, have their own properties which are stored in key-value fields [9]. In this paper, we present a shallow approach to query graph data store on the crawled real data from an operating information system. In initial experiment, we evaluate the time efficency of common and advanced queries on two database management systems in representation for relational database and NoSQL database. In addition, we also deploy an information system using graph database to demonstrate the feasibility of our application and data. The remains of this paper are organized as follows. First, Section II presents the related work on NoSQL and graph data store in particular. Next, we describle the approach to query data on graph database. Then, Section IV present some experiments on the crawled real data. Finally, conclusion is presented in Section V. RELATED WORK There have been many studies on investigation of alternative storages to relational databases. In some way, NoSQL is the blanket term for them. In this term, many projects such as Cassandra, BigTable, CouchDB, Voldemort, Dynamo,… are presented and are used more widely [1,8].BigTable [12] is, in effect, a database system created and used by Google, with large-scale, fast, and distributed. While Cassandra is developed by Facebook, an open-source, distributed key-value data store [13],project Voldemort is LinkedIn’s large-scale, persistent hash table working in distributed enviroment and being designed majorly to handle errors [14]. ISBN: 978-604-82-1375-6 3 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Most recently, some very new projects, like Redis [11], is suitable for proving high performance computing to small amount of data but not big data storage; or we could include MongoDB, the hybrid form between relational database and non-relational database. MongoDB supports a range of complex data types with powerful query language: most of functions like querying in single-table, and effective index, which can make itself 10 times faster accessing than MySQL, as it claimed [10]. While [1] pointed out the main features in NoSQL that differs with relational databases are considered as four aspects: concurrently reading and updating, supporting mass storage and access requirements, easy on scalability, and low cost [1,8,9]; authors in [7] claimed that there are just two possible reason to move to NoSQL but not relational databases: performance and flexibility. These judgementsare somewhat precise in business problems of massively complex relationships between objects such as social networking, rules-based engines, mashups. In these case, graph system is the most suitable for quickly analyzing complex network structures, even with mining patterns [8,9]. Graph store represents any complex network problem as graphthat contains nodes on vertices, relationships on edges and their properties. The relationship can be thought of as the connection between the objects from real world objects [9]. The author in [9] also pointed out queries in graph data stores are similar to traversing nodes in a common graph: what the shortest path between two nodes is, what nodes have nearest nodes that have given properties,… Although graph data store can meet the existing problems, there is still a few of experiments to compare graph data store with the relational databases. In [1], the authors just gave some options to consider in which properties that NoSQL is well-adapted. The authors in [8] achieved results at specific aspects: designing some experiments on comparison of MySQL [16], representative relational database and Neo4j [1, 3, 4], representative NoSQL. The experiments based on a predefined set of queries, evaluated processing speed on both data store managements. However, data is random characters (8K or 32K) and is not real-world data. Compared with previous work, our work makes some contributions to the advancement of judging the NoSQL movement as follows: We present the evaluation on time efficiency and make comparision between a relational database management system and a graph data store system. The evaluation is processed on real-world data, which is crawled from an operating information system, by using meaningful graph queries. We build and deploy another information system using graph data store and graph queries to illustrate the feasibility of using graph data store in action. To our knowledge, this is one of the first works that exploits real-world data to compare the performance between relational database and NoSQL, in particular: MSSQL server [17] and Neo4j graph database [1, 3, 4]. A SHALLOW APPROACH FOR QUERYING DATA Aiming to the target of comparising time efficiency performance, we carried out some specific database management systems, on both relational and graph database. Based on related work and some technology knowledge, we decided to choose MSSQL server, representative of relational database and Neo4j, representative of graph database, NoSQL. The process to benchmark two systems’ performance is as follows: firstly, we build a crawler to get the real-world data from Foody (http://www.foody.vn), the system with more than one million users.The data is about a social network, in which food courts are the nuclei. Then, based on our knowledge in data schema, we create two schemas which each of them corresponding to a database management system (MSSQL server or Neo4j). Next, the crawled data is ETL processing [15] before being constructed fulfilled databases. The experiments are processed on these databases with the same predefined meaningful queries which are suitable and essential for real applications. ISBN: 978-604-82-1375-6 4 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM We present some objects’s brief descriptions in Table 1. Table 1.Object description in data Object Name (vn-vi) Object name (en-us) Description THANHVIEN USER Member’s information DIADIEMANUONG FOODCOURT BANBE FRIEND BINHLUAN COMMENT DANHGIA RATING DIADIEM_MONAN FOODCOURT _FOOD MONDACTRUNG TYPICALFOOD THANHVIEN_CHECKIN_DIADIEM USER_CHECKIN_FOODCOURT THANHVIEN_LIKE_DIADIEM USER_LIKE_FOODCOURT Food court information Information of Member’s friend Member’s comment for a venue Member’s rating point for a venue Relationship between COURT and FOOD Food information, corresponding to some courts Relationship between USER and COURT, related to action ―Check-in‖ Relationship between USER and COURT, related to action ―Like‖ EXPERIMENT To evaluate the time efficiency of queries on two database management systems, we predefined three queries corresponding to existing problems on dense relationship network data. The queries are presented in Table 2. We plot the running time on a 2.1 GHz CPU CoreI3, 2GB RAM. Execution time is measured in miliseconds (ms). Table 2 .Experiment query on food court data ID Query 1 Finding friends of friend in variety of depth-level 2 Browsing food courts that friends used to check-in, like, comment or rate, with the given properties Suggesting food courts that followed a pattern (User used to come X then coming Y) 3 Data Data for experiments is the full data as we presented above. Table 3 describes the number of records of each object. Data for experiments is the same inMSSQL server and Neo4j. Table 3. Data record used for experiment Object Name (vn-vi) Object name (en-us) Number of records THANHVIEN USER 41881 DIADIEMANUONG BANBE BINHLUAN DANHGIA DIADIEM_MONAN MONDACTRUNG THANHVIEN_CHECKIN_DIADIEM THANHVIEN_LIKE_DIADIEM VENUE FRIEND COMMENT RATING VENUE_CUISINE TYPICALCUISINE USER_CHECKIN_VENUE USER_LIKE_VENUE 16633 86746 28148 14345 17571 14045 3886 13980 ISBN: 978-604-82-1375-6 5 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Query As we presented above, the experiments will evaluate two systems on three queries respectively. In each following subsection, purpose of each query and its performance in time (ms) would be described and analyzed seriously. i. Query 1: Finding friends of friend in variety of depth-level. This query is used to find friends along with their properties, with a given user and depth-level. It can be described as follows: the current user’s name is Nam; this query targets to find all friends of Nam with given depth-level; assuming that the depth-level equals to 2, the mention-aboved query will find not only friends of Nam but also all friends of friends of Nam. The query’s experimental result on two database system is presented in Figure 1. In which, we should say that costly time of this query in Neo4j tend to be stable when the depthlevel increases while the one in MSSQL serverrapidly increasewhen the depth-level equals 5. Figure 2 also shows that costly time in Neo4j is proportional with depth-level increment but just slightly, in comparision with performance in MSSQL server. 140000 120000 Time (ms) 100000 80000 60000 RDBMS 40000 Graph Datastore 20000 0 1 2 3 4 5 Depth - level Figure 1. First query’s experimental result on time costing 14000 12000 10000 8000 Graph Datastore 6000 4000 2000 0 6 7 8 9 10 Figure 2. First query’s experiment on Neo4j with high depth-level ii. Query 2: Browsing food courts that friends used to check-in, like, comment or rate, with the given properties. This query is used to list all food courts related to current user’s friends (check-in, like, comment or rate) with several given properties. The result is presented in Figure 3 and we measured costly time according to times ISBN: 978-604-82-1375-6 6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM of execution. Excuding that Neo4j is considered as 20 times faster than MSSQL server in this query, Neo4j still express its stable execution. 10000 9000 8000 Time (ms) 7000 6000 5000 4000 RDBMS 3000 Graph Datastore 2000 1000 0 1st 2nd 3rd 4th 5th Execution Times Figure 3. Second query’s experimental result on time costing iii. Query 3: Suggesting food courts that followed a pattern (User used to come X then coming Y). In real world, there is a demand that people need suggestion before giving their decision. We assumed that when user A visited food court X, user A tends to visit food court Y and so on. With a large data, the patterns will be generated and this query is used to suggest users these patterns. Absolutely, the properties of the ―next‖ food court will be listed also. In this case, we try to explore whether how costly time increase for each database system when more criteria (action check-in, like, comment) are included. The result is presented in Figure 4 and Figure 5. When the query included more criteria, absolutely that costly time will increase on both database system, but we can see the Neo4j’s stable is clearly evident. 80000 70000 Time (ms) 60000 50000 40000 RDBMS 30000 Graph Datastore 20000 10000 0 4 5 6 Number of criteria Figure 4. Third query’s experimental result on time costing according to included criteria ISBN: 978-604-82-1375-6 7 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM 350 300 Time (ms) 250 200 150 Graph Datastore 100 50 0 4 5 6 7 Number of criteria Figure 5. Third query’s experimental result on time costing on Neo4j Application development Based on the characteristics of crawled data and several functional and non-functional requirements, we developed an information system application that uses ASP.NET MVC [18] and Neo4j community server [3, 4], aiming to indicate the feasibility of an approach to store and query large scalabledata. The application is deployed as a website that is the same purpose with Foody but focusing on advanced queries that utilize graph data store’s ability. CONCLUSION In this paper, we presented a shallow approach to query data on graph database and made comparison with the relational database. We also described the advantage and disadvantage of Graph database Neo4j, in comparison with MSSQL server as a case study. Graph database is compatible with scalable data which can be represented as nodes and links between them. Experiments show that graph database is critically effective than relational database in case queries is complex and require join operations between the objects. Drawbacks, in simple queries or on sparse relationship data, relational database still express its high performance compared with graph database. So that graph database is actually suitable with large scale and dense data.Anyway, one of the reason is that relational database has many constraints in data, which is considered as not important at real time in graph data store. However, there are still some limitations in our research such as the specific interfaces of SQL and NoSQL. In this case, they are MSSQL server and Neo4j on NoSQL. To get a objective glance, the comparision in a set of interfaces should be included on the crawled real data, which we have done well. Moreover, the application we built should be deployed in reality to get feedback on rising of scalable data. REFERENCES [1]. Han, Jing, et al. "Survey on NoSQL database." Pervasive computing and applications (ICPCA), 2011 6th international conference on. IEEE, 2011. [2]. Robinson, Ian, Jim Webber, and Emil Eifrem. Graph databases. " O'Reilly Media, Inc.", 2013. [3]. Miller, Justin J. "Graph Database Applications and Concepts with Neo4j." (2013). [4]. Partner, Jonas, Aleksa Vukotic, and Nicki Watt. Neo4j in Action. O'Reilly Media, 2013. [5]. Neubauer, Peter. "Graph databases, NOSQL and Neo4j." (2010). [6]. Holzschuher, Florian, and René Peinl. "Performance of graph query languages: comparison of cypher, gremlin and native access in neo4j." Proceedings of the Joint EDBT/ICDT 2013 Workshops. ACM, 2013. [7]. Stonebraker, Michael. "SQL databases v. NoSQL databases." Communications of the ACM 53.4 (2010): 10-11. [8]. Vicknair, Chad, et al. "A comparison of a graph database and a relational database: a data provenance perspective." Proceedings of the 48th annual Southeast regional conference. ACM, 2010. [9]. McCreary, Dan, and Ann Kelly. "Making Sense of NoSQL." Greenwich, Conn.: Manning Publications (2013). ISBN: 978-604-82-1375-6 8 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM [10]. Banker, Kyle. MongoDB in action. Manning Publications Co., 2011. [11]. Carlson, Josiah L. Redis in Action. Manning Publications Co., 2013. [12]. Chang, Fay, et al. "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4. [13]. Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40. [14]. Sumbaly, Roshan, et al. "Serving large-scale batch computed data with project voldemort." Proceedings of the 10th USENIX conference on File and Storage Technologies. USENIX Association, 2012. [15]. Karakasidis, Alexandros, Panos Vassiliadis, and Evaggelia Pitoura. "ETL queues for active data warehousing." Proceedings of the 2nd international workshop on Information quality in information systems. ACM, 2005. [16]. MySQL: the world's most popular open source database. MySQL AB, 1995. [17]. Mistry, Ross, and Stacia Misner. Introducing Microsoft® MSSQL server® 2012. " O'Reilly Media, Inc.", 2012. [18]. Esposito, Dino. Programming Microsoft ASP. NET MVC. Pearson Education, 2011. ISBN: 978-604-82-1375-6 9 ... on graph database and made comparison with the relational database We also described the advantage and disadvantage of Graph database Neo4j, in comparison with MSSQL server as a case study Graph. .. that graph database is actually suitable with large scale and dense data.Anyway, one of the reason is that relational database has many constraints in data, which is considered as not important... the ACM 53.4 (2010): 10-11 [8] Vicknair, Chad, et al "A comparison of a graph database and a relational database: a data provenance perspective." Proceedings of the 48th annual Southeast regional

Định dạng
Số trang	7
Dung lượng	597,63 KB