Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 78 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
78
Dung lượng
1,38 MB
Nội dung
Data Modeling and Query Processing for Online Social Networking Services by c Sun Yang ⃝ A thesis submitted for the degree of Master of Science School of Computing National University of Singapore 2011 Abstract Web2.0 boosts the proliferation of online social networking services Nowadays, online Social Networking Sites (SNSs) has become a fast-growing business in the Internet Hundreds of millions of individual users create online profiles and share personal information with their friends on these sites, which facilitates a high level of user personalization and user inter-communication The users publish their creations called User Generated Content (UGC) such as bookmarks, pictures, videos and blogposts to entertain others or to be entertained by other users’ contributions Therefore popular online social networking sites possess huge web communities and contain enormous collections of content generated by their users Consequently, the dramatically growing online social networking data is becoming more and more complex, heterogeneous and temporal, and it becomes more and more challenging to manage such data In the past decades, various database models have been proposed by the database research community as the conceptual frameworks which provide the foundations to solve data management problems for a specific domain However, as far as we know, existing database models, query languages and access methods not offer adequate and native support for the representation, management, querying and especially inter-operability of online social networking data To meet this challenge, we choose to move beyond the traditional approach In this thesis we present the concept and design of an expressive standard graph data model, which gives clients easy control over data We also provide a detailed illustration of the operators, SNG-Algebra and the query language SNGQL designed for the new graph database system The graphical formalism in this thesis for online social networking data offers high expressibility and adequate modeling power ii Acknowledgements I wish to specially thank my supervisor, Professor TAY Yong Chiang, who has supported me throughout my thesis with his patience and knowledge whilst allowing me the room to work in my own way His sound suggestions and good teaching have been invaluable Besides, I would like to thank everybody who was important to the successful realization of thesis, as well as expressing my apology that I could not mention personally one by one Sun Yang iii Contents Abstract ii Acknowledgements iii List of Figures vii List of Tables ix Introduction 1.1 Motivation 1.2 Objective 1.3 Contribution 1.4 Overview Related Works 2.1 Data Model iv 2.1.1 Relational Model 2.1.2 NoSQL 10 2.1.2.1 Network Model 10 2.1.2.2 Object-oriented Model 11 2.1.2.3 Semi-structured Data Model 12 2.1.2.4 Graph Model 12 2.2 Resource Description Framework 14 2.3 Graph database 16 2.3.1 Graph Query Language 16 2.3.2 Graph Query Processing 17 Data Model and Operators 3.1 Graph Model 20 3.1.1 Notations 20 3.1.2 Model Definition 21 3.1.3 3.2 19 3.1.2.1 Node Definition 22 3.1.2.2 Edge Definition 23 3.1.2.3 Graph Definition 24 Constraints 26 Operators of SNG-Algebra 27 v 3.2.1 Operator Composition 36 SNGQL 38 4.1 Data Definition 39 4.2 Data Manipulation 41 4.3 Data Retrieval 42 4.3.1 Requirements 42 4.3.2 The Basic Form of A SNGQL Query 43 4.3.3 SNGQL Query Examples 44 Query Processing 50 5.1 Query Translation 52 5.2 Pattern Matching 53 5.2.1 Problem Definition 54 5.2.2 Normal Form for Pattern 56 5.2.3 Graph Indexes 57 5.2.4 Case Study 59 Conclusion 6.1 61 Future Work 63 Bibliography 64 vi List of Figures 3.1 Supertype-Subtype Tree 23 3.2 A small subset of a online social networking data graph 26 3.3 Example of Pattern Mapping 30 3.4 Merge-On Two Patterns 31 3.5 Merge–On Two Sets of components 32 3.6 The signature of the neighborhood operator 33 3.7 Example for Concatenation 35 3.8 Example for Composition 35 4.1 Syntax for Data Definition Language 39 4.2 Statements for defining sample database 40 4.3 Example: insertion of a user node 41 4.4 Basic Query Form 43 4.5 Schema Graph 47 vii 5.1 Pattern Normal Form 56 5.2 Query Pattern from Example 4.4 59 viii List of Tables 3.1 Notations Used Throughout 21 5.1 Summarization of Operators 51 ix Chapter Introduction The popularity of online social network has been taken to a height that was never reached before Nowadays, online Social Network Sites (SNSs) has become a fast-growing business in the Internet Recently we have witnessed the dramatic growth of a number of such web services including Flickr1 , Del.icio.us2 , MySpace3 and Facebook4 Through these sites hundreds of millions of users create their online profiles and share personal information with their friends They publish data items called User Generated Content (UGC) such as bookmarks, pictures, videos and blogposts For instance, major movie studios can place trailers for their new movies on YouTube5 ; US presidential candidates run online political campaigns on Facebook; and individuals upload songs, pictures, and blogs to their MySpace pages, all hoping to reach millions of online users Indeed, The video sharing website http://www.flickr.com http://delicious.com http://www.myspace.com http://www.facebook.com http://www.youtube.com We define the size of the pattern P as an ordered pair, denoted as size(P) = ⟨|N(P)|, |E(P)|⟩ The two elements represents the number of nodes and number of edges in pattern P respectively For two graph query pattern P1 and P2 , If |N(P1 )| < |N(P2 )|, then size(P1 ) < size(P2 ); If |N(P1 )| = |N(P2 )|, |E(P1 )| < |E(P2 )|, then size(P1 ) < size(P2 ) The patterns we discuss here should be at least of size ⟨2,1⟩ Processing a pattern matching with a single node is trivial and thus not discussed When there is no matching to the query, the answer set is empty Generally, assuming the size of a graph pattern P is N, and the size of the original graph database D (number of nodes contained) is M, since for each node n ∈ N(P), an exhaustive search of possible one-on-one correspondences to u ∈ V(D), the worst-case time complexity of the exhaustive search algorithm can be O(M N ) [48, 11] Specifically, the classification mechanism (Figure 3.1) of our data model already induces a partition on the nodes of the graph database This category system forms the (ideal) basis for structural index which supports query processing in the language So in this case, for a graph pattern P of size N, assuming the total number of nodes can be classified into n subtypes, which has m1 , m2 , · · · , mn nodes respectively, now the size of exhaustive search ∏N space for a node n ∈ N(p) is alleviated to i=1 mi , where mi is the total number of nodes with the same subtype of n The complexity is reduced a bit, but the matching process is still rather time-consuming, a novel index mechanism is needed 55 5.2.2 Normal Form for Pattern Every pattern query will be rewritten to a normal form for processing In SNGQL, more complex graph patterns can be formed by combining smaller patterns in various ways As shown in Figure 5.1, a query pattern P can be represented as the union of a certain number (n ≥ 1) of sub-patterns, which come from composition of several paths The paths can be trivial (edges) or got from concatenation of a certain number of trivial paths For one pattern, perhaps multiple normal forms can be got (i.e., the normal form for a certain query pattern maybe not unique) This will not effect the pattern matching processing anyway Figure 5.1: Pattern Normal Form Proof: We prove this theorem through induction on the size (number of nodes and edges) of the pattern denoted by ⟨|N(P)|, |E(P)|⟩ If the size of the pattern P is ⟨2,1⟩, i.e the pattern is a trivial path, obviously the claim holds Suppose the theorem holds for size(P) = ⟨n, e⟩, when the size of the pattern increases, we get two basic conditions: (I) |N(P)|= n+1, |E(P)|= e+i, that is, adding one more node and i(i≥1) edges to the query pattern 56 There are two and only two conditions for this node to be connected with existing pattern(|N(P)|= n): 1) be concatenated to at least one of the paths within the existing pattern; 2) cannot be concatenated to existing paths, but can be composed with at least one of them Either of these two conditions can be easily represented by the normal form above The claim holds (II) The number of nodes remains the same, i.e., |N(P)|= n One edge is added between two existing nodes, i.e., |E(P)|= e+1 Clearly, this added trivial path can be connected with existing edges through two different ways: 1) concatenation 2) composition In both situations, we can use normal form to represent it The claim holds 5.2.3 Graph Indexes In a traditional relational DBMS, an index is created on an attribute in order to locate tuples with particular attribute values quickly In our DBMS, such a value index alone is not sufficient, since the topological structure is as important as the attribute values Thus, we design indexes towards graph structure for our system, i.e., a topological index, or GPattern, to incorporate pointers towards nodes in the graphs for matching certain basic patterns Our primary goal is to develop a clear understanding of the design space of the structural index which can reflect the native structure of graphs and expressivity of graph query lan57 guages In this section, we will mainly focus on a novel method of indexing the graph databases called GPattern in order to accelerate pattern matching queries Generally recording paths of various lengths is obviously infeasible, for such path indexes probably result in performance degradation since the set of paths in a large graph database usually is huge We notice that even for huge data graph in the database, the number of different patterns for paths of certain length is usually small GPattern is comprised of two major forms of patterns: 1) the pattern of the size ⟨ 2,1 ⟩ (two nodes and one edge); 2) the pattern of the path(s) of length For each of the above two kinds of patterns, GPattern maintains a set of starting nodes of the corresponding mappings in the graph These two structures are ideal candidates for indexing since they cover the two kinds of most simple path patterns, which can be used as basic structures to construct patterns of any size according to the normal forms As basic graph indexing units, (i) they avoids an exhaustive enumeration procedure as in [46] and [19]; (ii) they are very space-efficient, which is very important considering scalability in large social networks; (iii) they preserve local pattern information for the nodes, which is especially useful for search space pruning before costly pattern matching Given a social networking data graph G and a query pattern P, the general procedure of conducting a graph pattern matching using our indexes is as following: 1) Transform the query pattern P into Normal Form; 2) According to the distributive law (Proposition 3), with the help of the index, get the matching set from the data graph for each path of the query pattern P 3) Calculate and get the final result set through the composition or union operations conducted on the filtered sets 58 5.2.4 Case Study Consider the query in Example 4.4, we get a relatively complicated query pattern(see Figure 5.2) from the in statement The pattern of the query in Example 4.4 contains four path expressions, i.e., path 1: user A ⟨create⟩ photo A path 2: user A ⟨create⟩ comment B ⟨attached to⟩ photo B path 3: user B ⟨create⟩ photo B path 4: user B ⟨create⟩ comment A ⟨attached to⟩ photo A Figure 5.2: Query Pattern from Example 4.4 Path and path are trivial paths (i.e., edges) path is got from the concatenation of edge (user1 ⟨create⟩ photo1) and edge (comment2 ⟨attached to⟩ photo2 Similarly path are also got from concatenation of two edges The whole pattern just equals to ”path + 59 path + path + path 4”, ”path + path + path + path 3”, or ”path + path + path + path 4” etc Actually, according to Proposition 1, the order of composition will not effect the processing of pattern match Then, according to Proposition 3, Λ(path + path + path + path 4) = Λ(path 1) + Λ(path 2) + Λ(path 3) + Λ(path 4) Thus with the help of GPattern, we get the matching set for path 1, path 2, path and path respectively Finally we can get the result set of the matchings of query pattern from the union of the four filtered matching sets for the paths 60 Chapter Conclusion It is hard to deny the booming popularity of social networking sites, which facilitate a high level of user personalization, and user inter-communication On the other hand, the problem of database representation is complicated by the fact that no single model of reality may be appropriate for all users and problem domains Specifically, classic data models were criticized for their lack of semantics, the flatness of the permitted data structures, the difficulties the user has to capture the data connectivity, and how difficult it is to model complex objects Meanwhile, the limited expressive power of current query languages motivated the search for models that allowed better representation of more complex applications Thus, considering the nature of online social networking data, we are required to rethink some aspects of traditional solutions Graphs have high expressive power to model complicated structures Formalizing graphs as a graph-theoretic data model defines a useful database In this thesis, we describe a natural and easy-to-use graph model to naturally represent and manipulate online social network 61 related data, which is the first step to develop the new database management system We think of the online social network sites as a whole graph, where the nodes are three types of data elements (Actors, Objects and Concepts) and the edges correspond to the binary relationship between the nodes We first demonstrate that the said data model is a natural candidate for management of online social network data We also describe in detail the collection of operators that are available to operate on the graph, and investigate the power of the query language SNGQL in this thesis The rapid increase in the size and complexity of graph-structured data as well as the NP-complete nature of subgraph isomorphism have raised the need for efficiently processing of the graph database queries So we introduce SNG-Algebra and structural index GPattern in the last chapter All of these are in the application domain of online social network services To conclude, we feel that an explicit graph model for online social network is very desirable for several reasons: Graphs are a flexible and natural way of representing online social networking data Graphs closely model the interconnected nature of this data, balancing the relative importance of data units and interrelationships It provides a database system interface with explicit semantics, which allows the user to more directly model the way he/she thinks about the problems to solve Graphs have a solid foundation in mathematics and computer science Graph theory is supported by a tremendous amount of formal analysis and study The system can use efficient graph algorithms designed to utilize the special graph data structures A graph-theoretic data model supports the operators needed for social network databases, including algorithms for answering complex queries 62 Queries can refer directly to the graph structure For example, paths can be defined (not present in any of the other models) and they are the interesting entities in most networks The query language provides an intuitive, flexible graph-based formalism even for nonexpert database users Our investigation of new graph data model and graph databases suggests that graphs are a good representation for online social network services and that graph databases are sufficient to help meet the current and near future needs for social web 6.1 Future Work Our work provides a generic foundation of semantic data management for the online social networking services It should be pointed out that the study of the graph data model and SNGQL language for online social networking services and their implications is a continuing effort and that details of the design should be considered preliminary We believe that there is a varieties of interesting work to be done in improving the model designed and developing intelligent query planner that have knowledge of graph operators so that it can tune performance for different operational characteristics For example, right now SNG-Algebra and SNGQL not contain transitive closure operator (e.g if there is a path between two nodes) We will include this kind of operators in the next step Additionally, we choose to implement the whole database management system from scratch, rather than building an extension to an existing DBMS to handle online social network data Building our own complete DBMS allows us full control over all components of the system 63 Bibliography [1] A graph query language and its query processing In Proceedings of the 15th International Conference on Data Engineering, ICDE ’99, pages 572–, Washington, DC, USA, 1999 IEEE Computer Society [2] S Abiteboul, P Buneman, and D Suciu Data on the Web: from relations to semistructured data and XML Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2000 [3] A V Aho and J D Ullman The universality of data retrieval languages In POPL, pages 110–120, 1979 [4] R Angles and C Guti´errez Survey of graph database models ACM Comput Surv., 40(1), 2008 [5] C Beeri and Y Kornatzky A logical query language for hypertext systems In ECHT, pages 67–80, 1990 [6] C Berge Graphs and Hypergraphs Elsevier Science Ltd, 1985 [7] E Bertino and B C Ooi The indispensability of dispensable indexes IEEE Transactions on Knowledge and Data Engineering, 11:17–27, 1999 64 [8] J Biskup, U Răasch, and H Stiefeling An extension of sql for querying graph relations Comput Lang., 15(1):65–82, 1990 [9] N Bruno, N Koudas, and D Srivastava Holistic twig joins: optimal xml pattern matching In SIGMOD Conference, pages 310–321, 2002 [10] P Buneman Semistructured data In Proceedings of the sixteenth ACM SIGACTSIGMOD-SIGART symposium on Principles of database systems, PODS ’97, pages 117–121, New York, NY, USA, 1997 ACM [11] H Bunke, T Glauser, and T.-H Tran An efficient implementation of graph grammars based on the rete matching algorithm In Proceedings of the 4th International Workshop on Graph-Grammars and Their Application to Computer Science, pages 174–189, London, UK, 1991 Springer-Verlag [12] L Chen, A Gupta, and M E Kurul Stack-based algorithms for pattern matching on dags In Proceedings of the 31st international conference on Very large data bases, VLDB ’05, pages 493–504 VLDB Endowment, 2005 [13] J Cheng, Y Ke, W Ng, and A Lu Fg-index: towards verification-free query processing on graph databases In SIGMOD Conference, pages 857–872, 2007 [14] J Cheng, J X Yu, B Ding, P S Yu, and H Wang Fast graph pattern matching In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 913–922, Washington, DC, USA, 2008 IEEE Computer Society [15] E F Codd A relational model of data for large shared data banks Commun ACM, 13(6):377–387, 1970 65 [16] M P Consens and A O Mendelzon Expressing structural hypertext queries in graphlog In Hypertext, pages 269–292, 1989 [17] C Faloutsos, K S McCurley, and A Tomkins Fast discovery of connection subgraphs In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 118–127, New York, NY, USA, 2004 ACM [18] M R Garey and D S Johnson Computers and Intractability; A Guide to the Theory of NP-Completeness W H Freeman & Co., New York, NY, USA, 1990 [19] R Giugno and D Shasha Graphgrep: A fast and universal method for querying graphs In ICPR (2), pages 112115, 2002 [20] R H Găuting Graphdb: Modeling and querying graphs in databases In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, pages 297–308, San Francisco, CA, USA, 1994 Morgan Kaufmann Publishers Inc [21] M Gyssens, J Paredaens, and D V Gucht A graph-oriented object database model In PODS, pages 417–424, 1990 [22] H He and A K Singh Closure-tree: An index structure for graph queries In ICDE, page 38, 2006 [23] H He and A K Singh Graphs-at-a-time: query language and access methods for graph databases In SIGMOD Conference, pages 405–418, 2008 66 [24] L B Holder, D J Cook, and S Djoko Substructure discovery in the subdue system In In Proc of the AAAI Workshop on Knowledge Discovery in Databases, pages 169– 180, 1994 [25] H Jiang, H Wang, P S Yu, and S Zhou Gstring: A novel approach for efficient search in graph databases In ICDE, pages 566–575, 2007 [26] J J Jung and J Euzenat Towards semantic social networks In ESWC, pages 267– 280, 2007 [27] W Kim, J Banerjee, H.-T Chou, and J F Garza Object-oriented database support for cad Computer-Aided Design, 22(8):469–479, 1990 [28] Y Koren, S C North, and C Volinsky Measuring and extracting proximity in networks In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 245–255, New York, NY, USA, 2006 ACM [29] S Leinhardt Social networks : a developing paradigm 1977 [30] G Li, B C Ooi, J Feng, J Wang, and L Zhou Ease: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data In SIGMOD Conference, pages 903–914, 2008 [31] M S Mart´ın and C Gutierrez Representing, querying and transforming social networks with rdf/sparql In ESWC, pages 293–307, 2009 [32] P Mika Social Networks and the Semantic Web, volume of Semantic Web And Beyond Computing for Human Experience Springer, 2007 67 [33] S Mitra, A Bagchi, and A K Bandyopadhyay Design of a data model for social network applications J Database Manag., 18(4):51–79, 2007 [34] S B Navathe Evolution of data modeling for databases Commun ACM, 35:112– 123, September 1992 [35] P P shan Chen The entity-relationship model: Toward a unified view of data ACM Transactions on Database Systems, 1:9–36, 1976 [36] H Shang, Y Zhang, X Lin, and J X Yu Taming verification hardness: an efficient algorithm for testing subgraph isomorphism Proc VLDB Endow., 1:364–375, August 2008 [37] D Suciu An overview of semistructured data SIGACT News, 29:28–38, December 1998 [38] R W Taylor and R L Frank Codasyl data-base management systems ACM Comput Surv., 8(1):67–103, 1976 [39] F W Tompa A data model for flexible hypertext database systems ACM Trans Inf Syst., 7(1):85–100, 1989 [40] H Tong and C Faloutsos Center-piece subgraphs: problem definition and fast solutions In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 404–413, New York, NY, USA, 2006 ACM [41] H Tong, C Faloutsos, B Gallagher, and T Eliassi-Rad Fast best-effort pattern matching in large attributed graphs In Proceedings of the 13th ACM SIGKDD in68 ternational conference on Knowledge discovery and data mining, KDD ’07, pages 737–746, New York, NY, USA, 2007 ACM [42] H Tong, B Gallagher, C Faloutsos, and T Eliassi-rad Fast best-effort pattern matching in large attributed graphs In In KDD, pages 737–746, 2007 [43] C Vicknair, X Nan, Y Chen, and D Wilkins A comparison of a graph database and a relational database a data provenance perspective Access, page 1, 2010 [44] D J Watts Six Degrees: The Science of a Connected Age W W Norton, New York, 2003 [45] D W Williams, J Huan, and W Wang Graph database indexing using structured graph decomposition In In ICDE, 2007 [46] X Yan, P S Yu, and J Han Graph indexing: a frequent structure-based approach In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD ’04, pages 335–346, New York, NY, USA, 2004 ACM ă [47] L Zou, L C 0002, and M T Ozsu Distancejoin: Pattern match query in a large graph database PVLDB, 2(1):886897, 2009 [48] A Zăundorf Graph pattern matching in progres In Selected papers from the 5th International Workshop on Graph Gramars and Their Application to Computer Science, pages 454–468, London, UK, 1996 Springer-Verlag 69 ... conventional data and social networking data is that conventional data focuses on entities and attributes, whereas social networking data focuses on entities and their inter-relationships Therefore, the... models, query languages and access methods not offer adequate and native support for the representation, management, querying and especially inter-operability of online social networking data To... language SNGQL designed for the new graph database system The graphical formalism in this thesis for online social networking data offers high expressibility and adequate modeling power ii Acknowledgements