Graph based modeling and query optimization for heterogeneous iot data

Doctoral Dissertation Graph-based Modeling and Query Optimization for Heterogeneous IoT Data Department of Electronics and Computer Engineering Graduate School, Chonnam National University Van-Quyet Nguyen August 2019 Contents List of Figures vi List of Tables viii Acknowledgements ix Abstract x Introduction 1.1 Motivation 1.2 Research Problems and Objectives 1.3 Contributions 1.4 Dissertation Structure 1.5 Dissertation Publications Background 2.1 10 IoT Data Characteristics 10 2.1.1 Heterogeneity 10 2.1.2 Highly Connected Data 11 i 2.2 2.3 2.4 2.5 2.1.3 Dynamic Changes 11 2.1.4 Massive Real-time Data 12 IoT Architecture Stack 12 2.2.1 Things Layer 13 2.2.2 Communication Layer 13 2.2.3 Data Layer 13 2.2.4 Application Layer 14 IoT Data Management 14 2.3.1 Collecting IoT Data 14 2.3.2 Storing IoT Data 15 2.3.3 Analyzing IoT Data 16 Graph Querying 17 2.4.1 Reachability Queries 17 2.4.2 Regular Path Queries 17 2.4.3 Shortest Path Queries 18 Comparison of Relational Databases and Graph Databases for Heterogeneous IoT Data 18 2.5.1 Comparison of Key Features 18 2.5.2 Experimental Evaluation 20 Graph-based Modeling for Heterogeneous IoT Data 3.1 25 A Graph-based View on IoT Data 25 3.1.1 26 Things Graph ii 3.1.2 Spatial Graph 26 3.1.3 Social Graph 27 3.2 Definition of Graph Models 27 3.3 IoT Graph Data Modeling 32 3.3.1 Definition of Graph Data Modeling 32 3.3.2 Graph-based Modeling for IoT Data 32 3.4 Analysis of Heterogeneous IoT Graph Data 34 3.5 Enriching Graph Models with Location Trustiness 36 3.5.1 Calculating Trustiness of Location based on Sensor Data 36 3.5.2 Weighted Graph Models using Trustiness of Location: A Case Study on Resilient Network Provisioning 40 An Optimization Technique for Regular Path Queries on Large Graphs 43 4.1 Introduction 44 4.2 Related Work 47 4.3 Preliminaries 49 4.3.1 Definition and Categorization of Regular Path Queries 49 4.3.2 Evaluating RPQs 50 4.4 Unit-Subquery Cost Matrix (USCM) 53 4.5 Estimating the Searching Cost of RPQs with USCM 54 4.5.1 Estimating the Searching Cost of a Concatenation RPQ 55 4.5.2 Estimating the Searching Cost of an Alternation RPQ 56 4.5.3 Estimating the Searching Cost of a Kleene Star RPQ 58 iii 4.5.4 4.6 4.7 4.8 4.9 Estimating the Searching Cost of a Highly Complex RPQ 60 Estimating Result Size of RPQs with USCM 61 4.6.1 Estimating Result Size of a Concatenation RPQ 62 4.6.2 Estimating Result Size of an Alternation RPQ 62 4.6.3 Estimating Result Size of a Kleene Star RPQ 64 4.6.4 Estimating Result Size of a Highly Complex RPQ 64 Efficient Parallel Evaluation of RPQs using Estimated Cost 65 4.7.1 Estimating Parallel Evaluation Cost 66 4.7.2 Parallel Evaluation of RPQs based on Minimum Estimated Evaluation Cost 68 Experimental Evaluation 69 4.8.1 Evaluation Settings 69 4.8.2 Experimental Results 71 Summary 78 A Scalable Approach for Shortest Path Queries on Large Dynamic Graphs 79 5.1 Introduction 80 5.2 Related Work 82 5.3 Emergency Evacuation System for Large Smart Buildings 84 5.3.1 Overview of System Architecture 84 5.3.2 Smart Indicators 84 5.3.3 Smart Guidance Agents 85 iv 5.3.4 Global Coordinator 86 5.4 LCDT-based Weighted Graph Model for Providing Situation Awareness 87 5.5 A Distributed Approach for Shortest Path Queries in Evacuation Routing 88 5.6 Caching Strategy for Dynamic Evacuation Routes 91 5.6.1 Observation that Motivates Caching 91 5.6.2 A Caching Strategy 92 5.6.3 Updating Evacuation Routes using Caches 94 5.7 Evaluation 95 5.8 Summary 100 Conclusion 102 6.1 Summary of the Dissertation 102 6.2 Future Work 103 6.2.1 Predicting Congestion in Large Smart Buildings for Emergency Evacuation 103 6.2.2 A Framework for Multiple-query Optimization of RPQs 104 References 105 Abstract in Korean 118 Appendices 119 v List of Figures 2.1 A general IoT architecture stack 2.2 Performance comparison between relational database and graph database on Sakila dataset 2.3 12 22 Performance comparison between relational database and graph database on Gnutella dataset 23 3.1 A conceptual view of IoT data 26 3.2 Weighted Graph 28 3.3 Node-Labeled Graph 28 3.4 Edge-Labeled Graph 29 3.5 Property Graph 31 3.6 The format of nodes and edges in the property graph 33 3.7 An example illustrates the relationship between temperature and trust value 38 3.8 A model of sensors distribution on the geo-mapping matrix 39 3.9 Ordinary setting of primary/backup paths of a flow 41 3.10 Scenario of large scale disaster vi 41 4.1 Example of query automaton with different types of RPQ 51 4.2 Example of a directed graph representing a social shopping network 52 4.3 A tree representing all possible paths satisfying an RPQ with a Kleene Star operator at 58 4.4 A tree representing possible paths satisfying a complex RPQ 65 4.5 Response time comparison of parallel RPQs evaluation on different graphs 72 4.6 Comparing the evaluation cost between USCM-based and AUT, TRL approaches 4.7 73 Response time comparison for parallel RPQs evaluation with varied graph sizes 74 4.8 Comparison of response time with varied number of subqueries 74 4.9 Accuracy evaluation with different parameters of graph 76 4.10 Accuracy evaluation with different parameters of query 77 5.1 Overview of system architecture 85 5.2 Comparison of the effectiveness among evacuation methods with scenario disaster event happened at normal regions in Donald Bren Hall 5.3 Comparison of the effectiveness among evacuation methods with scenario disaster event happened at critical regions in Donald Bren Hall 5.4 99 Comparison of the effectiveness among evacuation methods with scenario disaster event happened at critical regions in Ten-Story Building 5.6 98 Comparison of the effectiveness among evacuation methods with scenario disaster event happened at normal regions in Ten-Story Building 5.5 98 99 The impact of update interval on evacuation system 100 vii List of Tables 2.1 Performance of SQL queries on Sakila dataset 21 2.2 Performance of Cypher queries on Sakila dataset 21 2.3 Performance of SQL queries on Gnutella dataset 22 2.4 Performance of Cypher queries on Gnutella dataset 23 3.1 Node Types Description 33 3.2 Edge Types Description 34 3.3 Analysis of IoT Graph Characteristics 35 4.1 Example of USCM for a graph of social shopping network 53 viii [98] A Desmet and E Gelenbe, “Capacity based evacuation with dynamic exit signs,” in Pervasive Computing and Communications Workshops (PERCOM Workshops), 2014 IEEE International Conference on IEEE, 2014, pp 332–337 [99] A Oyola, D G Romero, and B X Vintimilla, “A dijkstra-based algorithm for selecting the shortest-safe evacuation routes in dynamic environments (sser),” in International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems Springer, 2017, pp 131–135 [100] Q Zhang, T Chen, and X.-z Lv, “New framework of intelligent evacuation system of buildings,” Procedia engineering, vol 71, pp 397–402, 2014 [101] J Liu, F Lin, E Chu, and J.-L Zhong, “Intelligent indoor emergency evacuation systems: Reference architecture and data requirements,” in Future Technologies Conference (FTC) IEEE, 2016, pp 600–609 [102] P Boguslawski, L Mahdjoubi, V Zverovich, and F Fadli, “A dynamic approach for evacuees’ distribution and optimal routing in hazardous environments,” Automation in Construction, vol 94, pp 11–21, 2018 [103] V Balasubramanian, D V Kalashnikov, S Mehrotra, and N Venkatasubramanian, “Efficient and scalable multi-geography route planning,” in Proceedings of the 13th International Conference on Extending Database Technology ACM, 2010, pp 394–405 [104] J R Thomsen, M L Yiu, and C S Jensen, “Effective caching of shortest paths for location-based services,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data ACM, 2012, pp 313–324 [105] Y Zhang, Y.-L Hsueh, W.-C Lee, and Y.-H Jhang, “Efficient cache-supported path planning on roads,” IEEE Transactions on Knowledge and Data Engineering, vol 28, no 4, pp 951–964, 2016 [106] Z Abul-Basher, “Multiple-query optimization of regular path queries,” in Data Engineering (ICDE), 2017 IEEE 33rd International Conference on IEEE, 2017, pp 1426–1430 116 다형 IoT 데이터를 위한 그래프 기반 모델링 및 질의 최적화 뉘엔반퀴엣 전남대학교 대학원 전자컴퓨터공학과 (지도교수 : 김경백) 국문초록 사물인터넷 (IoT : Internet of Thing) 환경에서는 서로 다른 속성과 특징을 가진 개체들이 고도의 연결성을 가지고 협력하게 될 것이다 구체적으로는 기계 · 전자 기기 뿐 아니라 사람 · 위치 · 애플리케이션 등의 다른 개체들이 서로 연결될 것이다 새로운 사물인터넷 서비스의 기회를 포착하는 기업에는 이러한 사물인터넷 개체의 연결 개 념에 대한 이해와 관리가 중요한 역할을 한다 IoT 데이터를 저장하고 질의하는 기존 접근방식은 MySQL 또는 MSSQL과 같은 RDMS(Relational Database Management System)을 사용한다 그러나 RDMS를 사용하는 것은 고도로 연결된 다형 IoT 데이터 를 처리하기에 유연하지 못하고 충분하지 않다 왜냐하면, 이러한 데이터는 중첩 질 의들과 복잡한 결합 질의들을 필요로 하는 매우 복잡한 관계를 가지고 있기 때문이다 다행히도, 그래프 데이터베이스는 고도로 연결된 데이터를 저장하고 분석하기 위해 최근에 개발되었다 본 논문은 다형 IoT 데이터를 위한 그래프 기반 모델링 및 질의 최적화를 연구했다 첫째, 우리는 이질적인 IoT 데이터를 그래프 모델에서 어떻게 표현해야 하는지를 연구했다 제안하는 그래프 모델은 소셜 그래프, 공간 그래프, 사물 그래프를 결합하고 그것들 사이의 관계를 하나의 그래프로 통합한다 우리는 서로 다른 타입의 그래프 데이터를 조합하여 생성된 다형 IoT 데이터의 변화에 따른 그래프 특성을 분석하였다 117 IoT 데이터 자체에 기반하는 그래프 모델 고도화를 위한 추정 모델도 제안한다 IoT 시스템에서 원시 데이터에서 얻은 노드와 에지 속성을 사용한다고 해서 실제 당면 문제가 항상 해결되는 것은 아닐 수 있다 예를 들어 많은 화재 센서가 스마트 건물의 복도에 배치되지만, 화재 사건이 발생할 때 센서 데이터를 기반으로 위험 강도가 높은 복도를 어떻게 알 수 있는가 하는 것이 중요한 문제다 본 논문에서는 그래프 모델을 풍부하게 하기 위한 초특성 (즉, 위치 신뢰도) 값 추정 모델을 제안한다 둘째, 다형 IoT 데이터로부터 지식을 얻으려면 대형 그래프에서 질의하는 핵심 문제를 해결해야 한다 규칙적인 경로 쿼리 (RPQ) 를 사용하는 것은 그래프 데이터베 이스의 패턴을 탐색하는 일반적인 방법이다 대형 그래프의 경로 쿼리 평가를 위한 기존의 오토마타 기반 접근방식은 그래프 크기 및/또는 매우 복잡한 경로 쿼리를 평가하기 위해 매우 높은 비용을 유발한다 최근 대형 그래프에 적용되는 임계값 희귀 라벨 기반 접근법이 유효하다는 것이 입증되었다 그러나 희귀 라벨 기반 접근법은 병렬 컴퓨팅에 적용 시 경로 쿼리의 최소 평가 비용을 항상 보장할 수 없기 때문에 개선의 여지가 여전히 존재한다 본 논문에서는, 대형 그래프에 대한 경로 쿼리의 병렬 처리시 검색 및 결합비용을 최소화 하는 비용 기반 최적화 기법을 제안한다 마지막으로, 동적 변화 환경에서 대용량 데이터를 기반으로 하는 대형 그래프상 에서 최단 경로 쿼리를 처리하기 위한 확장성을 가지는 기법을 연구했다 이를 위해, 대규모 스마트 빌딩에서의 대피 경로 생성을 사례 연구로 고려하였다 여기서 스마트 방향지시기의 네트워크를 그래프로 간주된다 그래프에서 재해 조건 (예 : 위험 강도) 과 건물 조건(예: 복도 수용량)을 사용하여 각 에지에 가중치를 부여한다 대형 스마트 건물에서 대피 경로를 찾기 위한 간단한 접근방식으로 분산 최단 경로 쿼리를 생각할 수 있다 즉, 대형 스마트 빌딩의 각 층별로 대피 경로를 생성하는 지역 프로세스를 사용한다 그러나 이 방식은 전물 전체에서 위치별 위험 강도와 군중 정체와 같은 전지적 관점의 정보에 대한 고려가 부족하다 예를 들어, 더 높은 층에서 대피경로를 따라온 피난민들이 낮은 층으로 이동할 때 위험 지역이나 군중 혼잡에 직면할 수 있다 기존 접근방식의 또 다른 한계는 재해 피해 정도 및 군중 혼잡 정보를 얻는 데 많은 시간이 소요되기 때문에 효과적인 대피 경로가 늦게 수정될 수 있다는 이에 따라, 우리는 대피 시간동안 동적으로 변화하는 현재의 재해 상황 및 빌딩 조건을 고려하는 긴급 대피 경로 생성 및 수정을 위한 확장성 있는 경로 생성 방법을 제안한다 118 Appendix A Queries on Sakila dataset A.1 SQL Queries /* Q1 SELECT * FROM rental WHERE staff_id = '1'; Q2 SELECT * FROM payment WHERE staff_id = '2'; Q3 SELECT * FROM payment WHERE DATE(payment_date) = '2005-08-23'; Q4 SELECT * FROM rental WHERE staff_id >= '1' AND staff_id = '1' AND staff_id = '2005-08-23' AND DATE(payment_date) = '2005-08-23'; Q10 SELECT a.actor_id, a.first_name, a.last_name, c.name, COUNT(fc.category_id) FROM film f, film_category fc, category c, film_actor fa, actor a WHERE f.film_id = fc.film_id AND fc.category_id = c.category_id AND f.film_id = fa.film_id AND a.actor_id = fa.actor_id GROUP BY a.actor_id, a.first_name, a.last_name, c.name Q11 SELECT f.film_id, f.title, COUNT(r.rental_id) FROM film f, film_category fc, category c, store s, inventory i, rental r, payment p WHERE f.film_id = fc.film_id AND fc.category_id = c.category_id AND f.film_id = i.film_id AND i.store_id = s.store_id AND i.inventory_id = r.inventory_id AND p.rental_id=r.rental_id GROUP BY f.film_id, f.title Q12 121 SELECT f.film_id, f.title, c.name, SUM(p.amount) as amount FROM film f, film_category fc, category c, store s, inventory i, rental r, payment p WHERE f.film_id = fc.film_id AND fc.category_id = c.category_id AND f.film_id = i.film_id AND i.store_id = s.store_id AND i.inventory_id = r.inventory_id AND p.rental_id=r.rental_id GROUP BY f.film_id, f.title, c.name HAVING amount > 200 A.2 Cypher Queries Q1 MATCH (r:rental)-[:ApprovedBy]->(s:staff {staff_id: '1'}) RETURN r Q2 MATCH (r:payment)-[:PayFor]->(s:staff {staff_id: '2'}) RETURN r Q3 MATCH (p:payment) WHERE date(left(p.payment_date,10)) = date('2005-08-23') RETURN p Q4 MATCH (r:rental)-[:ApprovedBy]->(s:staff) WHERE s.staff_id >= '1' AND s.staff_id (s:staff) WHERE s.staff_id >= '1' AND s.staff_id date('2005-08-23') AND date(left(p.payment_date,10)) < date('2009-08-23') RETURN p Q7 MATCH (f:film) OPTIONAL MATCH (f) (c:category) OPTIONAL MATCH (a:actor) (f) RETURN f.title, c.name, a.first_name Q8 MATCH (f:film) OPTIONAL MATCH (f) (c:category) OPTIONAL MATCH (a:actor) (f) OPTIONAL MATCH (i:inventory) (f) OPTIONAL MATCH (s:store)-[:Has]->(i) OPTIONAL MATCH (r:rental)-[:Rent]->(i) OPTIONAL MATCH (p:payment)-[:PayTo]->(r) RETURN f.title, c.name,i.inventory_id, s.store_id, r.rental_id,p.payment_id Q9 123 MATCH (f:film) OPTIONAL MATCH (f) (c:category) OPTIONAL MATCH (a:actor) (f) OPTIONAL MATCH (i:inventory) (f) OPTIONAL MATCH (s:store)-[:Has]->(i) OPTIONAL MATCH (r:rental)-[:Rent]->(i) OPTIONAL MATCH (p:payment)-[:PayTo]->(r) OPTIONAL MATCH (s)-[:LocatedAt]->(ad:address) OPTIONAL MATCH (ad) (ct:city) OPTIONAL MATCH (ct) (ctr:country) WHERE date(left(p.payment_date,10)) = date('2005-08-23') RETURN f.title, c.name,i.inventory_id, s.store_id,ad.address, ct.city,ctr.country, r.rental_id,p.payment_id Q10 MATCH (f:film) OPTIONAL MATCH (f) (c:category) OPTIONAL MATCH (a:actor) (f) RETURN a.actor_id, a.first_name,c.name, count(*) Q11 MATCH (f:film) OPTIONAL MATCH (f)-[:BelongTo]->(i:inventory) OPTIONAL MATCH (r:rental)-[cc:Rent]->(i) RETURN f.film_id, f.title, count(cc) Q12 MATCH (f:film) OPTIONAL MATCH (f) (c:category) OPTIONAL MATCH (f)-[:BelongTo]->(i:inventory) OPTIONAL MATCH (r:rental)-[cc:Rent]->(i) 124 OPTIONAL MATCH (p:payment)-[:PayTo]->(r) WITH f,c,sum(TOFLOAT(p.amount)) as total WHERE total > 200 RETURN f.film_id, f.title,c.name, total 125 Appendix B Queries on Gnutella dataset B.1 SQL Queries Q1 SELECT * FROM Links6 WHERE FromNodeId = '100'; Q2 SELECT * FROM Links6 WHERE ToNodeId = '900'; Q3 SELECT * FROM Links6 WHERE FromNodeId = '100' AND ToNodeId = '15605'; Q4 SELECT * FROM Links WHERE FromNodeId >= '100' AND FromNodeId = '200' AND FromNodeId = '200' AND ToNodeId 10 B.2 Cypher Queries Q1 MATCH link=((h1:host {host_id: '100'}) -[:ConnectTo]->(h2:host)) 128 RETURN link Q2 MATCH link=((h1:host) -[:ConnectTo]->(h2:host {host_id: '900'})) RETURN link Q3 MATCH link=((h1:host {host_id: '100'}) -[:ConnectTo]->(h2:host {host_id: '15605'})) RETURN link Q4 MATCH link=(h1:host)-[:ConnectTo]->(h2:host) WHERE ToInteger(h1.host_id) >= 100 AND ToInteger(h1.host_id) (h2:host) WHERE ToInteger(h1.host_id) >= 200 AND ToInteger(h1.host_id) (h2:host) WHERE ToInteger(h1.host_id) >= 200 AND ToInteger(h2.host_id) (h2:host) RETURN h1.host_id, h2.host_id Q8 MATCH link=(h1:host)-[*1 3]->(h2:host) RETURN h1.host_id, h2.host_id Q9 MATCH link=(h1:host)-[*1 4]->(h2:host) RETURN h1.host_id, h2.host_id Q10 MATCH link=(h1:host)-[:ConnectTo]->(h2:host) RETURN h1.host_id, COUNT(link) Q11 MATCH link=(h1:host)-[:ConnectTo]->(h2:host) RETURN h2.host_id, COUNT(link) Q12 MATCH link=(h1:host)-[:ConnectTo]->(h2:host) WITH COUNT(link) as c, h2 WHERE c > 10 RETURN h2.host_id, c 130

Định dạng
Số trang	143
Dung lượng	4,19 MB