Tìm kiếm motif trên chuỗi thời gian bằng giải thuật scrimp++

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	93
Dung lượng	3,32 MB

Nội dung

Tìm kiếm motif trên chuỗi thời gian bằng giải thuật scrimp++ Tìm kiếm motif trên chuỗi thời gian bằng giải thuật scrimp++ Tìm kiếm motif trên chuỗi thời gian bằng giải thuật scrimp++ Tìm kiếm motif trên chuỗi thời gian bằng giải thuật scrimp++ Tìm kiếm motif trên chuỗi thời gian bằng giải thuật scrimp++

MỤC LỤC LỜI CAM ĐOAN iii LỜI CẢM TẠ iv MỤC LỤC v DANH MỤC HÌNH ẢNH, BẢNG BIỂU viii ĐẶT VẤN ĐỀ 1 Đặt vấn đề Mục tiêu nghiên cứu Đối tượng phạm vi nghiên cứu Cách tiếp cận phương pháp nghiên cứu Ý nghĩa thực tiễn đề tài CHƯƠNG 1: TỔNG QUAN VỀ DỮ LIỆU CHUỖI THỜI GIAN .3 1.1 Tổng quan 1.2 Phát motif liệu chuỗi thời gian 1.3 Cấu trúc luận văn CHƯƠNG 2: CÁC KIẾN THỨC CƠ BẢN 2.1 Các khái niệm 2.1.1 Chuỗi thời gian 2.1.2 Cửa số trượt (Sliding Window) 2.1.3 Chuỗi 2.1.4 So trùng mẫu 2.1.4.1 So trùng tầm thường 2.1.4.2 So trùng không tầm thường 2.1.5 Cơ sở liệu chuỗi thời gian (A time series database) 2.1.6 Các định nghĩa Motif 2.2 Các độ đo khoảng cách 10 2.2.1 Độ đo Minkowski 10 2.2.2 Độ đo xoắn thời gian động 13 2.2.3 Matrix profile (MP) 14 2.3 Một số phương pháp phát motif tiêu biểu 18 v CHƯƠNG 3: PHƯƠNG PHÁP PHÁT HIỆN MOTIF DỰA VÀO GIẢI THUẬT SCRIMP++ .22 3.1 Giới thiệu 22 3.2 Thuật toán SCRIMP 23 3.3 Thuật toán preSCRIMP 28 3.4 Thuật toán SCRIMP++ cải tiến 32 CHƯƠNG 4: ĐÁNH GIÁ BẰNG THỰC NGHIỆM 36 4.1 Môi trường liệu thực nghiệm 36 4.2 Tiêu chí đánh giá 36 4.3 Các trường hợp thực nghiệm 37 4.3.1 Dữ liệu Random Walk 38 4.3.1.1 Thực nghiệm trường hợp cố định độ dài chuỗi thời gian thay đổi độ dài chuỗi (Độ dài chuỗi thời gian: 10000 điểm chiều dài chuỗi là: 64, 128, 256, 512, 1024 điểm) tập liệu Random Walk .38 4.3.1.2 Thực nghiệm trường hợp cố định độ dài chuỗi thay đổi độ dài chuỗi thời gian (Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm độ dài chuỗi cố định: 64, 128, 256 điểm) tập liệu Random Walk .43 4.3.2 Dữ liệu Seismology 50 4.3.2.1 Thực nghiệm trường hợp cố định độ dài chuỗi thời gian thay đổi độ dài chuỗi (Độ dài chuỗi thời gian: 10000 điểm chiều dài chuỗi là: 64, 128, 256, 512, 1024 điểm) tập liệu Seismology .50 4.3.2.2 Thực nghiệm trường hợp cố định độ dài chuỗi thay đổi độ dài chuỗi thời gian (Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm độ dài chuỗi cố định: 64, 128, 256 điểm) tập liệu Seismology .55 4.3.3 Dữ liệu Insect EPG 61 4.3.3.1 Thực nghiệm trường hợp cố định độ dài chuỗi thời gian thay đổi độ dài chuỗi (Độ dài chuỗi thời gian: 10000 điểm chiều dài chuỗi là: 64, 128, 256, 512, 1024 điểm) tập liệu Insect EPG .61 4.3.3.2 Thực nghiệm trường hợp cố định độ dài chuỗi thay đổi độ dài chuỗi thời gian (Độ dài chuỗi thời gian là: 1000, vi 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm độ dài chuỗi cố định: 64, 128, 256 điểm) tập liệu Insect EPG .65 CHƯƠNG 5: KẾT LUẬN VÀ KIẾN NGHỊ .71 5.1 Kết đạt 71 5.2 Hạn chế 71 5.3 Hướng phát triển 72 vii DANH MỤC HÌNH ẢNH, BẢNG BIỂU Hình 1.1: Đường biểu diễn chuỗi thời gian ( [1]) Hình 1.2: Ví dụ motif chuỗi thời gian ( [3]) Hình 2.1: Cửa sổ trượt liệu chuỗi thời gian ( [7]) Hình 2.2: So trùng khớp hai chuỗi C M cắt từ chuỗi thời gian T ( [6]) Hình 2.3: Hai chuỗi chuỗi thời gian T so trùng tầm thường ( [6]) Hình 2.4: K-motif [8] Hình 2.5: Một chuỗi thời gian chứa motif [6] Hình 2.6: Minh họa hai chuỗi thời gian giống 12 Hình 2.7: Độ đo Dynamic Time Warping độ đo Euclidean ( [13]) 13 Hình 2.8: Mối quan hệ khoảng cách ma trận, Matrix distances Matrix profile ( [14]) 15 Hình 2.9: Ví dụ Matrix profile index mơt chuỗi thời gian 15 Hình 2.10: Matrix distances Matrix profile ( [14]) 16 Hình 2.11: Ví dụ matrix profile 16 Hình 2.12: Ví dụ matrix profile 17 Hình 2.13: Trong matrix profile loại bỏ trường hợp so trùng tầm thường Khu vực DAA xác định khu vực chuỗi so trùng tầm thường 17 Hình 2.14: Hai nhánh tìm kiếm motif liệu chuỗi thời gian 19 Hình 3.1: Thuật toán SCRIMP++ xây dựng dựa hai thuật toán PreSCRIMP SCRIMP [22] 23 Bảng 3.1: Tính chập [22] 23 Hình 3.2: Ví dụ tính chập [22] 25 Bảng 3.2: Thuật toán SCRIM [22] 25 Hình 3.3: Đánh giá đường chéo thuật tốn SCRIMP [22] 27 Hình 3.4: Đánh giá đường chéo thuật toán SCRIMP [22] 27 Hình 3.5: Thuộc tính Consecutive Neighborhood Preserving (CNP) 28 Hình 3.6: Khoảng thời gian lấy mẫu s 29 Bảng 3.3 Thuật toán MASS 29 Bảng 3.4 Thuật toán preSCRIMP [22] 30 viii Bảng 3.5 Thuật tốn tìm số vịng lặp 32 Bảng 3.6 Thuật toán SCRIMP++ cải tiến 33 Hình 4.1: Giao diện chương trình 37 Bảng 4.1: Chú thích giao diện chương trình 37 Bảng 4.2: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 10000 điểm, chiều dài chuỗi con: 64, 128, 256, 512, 1024 điểm 38 Bảng 4.3: Kết khoảng cách thời gian cặp chuỗi Motif Độ dài chuỗi thời gian: 10000 điểm, chiều dài chuỗi con: 64, 128, 256, 512, 1024 điểm 39 Hình 4.2: Kết ba giải thuật tìm motif liệu RandomWalk với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi con: 512 điểm 39 Hình 4.3: Kết chi tiết motif ba giải thuật SCRIMP, SCRIMP++ SCRIMP++ cải tiến tìm motif liệu RandomWalk với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi con: 512 điểm Hình trên: motif chuỗi thời gian; Hình dưới: Chi tiết motif 40 Hình 4.4: So sánh thời gian thực thi tập liệu RandomWalk với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi là: 64, 128, 256, 512, 1024 điểm 41 Hình 4.5: Kết chi tiết motif ba giải thuật SCRIMP, SCRIMP++ SCRIMP++ cải tiến tìm motif liệu RandomWalk với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi con: 256 điểm Hình trên: motif chuỗi thời gian; Hình dưới: Chi tiết motif 42 Bảng 4.4: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 64 điểm 43 Bảng 4.5: Thời gian thực thi khoảng cách cặp chuỗi tìm với độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 64 điểm 44 Hình 4.6: So sánh thời gian thực thi tập liệu RandomWalk với độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 64 điểm 44 Bảng 4.6: Kết vị trí chuỗi tìm độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 128 điểm 45 Bảng 4.7: Thời gian thực thi khoảng cách cặp chuỗi tìm với độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 128 điểm 45 ix Hình 4.7: So sánh thời gian thực thi tập liệu RandomWalk với Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 128 điểm 46 Hình 4.8: Kết chi tiết motif ba giải thuật SCRIMP, SCRIMP++ SCRIMP++ cải tiến tìm motif liệu RandomWalk với Độ dài chuỗi thời gian: 20000 điểm, độ dài chuỗi con: 128 điểm Hình trên: motif chuỗi thời gian; Hình dưới: Chi tiết motif 47 Bảng 4.8: Kết vị trí chuỗi tìm độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 256 điểm 47 Bảng 4.9: Thời gian thực thi khoảng cách cặp chuỗi tìm với độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 256 điểm 48 Hình 4.9: So sánh thời gian thực thi tập liệu RandomWalk với độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 256 điểm 49 Hình 4.10: Kết chi tiết motif ba giải thuật SCRIMP, SCRIMP++ SCRIMP++ cải tiến tìm motif liệu RandomWalk với Độ dài chuỗi thời gian: 5000 điểm, độ dài chuỗi con: 256 điểm Hình trên: motif chuỗi thời gian; Hình dưới: Chi tiết motif 49 Bảng 4.10: Kết vị trí chuỗi tìm độ dài chuỗi thời gian: 10000 điểm, chiều dài chuỗi con: 64, 128, 256, 512, 1024 điểm 51 Bảng 4.11: Kết khoảng cách thời gian cặp chuỗi Motif Độ dài chuỗi thời gian: 10000 điểm, chiều dài chuỗi con: 64, 128, 256, 512, 1024 điểm 51 Hình 4.11: Kết ba giải thuật tìm motif liệu Seismology với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi con: 512 điểm 52 Hình 4.12: Kết chi tiết ba giải thuật SCRIMP, SCRIMP++ SCRIMP++ cải tiến tìm motif liệu Seismology với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi con: 512 điểm Hình trên: motif chuỗi thời gian; Hình dưới: Chi tiết motif 53 Hình 4.13: So sánh thời gian thực thi tập liệu Seismology với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi là: 64, 128, 256, 512, 1024 điểm 54 Bảng 4.12: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 64 điểm 55 x Bảng 4.13: Thời gian thực thi khoảng cách cặp chuỗi tìm với Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 64 điểm 56 Hình 4.14: So sánh thời gian thực thi tập liệu Seismology với Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 64 điểm 56 Bảng 4.14: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 128 điểm 57 Bảng 4.15: Thời gian thực thi khoảng cách cặp chuỗi tìm với Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 128 điểm 57 Hình 4.15: So sánh thời gian thực thi tập liệu Seismology với Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 128 điểm 58 Bảng 4.16: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 256 điểm 58 Bảng 4.17: Thời gian thực thi khoảng cách cặp chuỗi tìm với Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 256 điểm 59 Hình 4.16: So sánh thời gian thực thi tập liệu Seismology với Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 256 điểm 60 Bảng 4.18: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 10000 điểm, chiều dài chuỗi con: 64, 128, 256, 512, 1024 điểm 61 Bảng 4.19: Kết khoảng cách thời gian cặp chuỗi Motif Độ dài chuỗi thời gian: 10000 điểm, chiều dài chuỗi con: 64, 128, 256, 512, 1024 điểm 61 Hình 4.17: Kết ba giải thuật tìm motif liệu Insect EPG với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi con: 1024 điểm 62 Hình 4.18: Kết chi tiết ba giải thuật SCRIMP, SCRIMP++ SCRIMP++ cải tiến tìm motif liệu Insect EPG với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi con: 1024 điểm Hình trên: motif chuỗi thời gian; Hình dưới: Chi tiết motif 63 Hình 4.19: So sánh thời gian thực thi tập liệu Insect EPG với Độ dài chuỗi thời gian: 10000 điểm, độ dài chuỗi là: 64, 128, 256, 512, 1024 điểm 64 xi Bảng 4.20: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 64 điểm 65 Bảng 4.21: Thời gian thực thi khoảng cách cặp chuỗi tìm với Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 64 điểm 65 Hình 4.20: So sánh thời gian thực thi tập liệu Insect EPG với Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 64 điểm 66 Bảng 4.22: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 128 điểm 67 Bảng 4.23: Thời gian thực thi khoảng cách cặp chuỗi tìm với Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 128 điểm 67 Hình 4.21: So sánh thời gian thực thi tập liệu Insect EPG với Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 128 điểm 68 Bảng 4.24: Kết vị trí chuỗi tìm Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Chiều dài chuỗi con: 256 điểm 68 Bảng 4.25: Thời gian thực thi khoảng cách cặp chuỗi tìm với Độ dài chuỗi thời gian: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm Độ dài chuỗi con: 256 điểm 69 Hình 4.22: So sánh thời gian thực thi tập liệu Insect EPG với Độ dài chuỗi thời gian là: 1000, 2000, 3000, 5000, 8000, 15000, 20000, 25000, 30000 điểm, độ dài chuỗi con: 256 điểm 70 xii ĐẶT VẤN ĐỀ Đặt vấn đề Với phát triển khoa học công nghệ nay, với cách mạng khoa học 4.0 nguồn liệu tăng lên đáng kể theo ngày Tuy nhiên, đứng trước thực trạng “Dư liệu thiếu thơng tin”, có nghĩa liệu có nhiều việc khai thác liệu cách hiệu để trích xuất thơng tin vấn đề nhà khoa học nghiên cứu để có giải pháp tốt Đã có nhiều đề xuất để khai thác liệu chuỗi thời gian ví dụ: tìm motif, phát bất thường liệu lớn để từ áp dụng vào tốn dự báo, hỗ trợ đưa định, … Việc nghiên cứu thuật tốn để khai thác thơng tin liệu lớn cách hiệu vấn đề thiết Một toán nhà khoa học quan tâm toán phát motif liệu chuỗi thời gian Bài toán áp dụng nhiều lĩnh vực, ứng dụng sống như: dùng motif để dự báo giá chứng khoán, kiểm tra chữ ký, kiểm tra đạo nhạc, phân cụm, phân lớp chuỗi thời gian, … Mục tiêu nghiên cứu Mục tiêu đề tài tìm hiểu cài đặt thuật tốn SCRIMP++ sử dụng toán phát motif liệu chuỗi thời gian Đồng thời, nghiên cứu cải tiến nhằm tăng tốc độ xử lý thuật toán SCRIMP++ Đối tượng phạm vi nghiên cứu - Đối tượng nghiên cứu: Dữ liệu chuỗi thời gian, motif liệu chuỗi thời gian kết nghiên cứu công bố phát motif liệu chuỗi thời gian - Phạm vi nghiên cứu: Chuỗi thời gian toán phát motif chuỗi thời gian thuật toán SCRIMP++ Cách tiếp cận phương pháp nghiên cứu Tổng kết kết nghiên cứu liên quan trước Đánh giá hiệu phương pháp Thực nghiệm để kiểm tra kết Nghiên cứu tài liệu, ứng dụng mơ hình lý thuyết chứng minh thực nghiệm Ý nghĩa thực tiễn đề tài Nghiên cứu tảng cho nghiên cứu toán phát motif khai phá liệu chuỗi thời gian Ngoài ra, cịn làm tài liệu tham khảo cho nhà khoa học quan tâm CHƯƠNG 5: KẾT LUẬN VÀ KIẾN NGHỊ Chương tổng kết lại kết đạt luận văn này, đóng góp, hạn chế hướng phát triển tương lai 5.1 Kết đạt Qua thời gian thực nghiên cứu, đề tài đạt kết sau đây:  Đã tìm hiểu thuật tốn SCRIMP, SCRIMP++ giới thiệu gần cho toán phát motif  Đề xuất ý tưởng cải tiến thuật toán SCRIMP++ nhằm cải thiện tốc độ thực thi thuật toán  Đề tài cài đặt thành cơng ba thuật tốn phát motif liệu chuỗi thời gian: thuật toán SCRIMP, thuật toán SCRIMP++ thuật toán SCRIMP++ cải tiến Từ thực nghiệm ba tập liệu: RandomWalk, Seismology Insect EPG với nhiều trường hợp khác độ dài chuỗi thời gian độ dài chuỗi cho thấy thuật toán SCRIMP++ cải tiến có kết tương đối xác thời gian chạy nhanh so với hai thuật toán cịn lại Thuật tốn SCRIMP++ cải tiến ứng dụng tốn tìm kiếm motif liệu 5.2 Hạn chế Tuy nhiên đề số hạn chế như:  Chiều dài motif người dùng khai báo chạy thử  Chưa thực nghiệm nhiều tập liệu khác  Kết tìm motif chưa xác hồn tồn vị trí số trường hợp  Đề tài cải tiến thuật toán SCRIMP++, nhiên tên đề tài không rõ việc cải tiến thuật toán SCRIMP++ 71 5.3 Hướng phát triển Đề tài cải tiến thuật tốn SCRIMP++ thơng qua hai ý tưởng chính: giới hạn số vịng lặp thứ tự chuỗi thực Tuy nhiên, số vấn đề cần hoàn thiện nêu phần hạn chế tương lai:  Gợi ý chiều dài motif  Thực nghiệm nhiều tập liệu lớn  Cải tiến phương pháp để tìm motif xác 72 TÀI LIỆU THAM KHẢO [1] R Hyndman, "Time Series http://www.datamarket.com Data Library," [Online] Available: [2] E Keogh and S Kasetty, "On the need for time series data mining benchmark: A surevey and empirical demonstration" in In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p 23 – 26, 2002 [3] A Mueen, E Mueen, Q Zhu, S Cash and B West, "Exact Discovery of Time Series" SLAM International Conference on Data Mining (SDM09), 2009 [4] E Tufte, "The visual display of quantitative information" Graphic Press, Cheshire, Connecticut [5] A Mueen, E Keogh , Q Zhu, S Cash and B Westover, "Exact Discovery of Time Series Motifs".University of California [6] B Chiu, E Keogh and S Lonardi, "Probabilistic Discovery of Time Series Motifs" Porceedings of the 9th International Conference on Knowledge Discovery and Data, pp 493-498, 2003 [7] P Salmon, J C Olivier, K J Wessels, W Kleynhans, F v d Bergh and K C Steenkamp, "Unsupervised Land Cover Change Detection: Meaningful Sequential Time Series Analysis" IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp 327 - 335, 2011 [8] J Lin, E Keogh, S Lonardi and P Patel, "Finding Motifs in Time Series" Porceedings of 2th Workshop on Temperal Data Mining, at the 8th ACM SIGKDD, 2002 [9] Deza, Elena and M Michel, "Encyclopedia of Distances" Springer, p 94, 2009 [10] E Keogh, "Mining Shape and Time Series Databases with Symbolic Representa- tions" in Tutorial of the 13rd ACM International Conference on Knowledge Discovery and Data mining (KDD 2007), pp 12-15, 2007 [11] J Han and M Kamber, "Data Mining: Concepts and Techniques" Second Edition ed Morgan Kaufmann publishers, 2006 73 [12] N Beckmann, H Kriegel, R Schneider and B Seeger, "The R*-tree: An efficient and robust access method for points and rectangles" in Proc of 1990 ACM SIGMOD Conf., Atlantic City [13] K Eamonn , A W Fu, L Y H Lau and C A Ratanamahatana, "Scaling and Time Warping in Time Series Querying" The VLDB Journal, p 899–921, 2008 [14] Y Zhu, C M Yeh, Z Zimmerman, K Kamgar and E Keogh, "Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive" in IEEE International Conference on Data Mining (ICDM), 2018 [15] C.M Yeh, N Kavantzas and K Eamonn , "Meaningful Multidimensional Motif Discovery" in ICDM, 2017 [16] N C Castro and P J Azevedo, "Significant motifs in time series" Statistical Analysis and Data Mining, pp 35-53, 2012 [17] C D Truong and D T Anh, "An Efficient Method for Discovering Motif in Large Time Series" Proc of 5th Asian Conference on Intelligent Information and Database Systems (ACIIDS 2013), pp 135-145, 2013 [18] B Liu, L Jianqiang and C Cheng, "Efficient Motif Discovery for LargeScale Time Series in Healthcare" IEEE Transactions on Industrial Informatics, vol 11, no 3, pp 583 - 590, 2015 [19] Y Zhu and Z Zimmerman, "Matrix profile II: Exloiting a Novel Algorithm and CPUs yo break he one Hunded Milion Barries for Time Series motifs and joins" EEE ICDM, 2016 [20] N Castro and P Azevedo, "Multiresolution Motif Discovery in Time Series" in Proceedings of the SIAM International Conference on Data Mining (SDM 2010), pp 665-676, 2010 [21] C M Yeh, Y Zhu and H D Anh, "Matrix Profile I: All pairs similarity joins for Time Series" IEEICDM, 2016 [22] Y Zhu, C M Yeh, Z Zimmerman, K Kamgar and K Eamonn , "Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive" in ICDM, 2018 [23] K Pariwatthanasak, and C A Ratanamahatana,, "Time Series Motif Discovery Using" Springer Nature Singapore, 2019 74 [24] W Ball, "Other questions on probability" in Math Recreations Essays , 1960, p 45 [25] UCR, "The UCR Matrix Profile Page" [Online] https://www.cs.ucr.edu/~K Eamonn/MatrixProfile.html Available: [26] Y Mohammad and T Nishida, "Exact Discovery of Length-Range Motifs" Intelligent Information and Database Systems, vol 8398 of the series Lecture Notes in Computer Science, pp 23-32, 2014 [27] Y Gao, J Lin and H Rangwala, "Iterative Grammar-Based Framework for Discovering Variable-Length Time Series Motifs" in IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, USA, 2016 [28] M Linardi, Y Zhu, T Palpanas and K Eamonn , "Matrix Profile X: VALMOD - Scalable Discovery of" in SIGMOD, 2018 [29] J Lin, E Keogh, L S Hee and H V Herle, "Finding the most unusual time series" In knowledge and information systems, 2006 [30] D C Truong and T D Anh, "An efficient method for motif and anomaly detection in time series based on clustering" International Journal of Business Intelligence and Data Mining, vol 10, pp 356-377, 2015 [31] [Online] Available: https://www.cs.ucr.edu/~K Eamonn/MatrixProfile.html 75 International conference on AITC 2020 DETECTING SEISMIC FREQUENCY OCCUR BASED ON MOTIF DISCOVERY APPROACH Tran Thi Dung1, Nguyen Thanh Son2 University of Transport and Communications, Campus in HCMC University of Technology and Education HCMC Abstracts: Detection and prediction of seismic are important problems have extensive application in many areas such as predict earthquakes, tsunamis in geography, architecture , especially in the traffic construction domain Researching to find out the effective seismic prediction and detection method with high accuracy is a hot trend both in Vietnam and the world The survey phased of seismic characteristics before starting a project on traffic construction is vital The next tasks include time series classification, frequent sequence pattern recognition, abnormal detection, and time series prediction Motif detection in time series data has received significant recognition in the data mining community since its genesis, mainly because, motif discovery is both meaningful and more probable to succeed on big data In this paper, the motif detection problem will be used to predict the most frequent seismic frequency This method is being applied in many fields, in particular applied in problems with massive data volume and high efficiency Experimental outcomes show the robustness of our method Keywords: Time series, Motif, SCRIMP++, time series data, seismology INTRODUCTION Seismic is a profoundly important field of geophysics in general and earthquake science in particular High seismic intensity is one of the main causes of damage to traffic constructions and people Seismic research has an important role in the field of transportation construction which helps in understanding the behavior of structures of various types subjected to earthquake loads, and how we can preserve the inhabitants of that construction in an event of an earthquake So seismic exploration is a needed task when implementing the construction of transport works Due to the inefficiency of visually examining data recorded from devices, it is of desire to predict the seismic using data reliably [1] Some seismic measurement methods such as refractive seismic, reflex seismic, fluorescence, However, those traditional methods become inadequate to detect seismic that occur frequency In recent years, machine learning can be used to tackle the problem in seismic prediction There have been many articles on seismic and building applications such as earthquake detection [2] by using convolutional neural network, earthquake prediction [3] by using the temporal sequence of historic seismic activities in combination with the machine learning classifiers, However, the problem of detecting how often the seismic frequency occurs has not been a specific study Time-series data mining comprises a group of intelligent techniques by which to ''mine'' valuable information and knowledge from time-series datasets A time series is a collection of observations made chronologically The nature of time-series data includes: large in data size, high dimensionality, and necessary to update continuously Time series motifs are pairs of individual time series, or subsequences of a longer time series, which are very similar to each other Currently, the motif mining problems not only are being researched, developed, and deployed by famous scientists but also related to other problems in the time series -1- International conference on AITC 2020 Many problems have applied motifs discovery approaches such as understanding customers' habits, finding items with the same sales cycle, detecting copyright infringement, plagiarism detection, seismic data forecast Therefore, this paper investigates how much frequent seismic frequencies are possible by the application of motif detection in time series METHODOLOGY 2.1 Background Definition 1: Time series If T is a time series then T = (t1, t2, , tn) consists of a set of n numbers with real values over time [4] Definition 2: Sliding window: Given a time series T of length n, to determine the subsequence of length m, we use a sliding window of length m to slide through each point from left to right on the time series T to identify each subsequence C [5] Definition 3: Subsequence: Given a time series T = (t1, t2…, tn), a subqueries of length n of T is a sequence Ti,n = (ti,ti+1,…,ti+n-1) with 1≤ i ≤ m-n+1 [4] Definition 4: The motif subsequence is a pair of subqueries {Ti,n, Tj,n} non-trivial matches of a time series most similar to T In other words, ∀a,b,i,j {Ti,n, Tj,n} is the subsequence motif if: Dist(Ti,n, Tj,n) ≤ Dist(Ta,n,Tb,n), |i-j| ≥ w and |a-b| ≥ w inside w > [4] Note that w used in the above definition eliminates trivial matches in the case of subsequence [5] and Dist(Ci, Cj) is a measure of the meaningful distance between two time series Motif in a S time series database is a pair of different time series {Ti, Tj}, i ≠ j, in database S has the smallest distance Mean x, y, x ≠ y, i ≠ j, DISTANCE(Ti, Tj) ≤ DISTANCE(Tx, Ty) [6] Definition 5: A Matrix distances Di corresponding to the subsequence Ti, m and the time series T is a vector of the Euclidean distance between a given subsequence Ti, m and each of the time series T Or Di = [di, 1, di, 2, , di, n-m + 1], inside di, j (1≤ j ≤ n - m + 1) is the distance between Ti, m and Tj, m [7] Definition 6: A Matrix profile P of time series T is a vector of intervals of Euclides between each subsequence of T and the nearest neighbor in T, the concept of nearest neighbor means that two pairs of subqueries have distance smallest compared to other subqueries Or, P = [min(D1), min(D2),…,min(Dn-m+1)], inside Di (1 ≤ i ≤ n-m+1) is Matrix distances Di corresponds to the query Ti,m and time series T [7] Figure shows the relationship between matrix distances, Matrix distances, and Matrix profiles Each component of the distance matrix di, j is the distance between Ti, m and Tj, m (1 ≤ i, j ≤ n-m + 1) in the time series T Figure 1: Relationship between Matrix distances and Matrix profiles ([7].) -2- International conference on AITC 2020 The index i in the Matrix profile P tells us that the Euclidean distance between the subsequence Ti, m and the nearest neighbor in the time series T However, it does not indicate the location of the nearest neighbors, so the concept Matrix profile is given: Definition 7: Matrix profile index I of time series T is a vector of integers: I = [I1, I2, … In-m+1], where Ii=j if di,j = min(Di) [7] Figure 2: Example of a Matrix profile index of a time series [7] The position of the minimum value in each column is stored along with the Matrix profile index Definition 8: 1NN-join Function is defined as the first nearest neighbor (1NN) between two subqueries A[i] and B[j] 1NN-join functon 𝜃1𝑁𝑁 (𝐴[𝑖], 𝐵[𝑗]) returns "True" if B[j] is the nearest neighbor of A[i] 1NN-join function is a similar connection operator, applied on two sets of all subqueries; As a result, we can create AB similarity join set: Definition 9: AB Similarity Join JAB is a set of pairs of each subsequence in A with its nearest neighbor in B and vice versa Definition 10: Join Matrix Profile PABBA is an array of Euclidean distances for each pair in JABBA The length of PABBA is × (n - L) + and it is twice the length of PAB 2.2 ALGORITHM The motif search problem in time series data is basically divided into two branches: exact search (Exact Motif) and approximate search (Approximate Motif) Both problems have certain advantages and disadvantages Depending on the research needs (improving efficiency or improving accuracy in finding motifs), we proceed to select the appropriate method to learn The SCRIMP ++ algorithm is an algorithm that combines two algorithms: PreSCRIMP and SCRIMP [7] PreSCRIMP algorithm is an algorithm of the approximate motif search method, its complexity is O(n2logn/s) The SCRIMP algorithm is an algorithm of the exact search method and it is complex O(n2) The SCRIMP algorithm uses the PreSCRIMP algorithm as a time series pretreatment, it has the ability to detect motifs in the time series and it only finds an approximate Matrix Profile From that approximate Matrix Profile will act as input for SCRIMP algorithm to find the exact Matrix Profile That is the idea of SCRIMP ++ algorithm [7] Figure 3: SCRIMP ++ algorithm is built on two algorithms PreSCRIMP and SCRIMP [7] 2.2.1 SCRIMP algorithm Before going into the SCRIMP algorithm, we review the standardized formula z (the standardized formula z normalizes values in time series with amplitude) in the distance di, j of the two sub-sequences Ti,m and Tj,m with the following formula: -3- International conference on AITC 2020 di,j= √2𝑚(1 − 𝑄𝑖,𝑗 −𝑚𝜇𝑖 𝜇𝑗 𝑚𝜎𝑖 𝜎𝑗 ) (2.1) Inside: + m is the length of the subsequence + Qi, j are convolution points in Ti, m and Tj, m + μi is the average value of Ti, m + μj is the average value of Tj, m + σi is the standard deviation of Ti, m + σj is the standard deviation of Tj, m The input of the computation of the slippage points is a Q query and the time series T Once done, its output will be the convolution of the points between query Q and all subqueries in T The process of standardizing input data is very necessary in the motif detection problem Motif normalization helps data in time series be homogeneous during the calculation The standardization of time series data in the SCRIMP algorithm is made easy at the beginning, and then standardized data will be included to perform the next steps In the previously discovered motif detection algorithms, people also used data normalization before calculation, and they separated two clear steps: normalization and motif detection Separating into two such steps will take an additional loop, which will consume more resources In this problem, time series T will use a sliding window of points with the length m (m is the length of the subsequence) respectively and will perform standardization of each slip Normalizing right in the step of taking subqueries will save time, because we remove a loop to cut off subqueries, store them down and normalize each subsequence The SCRIMP algorithm is presented as in Table below: Table 1: SCRIMP algorithm [7] SCRIMP algorithm Input: A time series T and a subsequence length m Output: Matrix profile P and matrix profile index I of time series T n  time series length T Calculate µ, σ of the time series T with the subsequence length m Initialize initial values: P infs, I ones Orders RandPerm(m/4+1:n-m+1) // evaluate the random order value for k in Orders for i to n-m+2-k if i=1 q DotProduct(T1,m , Tk,m) else q q-ti-1ti+k-2 + ti+m-1ti+k+m-2 end if 10 d CalculateDistance(q, µi, σi, µi+k-1, σi+k-1) (formula 2.1) 11 if d

Ngày đăng: 04/12/2021, 11:49