Chú thích ảnh tự động dựa trên CNN RNN và LSTM

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH Võ Đình Hùng CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM LUẬN VĂN THẠC SĨ KHOA HỌC MÁY TÍNH Thành phố Hồ Chí Minh – Năm 2020 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH Võ Đình Hùng CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM Chuyên ngành: Khoa Học Máy Tính Mã số: 8480101 NGƯỜI HƯỚNG DẪN KHOA HỌC: TS TRẦN NGỌC BẢO Thành phố Hồ Chí Minh – Năm 2020 LỜI CAM ĐOAN Tôi xin cam đoan: luận văn “Chú thích ảnh tự động dự CNN, RNN LSTM.” cơng trình nghiên cứu tơi hướng dẫn giảng viên hướng dẫn, không chép lại người khác Các tài liệu luận văn tham khảo, kế thừa trích dẫn liệt kê danh mục tài liệu tham khảo Tôi xin chịu hoàn toàn trách nhiệm lời cam đoan TP.HCM, ngày tháng năm 2021 Học viên Võ Đình Hùng LỜI CẢM ƠN Tơi xin chân thành cảm ơn thầy cô Trường đại học sư phạm TP.HCM, đặc biệt thầy cô môn Khoa học máy tính, tận tình dạy dỗ, giúp đỡ tạo điều kiện tốt cho em suốt quãng thời gian em theo học trường, để em hồn thành đề tài Em tỏ lịng biết ơn sâu sắc với TS.Trần Ngọc Bảo, người thầy tận tình hướng dẫn khoa học giúp đỡ, bảo em suốt trình nghiên cứu hồn thành luận văn Tơi xin chân thành cảm ơn bạn học viên cao học khóa 27 Trường đại học sư phạm TP.HCM giúp đỡ trình theo học trường, giúp đỡ thực đề tài Xin trân trọng cảm ơn! MỤC LỤC LỜI CAM ĐOAN Trang LỜI CẢM ƠN MỤC LỤC DANH MỤC HÌNH ẢNH DANH SÁCH CÁC TỪ VIẾT TẮT MỞ ĐẦU Lý chọn đề tài Cơ sở khoa học thực tiễn đề tài Cấu trúc luận văn CHƯƠNG CƠ SỞ LÝ THUYẾT TRÌNH BÀY NHỮNG KHÁI NIỆM VÀ MƠ HÌNH TRONG HỌC SÂU 1.1 CNN (Mạng nơ-ron tích chập) 1.1.1 Khái niệm mạng nơ-ron tích chập 1.1.2 Mơ hình kiến trúc mạng CNN 1.1.3 Các vấn đề mạng CNN 1.1.4 Huấn luyện mơ hình 10 1.2 MƠ HÌNH YOLO 12 1.2.1 Cách thức hoạt động YOLO: 13 1.2.2 Chi tiết mơ hình YOLO 13 1.2.3 YOLO phát đối tượng CNN 15 1.2.4 Kiến trúc YOLO: 16 1.3 MẠNG RNN (Recurrent Neural Network) 18 1.3.1 Khái niệm mạng RNN 18 1.3.2 Huấn luyện mạng RNN 21 1.3.3 Các phiên mở rộng RNN 21 1.4 MẠNG LSTM (Mạng Long Short Term Memory) 22 1.4.1 Giới thiệu LSTM 22 1.4.2 Mơ hình LSTM 27 CHƯƠNG BÀI TỐN CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN & LSTM 29 2.1 Xác định toán 29 2.2 Ý tưởng toán 29 2.3 Tiến trình thực tốn: 30 2.3.1 Phát đối tượng (Object detection) 30 2.3.2 Chú thích hình ảnh (Image Captioning) 35 2.3.3 CHUYỂN VĂN BẢN THÀNH GIỌNG NÓI (TEXT TO SPEECH) 39 CHƯƠNG 3: THỰC NGHIỆM MƠ HÌNH 41 3.1 DỮ LIỆU VÀ CÔNG CỤ THỰC NGHIỆM 41 3.1.1 Dữ liệu: 41 3.1.2 Công cụ sử dụng: 42 3.2 THỰC NGHIỆM 42 3.2.1 Cài đặt thực nghiệm mơ hình 42 3.2.2 Đánh giá độ xác mơ hình 51 3.2.3 Kết thực nghiệm 53 KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN 54 TÀI LIỆU THAM KHẢO 55 DANH MỤC HÌNH ẢNH Hình 1.1 Hình ảnh lớp Convolution với filter [11] Hình 1.2 Tính tốn với phương pháp MaxPooling [11] Hình 1.3 Trọng số CNN [11] Hình 1.4 Ví dụ ảnh CIFAR-10 [11] Hình 1.6 Tác động trọng số đến loss function [11] 11 Hình 1.7 Bước học (learning rate) [11] 11 Hình 1.8 Bộ lọc đặt tính [11] 12 Hình 1.9 Hình minh họa nguồn hộp giới hạn với tọa độ x,y,w,z điểm tin cậy.[Nguồn internet] 13 Hình 1.10 Hình minh họa phát tâm object.[12] 14 Hình 1.11 Dự đoán boundary box.[12] 15 Hình 1.12 Kiến trúc Darknet 19 [Nguồn internet] 17 Hình 1.13 Cấu trúc Darknet-53 [Nguồn internet] 18 Hình 1.14 Sơ đồ mơ hình mạng thần kinh nhân tạo lớp [Nguồn internet] 19 Hình 1.15 Q trình xử lý thơng tin mạng RNN [Nguồn internet] 20 Hình 1.16 RNN phụ thuộc long-term [13] 23 Hình 1.17 Các module lặp mạng RNN chứa layer [13] 24 Hình 1.18 Các mơ-đun lặp mạng LSTM chứa bốn layer [13] 24 Hình 1.19 Cell state LSTM giống băng truyền [13] 25 Hình 1.20 Cổng trạng thái LSTM tạo hàm sigmoid toán tử nhân [13] 25 Hình 1.21 LSTM focus I [13] 26 Hình 1.22 LSTM focus c [13] 26 Hình 1.23 LSTM focus o [13] 27 Hình 1.24 Mơ hình LSTM Networks [13] 27 Hình 2.1 Mơ hình thích ảnh tốn 29 Hình 2.2 Mơ hình hình ảnh 37 Hình 2.3 Mơ hình ngơn ngữ 38 Hình 2.4 Mơ hình tốn: 38 Hình 2.5 Mơ hình lớp 39 Hình 3.1 Ảnh Flickr8k 41 Hình 3.2 Giao diện mơ hình 53 Hình 3.3 Kết chọn ngôn ngữ tiếng Việt 53 DANH SÁCH CÁC TỪ VIẾT TẮT STT Kí hiệu viết tắt Nội dung viết tắt AI Artificial Intelligence (Trí tuệ nhân tạo) ANN Artificial Neural Network – Mạng nơ-ron nhân tạo API Application Programming Interface – giao diện lập trình CNN Convolutional Neuron Networks – Mạng nơ-ron xoắn CSLT Cơ sở lý thuyết DL Deep Learning – Học sâu GPU Graphics Processing Unit – Vi xử lý đồ họa IOT Internet of Things – Kết nối vạn vật KLT Kanade–Lucas–Tomasi – giải thuật thị giác máy tính 10 LTSM Long Short Term Memory networks – mạng nhớ dài- ngắn hạn 11 ReLU Rectified Linear Unit – Điều chỉnh đơn vị tuyến tính 12 RNN Recurrent Neural Network – mạng hồi quy 13 YOLO You Only Look Once – Một tảng xử lý ảnh MỞ ĐẦU Lý chọn đề tài Những năm gần đây, chứng kiến nhiều thành tựu vượt bậc lĩnh vực Thị giác máy tính (Computer Vision) Các hệ thống xử lý ảnh quy mô lớn Facebook, Google hay Amazon đưa vào sản phẩm chức thơng minh nhận diện khuôn mặt người dùng, phát triển xe tự lái hay thiết bị bay không người lái tự giao hàng Máy tính có sức mạnh tính tốn lớn giá thành lại mức phổ thông, dẫn tới người làm nghiên cứu dễ dàng để tự kiểm nghiệm lý thuyết trí tuệ nhân tạo mà nhiều năm trước gần Cùng với mã nguồn mở, sóng trí tuệ nhân tạo bùng nổ mạnh mẽ thời gian gần với nhiều ứng dụng đời sống Chính lý trên, việc tìm hiểu nghiên cứu thị giác máy tính có ý nghĩa thiết thực Hiện giới, nhiều nước ứng dụng thị giác máy tính vào đời sống hàng ngày, ví dụ SkyNet Trung Quốc, hệ thống kiểm duyệt nội dung tự động, vv Cơ sở khoa học thực tiễn đề tài Ngày nay, kỷ nguyên số, máy tính phần thiếu nghiên cứu khoa học đời sống hàng ngày Tuy nhiên, hệ thống máy tính dựa lý thuyết cổ điển (tập hợp, logic nhị phân), nên dù có khả tính tốn lớn độ xác cao, máy tính làm việc theo chương trình gồm thuật tốn viết sẵn lập trình viên chưa thể tự lập luận hay sáng tạo Mạng nơ ron Xoắn (Convolutional Neural Network) mơ hình học sâu đại Mạng nơ ron Xoắn thường sử dụng nhiều hệ thống thông minh ưu điểm mạng có độ xác cao tốc độ tính tốn lại nhanh Vì lý đó, mạng nơ ron xoắn mạnh xử lý hình ảnh, ứng dụng nhiều ngành thị giác máy tính tốn liên quan đến nhận dạng đối tượng Mạng nhớ dài-ngắn hạn (Long Short Term Memory networks) viết tắt LSTM - dạng đặc biệt RNN (Recurrent Neural Network – Mạng hồi quy), có khả học phụ thuộc xa LSTM giới thiệu Hochreiter & Schmidhuber (1997), sau cải tiến phổ biến hoạt động hiệu nhiều toán khác nên LSTM dần trở nên phổ biến LSTM thiết kế để tránh vấn đề phụ thuộc nhớ dài hạn (long-term dependency) Việc nhớ thông tin suốt thời gian dài đặc tính mặc định chúng, ta khơng cần phải huấn luyện để nhớ Tức nội ghi nhớ mà khơng cần can thiệp Ngày nay, với phát triển công nghệ bán dẫn, máy tính ngày nhỏ đi, lượng tiêu thụ ngày thấp xuống, sức mạnh lại ngày tăng lên Với ưu điểm vậy, thấy nhiều thiết bị thông minh diện nơi đời sống, với camera nhiều điểm ảnh, nhớ lớn vi xử lý mạnh như: điện thoại thông minh, máy ảnh kỹ thuật số, camera hành trình,… Ngồi ra, với bùng nổ xu hướng mạng vạn vật IOT, người ta cịn thấy nhiều thiết bị thông minh xuất hiện: xe ô tô tự lái, thiết bị bay không người lái tự giao hàng, Có thể thấy, việc sử dụng trí thơng minh nhân tạo để khai thác liệu hình ảnh thiết bị thông minh tương lai trở thành xu hướng Từ hình ảnh có từ camera, webcam, … ta phát đối tượng ảnh đưa lời thích tự động giọng nói với nhiều ngôn ngữ khác Với phát triển AI (Artificial Intelligence) người khiếm thị hồn tồn nhận biết vật xung quanh giọng nói thơng qua hình ảnh ghi nhận từ camera, webcam, … cơng việc di chuyển người khiếm thị trở nên dễ dàng Từ nhận định gợi ý giảng viên hướng dẫn, định chọn đề tài “CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM” để làm đề tài nghiên cứu thực luận văn thạc sỹ Giới hạn đề tài Đề tài nghiên cứu thích ảnh tự động liệu hình ảnh Flickr 8k 41 CHƯƠNG 3: THỰC NGHIỆM MƠ HÌNH Chương tiến hành thực nghiệm mơ hình thích ảnh tự động cách sử dụng CNN, RNN, LSTM Mô tả liệu thực nghiệm công cụ sử dụng hệ thống 3.1 DỮ LIỆU VÀ CÔNG CỤ THỰC NGHIỆM 3.1.1 Dữ liệu: Dữ liệu dùng luận văn Flickr8k Dữ liệu gồm 8000 ảnh, 6000 ảnh cho traning set, 1000 cho dev set (validation set) 1000 ảnh cho test set Khi tải có folder: Flicker8k_Dataset Flicker8k_Text Flicker8k_Dataset chứa ảnh với tên id khác Flicker8k_Text chứa: • Flickr_8k.testImages, Flickr_8k.devImages, Flickr_8k.trainImages, Flickr_8k.devImages chứa id ảnh dùng cho việc test, train, validation • Flickr8k.token chứa caption ảnh, ảnh chứa captions Ví dụ ảnh có captions: • A child in a pink dress is climbing up a set of stairs in an entry way • A girl going into a wooden building • A little girl climbing into a wooden playhouse • A little girl climbing the stairs to her playhouse • A little girl in a pink dress going into a wooden cabin Hình 3.1 Ảnh Flickr8k 42 Một ảnh caption cho traning set khác nhau: (ảnh, caption 1), (ảnh, caption 2), (ảnh, caption 3), (ảnh, caption 4), (ảnh, caption 5) Như traning set có 6000 * = 40000 dataset 3.1.2 Công cụ sử dụng: Trong luận văn để thử nghiệm mơ hình kết hợp sử dụng thư viện mã nguồn mở cơng cụ có sẵn để xử lý liệu, huấn luyện mơ hình dự báo Tesnorflow thư viện nguồn mở để phát triển ứng dụng học máy Các ứng dụng thực cách sử dụng đồ thị để tổ chức dòng chảy hoạt động tensors để đại diện cho liệu Nó cung cấp giao diện lập trình ứng dụng (API) Python, hàm thấp thực cách sử dụng C ++ Nó cung cấp tính phép nhanh mẫu thực mơ hình học máy ứng dụng cho tảng điện tốn khơng đồng cao Python ngơn ngữ lập trình thơng dịch (interpreted), hướng đối tượng (object-oriented), ngôn ngữ bậc cao (high-level) ngữ nghĩa động (dynamic semantics) Python hỗ trợ module gói (packages), khuyến khích chương trình module hóa tái sử dụng mã Trình thơng dịch Python thư viện chuẩn mở rộng có sẵn dạng mã nguồn miễn phí cho tất 3.2 THỰC NGHIỆM 3.2.1 Cài đặt thực nghiệm mơ hình • Khai báo thư viện cần thiết (Keras, TensorFlow) from keras.preprocessing import sequence from keras.models import Sequential import numpy as np import pandas as pd import pickle from keras.preprocessing import image import keras from keras import backend from keras.models import load_model import time from PIL import Image 43 from flask import Flask, request import tensorflow as tf import os • Input liệu hình ảnh Flickr, tạo tệp trainimgs.txt testimages.txt token_dir = DataPath+ "Flickr8k_text/Flickr8k.token.txt" image_captions = open(token_dir).read().split('\n') caption = {} for i in range(len(image_captions)-1): id_capt = image_captions[i].split("\t") id_capt[0] = id_capt[0][:len(id_capt[0])2] # to rip off the #0,#1,#2,#3,#4 from the tokens file if id_capt[0] in caption: caption[id_capt[0]].append(id_capt[1]) else: caption[id_capt[0]] = [id_capt[1]] # Tạo tệp có tên "trainimgs.txt" "testImages.txt" train_imgs_id = open(DataPath+"Flickr8k_text/Flickr_8k.trainImages.txt").read( ).split('\n')[:-1] train_imgs_captions = open(DataPath+"Flickr8k_text/trainimgs.txt",'w') for img_id in train_imgs_id: for captions in caption[img_id]: desc = " "+captions+" " train_imgs_captions.write(img_id+"\t"+desc+"\n") train_imgs_captions.flush() train_imgs_captions.close() test_imgs_id = open(DataPath+"Flickr8k_text/Flickr_8k.testImages.txt").read() split('\n')[:-1] test_imgs_captions = open(DataPath+"Flickr8k_text/testimgs.txt",'w') for img_id in test_imgs_id: for captions in caption[img_id]: desc = " "+captions+" " test_imgs_captions.write(img_id+"\t"+desc+"\n") test_imgs_captions.flush() test_imgs_captions.close() • Phát đối tượng (Object detectioning) 44 Mơ hình Yolov3 sử dụng trọng số huấn luyện trước Ta tải model weights huấn luyện trước với tệp yolo3.weights Các modul trình cài đặt sau: import numpy as np from numpy import expand_dims from keras.models import load_model, Model from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from matplotlib import pyplot from matplotlib.patches import Rectangle DATA_PATH = 'data' IMG_SAVED_PATH = 'static/' class BoundBox: def init (self, xmin, ymin, xmax, ymax, objness=None, classes=None): self.xmin = xmin self.ymin = ymin self.xmax = xmax self.ymax = ymax self.objness = objness self.classes = classes self.label = -1 self.score = -1 def get_label(self): if self.label == -1: self.label = np.argmax(self.classes) return self.label def get_score(self): if self.score == -1: self.score = self.classes[self.get_label()] return self.score def _sigmoid(x): return / (1 + np.exp(-x)) def decode_netout(netout, anchors, obj_thresh, net_h, net_w): grid_h, grid_w = netout.shape[:2] nb_box = netout = netout.reshape((grid_h, grid_w, nb_box, -1)) nb_class = netout.shape[-1] - 45 boxes = [] netout[ , :2] = _sigmoid(netout[ , :2]) netout[ , 4:] = _sigmoid(netout[ , 4:]) netout[ , 5:] = netout[ , 4][ , np.newaxis] * netout[ , 5:] netout[ , 5:] *= netout[ , 5:] > obj_thresh for i in range(grid_h * grid_w): row = i / grid_w col = i % grid_w for b in range(nb_box): # 4th element is objectness score objectness = netout[int(row)][int(col)][b][4] if (objectness.all() 0: nb_class = len(boxes[0].classes) else: return for c in range(nb_class): sorted_indices = np.argsort([-box.classes[c] for box in boxes]) for i in range(len(sorted_indices)): index_i = sorted_indices[i] if boxes[index_i].classes[c] == 0: continue for j in range(i + 1, len(sorted_indices)): index_j = sorted_indices[j] if bbox_iou(boxes[index_i], boxes[index_j]) >= nms_thresh: boxes[index_j].classes[c] = # load and prepare an image def load_image_pixels(filename, shape): # load the image to get its shape image = load_img(filename) width, height = image.size # load the image with the required size image = load_img(filename, target_size=shape) # convert to numpy array image = img_to_array(image) # scale pixel values to [0, 1] image = image.astype('float32') image /= 255.0 # add a dimension so that we have one sample image = expand_dims(image, 0) return image, width, height 47 # get all of the results above a threshold def get_boxes(boxes, labels, thresh): v_boxes, v_labels, v_scores = list(), list(), list() # enumerate all boxes for box in boxes: # enumerate all possible labels for i in range(len(labels)): # check if the threshold for this label is high enough if box.classes[i] > thresh: v_boxes.append(box) v_labels.append(labels[i]) v_scores.append(box.classes[i] * 100) # don't break, many labels may trigger for one box return v_boxes, v_labels, v_scores def draw_boxes(filename, v_boxes, v_labels, v_scores): # load the image data = pyplot.imread(filename) # plot the image pyplot.imshow(data) # get the context for drawing boxes ax = pyplot.gca() # plot each box for i in range(len(v_boxes)): box = v_boxes[i] # get coordinates y1, x1, y2, x2 = box.ymin, box.xmin, box.ymax, box.xmax # calculate width and height of the box width, height = x2 - x1, y2 - y1 # create the shape rect = Rectangle((x1, y1), width, height, fill=False, color='white') # draw the box ax.add_patch(rect) # draw text and score in top left corner label = "%s (%.3f)" % (v_labels[i], v_scores[i]) pyplot.text(x1, y1, label, bbox=dict(facecolor='green', alpha=0.8)) # show the plot # pyplot.figure(figsize=(12,9)) pyplot.axis('off') pyplot.savefig(filename[:-4] + '_detected' + filename[-4:], pad_inches=0, bbox_inches='tight', transparent=True) # pyplot.show() pyplot.clf() 48 return filename[7:-4] + '_detected' + filename[-4:] def init_model(): model = load_model(DATA_PATH + '/model.h5') # Fixed bug " is not an element of this graph." when loading model model._make_predict_function() print("Object detection model loaded") return model def detect_object(model, image_name): # view yolov3 model # model.summary() # print(len(model.layers)) # print(model.layers[250].output) # new_model = extract_model(model) # define the expected input shape for the model input_w, input_h = 416, 416 # define our new photo image_path = IMG_SAVED_PATH + image_name # load and prepare image image, image_w, image_h = load_image_pixels(image_path, (input_w, input_h)) # make prediction yhat = model.predict(image) # summarize the shape of the list of arrays print([a.shape for a in yhat]) # feature = new_model.predict(image) # print("feature", feature.shape) # define the anchors anchors = [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]] # define the probability threshold for detected objects class_threshold = 0.6 boxes = list() for i in range(len(yhat)): # decode the output of the network boxes += decode_netout(yhat[i][0], anchors[i], class_threshold, input_h, input_w) 49 # correct the sizes of the bounding boxes for the shape of the image correct_yolo_boxes(boxes, image_h, image_w, input_h, input_w) # suppress non-maximal boxes do_nms(boxes, 0.5) # define the labels labels = ["person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "sofa", "pottedplant", "bed", "diningtable", "toilet", "tvmonitor", "laptop", "mouse","remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator","book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"] # get the details of the detected objects v_boxes, v_labels, v_scores = get_boxes(boxes, labels, class_threshold) # summarize what we found for i in range(len(v_boxes)): print(v_labels[i], v_scores[i]) # draw what we found and save to file detected_image = draw_boxes(image_path, v_boxes, v_labels, v_scores) return detected_image def extract_model(model): # re-structure the model # model1 = load_model(DATA_PATH + '/model.h5') # model1.layers.pop() new_model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # model._make_predict_function() # new_model._make_predict_function() # summarize # print(new_model.summary()) print(new_model.layers[-1].get_config()) return new_model • Chú thích hình ảnh (Image captioning) 50 Chú thích hình ảnh cơng việc liên quan đến thị giác máy tính xử lý ngơn ngữ tự nhiên Ta mơ tả diễn hình ảnh văn với nhiều ngôn ngữ khác Quá trình cài đặt thực sau: import pickle import numpy as np from keras.preprocessing import sequence, image from keras.models import load_model, Model from keras.applications.inception_v3 import InceptionV3 DATA_PATH = 'data' IMAGE_PATH = 'static/' def preprocess_input(x): x /= 255 x -= 0.5 x *= return x def preprocess(image_path): img = image.load_img(image_path, target_size=(299, 299)) x = image.img_to_array(img) x = np.expand_dims(x, axis=0) x = preprocess_input(x) return x def get_caption_model(): model = load_model(DATA_PATH + '/imageCaption20.h5') print("Image captioning model loaded") # model.summary() return model def get_encode_img_model(): model = InceptionV3(weights=DATA_PATH + '/inception_v3_weights_tf_dim_ordering_tf_kernels.h5') new_input = model.input new_output = model.layers[-2].output model_new = Model(new_input, new_output) print('Encode image model loaded') return model_new def encode(model, image): image = preprocess(image) temp_enc = model.predict(image) temp_enc = np.reshape(temp_enc, temp_enc.shape[1]) 51 return temp_enc def processing(): vocab = pickle.load(open(DATA_PATH + '/vocab.p', 'rb')) print(len(vocab)) word_idx = {val: index for index, val in enumerate(vocab)} idx_word = {index: val for index, val in enumerate(vocab)} return word_idx, idx_word def predict_captions(encode_img_model, image_caption_model, image_name): start_word = [""] max_length = 40 word_idx, idx_word = processing() encode_img = encode(encode_img_model, IMAGE_PATH + image_name) while 1: now_caps = [word_idx[i] for i in start_word] now_caps = sequence.pad_sequences([now_caps], maxlen=max_length, padding='post') e = encode_img preds = image_caption_model.predict([np.array([e]), np.array(now_caps)]) word_pred = idx_word[np.argmax(preds[0])] start_word.append(word_pred) if word_pred == "" or len(start_word) > max_length: # keep on predicting next word unitil word predicted is or caption lenghts is greater than max_lenght(40) break return ' '.join(start_word[1:-1]) 3.2.2 Đánh giá độ xác mơ hình #Plot training & validation accuracy values plt.plot(history2.history['acc']) plt.plot(history2.history['val_acc']) plt.title('Model accuracy') plt.ylabel('Accuracy') plt.xlabel('Epoch') plt.legend(['Train', 'Test'], loc='lower right ') plt.show() 52 Đánh giá độ lỗi mơ hình # Plot training & validation loss values plt.plot(history2.history['loss']) plt.plot(history2.history['val_loss']) plt.title('Model loss') plt.ylabel('Loss') plt.xlabel('Epoch') plt.legend(['Train', 'Test'], loc='upper right' ) plt.show() 53 3.2.3 Kết thực nghiệm Giao diện mơ hình Hình 3.2 Giao diện mơ hình • Chọn hình ảnh muốn thích Browse • Chọn ngơn ngữ muốn diễn đạt: Tiếng Việt, tiếng Anh, tiếng Tây ban nha, tiếng Nhật • Nhấn Process để thực Kết thực hiện: Khi chọn ngơn ngữ tiếng Việt Nam kết sau: Hình 3.3 Kết chọn ngơn ngữ tiếng Việt Hình ảnh bên trái hình ảnh góc Hình ảnh bên phải đối tượng nhận diện Chú thích giọng nói dịng văn “Hai cầu thủ bóng đá chơi bóng đá sân” 54 KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN Kết luận Với ý tưởng áp dụng trí tuệ nhân tạo vào nhu cầu đời sống, nhằm hỗ trợ người với cơng việc đơn giản góp phần xây dựng cách mạng công nghiệp 4.0 Với mục tiêu cải tiến mơ hình mạng học sâu để vận dụng xây dựng hệ thống việc thích ảnh tự động, tác giả hoàn thiện luận văn với kết đạt sau: - Giới thiệu tổng quan mơ hình học sâu Trình bày cụ thể mơ hình mạng nơ-ron, mơ hình CNN, mơ hình RNN, LSTM, mơ hình YOLO - Xây dựng chương trình tạo thích ảnh tự động văn giọng nói với nhiều ngơn ngữ khác - Khi đưa thích thành cơng hỗ trợ người khiếm thị việc di chuyển dễ dàng Tuy nhiên hạn chế mặt thời gian kiến thức nên luận văn cịn tồn số thiếu sót mà tác giả cịn phải tiếp tục nghiên cứu, tìm hiểu là: - Bộ liệu mẫu cịn ít, cần bổ sung thêm nguồn liệu lớn - Mới cài đặt chương trình dựa cấu trúc mạng học sâu với mục đích học tập nghiên cứu, nhiên để ứng dụng vào thực tế cần liệu lớn thời gian nghiên cứu nhiều để hoàn chỉnh hệ thống Hướng phát triển đề tài Với nhiều ứng dụng thực tế mạng nơ ron nhân tạo Đề tài có nhiều hướng phát triển tương lai, để tạo thành hệ thống tồn diện hơn, khai thác nhiều thơng tin Luận văn thực mục tiêu ban đầu đặt xây dựng thành công hệ thống thích ảnh tự động, nhiên cịn nhiều hạn chế như: - Cần bổ sung thêm liệu tập huấn để mơ hình mạng học sâu có độ tin cậy cao hoạt động hiệu - Tìm hiểu nhu cầu thực tế để từ cải tiến chương trình, cài đặt lại cấu trúc mạng học sâu nghiên cứu để làm việc tốt với sở liệu lớn 55 TÀI LIỆU THAM KHẢO [1] Xu, Kelvin et al Show, attend and tell: neural image caption generation with visual attention arXiv:1502.03044, February 2015 [2] Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang and Yuille, Alan Deep captioning with multimodal recurrentneural networks arXiv: 1412.6632, December 2014 [3] Alex Graves (2012), Supervised Sequence Labelling with Recurrent Neural Networks, Studies in Computational Intelligence, Springer [4] https://www.tensorflow.org/tutorials/text/image_captioning?fbclid=IwAR280fs BgmQwIX4DsLZz7CBap5Xm9p2Z8UgJQwkxEuR-kJuAsMa_d4HwpZM truy cập ngày 08/01/2021 [5] https://github.com/Faizan-E-Mustafa/Image-Captioning? fbclid=IwAR34KRpGFcHhaPMrjqsSweIu2T9SesvA4muIB1aVU4EFFh9ot9G4yRVro truy cập ngày 10/01/2021 [6] http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture10.pdf truy cập ngày 10/01/2021 [7] https://pythonprogramminglanguage.com/text-to-speech/ truy cập ngày 10/01/2021 [8] https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b truy cập ngày 11/01/2021 [9] https://towardsdatascience.com/object-detection-using-yolov3-using-keras80bf35e61ce1 truy cập ngày 10/01/2021 [10] https://www.tensorflow.org/tutorials/text/image_captioning? fbclid=IwAR280fsBgmQwIX4DsLZz7CBap5Xm9p2Z8UgJQwkxEuRkJuAsMa_d4HwpZM truy cập ngày 07/01/2021 [11] http://nhiethuyettre.me/mang-no-ron-tich-chap-convolutional-neural-network/, truy nhập ngày 02/01/2018 [12] https://pbcquoc.github.io/yolo/ truy nhập ngày 02/01/2021 [13] https://dominhhai.github.io/vi/2017/10/what-is-lstm/ truy nhập ngày 02/01/2021 [14] https://arxiv.org/abs/1411.4555 truy cập ngày 10/06/2020 ... định chọn đề tài “CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM? ?? để làm đề tài nghiên cứu thực luận văn thạc sỹ Giới hạn đề tài Đề tài nghiên cứu thích ảnh tự động liệu hình ảnh Flickr 8k 3... khác LSTM cung cấp nhiều khả kiểm sốt đó, kết tốt Nhưng kèm với phức tạp chi phí hoạt động nhiều 29 CHƯƠNG BÀI TỐN CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN & LSTM 2.1 Xác định tốn • Input: Ảnh. ..BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH Võ Đình Hùng CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM Chuyên ngành: Khoa Học Máy Tính Mã số:

Định dạng
Số trang	63
Dung lượng	1,87 MB