(TIỂU LUẬN) chú thích ảnh tự động dựa trên cnn, rnn và lstm

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH Võ Đình Hùng CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM LUẬN VĂN THẠC SĨ KHOA HỌC MÁY TÍNH Thành phố Hồ Chí Minh – Năm 2020 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH Võ Đình Hùng CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM Chuyên ngành: Khoa Học Máy Tính Mã số: 8480101 NGƯỜI HƯỚNG DẪN KHOA HỌC: TS TRẦN NGỌC BẢO Thành phố Hồ Chí Minh – Năm 2020 LỜI CAM ĐOAN Tôi xin cam đoan: luận văn “Chú thích ảnh tự động dự CNN, RNN LSTM.” cơng trình nghiên cứu tơi hướng dẫn giảng viên hướng dẫn, không chép lại người khác Các tài liệu luận văn tham khảo, kế thừa trích dẫn liệt kê danh mục tài liệu tham khảo Tôi xin chịu hoàn toàn trách nhiệm lời cam đoan TP.HCM, ngày tháng năm 2021 Học viên Võ Đình Hùng LỜI CẢM ƠN Tơi xin chân thành cảm ơn thầy cô Trường đại học sư phạm TP.HCM, đặc biệt thầy cô môn Khoa học máy tính, tận tình dạy dỗ, giúp đỡ tạo điều kiện tốt cho em suốt quãng thời gian em theo học trường, để em hồn thành đề tài Em tỏ lịng biết ơn sâu sắc với TS.Trần Ngọc Bảo, người thầy tận tình hướng dẫn khoa học giúp đỡ, bảo em suốt trình nghiên cứu hồn thành luận văn Tơi xin chân thành cảm ơn bạn học viên cao học khóa 27 Trường đại học sư phạm TP.HCM giúp đỡ trình theo học trường, giúp đỡ thực đề tài Xin trân trọng cảm ơn! MỤC LỤC LỜI CAM ĐOAN Trang LỜI CẢM ƠN MỤC LỤC DANH MỤC HÌNH ẢNH DANH SÁCH CÁC TỪ VIẾT TẮT MỞ ĐẦU Lý chọn đề tài Cơ sở khoa học thực tiễn đề tài Cấu trúc luận văn CHƯƠNG CƠ SỞ LÝ THUYẾT TRÌNH BÀY NHỮNG KHÁI NIỆM VÀ MƠ HÌNH TRONG HỌC SÂU .4 1.1 CNN (Mạng nơ-ron tích chập) 1.1.1 Khái niệm mạng nơ-ron tích chập 1.1.2 Mơ hình kiến trúc mạng CNN 1.1.3 Các vấn đề mạng CNN 1.1.4 Huấn luyện mơ hình 10 1.2 MƠ HÌNH YOLO 12 1.2.1 Cách thức hoạt động YOLO: 13 1.2.2 Chi tiết mơ hình YOLO 13 1.2.3 YOLO phát đối tượng CNN 15 1.2.4 Kiến trúc YOLO: 16 1.3 MẠNG RNN (Recurrent Neural Network) 18 1.3.1 Khái niệm mạng RNN 18 1.3.2 Huấn luyện mạng RNN 21 1.3.3 Các phiên mở rộng RNN 21 1.4 MẠNG LSTM (Mạng Long Short Term Memory) 22 1.4.1 Giới thiệu LSTM 22 1.4.2 Mô hình LSTM 27 CHƯƠNG BÀI TỐN CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN & LSTM 29 2.1 Xác định toán 29 2.2 Ý tưởng toán 2.3 Tiến trình thực tốn: 2.3.1 Phát đối tượng (Object 2.3.2 Chú thích hình ảnh (Image C 2.3.3 CHUYỂN VĂN BẢN THÀ CHƯƠNG 3: THỰC NGHIỆM MÔ HÌNH 3.1 DỮ LIỆU VÀ CÔNG CỤ THỰC NGHIỆM 3.1.1 Dữ liệu: 3.1.2 Công cụ sử dụng: 3.2 THỰC NGHIỆM 3.2.1 Cài đặt thực nghiệm mơ hìn 3.2.2 Đánh giá độ xác m 3.2.3 Kết thực nghiệm KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN TÀI LIỆU THAM KHẢO Hình 1.1 Hình ảnh lớp Convolution Hình 1.2 Tính tốn với phương pháp Hình 1.3 Trọng số CNN [11] Hình 1.4 Ví dụ ảnh CIFA Hình 1.6 Tác động trọng số đến Hình 1.7 Bước học (learning rate) [1 Hình 1.8 Bộ lọc đặt tính Hình 1.9 Hình minh họa nguồn hộp cậy.[Nguồn internet] Hình 1.10 Hình minh họa p Hình 1.11 Dự đốn bounda Hình 1.12 Kiến trúc Darkne Hình 1.13 Cấu trúc Dar Hình 1.14 Sơ đồ mơ hình m Hình 1.15 Quá trình xử lý t Hình 1.16 RNN phụ thuộc Hình 1.17 Các module lặp Hình 1.18 Các mơ-đun lặp Hình 1.19 Cell state LS Hình 1.20 Cổng trạng thái Hình 1.21 LSTM focus I [1 Hình 1.22 LSTM focus c [ Hình 1.23 LSTM focus o [ Hình 1.24 Mơ hình LSTM Hình 2.1 Mơ hình thích ảnh Hình 2.2 Mơ hình hình ảnh Hình 2.3 Mơ hình ngơn ngữ Hình 2.4 Mơ hình tốn: Hình 2.5 Mơ hình lớp Hình 3.1 Ảnh Flickr8k Hình 3.2 Giao diện mơ hình Hình 3.3 Kết chọn ngơn ngữ tiếng Việt STT Kí hiệu viết tắt AI ANN API CNN CSLT DL GPU IOT KLT 10 LTSM 11 ReLU 12 RNN 13 YOLO MỞ ĐẦU Lý chọn đề tài Những năm gần đây, chứng kiến nhiều thành tựu vượt bậc lĩnh vực Thị giác máy tính (Computer Vision) Các hệ thống xử lý ảnh quy mô lớn Facebook, Google hay Amazon đưa vào sản phẩm chức thơng minh nhận diện khuôn mặt người dùng, phát triển xe tự lái hay thiết bị bay không người lái tự giao hàng Máy tính có sức mạnh tính tốn lớn giá thành lại mức phổ thơng, dẫn tới người làm nghiên cứu dễ dàng để tự kiểm nghiệm lý thuyết trí tuệ nhân tạo mà nhiều năm trước gần khơng thể Cùng với mã nguồn mở, sóng trí tuệ nhân tạo bùng nổ mạnh mẽ thời gian gần với nhiều ứng dụng đời sống Chính lý trên, việc tìm hiểu nghiên cứu thị giác máy tính có ý nghĩa thiết thực Hiện giới, nhiều nước ứng dụng thị giác máy tính vào đời sống hàng ngày, ví dụ SkyNet Trung Quốc, hệ thống kiểm duyệt nội dung tự động, vv Cơ sở khoa học thực tiễn đề tài Ngày nay, kỷ nguyên số, máy tính phần thiếu nghiên cứu khoa học đời sống hàng ngày Tuy nhiên, hệ thống máy tính dựa lý thuyết cổ điển (tập hợp, logic nhị phân), nên dù có khả tính tốn lớn độ xác cao, máy tính làm việc theo chương trình gồm thuật tốn viết sẵn lập trình viên chưa thể tự lập luận hay sáng tạo Mạng nơ ron Xoắn (Convolutional Neural Network) mơ hình học sâu đại Mạng nơ ron Xoắn thường sử dụng nhiều hệ thống thông minh ưu điểm mạng có độ xác cao tốc độ tính tốn lại nhanh Vì lý đó, mạng nơ ron xoắn mạnh xử lý hình ảnh, ứng dụng nhiều ngành thị giác máy tính tốn liên quan đến nhận dạng đối tượng Mạng nhớ dài-ngắn hạn (Long Short Term Memory networks) viết tắt LSTM - dạng đặc biệt RNN (Recurrent Neural Network – Mạng hồi quy), có khả học phụ thuộc xa LSTM giới thiệu Hochreiter & Schmidhuber (1997), sau cải tiến phổ biến hoạt động hiệu nhiều toán khác nên LSTM dần trở nên phổ biến LSTM thiết kế để tránh vấn đề phụ thuộc nhớ dài hạn (long-term dependency) Việc nhớ thông tin suốt thời gian dài đặc tính mặc định chúng, ta khơng cần phải huấn luyện để nhớ Tức nội ghi nhớ mà khơng cần can thiệp Ngày nay, với phát triển cơng nghệ bán dẫn, máy tính ngày nhỏ đi, lượng tiêu thụ ngày thấp xuống, sức mạnh lại ngày tăng lên Với ưu điểm vậy, thấy nhiều thiết bị thông minh diện nơi đời sống, với camera nhiều điểm ảnh, nhớ lớn vi xử lý mạnh như: điện thoại thông minh, máy ảnh kỹ thuật số, camera hành trình,… Ngồi ra, với bùng nổ xu hướng mạng vạn vật IOT, người ta cịn thấy nhiều thiết bị thông minh xuất hiện: xe ô tô tự lái, thiết bị bay không người lái tự giao hàng, Có thể thấy, việc sử dụng trí thơng minh nhân tạo để khai thác liệu hình ảnh thiết bị thơng minh tương lai trở thành xu hướng Từ hình ảnh có từ camera, webcam, … ta phát đối tượng ảnh đưa lời thích tự động giọng nói với nhiều ngơn ngữ khác Với phát triển AI (Artificial Intelligence) người khiếm thị hồn tồn nhận biết vật xung quanh giọng nói thơng qua hình ảnh ghi nhận từ camera, webcam, … cơng việc di chuyển người khiếm thị trở nên dễ dàng Từ nhận định gợi ý giảng viên hướng dẫn, định chọn đề tài “CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM” để làm đề tài nghiên cứu thực luận văn thạc sỹ Giới hạn đề tài Đề tài nghiên cứu thích ảnh tự động liệu hình ảnh Flickr 8k 43 from flask import Flask, request import tensorflow as tf import os • Input liệu hình ảnh Flickr, tạo tệp trainimgs.txt testimages.txt token_dir = DataPath+ "Flickr8k_text/Flickr8k.token.txt" image_captions = open(token_dir).read().split('\n') caption = {} for i in range(len(image_captions)-1): id_capt = image_captions[i].split("\t") id_capt[0] = id_capt[0][:len(id_capt[0])2] # to rip off the #0,#1,#2,#3,#4 from the tokens file if id_capt[0] in caption: caption[id_capt[0]].append(id_capt[1]) else: caption[id_capt[0]] = [id_capt[1]] # Tạo tệp có tên "trainimgs.txt" "testImages.txt" train_imgs_id = open(DataPath+"Flickr8k_text/Flickr_8k.trainImages.txt").read( ).split('\n')[:-1] train_imgs_captions = open(DataPath+"Flickr8k_text/trainimgs.txt",'w') for img_id in train_imgs_id: for captions in caption[img_id]: desc = " "+captions+" " train_imgs_captions.write(img_id+"\t"+desc+"\n") train_imgs_captions.flush() train_imgs_captions.close() test_imgs_id = open(DataPath+"Flickr8k_text/Flickr_8k.testImages.txt").read() split('\n')[:-1] test_imgs_captions = open(DataPath+"Flickr8k_text/testimgs.txt",'w') for img_id in test_imgs_id: for captions in caption[img_id]: desc = " "+captions+" " test_imgs_captions.write(img_id+"\t"+desc+"\n") test_imgs_captions.flush() test_imgs_captions.close() • Phát đối tượng (Object detectioning) 44 Mơ hình Yolov3 sử dụng trọng số huấn luyện trước Ta tải model weights huấn luyện trước với tệp yolo3.weights Các modul trình cài đặt sau: import numpy as np from numpy import expand_dims from keras.models import load_model, Model from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from matplotlib import pyplot from matplotlib.patches import Rectangle DATA_PATH = 'data' IMG_SAVED_PATH = 'static/' class BoundBox: def init (self, xmin, ymin, xmax, ymax, objness=None, classes=None): self.xmin = xmin self.ymin = ymin self.xmax = xmax self.ymax = ymax self.objness = objness self.classes = classes self.label = -1 self.score = -1 def get_label(self): if self.label == -1: self.label = np.argmax(self.classes) return self.label def get_score(self): if self.score == -1: self.score = self.classes[self.get_label()] return self.score def _sigmoid(x): return / (1 + np.exp(-x)) def decode_netout(netout, anchors, obj_thresh, net_h, net_w): grid_h, grid_w = netout.shape[:2] nb_box = netout = netout.reshape((grid_h, grid_w, nb_box, -1)) nb_class = netout.shape[-1] - 45 boxes = [] netout[ , :2] = _sigmoid(netout[ , :2]) netout[ , 4:] = _sigmoid(netout[ , 4:]) netout[ , 5:] = netout[ , 4][ , np.newaxis] * netout[ , 5:] netout[ , 5:] *= netout[ , 5:] > obj_thresh for i in range(grid_h * grid_w): row = i / grid_w col = i % grid_w for b in range(nb_box): # 4th element is objectness score objectness = netout[int(row)][int(col)][b][4] if (objectness.all() 0: nb_class = len(boxes[0].classes) else: return for c in range(nb_class): sorted_indices = np.argsort([-box.classes[c] for box in boxes]) for i in range(len(sorted_indices)): index_i = sorted_indices[i] if boxes[index_i].classes[c] == 0: continue for j in range(i + 1, len(sorted_indices)): index_j = sorted_indices[j] if bbox_iou(boxes[index_i], boxes[index_j]) >= nms_thresh: boxes[index_j].classes[c] = # load and prepare an image def load_image_pixels(filename, shape): # load the image to get its shape image = load_img(filename) width, height = image.size # load the image with the required size image = load_img(filename, target_size=shape) # convert to numpy array image = img_to_array(image) # scale pixel values to [0, 1] image = image.astype('float32') image /= 255.0 # add a dimension so that we have one sample image = expand_dims(image, 0) return image, width, height 47 # get all of the results above a threshold def get_boxes(boxes, labels, thresh): v_boxes, v_labels, v_scores = list(), list(), list() # enumerate all boxes for box in boxes: # enumerate all possible labels for i in range(len(labels)): # check if the threshold for this label is high enough if box.classes[i] > thresh: v_boxes.append(box) v_labels.append(labels[i]) v_scores.append(box.classes[i] * 100) # don't break, many labels may trigger for one box return v_boxes, v_labels, v_scores def draw_boxes(filename, v_boxes, v_labels, v_scores): # load the image data = pyplot.imread(filename) # plot the image pyplot.imshow(data ) # get the context for drawing boxes ax = pyplot.gca() # plot each box for i in range(len(v_boxes)): box = v_boxes[i] # get coordinates y1, x1, y2, x2 = box.ymin, box.xmin, box.ymax, box.xmax # calculate width and height of the box width, height = x2 - x1, y2 - y1 # create the shape rect = Rectangle((x1, y1), width, height, fill=False, color='white') # draw the box ax.add_patch(rect) # draw text and score in top left corner label = "%s (%.3f)" % (v_labels[i], v_scores[i]) pyplot.text(x1, y1, label, bbox=dict(facecolor='green', alpha=0.8)) # show the plot # pyplot.figure(figsize=(12,9) ) pyplot.axis('off') pyplot.savefig(filename[:-4] + '_detected' + filename[-4:], pad_inches=0, bbox_inches='tight', transparent=True) # pyplot.show() pyplot.clf() 48 return filename[7:-4] + '_detected' + filename[-4:] def init_model(): model = load_model(DATA_PATH + '/model.h5') # Fixed bug " is not an element of this graph." when loading model model._make_predict_function() print("Object detection model loaded") return model def detect_object(model, image_name): # view yolov3 model # model.summary() # print(len(model.layers)) # print(model.layers[250].output) # new_model = extract_model(model) # define the expected input shape for the model input_w, input_h = 416, 416 # define our new photo image_path = IMG_SAVED_PATH + image_name # load and prepare image image, image_w, image_h = load_image_pixels(image_path, (input_w, input_h)) # make prediction yhat = model.predict(image) # summarize the shape of the list of arrays print([a.shape for a in yhat]) # # feature = new_model.predict(image) print("feature", feature.shape) # define the anchors anchors = [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]] # define the probability threshold for detected objects class_threshold = 0.6 boxes = list() for i in range(len(yhat)): # decode the output of the network boxes += decode_netout(yhat[i][0], anchors[i], class_threshold, input_h, input_w) 49 # correct the sizes of the bounding boxes for the shape of the image correct_yolo_boxes(boxes, image_h, image_w, input_h, input_w) # suppress non-maximal boxes do_nms(boxes, 0.5) # define the labels labels = ["person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "banana", "cake", "laptop", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "chair", "sofa", "pottedplant", "bed", "diningtable", "toilet", "tvmonitor", "mouse","remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator" ,"book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"] # get the details of the detected objects v_boxes, v_labels, v_scores = get_boxes(boxes, labels, class_threshold) # summarize what we found for i in range(len(v_boxes)): print(v_labels[i], v_scores[i]) # draw what we found and save to file detected_image = draw_boxes(image_path, v_boxes, v_labels, v_scores) return detected_image def extract_model(model): # re-structure the model # model1 = load_model(DATA_PATH + '/model.h5') # model1.layers.pop() new_model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # model._make_predict_function() # new_model._make_predict_function() # summarize # print(new_model.summary()) print(new_model.layers[1].get_config()) return new_model • Chú thích hình ảnh (Image captioning) 50 Chú thích hình ảnh công việc liên quan đến thị giác máy tính xử lý ngơn ngữ tự nhiên Ta mơ tả diễn hình ảnh văn với nhiều ngơn ngữ khác Quá trình cài đặt thực sau: import pickle import numpy as np from keras.preprocessing import sequence, image from keras.models import load_model, Model from keras.applications.inception_v3 import InceptionV3 DATA_PATH = 'data' IMAGE_PATH = 'static/' def preprocess_input(x): x /= 255 x -= 0.5 x *= return x def preprocess(image_path): img = image.load_img(image_path, target_size=(299, 299)) x = image.img_to_array(img) x = np.expand_dims(x, axis=0) x = preprocess_input(x) return x def get_caption_model(): model = load_model(DATA_PATH + '/imageCaption20.h5') print("Image captioning model loaded") # model.summary() return model def get_encode_img_model(): model = InceptionV3(weights=DATA_PATH + '/inception_v3_weights_tf_dim_ordering_tf_kernels.h5') new_input = model.input new_output = model.layers[-2].output model_new = Model(new_input, new_output) print('Encode image model loaded') return model_new def encode(model, image): image = preprocess(image) temp_enc = model.predict(image) temp_enc = np.reshape(temp_enc, temp_enc.shape[1]) 51 return temp_enc def processing(): vocab = pickle.load(open(DATA_PATH + '/vocab.p', 'rb')) print(len(vocab)) word_idx = {val: index for index, val in enumerate(vocab)} idx_word = {index: val for index, val in enumerate(vocab)} return word_idx, idx_word def predict_captions(encode_img_model, image_caption_model, image_name): start_word = [""] max_length = 40 word_idx, idx_word = processing() encode_img = encode(encode_img_model, IMAGE_PATH + image_name) while 1: now_caps = [word_idx[i] for i in start_word] now_caps = sequence.pad_sequences([now_caps], maxlen=max_length, padding='post') e = encode_img preds = image_caption_model.predict([np.array([e]), np.array(now_caps)]) word_pred = idx_word[np.argmax(preds[0])] start_word.append(word_pred) if word_pred == "" or len(start_word) > max_length: # keep on predicting next word unitil word predicted is or caption lenghts is greater than max_lenght(40) break return ' '.join(start_word[1:-1]) 3.2.2 Đánh giá độ xác mơ hình #Plot training & validation accuracy values plt.plot(history2.history['acc']) plt.plot(history2.history['val_acc']) plt.title('Model accuracy') plt.ylabel('Accuracy') plt.xlabel('Epoch') plt.legend(['Train', 'Test'], loc='lower right ') plt.show() 52 Đánh giá độ lỗi mô hình # Plot training & validation loss values plt.plot(history2.history['loss']) plt.plot(history2.history['val_loss']) plt.title('Model loss') plt.ylabel('Loss') plt.xlabel('Epoch') plt.legend(['Train', 'Test'], loc='upper right' ) plt.show() 53 3.2.3 Kết thực nghiệm Giao diện mơ hình Hình 3.2 Giao diện mơ hình • • Chọn hình ảnh muốn thích Browse Chọn ngơn ngữ muốn diễn đạt: Tiếng Việt, tiếng Anh, tiếng Tây ban nha, tiếng Nhật • Nhấn Process để thực Kết thực hiện: Khi chọn ngơn ngữ tiếng Việt Nam kết sau: Hình 3.3 Kết chọn ngơn ngữ tiếng Việt Hình ảnh bên trái hình ảnh góc Hình ảnh bên phải đối tượng nhận diện Chú thích giọng nói dịng văn “Hai cầu thủ bóng đá chơi bóng đá sân” 54 KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN Kết luận Với ý tưởng áp dụng trí tuệ nhân tạo vào nhu cầu đời sống, nhằm hỗ trợ người với công việc đơn giản góp phần xây dựng cách mạng cơng nghiệp 4.0 Với mục tiêu cải tiến mơ hình mạng học sâu để vận dụng xây dựng hệ thống việc thích ảnh tự động, tác giả hồn thiện luận văn với kết đạt sau: - Giới thiệu tổng quan mơ hình học sâu Trình bày cụ thể mơ hình mạng nơ-ron, mơ hình CNN, mơ hình RNN, LSTM, mơ hình YOLO - Xây dựng chương trình tạo thích ảnh tự động văn giọng nói với nhiều ngơn ngữ khác - Khi đưa thích thành công hỗ trợ người khiếm thị việc di chuyển dễ dàng Tuy nhiên hạn chế mặt thời gian kiến thức nên luận văn cịn tồn số thiếu sót mà tác giả cịn phải tiếp tục nghiên cứu, tìm hiểu là: - Bộ liệu mẫu cịn ít, cần bổ sung thêm nguồn liệu lớn Mới cài đặt chương trình dựa cấu trúc mạng học sâu với mục đích học tập nghiên cứu, nhiên để ứng dụng vào thực tế cần liệu lớn thời gian nghiên cứu nhiều để hoàn chỉnh hệ thống Hướng phát triển đề tài Với nhiều ứng dụng thực tế mạng nơ ron nhân tạo Đề tài có nhiều hướng phát triển tương lai, để tạo thành hệ thống tồn diện hơn, khai thác nhiều thơng tin Luận văn thực mục tiêu ban đầu đặt xây dựng thành công hệ thống thích ảnh tự động, nhiên cịn nhiều hạn chế như: - Cần bổ sung thêm liệu tập huấn để mơ hình mạng học sâu có độ tin cậy cao hoạt động hiệu - Tìm hiểu nhu cầu thực tế để từ cải tiến chương trình, cài đặt lại cấu trúc mạng học sâu nghiên cứu để làm việc tốt với sở liệu lớn 55 TÀI LIỆU THAM KHẢO [1] Xu, Kelvin et al Show, attend and tell: neural image caption generation with visual attention arXiv:1502.03044, February 2015 [2] Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang and Yuille, Alan Deep captioning with multimodal recurrentneural networks arXiv: 1412.6632, December 2014 [3] Alex Graves (2012), Supervised Sequence Labelling with Recurrent Neural Networks, Studies in Computational Intelligence, Springer [4] https://www.tensorflow.org/tutorials/text/image_captioning? fbclid=IwAR280fs BgmQwIX4DsLZz7CBap5Xm9p2Z8UgJQwkxEuRkJuAsMa_d4HwpZM truy cập ngày 08/01/2021 [5] https://github.com/Faizan-E-Mustafa/Image-Captioning? fbclid=IwAR34KRpGFcHhaPMrjqsSweIu2T9SesvA4muIB1aVU4EFFh9ot9G4yRVro truy cập ngày 10/01/2021 [6] http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture10.pdf truy cập ngày 10/01/2021 [7] https://pythonprogramminglanguage.com/text-to-speech/ truy cập ngày 10/01/2021 [8] https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b truy cập ngày 11/01/2021 [9] https://towardsdatascience.com/object-detection-using-yolov3- using-keras-80bf35e61ce1 truy cập ngày 10/01/2021 [10] https://www.tensorflow.org/tutorials/text/image_captioning? fbclid=IwAR280fsBgmQwIX4DsLZz7CBap5Xm9p2Z8UgJQwkxEuR -kJuAsMa_d4HwpZM truy cập ngày 07/01/2021 [11] http://nhiethuyettre.me/mang-no-ron-tich-chap-convolutional-neural-network/ , truy nhập ngày 02/01/2018 [12] https://pbcquoc.github.io/yolo/ truy nhập ngày 02/01/2021 [13] https://dominhhai.github.io/vi/2017/10/what-is-lstm/ truy nhập ngày 02/01/2021 [14] https://arxiv.org/abs/1411.4555 truy cập ngày 10/06/2020 ... định chọn đề tài “CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM? ?? để làm đề tài nghiên cứu thực luận văn thạc sỹ Giới hạn đề tài Đề tài nghiên cứu thích ảnh tự động liệu hình ảnh Flickr 8k 3... khác LSTM cung cấp nhiều khả kiểm sốt đó, kết tốt Nhưng kèm với phức tạp chi phí hoạt động nhiều 29 2CHƯƠNG BÀI TỐN CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN & LSTM 2.1 Xác định tốn • Input: Ảnh. ..BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH Võ Đình Hùng CHÚ THÍCH ẢNH TỰ ĐỘNG DỰA TRÊN CNN, RNN VÀ LSTM Chuyên ngành: Khoa Học Máy Tính Mã số:

Tiêu đề	Chú Thích Ảnh Tự Động Dựa Trên CNN, RNN Và LSTM
Tác giả	Võ Đình Hùng
Người hướng dẫn	TS. Trần Ngọc Bảo
Trường học	Trường Đại Học Sư Phạm TP. Hồ Chí Minh
Chuyên ngành	Khoa Học Máy Tính
Thể loại	Luận Văn Thạc Sĩ
Năm xuất bản	2020
Thành phố	Thành Phố Hồ Chí Minh

Định dạng
Số trang	67
Dung lượng	1,05 MB