DỮ LIỆU VÀ CÔNG CỤ THỰC NGHIỆM

3.1.1 Dữ liệu:

Dữ liệu dùng trong luận văn là Flickr8k. Dữ liệu gồm 8000 ảnh, 6000 ảnh cho traning set, 1000 cho dev set (validation set) và 1000 ảnh cho test set. Khi tải về có 2 folder: Flicker8k_Dataset và Flicker8k_Text. Flicker8k_Dataset chứa các ảnh với tên là các id khác nhau. Flicker8k_Text chứa:

• Flickr_8k.testImages, Flickr_8k.devImages, Flickr_8k.trainImages, Flickr_8k.devImages chứa id các ảnh dùng cho việc test, train, validation. • Flickr8k.token chứa các caption của ảnh, mỗi ảnh chứa 5 captions Ví dụ ảnh dưới có 5 captions:

• A child in a pink dress is climbing up a set of stairs in an entry way.

• A girl going into a wooden building .

• A little girl climbing into a wooden playhouse .

• A little girl climbing the stairs to her playhouse .

• A little girl in a pink dress going into a wooden cabin .

Hình 3.1 Ảnh trong Flickr8k

Một ảnh 5 caption sẽ cho ra 5 traning set khác nhau: (ảnh, caption 1), (ảnh,

caption 2), (ảnh, caption 3), (ảnh, caption 4), (ảnh, caption 5). Như vậy traning set sẽ có 6000 * 5 = 40000 dataset.

3.1.2 Công cụ sử dụng:

Trong luận văn này để thử nghiệm được mô hình chúng tôi đã kết hợp sử dụng các thư viện mã nguồn mở và các công cụ có sẵn để xử lý dữ liệu, huấn luyện mô hình và dự báo.

Tesnorflow là một thư viện nguồn mở để phát triển các ứng dụng học máy. Các ứng dụng này được thực hiện bằng cách sử dụng các đồ thị để tổ chức dòng chảy của các hoạt động và các bộ tensors để đại diện cho dữ liệu. Nó cung cấp một giao diện lập trình ứng dụng (API) trong Python, cũng như một bộ các hàm thấp hơn được thực hiện bằng cách sử dụng C ++. Nó cung cấp một bộ các tính năng để cho phép nhanh hơn mẫu và thực hiện các mô hình học máy và các ứng dụng cho các nền tảng điện toán không đồng nhất cao.

Python là một ngôn ngữ lập trình thông dịch (interpreted), hướng đối tượng (object-oriented), và là một ngôn ngữ bậc cao (high-level) ngữ nghĩa động (dynamic semantics). Python hỗ trợ các module và gói (packages), khuyến khích chương trình module hóa và tái sử dụng mã. Trình thông dịch Python và thư viện chuẩn mở rộng có sẵn dưới dạng mã nguồn miễn phí cho tất cả.

3.2 THỰC NGHIỆM

3.2.1 Cài đặt thực nghiệm mô hình.

• Khai báo các thư viện cần thiết (Keras, TensorFlow).

from keras.preprocessing import sequence from keras.models import Sequential import numpy as np

import pandas as pd import pickle

from keras.preprocessing import image import keras

from keras import backend

from keras.models import load_model import time

from PIL import Image

from flask import Flask, request import tensorflow as tf

import os

• Input dữ liệu hình ảnh Flickr, tạo tệp trainimgs.txt và testimages.txt

token_dir = DataPath+ "Flickr8k_text/Flickr8k.token.txt" image_captions = open(token_dir).read().split('\n')

caption = {}

for i in range(len(image_captions)-1): id_capt = image_captions[i].split("\t") id_capt[0] = id_capt[0][:len(id_capt[0])-

2] # to rip off the #0,#1,#2,#3,#4 from the tokens file if id_capt[0] in caption:

caption[id_capt[0]].append(id_capt[1]) else:

caption[id_capt[0]] = [id_capt[1]]

# Tạo 2 tệp có tên "trainimgs.txt" và "testImages.txt". train_imgs_id =

open(DataPath+"Flickr8k_text/Flickr_8k.trainImages.txt").read( ).split('\n')[:-1] train_imgs_captions = open(DataPath+"Flickr8k_text/trainimgs.txt",'w')

for img_id in train_imgs_id: for captions in caption[img_id]:

desc = "<start> "+captions+" <end>"

train_imgs_captions.write(img_id+"\t"+desc+"\n") train_imgs_captions.flush() train_imgs_captions.close() test_imgs_id = open(DataPath+"Flickr8k_text/Flickr_8k.testImages.txt").read(). split('\n')[:-1] test_imgs_captions = open(DataPath+"Flickr8k_text/testimgs.txt",'w')

for img_id in test_imgs_id:

for captions in caption[img_id]:

desc = "<start> "+captions+" <end>"

test_imgs_captions.write(img_id+"\t"+desc+"\n") test_imgs_captions.flush()

test_imgs_captions.close()

• Phát hiện đối tượng (Object detectioning)

Mô hình Yolov3 sử dụng các trọng số đã huấn luyện trước. Ta tải về model

weights đã huấn luyện trước với tệp yolo3.weights. Các modul trong quá trình cài đặt

như sau:

import numpy as np

from numpy import expand_dims

from keras.models import load_model, Model from

keras.preprocessing.image import load_img from

keras.preprocessing.image import img_to_array

from matplotlib import pyplot

from matplotlib.patches import Rectangle DATA_PATH = 'data'

IMG_SAVED_PATH = 'static/' class BoundBox:

def __init__(self, xmin, ymin, xmax, ymax, objness=None, classes=None): self.xmin = xmin self.ymin = ymin self.xmax = xmax self.ymax = ymax self.objness = objness self.classes = classes self.label = -1 self.score = -1 def get_label(self): if self.label == -1: self.label = np.argmax(self.classes) return self.label def get_score(self): if self.score == -1: self.score = self.classes[self.get_label()] return self.score def _sigmoid(x): return 1. / (1. + np.exp(-x))

def decode_netout(netout, anchors, obj_thresh, net_h, net_w): grid_h, grid_w = netout.shape[:2]

nb_box = 3

netout = netout.reshape((grid_h, grid_w, nb_box, -1)) nb_class = netout.shape[-1] - 5

boxes = []

netout[..., :2] = _sigmoid(netout[..., :2]) netout[..., 4:] = _sigmoid(netout[..., 4:])

netout[..., 5:] = netout[..., 4][..., np.newaxis] * netout[..., 5:] netout[..., 5:] *= netout[..., 5:] > obj_thresh

for i in range(grid_h * grid_w): row = i / grid_w

col = i % grid_w

for b in range(nb_box):

#4th element is objectness score objectness

= netout[int(row)][int(col)][b][4] if

(objectness.all() <= obj_thresh): continue

#first 4 elements are x, y, w, and h

x, y, w, h = netout[int(row)][int(col)][b][:4]

x = (col + x) / grid_w # center position, unit: image width

y = (row + y) / grid_h # center position, unit: image height

w = anchors[2 * b + 0] * np.exp(w) / net_w # unit: image width

h = anchors[2 * b + 1] * np.exp(h) / net_h # unit: image height

#last elements are class

probabilities classes =

netout[int(row)][col][b][5:]

box = BoundBox(x - w / 2, y - h / 2, x + w / 2, y + h / 2, objectness, classes) boxes.append(box)

return boxes

def correct_yolo_boxes(boxes, image_h, image_w, net_h, net_w): new_w, new_h = net_w, net_h

for i in range(len(boxes)):

x_offset, x_scale = (net_w - new_w) / 2. / net_w, float(new_w) / net_w y_offset, y_scale = (net_h - new_h) / 2. / net_h, float(new_h) / net_h boxes[i].xmin = int((boxes[i].xmin - x_offset) / x_scale * image_w) boxes[i].xmax = int((boxes[i].xmax - x_offset) / x_scale * image_w) boxes[i].ymin = int((boxes[i].ymin - y_offset) / y_scale * image_h) boxes[i].ymax = int((boxes[i].ymax - y_offset) / y_scale * image_h)

def _interval_overlap(interval_a, interval_b): x1, x2 = interval_a x3, x4 = interval_b if x3 < x1: if x4 < x1: return 0 else: return min(x2, x4) - x1 else:

if x2 < x3:

return 0

else:

return min(x2, x4) - x3

def bbox_iou(box1, box2):

intersect_w = _interval_overlap([box1.xmin, box1.xmax], [box2.xmin, box2.xmax])

intersect_h = _interval_overlap([box1.ymin, box1.ymax], [box2.ymin, box2.ymax])

intersect = intersect_w * intersect_h

w1, h1 = box1.xmax - box1.xmin, box1.ymax - box1.ymin w2, h2 = box2.xmax - box2.xmin, box2.ymax - box2.ymin union = w1 * h1 + w2 * h2 - intersect return

float(intersect) / union

def do_nms(boxes, nms_thresh):

if len(boxes) > 0:

nb_class = len(boxes[0].classes)

else:

return

for c in range(nb_class):

sorted_indices = np.argsort([-box.classes[c] for box in boxes])

for i in range(len(sorted_indices)): index_i = sorted_indices[i]

if boxes[index_i].classes[c] == 0: continue for j in range(i + 1, len(sorted_indices)):

index_j = sorted_indices[j]

if bbox_iou(boxes[index_i], boxes[index_j]) >= nms_thresh: boxes[index_j].classes[c] = 0

# load and prepare an image

def load_image_pixels(filename, shape):

#load the image to get its

shape image =

load_img(filename) width, height = image.size

# load the image with the required size

image = load_img(filename, target_size=shape)

# convert to numpy array

image = img_to_array(image)

# scale pixel values to [0, 1]

image = image.astype('float32') image /= 255.0

#add a dimension so that we have one

sample image = expand_dims(image, 0)

#get all of the results above a threshold

def get_boxes(boxes, labels, thresh):

v_boxes, v_labels, v_scores = list(), list(), list()

#enumerate all

boxes for box in

boxes:

#enumerate all possible

labels for i in

range(len(labels)):

#check if the threshold for this label is high

enough if box.classes[i] > thresh:

v_boxes.append(box) v_labels.append(labels[i])

v_scores.append(box.classes[i] * 100)

# don't break, many labels may trigger for one

box return v_boxes, v_labels, v_scores

def draw_boxes(filename, v_boxes, v_labels, v_scores):

# load the image

data = pyplot.imread(filename)

#plot the image

pyplot.imshow(data )

#get the context for drawing

boxes ax = pyplot.gca()

# plot each box

for i in range(len(v_boxes)): box = v_boxes[i]

# get coordinates

y1, x1, y2, x2 = box.ymin, box.xmin, box.ymax, box.xmax

#calculate width and height of the

box width, height = x2 - x1, y2 - y1

# create the shape

rect = Rectangle((x1, y1), width, height, fill=False, color='white')

#draw the box

ax.add_patch(rect)

# draw text and score in top left corner

label = "%s (%.3f)" % (v_labels[i], v_scores[i])

pyplot.text(x1, y1, label, bbox=dict(facecolor='green', alpha=0.8))

# show the plot

#pyplot.figure(figsize=(12,9)

) pyplot.axis('off')

pyplot.savefig(filename[:-4] + '_detected' + filename[-4:], pad_inches=0,

bbox_inches='tight', transparent=True)

# pyplot.show()

return filename[7:-4] + '_detected' + filename[-4:]

def init_model():

model = load_model(DATA_PATH + '/model.h5')

#Fixed bug "<tensor> is not an element of this graph." when loading

model model._make_predict_function()

print("Object detection model loaded") return model

def detect_object(model, image_name):

# view yolov3 model

# model.summary()

# print(len(model.layers))

# print(model.layers[250].output)

# new_model = extract_model(model)

#define the expected input shape for the

model input_w, input_h = 416, 416

# define our new photo

image_path = IMG_SAVED_PATH + image_name

# load and prepare image

image, image_w, image_h = load_image_pixels(image_path, (input_w, input_h))

# make prediction

yhat = model.predict(image)

#summarize the shape of the list of

arrays print([a.shape for a in yhat])

# feature = new_model.predict(image)

# print("feature", feature.shape)

# define the anchors

anchors = [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]]

#define the probability threshold for detected

objects class_threshold = 0.6

boxes = list()

for i in range(len(yhat)):

# decode the output of the network

boxes += decode_netout(yhat[i][0], anchors[i], class_threshold, input_h, input_w)

# correct the sizes of the bounding boxes for the shape of the image

correct_yolo_boxes(boxes, image_h, image_w, input_h, input_w)

# suppress non-maximal boxes

do_nms(boxes, 0.5)

# define the labels

labels = ["person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck", "boat",

"traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",

"dog", "horse", "sheep", "cow", "elephant", "bear", "zebra",

"giraffe",

"backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard",

"sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl",

"banana",

"apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",

"cake",

"chair", "sofa", "pottedplant", "bed", "diningtable", "toilet", "tvmonitor",

"laptop",

"mouse","remote", "keyboard", "cell phone", "microwave", "oven", "toaster",

"sink", "refrigerator" ,"book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"]

# get the details of the detected objects

v_boxes, v_labels, v_scores = get_boxes(boxes, labels, class_threshold)

#summarize what we found

for i in range(len(v_boxes)): print(v_labels[i], v_scores[i])

# draw what we found and save to file

detected_image = draw_boxes(image_path, v_boxes, v_labels, v_scores) return detected_image

def extract_model(model):

# re-structure the model

# model1 = load_model(DATA_PATH + '/model.h5')

# model1.layers.pop()

new_model = Model(inputs=model.inputs, outputs=model.layers[-1].output)

# model._make_predict_function()

# new_model._make_predict_function()

# summarize

#print(new_model.summary())

print(new_model.layers[-

1].get_config()) return new_model

• Chú thích hình ảnh (Image captioning)

Chú thích hình ảnh là một công việc liên quan đến thị giác máy tính cũng như xử lý ngôn ngữ tự nhiên. Ta có thể mô tả những gì đang diễn ra trong hình ảnh bằng văn bản với nhiều ngôn ngữ khác nhau. Quá trình cài đặt thực hiện như sau:

import pickle

import numpy as np

from keras.preprocessing import sequence, image

from keras.models import load_model, Model

from keras.applications.inception_v3 import InceptionV3

DATA_PATH = 'data' IMAGE_PATH = 'static/' def preprocess_input(x): x /= 255. x -= 0.5 x *= 2. return x def preprocess(image_path):

img = image.load_img(image_path, target_size=(299, 299)) x = image.img_to_array(img)

x = np.expand_dims(x, axis=0)

x = preprocess_input(x)

return x

def get_caption_model():

model = load_model(DATA_PATH + '/imageCaption20.h5') print("Image captioning model loaded")

#model.summary() return model def get_encode_img_model(): model = InceptionV3(weights=DATA_PATH + '/inception_v3_weights_tf_dim_ordering_tf_kernels.h5') new_input = model.input new_output = model.layers[-2].output

model_new = Model(new_input, new_output) print('Encode image model loaded')

return model_new

def encode(model, image): image = preprocess(image) temp_enc = model.predict(image)

temp_enc = np.reshape(temp_enc, temp_enc.shape[1])

return temp_enc

def processing():

vocab = pickle.load(open(DATA_PATH + '/vocab.p', 'rb')) print(len(vocab))

word_idx = {val: index for index, val in enumerate(vocab)} idx_word = {index: val for index, val in enumerate(vocab)}

return word_idx, idx_word

def predict_captions(encode_img_model, image_caption_model, image_name): start_word = ["<start>"]

max_length = 40

word_idx, idx_word = processing()

encode_img = encode(encode_img_model, IMAGE_PATH + image_name)

while 1:

now_caps = [word_idx[i] for i in start_word] now_caps = sequence.pad_sequences([now_caps], maxlen=max_length, padding='post')

e = encode_img

preds = image_caption_model.predict([np.array([e]),

np.array(now_caps)]) word_pred = idx_word[np.argmax(preds[0])] start_word.append(word_pred)

if word_pred == "<end>" or len(start_word) > max_length:

# keep on predicting next word unitil word predicted is <end> or caption lenghts is greater than max_lenght(40)

break

return ' '.join(start_word[1:-1])

3.2.2 Đánh giá độ chính xác của mô hình

#Plot training & validation accuracy values

plt.plot(history2.history['acc']) plt.plot(history2.history['val_acc']) plt.title('Model accuracy')

plt.ylabel('Accuracy') plt.xlabel('Epoch')

plt.legend(['Train', 'Test'], loc='lower right ')

plt.show()

Đánh giá độ lỗi của mô hình.

# Plot training & validation loss values

plt.plot(history2.history['loss']) plt.plot(history2.history['val_loss']) plt.title('Model loss')

plt.ylabel('Loss') plt.xlabel('Epoch')

plt.legend(['Train', 'Test'], loc='upper right' )

plt.show()

3.2.3 Kết quả thực nghiệm

Giao diện mô hình.

Hình 3.2 Giao diện mô hình

• Chọn hình ảnh muốn chú thích trong Browse

• Chọn ngôn ngữ muốn diễn đạt: Tiếng Việt, tiếng Anh, tiếng Tây ban nha, tiếng Nhật.

• Nhấn Process để thực hiện.

Kết quả thực hiện:

Khi chọn ngôn ngữ tiếng Việt Nam thì kết quả như sau:

Hình 3.3 Kết quả chọn ngôn ngữ tiếng Việt

Hình ảnh bên trái là hình ảnh góc. Hình ảnh bên phải là các đối tượng được nhận diện. Chú thích là giọng nói và dòng văn bản “Hai cầu thủ bóng đá đang chơi bóng đá trên sân”.

4 KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN Kết luận

Với ý tưởng áp dụng trí tuệ nhân tạo vào các nhu cầu của đời sống, nhằm hỗ trợ con người với những công việc đơn giản và góp phần xây dựng cách mạng công nghiệp 4.0. Với mục tiêu cải tiến các mô hình mạng học sâu để vận dụng xây dựng hệ thống trong việc chú thích ảnh tự động, tác giả đã hoàn thiện luận văn với những kết quả đạt được như sau:

- Giới thiệu tổng quan về mô hình học sâu. Trình bày cụ thể các mô hình mạng nơ-ron, mô hình CNN, mô hình RNN, LSTM, mô hình YOLO.

- Xây dựng được chương trình tạo chú thích ảnh tự động bằng văn bản và giọng nói với nhiều ngôn ngữ khác nhau.

- Khi đưa ra được chú thích thành công sẽ hỗ trợ được người khiếm thị trong việc di chuyển được dễ dàng hơn.

Tuy nhiên do hạn chế về mặt thời gian và kiến thức nên luận văn vẫn còn tồn tại một số thiếu sót mà tác giả còn phải tiếp tục nghiên cứu, tìm hiểu đó là:

- Bộ dữ liệu mẫu còn ít, cần bổ sung thêm nguồn dữ liệu lớn hơn.

- Mới chỉ cài đặt chương trình dựa trên cấu trúc mạng học sâu với mục đích học tập và nghiên cứu, tuy nhiên để ứng dụng vào thực tế cần bộ dữ liệu lớn hơn và thời gian nghiên cứu nhiều hơn để hoàn chỉnh hệ thống.

Hướng phát triển của đề tài

Với rất nhiều ứng dụng thực tế của mạng nơ ron nhân tạo. Đề tài có rất nhiều hướng phát triển trong tương lai, để tạo thành một hệ thống toàn diện hơn, khai thác nhiều thông tin hơn. Luận văn tuy đã thực hiện được mục tiêu ban đầu đặt ra và xây dựng thành công hệ thống chú thích ảnh tự động, tuy nhiên vẫn còn nhiều hạn chế như:

- Cần bổ sung thêm dữ liệu tập huấn để mô hình mạng học sâu có độ tin cậy cao hơn và hoạt động hiệu quả hơn.

- Tìm hiểu nhu cầu thực tế để từ đó cải tiến chương trình, cài đặt lại cấu trúc mạng học sâu đã nghiên cứu để làm việc tốt hơn với các cơ sở dữ liệu lớn.

5TÀI LIỆU THAM KHẢO

[1] Xu, Kelvin et al. Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044, February 2015.

[2] Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang and Yuille, Alan Deep captioning with multimodal recurrentneural networks. arXiv: 1412.6632, December 2014.

[3] Alex Graves (2012), Supervised Sequence Labelling with Recurrent Neural Networks, Studies in Computational Intelligence, Springer.

[4] https://www.tensorflow.org/tutorials/text/image_captioning? fbclid=IwAR280fs BgmQwIX4DsLZz7CBap5Xm9p2Z8UgJQwkxEuR- kJuAsMa_d4HwpZM truy cập ngày 08/01/2021.

[5] https://github.com/Faizan-E-Mustafa/Image-Captioning? fbclid=IwAR34KRpGFcHhaPMrjqsSweIu2T9Se-

svA4muIB1aVU4EFFh9ot9G4yRVro truy cập ngày 10/01/2021

[6] http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture10.pdf truy cập ngày 10/01/2021 [7] https://pythonprogramminglanguage.com/text-to-speech/ truy cập ngày 10/01/2021 [8] https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b truy cập ngày 11/01/2021 [9] https://towardsdatascience.com/object-detection-using-yolov3- using-keras- 80bf35e61ce1 truy cập ngày 10/01/2021

[10] https://www.tensorflow.org/tutorials/text/image_captioning? fbclid=IwAR280fsBgmQwIX4DsLZz7CBap5Xm9p2Z8UgJQwkxEuR - kJuAsMa_d4HwpZM truy cập ngày 07/01/2021

[11]. http://nhiethuyettre.me/mang-no-ron-tich-chap-convolutional-neural-network/, truy nhập ngày 02/01/2018.

[12] https://pbcquoc.github.io/yolo/ truy nhập ngày 02/01/2021.

[13]https://dominhhai.github.io/vi/2017/10/what-is-lstm/ truy nhập ngày 02/01/2021

MẠNG RNN (Recurrent Neural Network)

MẠNG LSTM (Mạng Long Short Term Memory)