QUAN CỦA TOÀN BỘ HỆ THÓNG - Khóa luận tốt nghiệp K- 123docz.net

81 — Trình bày ý tưởng

Các ứng dụng hiện này đều muốn tạo nên sự thuận tiện nhất cho người dùng, nhất là hành vi người dùng đề dự đoán các xu hướng, gợi ý tìm kiếm, gol ý các sản phẩm có liên quan nhằm tăng mục tiêu doanh số, tỉ lệ hài lòng khách hàng. Vì vậy, việc thiết kế hệ thống dữ liệu lớn hỗ trợ cho hệ thống

máy học gợi ý theo hành vi là một trong những ý tưởng khả quan và cần thiết.

Dé thực hiện ý tưởng trên chúng ta cần lên mô hình kiến trúc cho hệ thống khuyến nghị.

Mô hình như sau:

Trang 130

ÂâÂ=: © [ssn

)

—_— =

al _ d

ee KS

Hình 42: Mô hình kiến trúc của Recommender System

Mô hình gồm 3 phần chính là:

1. Apache Spark: Là đầu ra của dit liệu trong Data Lake (được mô tả trong

chương 4 ),và đây cũng là đầu vô của dữ liệu trong toàn bộ hệ thống

khuyên nghị

2. Filtering: bao gồm 2 hướng tiếp cận và xử lý dữ liệu là Content-based

filtering và collabprative filtering

Trang 131

+ Content-based filtering: từ thông tin mô tả của sản pham,ta có thé biểu diễn thông tin của sản phẩm dưới dạng vector thuộc tính, sau đó dung các vector này dé học mô hình của từng user

+ Collaborative filtering: hoạt động dựa trên mô hình là những hành vi

của người dung trước đó như: lịch sử giao dịch,quan tâm sản phẩm nào,tìm kiếm sản phẩm nào,... Và mô hình này có khả năng khai thác

thông tin ngoài phạm vi của các thuộc tinh thông tin

3. Web App: Là | giao diện trực quan, thân thiện với người dùng, với các

web API chứa dit liệu dưới dang JSON và dữ liệu đã được xử lý dé có thé phục vụ cho người dùng phô thông

Trang 132

8.2 Hiện thực hóa ý tưởng

8.2.1 Phần khai thác dữ liệu

Chuan bị các library theo hình bên dưới bang cách tao 1 file requirements

Sao đó thực hiện gõ lệnh bằng terminal dé install các thư viện cần thiết

vào

pip install -r requirement.txt

Tạo file collapborativeFiltering.py đề thực hiện gợi ý sản phẩm cho người dùng (theo hướng tiếp cận Collaborative Filtering)

import numpy as np from pathlib import Path import pandas as pd

class ISGD:

def __init__(self, behaviors = None, k = 50, I2_reg=0.01, learn_rate=0.05, n_item = 0, n_user = 0):

self.k =k self.l2_reg = l2_reg self.learn_rate = learn_rate

print( )

if n_item == 0 and n_user == 0:

print(") self.known_users = np.array([])

ids = set([]) users = set([]) items = set([]) for id, u, i in behaviors:

ids.add(id) users.add(u) items.add(i) self.ids = list(ids) self.users = list(users) self.items = list(items)

n_user = len(users)

Trang 133

n_item = len(items)

print("n_user {}".format(n_user)) print("n_item {}".format(n_item))

self.n_user = n_user self.n_item = n_item self.A = np.random.normal(0., 0.1, (n_user, self.k)) self.B = np.random.normal(0., 0.1, (n_item, self.k)) self.setHistoryMat(behaviors)

self.A = np.load(‘data_ISGD/data_A.npy’) self.B = np.load(data_lSGD/data_B.npy') self.known_items = np.load(data_ISGD/data_known_ items.npy') self.known_users = np.load(data_ISGD/data_known_ users.npy') self.n_item = n_item

self.n_user = n_user

self.ids = list(np.load(‘data_ISGD/data_ids.npy'))

self.users = list(np.load(data_ISGD/data_users.npy')) self.items = Iist(np.load(data_ISGD/data_items.npy'))

def setHistoryMat(self, behaviors):

n_behaviors = behaviors.shape[0]

or ri in range(n_behaviors):

id, u, i= behaviors[ri]

u_index = self.users.index(u) i_index = self.items.index(i) self.update(u_index, i_index, isFromSetHistoriMat = True)

np.save(

"data_ISGD/data_source.npy", np.array([

self.k, self.l2_reg, self.learn_rate, self.n_user,

self.n_item,

))

np.save("data_ISGD/data_ids.npy", self.ids)

np.save("data_ISGD/data_B.npy", self.B)

np.save("data_ISGD/data_known_users.npy", self.known_users)

np.save("data_ISGD/data_known_items.npy", self.known_items)

( ( ( (

(

np.save("data_ISGD/data_users.npy", self.users)

Trang 134

np.save("data_ISGD/data_items.npy", self.items)

def recommendltem(self, user_id, N):

u_index = self.users.index(user_id)

recos = self.recommend(u_index, N)

result = []

i in range(len(recos)):

result.append(self.items|i])

result

def updateld(self, id, user_id, item_id):

id in self.ids:

isAddUser = False

f user_id not in self.users:

self.users.append(user_id) self.A = np.append(self.A , np.random.normal(0., 0.1, (1, self.k))) self.n_user = self.n_user + 1

isAddUser = True

isAddltem = False

f item_id not in self.items:

print("item no in it ) self.items.append(item_id) print(len(self.items))

self.B = np.concatenate((self.B , np.random.normal(0., 0.1, (1, self.k))))

self.n_item = self.n_item + 1 isAddltem = True

u_index = self.users.index(user_id)

i_index = self.items.index(item_id)

self.update(u_index, i_index)

np.save("data_ISGD/data_ids.npy", self.ids)

isAddUser:

np.save("data_ISGD/data_A.npy", self.A) np.save("data_ISGD/data_users.npy", self.users)

Trang 135

isAddltem:

np.save("data_ISGD/data_B.npy", self.B) np.save("data_ISGD/data_items.npy", self.items)

def update(self, u_index, i_index, isFromSetHistoriMat = False):

print(update: {} {}".format(u_index, i_index)) u_index not in self.known_users:

self.known_users = np.append(self.known_users, u_index) u_vec = self.A[u_index]

i_index not in self.known_items:

self.known_items = np.append(self.known_items, i_index) i_vec = self.B[i_index]

err = 1. - np.inner(u_vec, i_vec) self.A[u_index] = u_vec + self.learn_rate * (err * i_vec - self.I2_reg * u_vec)

f isFromSetHistoriMat != True:

np.save("data_ISGD/data_A.npy", self.A)

"data_ISGD/data_B.npy", self.B)

1own_ users.npy”, self.known_users)

ns.npy", self.known_ items)

def recommend(self, u_index, N):

f u_index not in self.known_ users:

[] #raise ValueError(Error: the user is not known.)

recos = []

scores = np.abs(1. - np.dot(np.array([self.A[u_index]]), self.B.T)).reshape(self.B.shape[0])

cnt =0

i_index in np.argsort(scores):

recos.append(i_index) ent += 1

cnt == N:

print (' Recommend item(s):’, recos, ‘for user index', u_index)

recos

Tao file contentBase.py dé thực hiện gợi ý sản phẩm tương tự cho sản

Trang 136

phẩm (theo hướng tếp cận content-based filtering)

from pyspark import SparkContext, SparkConf

from pyspark.sql import SQLContext

import findspark

import pandas

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import linear_kernel

findspark.init()

conf = SparkConf().setAppName("dacn").setMaster("local")

sc = SparkContext.getOrCreate()

sqlContext = SQLContext(sc)

spark_df =

sqlContext.read.parquet("hdfs://localhost:9000/topics/dbserver1 .inventory.product" ,Fheader=

)

df = spark_df.toPandas()

df = df[~df.index.duplicated(keep='lasf)]

print(df.columns)

df['description'] = df['description'].fillna(")

tfidf_matrix = tf.fit_transform(df['description’])

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

results = {}

for idx, row in df.iterrows():

similar_indices = cosine_similarities[idx].argsort()[:-100:-1]

similar_items = [(cosine_ similarities[idx][], df[id'][i]) for i in similar_indices]

results[row['id']] = similar_items[1:]

print(‘done!')

item(id):

return df.loc[df['id'] == id]['product_name’].tolist()[0].split(' - )[0]

Trang 137

recommend(item_id, num):

print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")

print( ')

recs = results[item_id][:num]

for rec in recs:

result.append(rec[1])

return result

Tao | file app.py với nội dung bên dưới dé setup api cho ứng dung

Lay thông tin từ hdfs lên thông qua spark

Get product and user from HDFS

findspark.init()

conf = SparkConf().setAppName("dacn").setMaster("local")

sc = SparkContext.getOrCreate()

sqlContext = SQLContext(sc)

spark_df =

sqlContext.read.parquet("hdfs://localhost:9000/topics/dbserver1 .inventory. product" header=

)

spark_user_df =

sqlContext.read.parquet("hdís://Iocalhost:9000/topics/dbserver1.inventory.customer",header=

)

spark_behavior_df =

sqlContext.read.parquet("hdfs://localhost:9000/topics/dbserver1 .inventory.behavior" ,header=

)

Tiền xử lý dữ liệu dé đưa thông tin vào xử lý

df = spark_ df.toPandas()

df = df[~df.index.duplicated(keep='last')]

df = df[df['id']<int(np.max(Y_data[:, 1])) + 1]

print(df.columns)

df_user = df_user[~df_user.index.duplicated(keep='last’)]

df_user = dí_user[df_user[id']<int(np.max(Y_ data[:, 0])) + 1]

Với ISGD thì tụi em lưu lại tập dữ liệu trước đó dé có thé sử dụng lại lần

sau nên cân lây dữ liệu ra theo như sau

Trang 138

isgd =

try:

data_source = np.load(data_ISGD/data_source.npy')

k = int(data_source[0])

I2_reg = data_source[1]

learn_rate = data_source[2]

n_user = int(data_source[3])

n_item = int(data_source[4])

print("try success")

isgd = ISGD(behaviors= ,k=k,l2_reg=l2_reg, learn_rate=learn_rate, n_user=n_user,

n_item=n_item)

except:

print("try dont success")

k= 50

Path("data_ISGD").mkdir(paren , exist_ok=

spark_behavior_df =

sqlContext.read.parquet("hdís://Iocalhost:9000/topics/dbserver1.inventory.behavior",header=

)

behavior = spark_behavior_ df.toPandas()

print(behavior.shape)

behavior = pd.DataFrame({id': behavior[id], 'customerid': behavior[customerid],'productid':

behavior['productid']})

behavior = behavior.sort_values(‘id')

print(behavior)

behavior = behavior.to_numpy().astype(np.int)

isgd = ISGD(behaviors=behavior ,k=2)

Thém phan route thông qua flask dé truy xuất thông tin dé các bên liên

quan có thể gọi api đến

@app.route(/searchProduct/<product_name>', methods = [GET'])

searchProduct(product_name):

stringFilterDF = df[df[product_name'].str.contains(product_name, flags=re.|GNORECASE)]

json = stringFilterDF.to_json(orient ='table')

return json

Trang 139

print(listld)

product = df[df['id'].isin(listld)]

jsonData = product.to_json(orient ='table')

return jsonData

@app.route('/changeRecommendRealtime/<user_id>/<product_id>', methods = [GETT])

changeRecommendRealtime(user_id, product_id):

isgd.updateld(id = -1,user_id = int(user_id), item_id = int(product_id))

return {

"result": "ok"

@app.route(/getRecommend/<user_id>', methods = [POST'])

getRecommend(user_id):

if request.method == 'POST':

listld = rs.pred_for_user(user_id = int(user_id)) product = df[df['id'].isin(listld)]

json = product.to_json(orient ='table’)

return json

else:

print("none")

@app.route(/getCustomer/<user_name>', methods = [GET])

@cross_origin()

getCustomerWithName(user_name):

stringFilterDF = df_user[df_user['last_name'].str.contains(user_name,

flags=re.|GNORECASE)]

json = stringFilterDF.to_json(orient ='table')

return json

Khởi tao consumer dé lay thông tin từ Kafka

topic = KAFKA_TOPIC

conf = {‘bootstrap.servers': "localhost:9092",

‘group.id': "default",

Trang 140

# Create Consumer instance

# 'auto.offset.reset=earliest' to start reading from the beginning of the

# topic if no committed offsets exist

consumer = Consumer(conf)

# Subscribe to topic

consumer.subscribe([topic])

def setAppListener():

# Process messages

total_count = 0

print("setAppListener")

def listener():

print("try AppListener") while True:

msg = consumer.poll(1.0)

if msg is None:

continue

elif msg.error():

print(‘error: {}'.format(msg.error())) else:

record_key = msg.key(

record_value = msg.value()

k = avro_serde.key.deserialize(record_key)

V = avro_serde.value.deserialize(record_value)

print(V[id])

print(v[customerid']) print(v['productid']) isgd.updateld(id = v[id'], user_id= v[customerid],item_id= v[productid]) KeyboardInterrupt:

print("something wrong")

# Leave group and commit final offsets print(listener close")

consumer.close() thread = threading. Thread(target=listener)

thread.start()

Trang 141

Run phần mềm bằng cách run app.py

python app.py

8.2.2 Phần hiễn thi UI

Tao một trang UI demo giao diện cho phan giao diện với danh sách sản phẩm bao gồm 3 chức năng chính là gợi ý sản phan cho người ding

thông qua api /getProductRelative/${product_ id}

56 Product(s) found Order by Select

ft ~ T

Detail here Detail here Detail here Detail here

ủ Search] Alisha Solid Women's, AW Bellies Alisha Solid Women's Sicons All Purpose

Cycling Shorts Cycling Shorts Amica Dog Shampoo

These are the user names: s 379.00 sằ 499.00 cạo 267.00 220 210.00

Aaren Anderson

(aren Thoma Fre),

Aaren Jackson

Aaren Harris |

Aaren Martin | A

_ Aaren Garcia g

Aaren Martinez

‘aren Clark Ị

Aaren Walker ]

Aaren Hall Detail here Detail here Detail here Detail here

i) ŠS dill bazaaar Bellies, Ladela Bellies Carrel Printed Womeris Freelance Vacuum

Aaren Hernandez Corporate Casuals, Bottles 350 mi Bottle

Aaren Adams Casuals =

‘Aaren Baker ] 699 349.00 1724 950.00 2299 910.00 699 699.00

AarenGonzale |

Hình 43: Gợi ý sản phẩm theo Collaborative Filteing

Và sản phẩm liên quan thông qua api

/getProductRelative/${product_id}

Relevant product

`" ® &

dill bazaaar Bellies, Corporate Ladela Belles Style Foot Bellies Nuride Canvas Shoes

Casuals, Casuals

349.00 950.00 44900 1349.00

Hình 44: Gợi ý san phẩm theo Content-Based Filtering

Trang 142

8.3 Mô hình kiến trúc của toàn bộ hệ thống

Tông quan lại, toàn bộ dé tai có thé được tóm gọn dưới một mô hình như

PostgreSQL

⁄4 dremio = @ =—————= 555°

roe Recommender System Hình 45:Mô hình kiến trúc của toàn bộ hệ thống

Trang 143

CHƯƠNG 9 KET LUẬN VA HƯỚNG PHAT TRIEN

9.1

9.2

Kết quả dat được

Làm rõ định nghĩa và xác định đúng Dữ liệu lớn là gi? Các khái nệm

chuyên ngành cần biết đến.

Dem đến cái hình từ tong quát đến chi tiết các hệ thống dit liệu lớn hiện

có trên thị trường.

Phân tích các phần cốt lõi để tạo nên một hệ thống đữ liệu lớn tại chỗ.

Thiết kế chỉ tiết hệ thống dữ liệu lớn (mức độ có thể triển khai) bao gồm:

Apache Hadoop, Apache Spark, Apache Kafka,Apache Confluent

Làm rõ định nghĩa và xác định đúng Hệ thống khuyến nghị là gì ? Các thuật toán và hướng tiếp cận cần tìm hiểu

Phân tích các thuật toán và hướng tiếp cận và tìm phương pháp tiếp cận phù hợp nhất cho hệ thống

Thiết kế chỉ tiết hệ thông khuyên nghị (mức độ có thê triển khai ) bao gồm Content-based Filtering và Collaborative Filtering

Nhận xét

9.2.1 Ưu điểm

Đi qua đầy đủ các khía cạnh của ngành Dữ liệu lớn.

Dem lại góc nhìn tổng qua và hướng đi trong ngành Khoa học dữ liệu.

Áp dụng được mô hình Dữ liệu lớn vào bai toán thực tế trong hệ thống khuyến nghị.

Hệ thống dit liệu gồm nhiều module có thé bổ sung hoặc tháo gỡ dé

dàng.

Đi qua đầy đủ các khía cạnh của hệ thống khuyến nghị Xây dựng thành công một hệ thống khuyến nghị trên cả 2 cách tiếp cận

Trang 144

— Phát triển phần lấy đữ liệu theo thời gian thực từ Kafka

(Content-based filtering va Collaborative Filtering)

— Xây dựng thành công một hệ thống khuyến nghị chạy theo thời gian thực 9.2.2 Khuyết điểm

9.3

Chưa có kết quả benchmark giữa các framework hỗ trợ dữ liệu với nhau. Xây dựng giao diện ứng dụng mang tính UI/UX tốt hơn nham mang đến trải nghiệm tốt nhất cho người dùng

Cải thiện hiệu năng và tôc độ của phân mêm

Hướng phát triển

Phát triển thêm phan lay đữ liệu từ Kafka.

Áp dụng thuật toán fingerprint đề truy vết khách hàng trên tập dit liệu vô

hạn Xây dựng Neural Collaborative Filtering

Trang 145