Course recommendation Xây dựng hệ thống gợi ý khóa học

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	31
Dung lượng	3,6 MB

Nội dung

hệ thống gợi ý khóa học thông qua Apache Nutch và Apache Solr, NLP. Hệ thống sử dụng Apache Nutch crawl dữ liệu, đánh index trên Apache Solr. Metadata từ Apache Solr được dùng làm dữ liệu và vận dụng NLP để gợi ý

Online Course Search Professor Kim Kyoung-Yun Content Introduction: Online course Search Engine Crawling by Apache Nutch Indexing in Apache Solr Analysis data Functions UI design Results Discussion Introduction Project domain: Online Course Urls: https://www.coursera.org/ Concept map: Introduction Scenario flow chart Introduction Scenario flow chart The words were not trained 1.Introduction: Architecture Recommendation Data Collection Topic modeling Crawler Topics Apache Solr Document Index Query parser Calculating semantic similarity score, ranking Ranking Result Tittle, Score, URL, Description Query processing (correct, split) Input: [Number (int or str)] + keywords For example: six machine learning python Or python computer vision Query csv file Title Offer by Level Description Skill Rating User 1.Introduction: Architecture https://www.researchgate.net/ https://arxiv.org/ https://www.mdpi.com/ https://link.springer.com/ Crawler Topics Calculating semantic similarity score, ranking Data Collection Document index Extracting metadata connect affiliation journal author book conference publisher Tittle, URL, Abstract, Author,conference/j ournal/book name, publisher, publish_date paper Result paper Topic modeling Used in filter condition Query processing (correct, split) Input: keywords For example: Human balance estimation Query User Crawling by Apache Nutch • The seed.txt file https://www.coursera.org • The regex-urlfilter file include: +^https://www.coursera.org/ +^https://in.coursera.org/ Crawling by Apache Nutch + Open Cygwin terminal and go to {nutch_home}: Crawl: bin/crawl –i –s urls crawl Dump to file: • bin/nutch readdb crawl/crawldb -stats >stats.txt • bin/nutch readdb crawl/crawldb/ -dump db • bin/nutch readlinkdb crawl/linkdb/ -dump link • bin/nutch readseg -dump crawl/segments/{segment_folder} crawl/segments/{segment_folder}_dump -nocontent -nofetch -noparse -noparsedata -noparsetext 1) CrawlDb - It contains all the link parsed by the Nutch 2) LinkDB - It contains for each URL the outgoing and the incoming URLs 3) Segment - It contains the list of URLs to be crawled or being crawled Indexing in Solr bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/{segments_folder}/ filter -normalize -deleteGone Functions • search_documents_by_keywords(keywords, num_docs, keywords_neg=None, return_documents=True, use_index=False, ef=None) Semantic search of documents using keywords The most semantically similar documents to the combination of the keywords will be returned If negative keywords are provided, the documents will be semantically dissimilar to those words Too many keywords or certain combinations of words may give strange results This method finds an average vector(negative keywords are subtracted) of all the keyword vectors and returns the documents closest to the resulting vector Parameters • keywords (List of str) – List of positive keywords being used for search of semantically similar documents • keywords_neg (List of str (Optional)) – List of negative keywords being used for search of semantically dissimilar documents • num_docs (int) – Number of documents to return • return_documents (bool (Optional default True)) – Determines if the documents will be returned If they were not saved in the model they will also not be returned Return: documents, doc_scores, doc_ids Functions • correcting_word(keyword) Jaccard distance, the opposite of the Jaccard coefficient, is used to measure the dissimilarity between two sample sets We get Jaccard distance by subtracting the Jaccard coefficient from We can also get it by dividing the difference between the sizes of the union and the intersection of two sets by the size of the union For example: pyth -> python, machne -> machine Functions • get_recommendation (query) -> Splitting query -> Correcting query -> Calculating score of query and documents -> Display result: “number” courses ( title, url, score and description) query: including [number(str or int)] + keywords Functions • Showing “number” of results: If a number of results is >= number Showing a number of results if a number of results is < number • If query is not correct in spelling, it can guess the query and display the results For example: pyt learning -> python learning • Input: number can be str or int ( one or 1, two or 2) • If the keywords did not train in model, it can not display the result ... Online course Search Engine Crawling by Apache Nutch Indexing in Apache Solr Analysis data Functions UI design Results Discussion Introduction Project domain: Online Course Urls: https://www.coursera.org/... Crawling by Apache Nutch • The seed.txt file https://www.coursera.org • The regex-urlfilter file include: +^https://www.coursera.org/ +^https://in.coursera.org/ Crawling by Apache Nutch + Open Cygwin... machne -> machine Functions • get _recommendation (query) -> Splitting query -> Correcting query -> Calculating score of query and documents -> Display result: “number” courses ( title, url, score and

Ngày đăng: 15/03/2023, 12:47