Nghiên cứu phương pháp thu thập dữ liệu trong khai phá mạng xã hội,

HỌC VIỆN NGÂN HÀNG KHOA HỆ THỐNG THÔNG TIN QUẢN LÝ KHÓA LUẬN TỐT NGHIỆP ĐẠI HỌC NGHIÊN CỨU PHƢƠNG PHÁP THU THẬP DỮ LIỆU TRONG KHAI PHÁ MẠNG XÃ HỘI ĐỖ HUYỀN TRANG HÀ NỘI, NĂM 2017 HỌC VIỆN NGÂN HÀNG KHOA HỆ THỐNG THƠNG TIN QUẢN LÝ KHĨA LUẬN TỐT NGHIỆP ĐẠI HỌC NGHIÊN CỨU PHƢƠNG PHÁP THU THẬP DỮ LIỆU TRONG KHAI PHÁ MẠNG XÃ HỘI Giáo viên hƣớng dẫn: TS Đinh Trọng Hiếu Sinh viên thực hiện: Đỗ Huyền Trang Mã sinh viên: 16A4040195 Lớp: HTTTB Khóa: K16 Hệ: Đại học quy Hà Nội, tháng 5/2017 LỜI CẢM ƠN Lời đầu tiên, em xin gửi lời biết ơn sâu sắc tới thầy Đinh Trọng Hiếu Thầy trực tiếp định hướng cố vấn tận tình giúp đỡ em suốt thời gian hồn thành khóa luận Nhờ thầy, em tiếp xúc với nhiều kiến thức vơ hữu ích cho tương lai cơng việc sau Bên cạnh đó, em xin chân thành cảm giúp đỡ đóng góp từ bạn bè tiếp thêm động lúc cho em Em xin gửi lời cảm ơn đơn vị thực tập, cán Công ty TNHH giải pháp Hệ thống thông tin Kiến trúc doanh nghiệp hết lòng hỗ trợ, tạo điều kiện thời gian tư vấn để em hồn thành khóa luận thật tốt Trong trình thực hiện, bắt đầu nghiên cứu chủ đề chưa có nhiều kinh nghiệm, với thời gian hạn hẹp nên khóa luận khơng tránh khỏi thiếu sót Em mong nhận góp ý, nhận xét từ phía thầy giáo để đề tài hồn chỉnh rút kinh nghiệp bổ ích để tiếp tục hướng phát triển sau Em xin chân thành cảm ơn! SINH VIÊN THỰC HIỆN Đỗ Huyền Trang i LỜI CAM KẾT Tôi xin cam đoan kết đạt báo cáo sản phẩm nghiên cứu, tìm hiểu riêng cá nhân tơi Trong toàn nội dung báo cáo, điều trình bày cá nhân tơi tổng hợp từ nhiều nguồn tài liệu Tất tài liệu tham khảo có xuất xứ rõ ràng trích dẫn hợp pháp Tơi xin chịu trách nhiệm chịu hình thức kỷ luật theo quy định cho lời cam đoan SINH VIÊN THỰC HIỆN Đỗ Huyền Trang ii NHẬN XÉT (Của giáo viên hƣớng dẫn) Trong bối cảnh, mạng toàn cầu (internet) ngày sử dụng rộng rãi, mạng xã hội trở thành kênh truyền thơng, giao tiếp có ảnh hưởng lớn tới đời sống nhiều người Việc thu thập, phân tích khai thác liệu từ mạng xã hội nhằm mang lại thông tin, tri thức có ích hoạt động sản xuất kinh doanh toán xuất phát từ nhu cầu thực tế tổ chức, doanh nghiệp Sinh viên Đỗ Huyền Trang chọn thực khóa luận với đề tài “Nghiên cứu phương pháp thu thập liệu khai phá mạng xã hội” đề tài có ý nghĩa xuất phát từ nhu cầu thực tiễn thời Trong thời gian thực đề tài với thái độ nghiêm túc, tinh thần cầu tiến, ham học hỏi, chủ động tìm hiểu, nghiên cứu, sinh viên Đỗ Huyền Trang nắm kiến trúc bản, hiểu bước cần giải thực hiện, toàn lớn cần giải tiến hành phân tích mạng xã hội Trong báo cáo khóa luận, sinh viên tìm hiểu kỹ thuật thu thập liệu từ hai mạng xã hội phổ biến Facebook Twitter; thiết kế sở liệu tiến hành lưu trữ; sau tiến hành biểu diễn liệu với cơng cụ Tableau Khóa luận được trình bày với bố cục hình thức hợp lý Kết luận: đề nghị cho sinh viên Đỗ Huyền Trang bảo vệ khóa luận trước hội đồng để nhận Cử nhân Hệ thống thông tin quản lý Hà Nội, ngày 02 tháng 06 năm 2017 Giáo viên hướng dẫn Đinh Trọng Hiếu iii MỤC LỤC MỞ ĐẦU CHƢƠNG I: NỀN TẢNG LÝ THUYẾT VỀ KHAI PHÁ MẠNG XÃ HỘI 1.1 Mạng xã hội – kho liệu tiềm cho ngành nghiên cứu phân tích liệu 1.1.1 Khái niệm mạng xã hội truyền thông xã hội 1.1.2 Ảnh hưởng mạng xã hội tới giới thực 1.2 Nền tảng lý thuyết khai phá mạng xã hội 1.2.1 Khai phá mạng xã hội gì? 1.2.2 Quy trình khai phá mạng xã hội 1.2.3 Các phương pháp nghiên cứu khai phá mạng xã hội .10 1.2.4 Các kỹ thuật phân tích mạng xã hội ứng dụng thực tế 11 1.2.5 Thách thức khai phá mạng xã hội 13 CHƢƠNG II: NGHIÊN CỨU PHƢƠNG PHÁP THU THẬP DỮ LIỆU TRÊN MẠNG XÃ HỘI 15 2.1 Tổng quan liệu mạng xã hội 15 2.1.1 Đặc điểm liệu mạng xã hội 15 2.1.2 Dữ liệu mạng xã hội big data 15 2.2 Phƣơng pháp thu thập liệu Internet 16 2.2.1 Thu thập liệu cổng giao diện lập trình – API .16 2.2.2 Thu thập liệu theo trang sử dụng Web Scraper 17 2.3 Phƣơng pháp thu thập liệu từ Twitter 18 2.3.1 Giới thiệu tổng quan Twitter 18 2.3.2 Thu thập liệu Twitter .21 2.4 Phƣơng pháp thu thập liệu từ Facebook 27 2.4.1 Giới thiệu tổng quan Facebook .28 2.4.2 Thu thập liệu Facebook 29 CHƢƠNG III: THỰC NGHIỆM THU THẬP DỮ LIỆU MẠNG XÃ HỘI VÀ ĐÁNH GIÁ KẾT QUẢ 35 3.1 Hƣớng tiếp cận thực nghiệm 35 3.1.1 Sử dụng ngơn ngữ lập trình Python để xây dựng chương trình thu thập liệu mạng xã hội .35 3.1.2 Sử dụng SQLite để lưu trữ liệu 36 3.1.3 Trực quan hóa liệu trả thơng qua Tableau .37 3.2 Thực nghiệm thu thập liệu Twitter 37 iv 3.2.1 Thu thập liệu người dùng sử dụng REST API Twitter 38 3.2.2 Thu thập liệu tweet sử dụng Search API Twitter 42 3.2.3 Thu thập liệu tweet sử dụng Streaming API Twitter 44 3.2.4 Đánh giá chung ứng dụng thu thập liệu mạng xã hội Twitter 47 3.3 Thực nghiệm thu thập liệu mạng xã hội Facebook 48 3.4 Kết luận thực nghiệm 50 KẾT LUẬN 51 TÀI LIỆU THAM KHẢO 53 PHỤ LỤC 55 v DANH MỤC CHỮ VIẾT TẮT STT Chữ viết tắt Tiếng Anh Tiếng Việt BBS Bulletin Board System Hệ thống tin nhắn API Application Programming Interface Giao diện lập trình ứng dụng URL Uniform Resource Locator Định vị Tài nguyên thống HTML XHTML HyperText Markup Ngôn ngữ Đánh dấu Language Siêu văn Extensible HyperText Markup Language Ngôn ngữ Đánh dấu Siêu văn Mở rộng HTTP HyperText Transfer Protocol Giao thức truyền tải siêu văn DOM Document Object Model Mơ hình Đối tượng Tài liệu SMS Short Message Services Dịch vụ tin nhắn ngắn REST Representational State Transfer Một quy tắc để tạo ứng dụng Web Service 10 DBMS Database Management System Hệ quản trị sở liệu 11 JSON JavaScript Object Notation Định dạng hoán vị liệu nhanh 12 XML eXtensible Markup Language Ngôn ngữ đánh dấu mở rộng vi DANH MỤC BẢNG BIỂU Bảng 2.1: Tổng quan Streaming API Twitter 25 Bảng 3.1: Kết thu thập liệu thông qua REST API .39 Bảng 3.2: Kết thu thập liệu thông qua Search API 42 Bảng 3.3: Thu thập đăng sử dụng Graph API 49 vii DANH MỤC HÌNH VẼ Hình 1.1: Những số đánh giá tình hình sử dụng mạng xã hội giới Hình 1.2: Quy trình khai phá liệu Hình 1.3: Quy trình khai phá mạng xã hội 10 Hình 2.1: Giao tiếp với máy chủ thông qua API 16 Hình 2.2: Tạo ứng dụng Twitter để nhận thông tin xác thực OAuth quyền truy cập API .22 Hình 2.3: Sơ đồ hoạt động người dùng thực goi yêu cầu tới REST API 23 Hình 2.4: Sơ đồ hoạt động Streaming API .24 Hình 2.5 :Dữ liệu nội dung tweet trả dạng JSON 25 Hình 2.6: Dữ liệu tài khoản trả dạng JSON .26 Hình 2.7: Trình khám phá Graph API 30 Hình 2.8: Phương thức để có Access token 31 Hình 2.9: Đồ thị xã hội 32 Hình 2.10: Kết trả Facebook theo định dạng JSON 33 Hình 3.1: Mơ hình liệu thu thập từ Twitter 38 Hình 3.2: Phân tích định lượng liệu hồ sơ cá nhân 40 Hình 3.3: Bản đồ lượng người theo dõi theo địa lý .41 Hình 3.4: Word cloud thể xu hướng thảo luận vấn đề du học giới .44 Hình 3.5: Word cloud thể xu hướng thảo luận vấn đề du học Việt Nam 44 Hình 3.6: Gán thẻ quan điểm cho ghi nội dung tweets 46 Hình 3.7: Chỉ số truyền thơng thơng qua phân tích quan điểm tweets 46 Hình 3.8: Dashboard Quản lý truyền thông cho Doanh nghiệp 48 Hình 3.9: Mơ hình liệu thu thập từ Facebook 49 viii Các cạnh tham chiếu: Cạnh Mô tả friends Bạn bè người dùng likes Các trang mà người dùng like post Tất cập nhật trạng thái người dùng đăng, gán thẻ xuất timeline người dùng family Mối quan hệ gia đình người dùng movie Những phim mà người dùng thích photos Hình ảnh mà người dùng tải lên gắn thẻ books Những sách đưuọc liệt kê hồ sơ người dùng 4.2 Dữ liệu đăng Trƣờng liệu Kiểu liệu Mô tả id String Mã định dạng đăng created_time Datetime Thời điểm đăng from Profile Thông tin người đăng link String Liên kết đính kèm với đăng message String Nội dung văn đăng object_id String Mã định dạng ảnh hay video đính kèm với đăng place Place Thơng tin vị trí địa lý định kèm với đăng story String Những hành động người dùng Facebook tự hiểu cập nhật type {link, status, photo, video, offer} Loại đăng update_time Datetime Thời điểm cuối mà đăng bị sửa 14 Các cạnh tham chiếu Cạnh Mơ tả likes Những người thích đăng reactions Những người đưa phản ứng quan điểm với đăng Ví dụ: live, like, wow comments Những bình luận đăng attachments Những media đính kèm đăng 15 PHỤ LỤC 5: CÁC TỐN TỬ TRUY VẤN CỦA TWITTER API Từ khóa Mơ tả watching now Chứa từ "watching" "now" Đây toán tử mặc định “happy hour” Chứa cụm từ "happy hour" love OR hate Chứa từ "love" hoặc"hate" (hoặc 2) beer -root Chứa từ "beer" không chứa từ"root" #haiku Chứa hashtag "haiku" from:Twitterapi Tìm tweet đăng từ @Twitterapi to:NASA Tweets trả lời cho đăng tài khoản NASA @NASA Trong nội dung Tweets đề cập đến tài khoản NASA superhero since:2015-12-21 Chứa từ “superhero” đăng kể từ ngày “2015-1221” (năm-tháng-ngày) puppy until:2015-12-21 Chứa “puppy” đăng trước ngày “2015-12-21” movie -scary :) Chứa từ"movie", khơng chứa "scary", hướng tích cực flight :( Chứa từ "flight" hướng tiêu cực traffic ? Chứa từ "traffic" câu hỏi 16 PHỤ LỤC 6: MÃ LỆNH CHƢƠNG TRÌNH THU THẬP DỮ LIỆU 6.1 Chương trình thu thập liệu Twitter Chương trình thu thập liệu Twitter sử dụng REST API Search API, có cấu trúc tương tự nhau, khác hàm thu thập lưu trữ liệu Các thư viện cần tải  Twython  Sqlite3  Configparser  Requests Mã nguồn tạo sở liệu parser = argparse.ArgumentParser() parser.add_argument('-i', ' id_Twitter', nargs='+', help="the ID's of the Twitter pages you want to scrape") args = parser.parse_args() if len(sys.argv) == 1: args.id_Twitter = input("Nhập vào tên tài khoản Twitter: ") ScreenName = args.id_Twitter if platform.system() == 'Windows': db_location = os.path.normpath('TwitterDB_%s.db' %ScreenName) else: db_location = r'TwitterDB_%s.db' %ScreenName objectsCreate = {'User': 'CREATE TABLE IF NOT EXISTS UserTimeline (' 'user_id int, ' 'user_name text, ' 'screen_name text, ' 'user_description text, ' 'user_location text, ' 'user_url text, ' 'user_created_datetime text,' 'user_language text ,' 'user_timezone text, ' 'user_utc_offset real,' 'user_friends_count real,' 'user_followers_count real,' 'user_statuses_count real,' 'user_favourites_count real,' 'tweet_id int,' 'tweet_id_str text,' 'tweet_text text,' 'tweet_retweeted text,' 'tweet_retweet_count real,' 'tweet_favorite_count real,' 'tweet_user_mentions text,' 'tweet_hashtags text,' 'tweet_created_timestamp text,' 'PRIMARY KEY(tweet_id, user_id)) } 17 #Tạo sở liệu db_is_new = not os.path.exists(db_location) with sqlite3.connect(db_location) as conn: if db_is_new: print("Creating database schema on " + db_location + " database \n") for t in objectsCreate.items(): try: conn.executescript(t[1]) except sqlite3.OperationalError as e: print (e) conn.rollback() sys.exit(1) else: conn.commit() else: print('Database already exists, bailing out ') Mã nguồn thu thập liệu hồ sơ cá nhân với REST API def getUserTimelineFeeds(StatusesCount, MaxTweetsAPI, ScreenName, IncludeRts, ExcludeReplies, AppKey, AppSecret): Twitter = twython.Twython(AppKey, AppSecret, oauth_version=2) try: ACCESS_TOKEN = Twitter.obtain_access_token() except twython.TwythonAuthError as e: print (e) sys.exit(1) else: try: Twitter = twython.Twython(AppKey, access_token=ACCESS_TOKEN) print('Acquiring tweeter feed for user "{0}" '.format(ScreenName)) params = {'count': StatusesCount, 'screen_name': ScreenName, 'include_rts': IncludeRts,'exclude_replies': ExcludeReplies} AllTweets = [] checkRateLimit(limittypecheck='usertimeline') NewTweets = Twitter.get_user_timeline(**params) if NewTweets is None: print('No user timeline tweets found for "{0}" account, exiting now '.format(ScreenName)) sys.exit() else: ProfileTotalTweets = [tweet['user']['statuses_count'] for tweet in NewTweets][0] if ProfileTotalTweets > MaxTweetsAPI: TweetsToProcess = MaxTweetsAPI else: TweetsToProcess = ProfileTotalTweets oldest = NewTweets[0]['id'] progressbar = tqdm(total=TweetsToProcess, leave=1) while len(NewTweets) > 0: checkRateLimit(limittypecheck='usertimeline') NewTweets = Twitter.get_user_timeline(**params, max_id=oldest) AllTweets.extend(NewTweets) oldest = AllTweets[-1]['id'] - if len(NewTweets)!=0: 18 progressbar.update(len(NewTweets)) progressbar.close() AllTweets = [tweet for tweet in AllTweets if tweet['id'] not in UserTimelineIDs] for tweet in AllTweets: if 'retweeted_status' in tweet: retweeted_status = 'THIS IS A RETWEET' else: retweeted_status = '' entities_hashtags, entities_mentions = [], [] for hashtag in tweet['entities']['hashtags']: if 'text' in hashtag: tag = hashtag['text'] entities_hashtags.append(tag) for at in tweet['entities']['user_mentions']: if 'screen_name' in at: mention = at['screen_name'] entities_mentions.append(mention) entities_mentions = ", ".join(entities_mentions) entities_hashtags = ", ".join(entities_hashtags) conn.execute("INSERT OR IGNORE INTO User" "(user_id," "user_name," "screen_name," "user_description," "user_location," "user_url," "user_created_datetime," "user_language," "user_timezone," "user_utc_offset," "user_friends_count," "user_followers_count," "user_statuses_count," "user_favourites_count," "tweet_id," "tweet_id_str," "tweet_text," "tweet_retweeted," "tweet_retweet_count," "tweet_favorite_count," "tweet_user_mentions," "tweet_hashtags," "tweet_created_timestamp) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",( tweet['user']['id'], tweet['user']['name'], tweet['user']['screen_name'], tweet['user']['description'], 19 tweet['user']['location'], tweet['user']['url'], time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['user']['created_at'],'%a %b %d %H:%M:%S +0000 %Y')), tweet['user']['lang'], tweet['user']['time_zone'], tweet['user']['utc_offset'], tweet['user']['friends_count'], tweet['user']['followers_count'], tweet['user']['statuses_count'], tweet['user']['favourites_count'], tweet['id'], tweet['id_str'], str(tweet['text'].replace("\n","")), retweeted_status, tweet['retweet_count'], tweet['favorite_count'], entities_mentions, entities_hashtags, time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')))) conn.commit() except Exception as e: print(e) sys.exit(1) Mã nguồn thu thập liệu tweets với Search API def getData(StatusesCount, MaxTweetsAPI, Keyword, Lang, Result_type, AppKey, AppSecret, max_id=None): Twitter = twython.Twython(AppKey, AppSecret, oauth_version=2) try: ACCESS_TOKEN = Twitter.obtain_access_token() except twython.TwythonAuthError as e: print (e) sys.exit(1) else: try: Twitter = twython.Twython(AppKey, access_token=ACCESS_TOKEN) print('Acquiring tweeter feed for keyword "{0}" '.format(Keyword)) checkRateLimit(limittypecheck='usertimeline') NewTweets = Twitter.search(q=Keyword, count = StatusesCount, lang = Lang, result_type= Result_type, max_id=max_id) if NewTweets is None: print('No user timeline tweets found for "{0}"keyword, exiting now '.format(Keyword)) sys.exit() else: return NewTweets except Exception as e: print (e) sys.exit(1) 20 def writeData(self, NewTweets): for tweet in NewTweets['statuses']: if 'retweeted_status' in tweet: retweeted_status = 'THIS IS A RETWEET' else: retweeted_status = '' entities_hashtags, entities_mentions = [], [] for hashtag in tweet['entities']['hashtags']: if 'text' in hashtag: tag = hashtag['text'] entities_hashtags.append(tag) for at in tweet['entities']['user_mentions']: if 'screen_name' in at: mention = at['screen_name'] entities_mentions.append(mention) entities_mentions = ", ".join(entities_mentions) entities_hashtags = ", ".join(entities_hashtags) conn.execute("INSERT OR IGNORE INTO UserTimeline " "(user_id," "user_name," "screen_name," "user_description," "user_location," "user_url," "user_created_datetime," "user_language," "user_timezone," "user_utc_offset," "user_friends_count," "user_followers_count," "user_statuses_count," "user_favourites_count," "tweet_id," "tweet_id_str," "tweet_text," "tweet_retweeted," "tweet_retweet_count," "tweet_favorite_count," "tweet_user_mentions," "tweet_hashtags," "tweet_created_timestamp) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",( tweet['user']['id'], tweet['user']['name'], tweet['user']['screen_name'], tweet['user']['description'], tweet['user']['location'], tweet['user']['url'], time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['user']['created_at'],'%a %b 21 %d %H:%M:%S +0000 %Y')), tweet['user']['lang'], tweet['user']['time_zone'], tweet['user']['utc_offset'], tweet['user']['friends_count'], tweet['user']['followers_count'], tweet['user']['statuses_count'], tweet['user']['favourites_count'], tweet['id'], tweet['id_str'], str(tweet['text'].replace("\n","")), retweeted_status, tweet['retweet_count'], tweet['favorite_count'], entities_mentions, entities_hashtags, time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')))) conn.commit() class Scrape: def getTweetSearch(self): for n, Keyword in enumerate(ids): print("\rprocessing id %s/%s" % (n+1, len(ids)), end=' ') sys.stdout.flush() NewTweets = getData(StatusesCount, MaxTweetsAPI, Keyword, Lang, Result_type, AppKey, AppSecret, max_id=None) if not NewTweets: continue if len(NewTweets['statuses'])==0: print("THERE WERE NO STATUSES RETURNED MOVING TO NEXT ID") continue writeData(self, NewTweets) last_status = NewTweets['statuses'][-1] min_id = last_status['id'] max_id = min_id-1 print('THIS IS THE min_id IN THE CURRENT SET OF TWEETS: ', max_id) if len(NewTweets['statuses']) >1: print("THERE WAS AT LEAST STATUS ON THE FIRST PAGE! NOW MOVING TO GRAB EARLIER TWEETS") count = while count < 101: print(" XXXXXX STARTING PAGE", count) NewTweets = getData(StatusesCount, MaxTweetsAPI, Keyword, Lang, Result_type, AppKey, AppSecret, max_id) 22 if not NewTweets: break elif not NewTweets['statuses']: break last_status = NewTweets['statuses'][-1] min_id = last_status['id'] max_id = min_id-1 print('THIS IS THE min_id IN THE CURRENT SET OF TWEETS: ', max_id) if not NewTweets: continue writeData(self, NewTweets) print(" XXXXXX FINISHED WITH PAGE", len(NewTweets['statuses']), count) if not len(NewTweets['statuses']) > 0: print(" > WE'VE REACHED THE LAST PAGE!!!! MOVING TO NEXT ID") break count += if count >100: print("WE'RE AT PAGE 100!!!!!") break Chương trình thu thập liệu Twitter sử dụng Streaming API cần thư viện sau :  Tweepy  Json  Dataset Mã nguồn chương trình thu thập liệu với Streaming API db = dataset.connect(CONNECTION_STRING) number=0 class StreamListener(tweepy.StreamListener): def on_status(self, status): global number number+=1 print(number) if status.retweeted: return print(status.text) u_id=status.user.id name = status.user.name screen=status.user.screen_name description = status.user.description loc = status.user.location url=status.user.url 23 user_created = status.user.created_at lang=status.user.lang timezone=status.user.time_zone utc_offset=status.user.utc_offset followers = status.user.followers_count friends = status.user.friends_count t_id = status.id id_str = status.id_str text = status.text retweets = status.retweet_count favorite=status.favorite_count coords = status.coordinates geo = status.geo created = status.created_at if hasattr(status, 'retweeted_status'): retweeted = 'THIS IS A RETWEET' else: retweeted ='' if geo is not None: geo = json.dumps(geo) if coords is not None: coords = json.dumps(coords) table = db[TABLE_NAME] try: table.insert(dict( user_id=u_id, user_name=name, screen_name=screen, user_description=description, user_location=loc, user_url=url, user_created_datetime=user_created, user_language=lang, user_timezone=timezone, user_utc_offset=utc_offset, user_friend_count=friends, user_followers_count=followers, tweet_id=t_id, tweet_id_str=id_str, tweet_text=text, tweet_retweeted=retweeted, tweet_retweet_count=retweets, tweet_favorite_count=favorite, tweet_created_timestamp=created, coordinates=coords, geo=geo, )) except ProgrammingError as err: print(err) def on_error(self, status_code): 24 if status_code == 420: return False auth = tweepy.OAuthHandler(TWITTER_APP_KEY, TWITTER_APP_SECRET) auth.set_access_token(TWITTER_KEY, TWITTER_SECRET) api = tweepy.API(auth) stream_listener = StreamListener() stream = tweepy.Stream(auth = api.auth, listener=stream_listener) stream.filter(track= TRACK_TERMS) 6.2 Chương trình thu thập liệu Facebook Các thư viện cần tải :  Facebook-sdk  Simplejson  Sqlite3 Mã nguồn tạo sở liệu def init (self, access_token, db_path, id_list): """Connects to Facebook Graph API and creates an SQLite database with four tables for Posts, Comments, Post_likes and People if not exists Takes three arguments: access_token: your own Facebook access token that you can get on https://developers.facebook.com/tools/explorer/ db_path: the path of the SQLite database where you want to store the data id_list: ID's of the Facebook pages you want to scrape """ self.access_token = access_token self.db_path = db_path self.id_list = id_list g = facebook.GraphAPI(self.access_token, version='2.3') self.g = g # kết nối csdl = lite.connect(self.db_path) self.con = with con: # tạo trỏ cur = con.cursor() self.cur = cur # tạo bảng cho đăng cur.execute( "CREATE TABLE IF NOT EXISTS Posts(post_id TEXT PRIMARY KEY, status_id TEXT, content TEXT, " " published_date TEXT, last_comment_date TEXT, post_type TEXT, status_type TEXT, " "post_link TEXT, link TEXT, video_link TEXT, picture_link TEXT, link_name TEXT, link_caption TEXT, " "link_description TEXT, comment_count INTEGER, share_count INTEGER, like_count 25 INTEGER, " "love_count INTEGER, wow_count INTEGER, haha_count INTEGER, sad_count INTEGER, angry_count INTEGER, " "mentions_count INTEGER, mentions TEXT, location TEXT, date_inserted TEXT)") Mã nguồn viết liệu def get_reactions(self, post_id, access_token): """Gets reactions for a post.""" base = "https://graph.facebook.com/v2.6" node = "/%s" % post_id reactions = "/?fields=" \ "reactions.type(LIKE).limit(0).summary(total_count).as(like)," \ "reactions.type(LOVE).limit(0).summary(total_count).as(love)," \ "reactions.type(WOW).limit(0).summary(total_count).as(wow)," \ "reactions.type(HAHA).limit(0).summary(total_count).as(haha)," \ "reactions.type(SAD).limit(0).summary(total_count).as(sad)," \ "reactions.type(ANGRY).limit(0).summary(total_count).as(angry)" parameters = "&access_token=%s" % access_token url = base + node + reactions + parameters # retrieve data with urlopen(url) as url: read_url = url.read() data = simplejson.loads(read_url) return data def write_data(self, d): """Writes data from the given Facebook page in SQLite database separated in four tables for posts, comments, likes and people Takes the converted JSON of a Facebook page from given feed as argument.""" date_inserted = datetime.now().strftime("%Y-%m-%d %H:%M:%S") messages = d['data'] self.no_messages += len(messages) for message in messages: published_date = message['created_time'] published_date = datetime.strptime(published_date, "%Y-%m%dT%H:%M:%S+%f").strftime("%Y-%m-%d %H:%M:%S") post_type = message['type'] post_id = message['id'] org_id = post_id.split('_')[0] status_id = post_id.split('_')[1] post_link = 'https://www.facebook.com/%s/posts/%s' % (org_id, status_id) location = '' if 'place' not in message else str(message['place']) link = '' if 'link' not in message else str(message['link']) 26 link_name = '' if 'name' not in message else message['name'] link_caption = '' if 'caption' not in message else message['caption'] link_description = '' if 'description' not in message else message['description'] content = '' if 'message' not in message else message['message'].replace('\n', ' ') status_type = '' if 'status_type' not in message else message['status_type'] picture_link = '' if 'picture' not in message else message['picture'] video_link = '' if 'source' not in message else message['source'] share_count = if 'shares' not in message else message['shares']['count'] reaction_data = self.get_reactions(post_id=post_id, access_token=self.access_token) \ if published_date > '2016-02-24 00:00:00' else {} love_count = if 'love' not in reaction_data else reaction_data['love']['summary']['total_count'] wow_count = if 'wow' not in reaction_data else reaction_data['wow']['summary']['total_count'] haha_count = if 'haha' not in reaction_data else reaction_data['haha']['summary']['total_count'] sad_count = if 'sad' not in reaction_data else reaction_data['sad']['summary']['total_count'] angry_count = if 'angry' not in reaction_data else reaction_data['angry']['summary']['total_count'] reaction_like_count = if 'like' not in reaction_data else reaction_data['like']['summary']['total_count'] like_count = reaction_like_count post_data = ( post_id, status_id, content, published_date, post_type, status_type, post_link, link, video_link, picture_link, link_name, link_caption, link_description, share_count, like_count, love_count, wow_count, haha_count, sad_count, angry_count, location, date_inserted) self.cur.execute( "INSERT OR IGNORE INTO Posts VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, " "?, ?, ?, ?, ?)", post_data) self.con.commit() Mã nguồn lấy liệu theo trang def scrape(self): for feed in self.id_list: try: d = self.g.get_connections(feed, 'feed') except Exception as e: print("Error reading feed id %s, exception: %s" % (feed, e)) continue no_messages = self.no_messages = no_messages count = print("Scraping page %s of feed id %s" % (count, feed)) self.write_data(d) try: 27 paging = d['paging'] if 'next' in paging: next_page_url = paging['next'] while next_page_url: count += print("Scraping page %s" % count) try: # convert json into nested dicts and lists with urlopen(next_page_url) as url: read_url = url.read() d = simplejson.loads(read_url) except Exception as e: print("Error reading id %s, exception: %s" % (feed, e)) continue if len(d['data']) == 0: print("There aren't any other posts Scraping of feed id %s is done! " % feed) break self.write_data(d) if 'paging' in d: if 'next' in d['paging']: next_page_url = d['paging']['next'] else: break except: if self.no_messages > 0: print("There aren't any other pages Scraping of feed id %s is done! " % feed) else: print("There is nothing to scrape Perhaps the id you provided is a personal page.") continue self.con.close() 28

Tiêu đề	Nghiên cứu phương pháp thu thập dữ liệu trong khai phá mạng xã hội
Tác giả	Đỗ Huyền Trang
Người hướng dẫn	TS. Đinh Trọng Hiếu
Trường học	Học viện Ngân hàng
Chuyên ngành	Hệ thống thông tin quản lý
Thể loại	Khóa luận tốt nghiệp đại học
Năm xuất bản	2017
Thành phố	Hà Nội

Định dạng
Số trang	92
Dung lượng	2 MB