lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp lập trình c, lập trình c sharp
1 Data Capture and Extraction with C# Succinctly By Ed Freitas Foreword by Daniel Jebaraj Copyright © 2016 by Syncfusion Inc 2501 Aerial Center Parkway Suite 200 Morrisville, NC 27560 USA All rights reserved I mportant licensing information Please read This book is available for free download from www.syncfusion.com on completion of a registration form If you obtained this book from any other source, please register and download a free copy from www.syncfusion.com This book is licensed for reading only if obtained from www.syncfusion.com This book is licensed strictly for personal or educational use Redistribution in any form is prohibited The authors and copyright holders provide absolutely no warranty for any information provided The authors and copyright holders shall not be liable for any claim, damages, or any other liability arising from, out of, or in connection with the information in this book Please not use this book if the listed terms are unacceptable Use shall constitute acceptance of the terms listed SYNCFUSION, SUCCINCTLY, DELIVER INNOVATION WITH EASE, ESSENTIAL, and NET ESSENTIALS are the registered trademarks of Syncfusion, Inc Technical Reviewer: Zoran Maksimovic Copy Editor: John Elderkin Acquisitions Coordinator: Morgan Weston, marketing coordinator, Syncfusion, Inc Proofreader: Darren West, content producer, Syncfusion, Inc Table of Contents About the Author Acknowledgements Introduction Chapter Extracting Data from Emails 10 Introduction 10 Understanding emails 11 MailKit basics 12 Parsing emails 17 Demo program 20 Using IMAP 27 Demo program source code 29 Chapter Extracting Data from Screenshots 34 Introduction 34 Understanding formats 34 OpenCV basics 35 Parsing screenshots 37 Demo program 39 Summary 40 Complete demo program source code 41 Chapter Extracting Data from the Web 45 Introduction 45 Understanding REST & HTTP requests 46 Parsing JSON responses 52 Demo program 55 Summary 57 Complete demo program source code 57 Chapter Extracting Meaning from Text 62 Introduction 62 Understanding contextualization 63 Common data types & RegEx 77 Identifying entities 80 Summary 84 The Story behind the Succinctly Series of Books S Daniel Jebaraj, Vice President Syncfusion, Inc taying on the cutting edge As many of you may know, Syncfusion is a provider of software components for the Microsoft platform This puts us in the exciting but challenging position of always being on the cutting edge Whenever platforms or tools are shipping out of Microsoft, which seems to be about every other week these days, we have to educate ourselves, quickly Information is plentiful but harder to digest In reality, this translates into a lot of book orders, blog searches, and Twitter scans While more information is becoming available on the Internet, and more and more books are being published, even on topics that are relatively new, one aspect that continues to inhibit us is the inability to find concise technology overview books We are usually faced with two options: read several 500+ page books or scour the web for relevant blog posts and other articles Just like everyone else who has a job to and customers to serve, we find this quite frustrating The Succinctly series This frustration translated into a deep desire to produce a series of concise technical books that would be targeted at developers working on the Microsoft platform We firmly believe, given the background knowledge such developers have, that most topics can be translated into books that are between 50 and 100 pages This is exactly what we resolved to accomplish with the Succinctly series Isn’t everything wonderful born out of a deep desire to change things for the better? The best authors, the best content Each author was carefully chosen from a pool of talented experts who shared our vision The book you now hold in your hands, and the others available in this series, are a result of the authors’ tireless work You will find original content that is guaranteed to get you up and running in about the time it takes to drink a few cups of coffee Free forever Syncfusion will be working to produce books on several topics The books will always be free Any updates we publish will also be free Free? What is the catch? There is no catch here Syncfusion has a vested interest in this effort As a component vendor, our unique claim has always been that we offer deeper and broader frameworks than anyone else on the market Developer education greatly helps us market and sell against competing vendors who promise to “enable AJAX support with one click” or “turn the moon to cheese!” Let us know what you think If you have any topics of interest, thoughts, or feedback, please feel free to send them to us at succinctly-series@syncfusion.com We sincerely hope you enjoy reading this book and that it helps you better understand the topic of study Thank you for reading Please follow us on Twitter and “Like” us on Facebook to help us spread the word about the Succinctly series! About the Author Ed Freitas works as consultant He was recently involved in analyzing 1.6 billion rows of data using Redshift (Amazon Web Services) in order to gather valuable insights on client patterns Ed holds a master’s degree in computer science, and he enjoys soccer, running, travelling, and life hacking You can reach him at Edfreitas.me Acknowledgements My thanks to all the people who contributed to this book, especially Hillary Bowling, Tres Watkins, and Darren West, the Syncfusion team that helped make it a reality Thanks also to Manuscript Manager Darren West and Technical Editor Zoran Maksimovic, who thoroughly reviewed the book’s organization, code quality, and accuracy My colleagues Simon, Neil, Josh, and John Robert acted as technical reviewers and provided many helpful suggestions regarding correctness, coding style, readability, and implementation alternatives Thank you all Introduction The world around us is filled with information Valuable data is locked in silos such as emails, screenshots and the web Capturing and extracting that information in order to process it, make sense of it, and use it to help us make better and informed decisions should be fun and stimulating This book will provide an overview on capturing and extracting data from various sources in an easy and comprehensible way, using open source technologies available to anyone However, these technologies are not replacements for, nor intended to compete with, specialized commercial tools that provide a much broader range of possibilities and are case specific and fined-tuned for particular scenarios You will also gain an understanding of the methods, techniques, and libraries used in data extraction, which can lead to valuable insights and help you become a better manager, operate your business more effectively, and create a competitive advantage in the business world For readers with knowledge of C#, this book will offer exciting glimpses into what is technically possible without in-depth analyses of each topic The techniques presented here, along with the clear, concise, and easy-to-follow examples provided, will provide a good head start on understanding what is feasible with data capture and extraction in C# Have fun! Chapter Extracting Data from Emails Introduction Email has become a pillar of our modern and connected society, and it now serves as a primary means of communication Because each email is filled with valuable information, data extraction has emerged as a worthwhile skill set for developers in today’s business world If you can parse an email and extract data from it, companies that automate processes, e.g., helpdesk systems, will value your expertise An email can be divided into several parts: subject, body, attachments, sender and receiver(s) We should also note that the headers section reveals important information about the mail servers involved in the process of sending and receiving an email Before addressing how we can extract information from each part of an email, we should understand that a mailbox can be viewed as a semistructured database that does not use a native querying language (e.g., SQL) to extract information Email Headers Contents Sender (From) Receiver (To) CC (List of one or more Receivers visible to the main Receiver) BCC (List of one or more Receivers not visible to the main Receiver) Subject Body Content Attachments (If any) Attachment Attachment N Table 1: A Typical Email Structure 10 if (xAttributes != null && xAttributes.Length > 0) { int i = 0; List rlts = new List(); foreach (string xAtrrib in xAttributes) { string xCol = (xColls != null && xColls.Length > && xColls.Length == xAttributes.Length) ? xColls[i] : String.Empty; rlts.Add(CalcProbXA(smoothing, xAtrrib, A, xAttributes.Length, xCol, aCol)); i++; } rlts.Add(ProbA(A, aCol)); double tmp = 0; int cnt = 0; foreach (double r in rlts) { tmp = (cnt == 0) ? r : tmp *= r; cnt++; } res = tmp; } return res; } // P(female | X) = [P(finance | female) * P( && gColls != null && gColls.Length > 0) { if (G.Length == gColls.Length) { int i = 0; foreach (string group in G) { denominator += PProbAX(smoothing, group, xAttributes, xColls, gColls[i]); i++; } } } if (denominator > 0) res = nonimator / denominator; return res; } public virtual void Dispose(bool disposing) { if (!this.disposed) { if (disposing) { dataset = null; } } this.disposed = true; } public void Dispose() { this.Dispose(true); GC.SuppressFinalize(this); } } } In Code Listing 20, the Bayes engine’s principal method is BayesAX, which calculates the nominator, i.e PP(female | X) Then the denominator is calculated, i.e [PP(female | X) + PP(male | X)], which returns the final result, i.e P(female | X) = [PP(female | X)] / [PP(female | X) + PP(male | X)] 72 Method BayesAX has the following parameters: A, G, gColls, xAttributes, xColls, aCol, and smoothing A represents the group (A) for which the probability will be calculated (in the given example, A is female) G represents a string array of the of the possible groups (in the given example, these groups are female and male) The parameter gColls represents a string array of the name of the columns in which the female and male groups are found (in the given example, the column name for both female and male is Gender) The parameter xAttributes represents a string array of the X values for calculating probability (in the given example, finance, senior, and 0) { foreach (string t in tmp) { if (t.Count(x => x == '/') == 2) { res.Add(t.Substring( 0, t.LastIndexOf("/") - 1)); } } } return res.ToArray(); } public NER() { string root = @"D:\Temp\NER\classifiers"; Classifier = CRFClassifier.getClassifierNoExceptions (root + @"\english.all.3class.distsim.crf.ser.gz"); } ~NER() { this.Dispose(false); } public string[] Recognize(string txt) { return ParseResult(Classifier.classifyToString(txt)); } public virtual void Dispose(bool disposing) { if (!this.disposed) { if (disposing) { Classifier = null; } } 82 this.disposed = true; } public void Dispose() { this.Dispose(true); GC.SuppressFinalize(this); } } } // Wrapper class around the Stanford NER Implementation using System; namespace TextProcessing { public class NerExample { public static void nerExample() { using (NER n = new NER()) { string[] res = n.Recognize("I went to Stanford, which is located in California"); if (res != null && res.Length > 0) { foreach (string r in res) { Console.WriteLine(r); } } } } } } // Main Program using System; using TextProcessing; namespace DataCaptureExtraction { class Program { static void Main(string[] args) 83 { NerExample.nerExample(); } } } The most important part of the code is the call to CRFClassifier.getClassifierNoExceptions , which is where the location of the classifier definitions (english.all.3class.distsim.crf.ser.gz) are physically located on disk Within Recognize, the Classifier.classifyToString method of the Stanford NER is invoked, and the results parsed This will produce the output we see in Figure 15 Figure 15: Stanford NER C# Implementation Output Using the input string “I went to Stanford, which is located in California," the Stanford NET C# program can recognize two named entities: Stanford (which is an organization) and California (which is a location) Summary Extracting meaning from text is a fascinating topic, whether we are examining how to extract specific data types, recognize entities, or classify words within text When you are able to make sense of extracted data, you have access to a powerful tool that can help you improve, accelerate, and automate business processes In fact, there is an unlimited potential of processes—from spam filters to text classification and beyond—that organizations can streamline and improve We’ve only scratched the surface of what is possible with powerful C# code implementations Keep in mind that the techniques I have presented in this book are recommened for concept testing rather than production usage We have focused on quick implementation of what might be achieved from a conceptual point of view, and these techniques not compete with or undermine any commercial offerings I encourage you to also consider the diverse range of commercial products that have powerful APIs and are professionally supported Thank you for reading I hope this material helps had broaden your view on data capture and extraction with C# 84 The complete Visual Studio project source code can be downloaded from this URL: http://1drv.ms/1Q72Utm 85 [...]... Tesseract and to perform data extraction and OCR on TIFF with CCITT Group IV compression screenshots EmguCV allows OpenCV functions to be called from native NET code written in C# , VB, VC++, or even IronPython EmguCV is also compatible with Mono from Xamarin and can run on Windows, Mac OS X, Linux, iOS, and Android You can install EmguCV as a NuGet package using Visual Studio Because there are several... such as typing and data entry As companies and individuals increasingly automate their internal processes, extracting information from screenshots, which avoids manual data entry and typing, becomes ever more important in the business world The process of reading screenshots and extracting valuable information is often called Capture or Extraction Extracting the words, numbers, or text contained within... extract data from each element and make meaning of it We will address this later in Chapter 1 Keep in mind that these elements will always contain data: Headers, Contents, To, Sender, and Receiver These are essential—without them an email cannot be relayed However, other email elements, such as CC, BCC, Subject, Body, Content and Attachments, might not contain data This chapter will address how to connect... Contents of a Received Email Figure 3 depicts a connection to a POP3 server and a particular email inbox and, based on the contents of the emails found, inspects each email subject, body and attachment contents, extracting all keywords and checking for any keywords that match a specific predefined set of words (e.g., support, marketing, or invoice) If so, the connection sends an automated response using... private const const const const string cImapUserName = "test@imapserver.com"; string cImapPwd = "1234"; string cImapMailServer = "mail.imapserver.com"; int cImapPort = 993; private private private private const const const const string cSmtpUserName = "test@smtpserver.com"; string cSmtpPwd = "1234"; string cSmtpMailServer = "mail.smtpserver.com"; int cSmptPort = 465; 30 public static void ShowPop3Subjects()... EmguCV NuGet package, you won’t get the Emgu.CV.OCR namespace, which is essential for working with Tesseract That means we need access to the Tesseract engine In order to access the necessary file Emgu.CV.OCR.dll, we must download and install the full EmguCV setup, which can be found here: https://sourceforge.net/projects/emgucv/files/emgucv/3.0.0/libemgucv-windows-universal3.0.0.2157.zip/download?use_mirror=freefr&r=&use_mirror=freefr... rest of the process is essentially the same, and each email is represented by a MimeMessage object To learn more about MailKit, to the project’s GitHub website and check the code examples and further documentation Demo program source code The following Code Listing contains complete source code for all the examples previously described using MailKit Code Listing 8: Demo Program Source Code Using MailKit... (OpenCV) is a C+ + cross-platform library that was designed for use in implementing computer vision solutions (face detection, recognition of patterns in images, etc.) You can learn more about it on the Open CV Wikipedia entry and from the OpenCV website Because OpenCV is a native (nonmanaged) C+ + library, there is a NET cross-platform wrapper called EmguCV that we will use to interact with Tesseract and. .. treat each element and predict the type of data we can expect to extract In order to connect to a mail server and extract data, we will be using a cross-platform C# library called MailKit 11 MailKit basics MailKit is cross-platform NET library for IMAP, POP3, and SMTP built on top of MimeKit Mailkit was developed by Jeffrey Stedfast from Xamarin, and more information about it can be found at https://components.xamarin.com/view/mailkit... cPopPort = 110; public static void ShowPop3Subjects() { using (EmailParser ep = new EmailParser(cPopUserName, cPopPwd, cPopMailServer, cPopPort)) { ep.OpenPop3(); ep.DisplayPop3Subjects(); ep.ClosePop3(); } } } } // Program.cs: Show Subjects from POP3 Messages using EmailProcessing; namespace DataCaptureExtraction { class Program { static void Main(string[] args) { EmailExample.ShowPop3Subjects(); } } } 14 .. .Data Capture and Extraction with C# Succinctly By Ed Freitas Foreword by Daniel Jebaraj Copyright © 2016 by Syncfusion Inc 2501 Aerial Center Parkway Suite 200 Morrisville, NC 27560 USA... Introduction Email has become a pillar of our modern and connected society, and it now serves as a primary means of communication Because each email is filled with valuable information, data extraction. .. always contain data: Headers, Contents, To, Sender, and Receiver These are essential—without them an email cannot be relayed However, other email elements, such as CC, BCC, Subject, Body, Content and