~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAPSTONE PROJECT 1 REPORT Subject Voice recognition Artificial Intelligence Class TT01 Semester 231 Group members Trần Hoàng Lâm – 1951161 Đỗ Hạo Nam – 1911640.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAPSTONE PROJECT REPORT Subject: Voice recognition Artificial Intelligence Class: TT01 Semester: 231 Group members: Trần Hoàng Lâm – 1951161 Đỗ Hạo Nam – 1911640 Tutor: PhD Phạm Việt Cường Table of Contents Project Overview Chapter 1: Overview about Artificial Intelligence and Speech Recognition 1.1 Artificial Intelligence 1.1.1 Definition 1.1.2 AI Algorithms 1.1.3 Natural Language Processing 1.1.3 Applications 1.2 Speech Recognition 1.2.1 Overview 1.2.2 Types of speech recognition 10 1.2.3 Speech recognition working principle 10 1.2.4 Applications 11 1.2.6 Alexa’s working principle 11 Chapter 2: Overview of python and how to use python language for speech recognition AI 12 2.1 Python language 12 2.1.1 Overview of python language 12 2.1.2 Python running softwares 15 2.2 Libraries 15 2.2.1 Introduction to Pip in python 15 2.2.2 How to use python language for virtual assistant 16 CONCLUSION 27 REFERENCES 27 Project Overview This project report represents an overview of Artificial Intelligence, Speech Recognition and its applications First chapter is about AI theory, describing the speech recognition process and its diverse uses Next chapter is about the Python programming language and building a virtual assistant using Python I Project Objective To understand: Fundamentals of speech recognition and AI Applications in many fields and working principle of speech recognition Python programming language Build a smart assistant using Python II Abstract Nowadays, with the day-by-day development of science and technology, speech recognition has become a fast growing technology Its applications play an important role in different fields and give huge benefits to humanity Not everyone in the world is born with full body parts and capabilities There are some people who have vision problems or cannot use their hands Speech recognition is the significant solution for those instances to help them in many tasks From those factors mentioned above, this project aims to develop an assistant that can deal with some of the problems of handicapped people III Project scope Overview about artificial intelligence and speech recognition with their applications Practice building smart assistant models with Python programming language Chapter 1: Overview about Artificial Intelligence and Speech Recognition 1.1 Artificial Intelligence 1.1.1 Definition Fig 1.1 Artificial intelligence Artificial Intelligence (AI) is a human-programmed intelligence with the goal of helping computers to automate behaviors and be as intelligent as humans Specifically, AI is the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience However, nowadays AI is not smart enough to think and make decisions independently Most systems today not come close to the concept of real AI They only work according to predefined algorithms and under human control AI is generally divided into two types: narrow (or weak) AI and general AI, also known as AGI or strong AI Artificial Narrow Intelligence (ANI), sometimes known as "weak AI" is a term used to describe artificial intelligence systems specified to handle limited simple tasks It is designed to perform a single function such as internet search, face recognition, speech recognition, etc under various constraints and limitations Narrow AI is very popular at the present time, especially the virtual assistants Siri, Cortana, often used as search engine, disease detection, or recommender systems Artificial General Intelligence (AGI) is known as "strong AI" and allows machines to apply knowledge and skills in different contexts Theoretically, AGI is the highest level of artificial intelligence today At this level, AI has feelings of happiness and sadness like humans Currently, AGI is where we are heading but still in its very nascent stages The human brain is so complex that it is not yet possible to create a model that replicates the connections in this biological network However, more advanced areas such as natural language processing and computer vision are bridging the gap between ANI and AGI 1.1.2 AI Algorithms 1.1.2.1 Machine Learning Fig 1.2 Machine learning Machine Learning was born in the late 80s and early 90s Machine Learning is a subfield of AI It allows computers to act and make decisions based on data to perform a certain task These programs are algorithms designed in such a way that they can learn and improve over time when exposed to new data Machine learning is a term that refers to the act of a computer learning itself to improve the task it is performing Any system where the performance of a computer when performing a task will get better after completing that task many times An example of machine learning used every day that we can see is Google search engine When we are searching for something on google, it will return a lot of those search results If you spend a lot of time looking at the results returned or can click on a link to continue reading, Google will recognize that this person spent a lot of time (e.g minutes) looking at the returned information This means that this information is useful and relevant to this person And if we just glance at the returned results in just a few seconds, Google will realize that these returned results are not suitable for searchers It will automatically adjust the results to the searcher for the following searches 1.1.2.2 Deep Learning True to the name Deep Learning is an algorithm that helps computers learn deeply from a huge amount of information This machine learning model will study examples from existing data, then it will be fed new data, and proceed to process the new data according to the existing data The more data it provides, the more accurately the machine learning algorithm will its job A system based on deep machine learning will be able to find a certain rule even if it has not been encountered before Deep learning uses artificial neural networks (ANN) to analyze data in various details using algorithms that simulate the human nervous system and perform learning from a large amount of data provided to solve specific problems The Deep Learning algorithm will perform the task multiple times, each time refining the task a bit to improve the result Fig 1.3 Artificial neural network ANN - a type of software architecture copied from the human brain Computer neural networks are links of electronic neurons capable of processing and classifying information They can be considered as layers and each class is responsible for its own task, the result of which forms an overall picture For example, when training a neural network to process images of different objects It will seek to expose objects from these images Each layer of the neural network will detect certain characteristics of shape, color, gender, age Fig 1.4 Animal recognition problem The animal recognition problem is a basic example of a Deep Learning algorithm The computer's task now is to recognize whether the given image is a dog or a cat For example, if we teach the computer to recognize the image of a cat, we will program many layers in the ANN Each class has the ability to define a specific feature of the cat such as whiskers, claws, legs Then show the machine thousands of different cat pictures and indicate that this is a cat and the same thousands of noncat pictures and indicate that this is not a cat When this ANN looks through all the pictures, the layers will gradually recognize beards, claws, legs Deep learning is very popular in today's major platforms such as Facebook, Lazada, Tiki As these platforms all have very strong recommendation systems that significantly increase user interaction Specifically, they are based on data that users generate when using and interacting on internet-connected devices to suggest more products they will like, and recommend advertising 1.1.2.3 Deep Learning and Machine Learning In fact, Deep Learning is a subfield of Machine Learning Machine learning uses information from Deep Learning to make decisions For example, after Deep Learning has learned where a car is, where is a pedestrian, where is a tree and the distance to objects in front, Machine Learning will be applied in autonomous driving It uses the information provided by Deep Learning, combines and analyzes it When facing an obstacle, a slowdown command will be made At the intersection and noticed yellow light, it commanded the car to slow down and stop at red light 1.1.3 Natural Language Processing Fig 1.5 NLP in AI Natural language processing (NLP) is a combination of computational linguistics with statistical, machine learning, and deep learning models which helps the computer understand natural language of humans whether it is spoken or written There are main phases in NLP: data processing and algorithm development Data preprocessing prepares textual data and "cleanses" it for machine analysis Preprocessing puts the data in a usable format and highlights the features in the text for the algorithm to work After the data is processed, it will develop an algorithm for the process Hidden Markov Model: In speech recognition, this model compares parts of the before waveform with the after waveform, and compares it with a dictionary of waveforms to understand what is being said The Hidden Markov Model (HMM) is a model that observes a series of emissions with the unknown sequence of states that the model needs to pass to generate emissions Hidden Markov model analysis attempts to restore a sequence of states from the originally observed data Fig 1.6 Hidden Markove model 1.1.4 Applications Fig 1.7 AI in eCommerce AI in eCommerce: AI technology created a recommendation engine through which you can have a better interaction with your customers These recommendations are made based on their browsing history and interests From that, improve your customer service and branding Fig 1.8 AI in Healthcare AI in Healthcare: AI technology helps to build advanced machines with capability of detecting diseases and identifying cancer cells Moreover, a combination of historical data and medical intelligence will be used to invent new drugs Fig 1.9 AI in Agriculture AI in Agriculture: AI technology can identify soil defects and malnutrition based on computer vision, robotics and machine learning applications AI can also build bots to harvest crops with higher efficiency and performance than human labor 1.2 Speech Recognition 1.2.1 Overview Speech recognition is a pattern recognition process with the main purpose of classifying input information as a speech signal into a sequence of previously learned patterns and stored in memory patterns can be units of identity, word or phonemes If these patterns are immutable and constant, then speech recognition becomes simple by comparing the speech data that need to be recognized with the patterns that have already been learned and stored in memory Speech recognition has become well known and used with the rise of AI and smart assistants, such as Alexa (Amazon), Siri (Apple), anh Cortana (Microsoft) Fig 1.10 Well-known smart assistants Speech recognition systems allow consumers to interact with technology just by talking to it, allowing hands-free requests, reminders, and other simple tasks 1.2.2 Types of speech recognition Speech recognition systems can be categorized into several types based on their capability to recognize words and list of words they have Some types of speech recognition are named under 1.2.2.1 Isolated speech recognition In isolated word recognition, one word or list of words are separated by a pause, which means that the system requires one utterance at a time to understand 1.2.2.2 Connected speech recognition Connected word system is much the same as isolated word system but it allows a small pause between separated words and utterances It can be considered as a planned speech 1.2.2.3 Continuous speech recognition Continuous speech system allows natural utterances However, the content must be identified by computer or some special methods 1.2.2.4 Spontaneous speech recognition Spontaneous speech is opposite to prepared speech, which is sounding natural and not rehearsed Spontaneous speech systems can understand some features of natural speech such as hesitated sounds, tight bound words and slight stutters 1.2.3 Speech recognition working principle Fig 1.11 Basic speech recognition system Basic speech recognition includes: Voice Recorder: It consists of a microphone, which converts the audio wave signal into an electrical signal and its Analog-to-Digital converter (ADC) that samples and digitizes the analog signal to obtain discrete data that the computer can understand Also the ADC quantization process will approximate a continuous range of values Acoustic Model: based on Hidden Markov Models and ANN, working on analog signals to find its phonemes and graphemes It is created by using voice Python's dynamic nature also causes Python to be slow because it has to the extra work of executing code Therefore, Python is not used for purposes where speed is an important aspect of the project Not Memory Efficient: Python needs to compromise a bit to provide simplicity to developers Python programming languages consume a lot of memory This can be a disadvantage when building an application if you prefer memory optimization Weak in Mobile Computing: Python is commonly used in server-side programming You cannot view Python on the client side or in your mobile application for the following reasons: Python is inefficient in memory and slower than other languages Database Access: Programming in Python is easy and hassle-free But when interacting with the database, it's late The database access layer of Python is primitive and underdeveloped compared to common technologies such as JDBC and ODBC Large enterprises need complex legacy data to interact seamlessly, and as a result, Python is rarely used by enterprises Runtime Errors: As you know, Python is a dynamically typed language, so you can change the data type of a variable at any time Variables that contain integers may contain strings in the future, which can lead to run-time errors Therefore, Python programmers need to thoroughly test their application 2.1.1.4 Command lines and control structures Command lines in python are simple since it doesn’t required any symbol to end Python also contains control structures, such as: if, elif, else structures Loop structures: while, for Class variables are declared when a class is being constructed Def: define the function 2.1.1.5 Applications Youtube, Instagram and Quora are one of the many websites using Python Most of Dropbox’s source code is python language 2.1.1.6 Summary Python is a simple, versatile and complete programming language Ideal for beginners to professionals There are some drawbacks, but you can see that the advantages outweigh the disadvantages Even Google makes Python one of the leading programming languages 2.1.2 Python running softwares We can type python command lines directly on Terminal Fig 2.1: using Terminal to run python However, to optimize the coding purpose we can run python in Integrated Development Environment (IDE) IDE integrates compilers or interpreters inside, which helps you execute your code directly even when coding the programs, such as Visual studio, Eclipse, Xcode, Android studio, Sublime Text3 … Moreover, when coding with these softwares, we only need to save files with py as extension and then we can run programs by Command Prompt 2.2 Libraries 2.2.1 Introduction to Pip in python Pip, which stands for Preferred Installer Program, is the installer package for python You can use pip to install packages from the Python Package Index and other indexes Pip allows you to install, reinstall or uninstall PyPI package by one simple command: -pip PIP is already contained in latest version pythons For Windows, we can open Command Prompt by type “cmd” in Run or directly in folders For macOS, we can open Command Prompt by “Command + Space” and find terminal For Linux, we can open Command Prompt by “Ctrl + Alt + T” After that, install packages from PyPI like: -pip install package - name Fig 2.2: installing package for Python Similarly, we can change the package-name by the name of the library packages that are supported by python, such as PyGame - graphic manipulations, Pillow - Image Processing, Matplotib - Sketching 2D, Speech recognition, Excluding installing required packages, python itself already integrates many usable packages (ex: tkinter - Graphics Interface, datetime - time library) To exit one package: -pip uninstall package-name, then confirm by (y/n) yes or no 2.2.2 How to use python language for virtual assistant 2.2.2.1 Reparations: Systems requirements: Python 3.1004, Sublime Text Libraries: speech_recogntion: Used for recognizing voice time, datetime: Used for time management wikipedia: Searching in wikipedia dictionary webbrowser, selenium, webdriver_manager, urllib: accessing web, browser (Chrome) pyttsx3: converting text into sound requests: crawl information from websites smtplib: Sending email by protocol SMTP re: Regular Expression os, sys, ctypes: Accessing and handling system files json: Used to handle json datatype youtube_search: Searching video on Youtube 2.2.2.2 Building a virtual assistant 2.2.2.2.1 Import needed libraries: The urllib2 is replaced by urllib.request and urllib in Python3, however we still import it as urllib2 to avoid misunderstanding For each use of the virtual assistant, we will represent it by a function Each function can return values or only commands depending on its use 2.2.2.2.2 Functions: I Converting text to sound The first function we need is converting a text to sound and playing the sound Fig 2.3: speak() function we used pyttsx3 library to convert the text into sound and play it II Converting sound to text This is the second basic function beside converting text to sound In this function, we use other supporting functions which is get_audio() and stop () Fig 2.4: get_audio() function In get_audio(), we used the speech_recognition library for recognizing sound and then converting it into text The sound is input from your microphone by the listen function of speech_recognition.Recognition and then save the data in audio By using phrase_time_limit, we also give a limit time for the function to end the process if your microphone has problems and the systems can’t stop The data we get in audio will be recognized as English by r.recognize_google to convert into text data and save as text If the sound data in audio is not an error, which means the r.recognize_google function can detect audio to convert then the get_audio() function’s value will be returned to text Otherwise, if the data is an error that the r.recognition_google function can’t detect then the get_audio() function’s value will be returned (The case is when the assistant doesn't understand what we say, it will redo the get_audio() function) Fig 2.5: stop() function stop() function is only used for speaking the text “See you again!” using the speak() function, which is mentioned above Fig 2.6: get_text() function get_text() function is used for making the systems detect the sound data input up to maximum times or until the system recognizes it We used a for function with loops, if the value of text is different from then the get_text() function’s value will be returned to the value of text.lower() (this function is used for converting uppercase into lowercase) III Greeting Fig 2.7: hello() function This function is used for creating some basic conversation between us and the virtual assistant, such as greetings, or about the assistant’s information Here, we call it the hello function We used the date_time variable to store the time at that moment, after that the variable is used for comparing with fixed timelines to choose the suitable text IV Checking current date, time Fig 2.8: get_time() function This function is quite simple, I used the datetime library to define the current time and then extract it to the now variable The assistant will check, if it detects the word “time” or “hour” or “now” in the text variable, it will speak out the current time Otherwise, if the word “day” or “date” in the text variable, it will speak out the current date V Open website, application Fig 2.9: open_application() function If there is some special keyword such as: google or word or excel in text variable, I used os.startfile() to open applications from the system Fig 2.10: open_website() function We used the re.search() function (Searching function in Regular Expression) to create the domain, which is after the word “open” then combines it with the string “http://www.” to create the complete url of the website After that, we used webbrowser.open(url) to open the requested website If the re.search() function can find the domain for the url, the virtual assistant will the open_website() function and the value of open_website() function will be returned True On the other hand, when the domain can’t be found then the open_website() function won’t run and the value of it will be returned False VI Sending email Fig 2.11: SMTP chart SMTP (stands for Simple Mail Transfer Protocol), this protocol is mainly used for sending email and for accessing mail server’s data we will have protocol IMAP or POP SMTP server is a service that allows you to send email with high quantity, fast speed and also unlimited Fig 2.12: send_email() function We used the smtplib library to send email with smtp