Design and manufacture a pet robot

Speech recognition and communication system: There are 3 steps basically for the robot to hear, understand and respond • Speech Input and Recognition: The robot collects sound signals t

INTRODUCTION

Introduction

Robotics and artificial intelligence are pivotal in the ongoing technological revolution, driving significant changes and challenges across multiple sectors AI-powered robots possess the capability to learn, automate tasks, and make informed decisions based on collected data This advancement has unlocked new opportunities in healthcare, manufacturing, and education A key trend is the emergence of user-friendly robots designed for effective collaboration and interaction with humans.

Sophia, the humanoid robot created by Hanson Robotics, combines a lifelike appearance with advanced AI capabilities, allowing her to generate thoughts, speech, and perform independent actions Through her interactions with humans, Sophia enhances her autonomy and develops emotions that resemble those of humans.

Figure 1.1 Close-up of Sophia wearing an Ao Dai and talking in Vietnam (Source: Tuoi Tre

Pet robots, a specialized branch of interactive robotics, are gaining traction in mental healthcare due to their unique designs and AI-driven intelligence These innovative machines offer an alternative approach to traditional therapy, engaging users with novel gestures and fostering emotional connections.

2 for those who wish to have pets but are allergic or concerned about the inconvenience of keeping pets

EMO, a compact AI robot created by LivingAI, exemplifies advanced robotics by operating independently, engaging with humans, and adapting to its environment It has garnered positive reviews from robot enthusiasts globally, quickly gaining popularity, particularly in Vietnam.

Figure 1.2 Robot Emo interacts with a person when it is being cuddled

Scientific and Practical Relevance

The pet robot industry has garnered significant attention from major technology companies globally, particularly in developed regions like the United States, Japan, and Europe, where substantial investments are being made in research and development Numerous innovative products and projects have emerged within this dynamic field.

Robotic dogs and cats have emerged as successful companions for the elderly in nursing homes, providing comfort and interaction These battery-operated pet robots are designed to resemble real pets in both appearance and size, featuring sensors that enable mobility and user interaction When engaged, they can blink, nod, and produce realistic barking or meowing sounds, fostering a natural and enjoyable experience for users.

A study in Japan investigated the impact of pet robots on patients and individuals with disabilities, revealing notable enhancements in emotional expression, communication, overall satisfaction, and a decrease in feelings of loneliness among the participants.

Engaging in activities with pet-type robots can enhance the quality of life and alleviate loneliness in elderly individuals This study compares the benefits of pet-type robot therapy to traditional animal-assisted therapy (AAT), emphasizing the potential of robots as a safe alternative in settings where interaction with live animals may increase the risk of infection.

Figure 1.3 AIBO robot help reduce the loneliness of elderly

Pet robots offer significant educational benefits by enhancing skills like problem-solving, creativity, and social interaction Additionally, they encourage critical thinking about the societal implications of technology.

Figure 1.4 Petoi Robot, a programmable open source robot dog for STEM education

Advancements in AI have significantly improved human-robot interaction, with machine learning algorithms allowing robots to adapt to individual user preferences for more personalized experiences Natural language processing (NLP) enhances communication by enabling voice commands and context-aware responses Additionally, progress in computer vision equips robots with the ability to recognize and interpret human gestures, emotions, and facial expressions Furthermore, reinforcement learning enables robots to learn from human feedback, optimizing their actions in real-time.

Figure 1.5 Task robot can do with AI

In conclusion, pet robots have demonstrated immense potential in various fields, particularly in providing therapeutic benefits These robots, designed to interact and

Pet robots engage with humans and provide more than companionship; they significantly enhance users' well-being by alleviating anxiety and memory loss Their effectiveness is highlighted by the reduced reliance on certain medications due to interactions with these robots As technology advances, we can expect the development of more sophisticated pet robots that will further transform healthcare and personal wellness The success of these robotic companions demonstrates the transformative power of robotics and its potential to improve our lives.

Rational

Recent advancements in the research and design of user-interactive robots in Vietnam have led to the development of various types, including humanoid and flower robots However, the country still lags behind in the realm of pet robots, which are often viewed merely as recreational toys This highlights the urgent need for research focused on designing intelligent pet robots that can foster companionship, offering both theoretical and practical benefits The outcomes of this research could serve as a valuable addition to mental healthcare practices in Vietnam.

Research Objectives

• Main objective: To research, design, and manufacture a pet robot

▪ Design a pet robot with two rear wheels and two joints on either side of the body to change the angle of the wheel arms on each side

▪ Use embedded programming to control the robot’s position, posture, and face expression to create cute gestures

▪ Process images and sounds for interaction and to follow human commands

▪ Make interesting and thrilling features.

Research Method

• Develop mechanics and physical structure for the robot

Robots are designed to gather information from their environment, process this data, and respond accordingly To achieve this, it is essential to identify the key components, including electrical circuits, materials, and software programs that facilitate these functions.

▪ Integrate sensors such as cameras, touch sensors, sound sensors so that the robot can recognize the surrounding environment and interact with the user

▪ Develop algorithms to process sensor data

▪ Use artificial intelligence models to improve the robot’s understanding and response capabilities

• Develop a user-friendly program for robot to interact with humans

METHODOLOGY

Idea for Function Capabilities

To enhance a robot's lifelike qualities and foster user connections, incorporating appropriate features is essential By enabling voice recognition, emotional expression, and natural interaction, robots can create a sense of personality and thoughtfulness This thoughtful selection of functionalities not only promotes user acceptance but also positions the robot as both a practical tool and a reliable companion.

The list below contains all the features we decided on the strategy to develop for our robot:

The robot utilizes a speech recognition and communication system to collect sound signals and interpret user commands or questions, whether online or offline By leveraging the ChatGPT API or the offline VC-02 module, it processes these inputs to generate appropriate verbal or gestural responses.

Emotion expression capabilities: Integrating mechanisms for the robot to express emotions through facial expressions using the LCD and body language by moving legs, ears, or the whole body

The goal is to enhance the robot's situational awareness by enabling it to recognize user faces and detect objects in its environment, thereby improving its ability to interact effectively with surrounding elements.

Auto navigation: By leveraging computer vision, the robot will navigate autonomously in its environment, avoid obstacles, and look around curiously like a pet

The robot serves as a valuable educational companion, assisting users with general information inquiries Furthermore, it functions as a content creation tool, transforming textual prompts into images through its advanced text-to-image model.

Users can engage in physical interaction with robots by touching or gently caressing their heads This tactile engagement is detected by sensors, which relay the information to the microcontroller for processing Such a mechanism facilitates a more natural connection between individuals and robots.

8 and adaptable connection with the robot, fostering an intimate and captivating interaction experience.

Approach Method

The appearance of a pet robot plays a vital role in its evaluation, with designs that are widely accepted across cultures and genders being preferred Modern consumers gravitate towards adorable and friendly aesthetics, as seen in both real-life products like the Loona and EMO robots and fictional characters such as WALL-E and Baymax Research indicates that a cute appearance significantly enhances customer evaluations, purchase intentions, and the likelihood of positive word-of-mouth recommendations Furthermore, robots that resemble animals, characterized by rounder, shorter, and smaller features, are perceived as more adorable, encouraging users to engage with them for longer periods.

We opted for a design that mimics a four-limbed animal without creating a fully functional robot with four limbs, aiming to reduce production costs and streamline development Inspired by the iconic WALL-E robot, our pet robot features a rounded rectangular body, two curved limbs powered by high-torque RC servos, and wheels at the back for smooth movement Additionally, we incorporated two robotic ears, which are revolute links attached to the head, enhancing its animal-like appearance.

The final design of the robot, depicted in Figure 2.1, features curved blocks and spiked back wheels, giving it a sturdy yet cute appearance The front wheels function solely as caster wheels, designed with small friction surfaces to enhance kinematic performance While primarily functional, the front wheels also contribute to the aesthetic, evoking the essence of four-limbed creatures Overall, our design seeks to establish this pet robot as a unique robotic lifeform, distinct from any real animals.

In this part, with each function, we determine the crucial component and how to create the function

Speech recognition and communication system : There are 3 steps basically for the robot to hear, understand and respond

• Speech Input and Recognition: The robot collects sound signals through a microphone and uses speech recognition to convert them into text This can be

The article discusses the use of speech recognition technology, specifically highlighting the VC-02 module developed by Shenzhen Ai-Thinker Technology Co., Ltd., which operates effectively in offline mode This low-cost module can automatically detect wake words or simple commands, allowing users to interact with the robot through voice requests When connected to the internet, voice data is recorded and transmitted as byte data to a computer, where it is processed using the Whisper model, a state-of-the-art speech-to-text technology that converts spoken words into text.

Figure 2.2 VC-02 Ai Thinker Intelligent Offline Speech Recognition Module

Natural Language Processing (NLP) techniques are essential for analyzing text to understand user intent and meaning, enabling accurate responses By utilizing the ChatGPT API, we effectively generate text responses that align with user inquiries.

The robot generates appropriate responses based on user intent, delivering them through speech, visuals, or gestures Utilizing a text-to-speech model, it produces a suitable voice for effective communication.

The robot's emotional expression is enhanced through face animation displayed on an LCD monitor, complemented by movable ears operated by small RC servos and two legs Furthermore, it incorporates sound effects during its actions to enrich its overall expressiveness.

Research shows that people prefer iconic designs for their straightforwardness and clarity in communication Additionally, incorporating a wide range of emotions and facial expressions in robots is crucial for showcasing emotional diversity, which enhances their realism and boosts user engagement.

Figure 2.4 Multiple emotion and state of our robot

Body language plays a crucial role in conveying emotions and intentions effectively To ensure our robot's body language aligns seamlessly with facial expressions and verbal communication, we meticulously design its movements Initially, we develop animations that represent distinct stages or emotions, enhancing the overall interaction experience.

12 we record the voiceover while referring to the animation Lastly, we carefully choose the movements for the ears and legs to match the animation

The camera captures images and sends them to our laptop, serving as a server for face recognition data to help the robot track users Additionally, the robot's object detection capabilities allow it to exhibit curiosity; when it identifies an object, it interacts with it in a playful manner.

By leveraging the vision, the camera serves as a means for the robot to avoid obstacles by employing the Depth Estimation model and using Fuzzy control algorithms

Figure 2.5 Depth map generated by Depth estimation model

Generative artificial intelligence (generative AI) is revolutionizing creativity by enabling the creation of images, music, and other artistic content, fostering innovation and exploration of new ideas This technology allows users to experiment with various styles and expand the limits of traditional creativity Our robot leverages generative AI to generate images based on specific user requests, utilizing a combination of speech-to-text and text-to-image models for a seamless experience that allows for image generation through voice commands.

THEORICAL BASIS

Mechanical Basis: Mathematical Model

To develop a mathematical model for a mobile robot that operates solely when both legs are on the ground, we must analyze two key components: the general kinetic model of the mobile robot and the specifications of its DC motors This approach simplifies the robot's movement to that of a basic differential mobile robot.

The robot operates as a differential drive mobile robot, with the left and right wheel speeds denoted as 𝜔𝐿 and 𝜔𝑅, respectively Given the wheel radius 𝑅, we can determine the linear velocity for each wheel.

Then, using the instantaneous center of rotation, we have the linear and angular velocity:

The final step in constructing a mathematical model is establishing the relationship between the input signal and the motor's output speed This relationship is defined by the ratio \( k_s \), which correlates the input signal to the output speed, and the parameter \( \tau \), which directly influences the motor's settling time Consequently, the motor's transfer function can be formulated based on these variables.

AI Model Theoretical Basis

There are five functions, each with its own distinct model In this section, we will explore each function and its theoretical foundational model

3.2.1 Speech Recognition and Communication System

3.2.1.1 Speech Input and Recognition Model a) Speech recognition definition:

Speech recognition, or automatic speech recognition (ASR), enables users to control electronic devices using their voice instead of traditional input methods like keyboards or buttons This technology transforms spoken words and phrases into formats that machines can understand, allowing for hands-free operation Typically, speech recognition systems are optimized for single-user interactions.

Sampling: In signal processing, sampling is the reduction of a continuous-time signal to a discrete-time signal The sampling rate is the number of samples taken per second

To effectively capture a signal, it is essential to have a minimum of two samples per cycle—one for the positive segment and one for the negative segment While increasing the number of samples enhances accuracy, sampling fewer than two per cycle can lead to significant information loss According to the Nyquist theorem, the sampling rate should be at least twice the highest frequency present in the signal to ensure accurate representation.

Most human-audible frequencies are below 10,000 Hz, so a sampling rate of 20,000

Hz is considered adequate However, for telephone audio, which filters out frequencies above 4,000 Hz, a sampling rate of 8,000 Hz is suitable, and for microphone audio, 16,000 Hz is commonly used

Quantization involves the conversion of real numbers into integer values, specifically storing amplitude values as either 8-bit integers (from -128 to 127) or 16-bit integers (from -32768 to 32767) In this process, each sample at time 𝑛 is represented as 𝑥[𝑛].

Windowing is a crucial step in audio processing that involves extracting features from the waveform's quantum representation Each segment of audio taken from the waveform is referred to as a frame This process is defined by three key parameters: the window size or frame size, the frame stride or shift (which determines the offset between consecutive windows), and the shape of the window itself.

Figure 3.3 Showing a 25ms rectangular window with a 10ms stride

To extract the signal, we multiply the value of the signal at time index 𝑛, 𝑥[𝑛], by the value of the window at time index 𝑛, 𝑤[𝑛]:

𝑦[𝑛] = 𝑤[𝑛] ∗ 𝑠[𝑛] (3.4) The shape of the window is rectangular, however, the window will truncate the signal at the boundary points To avoid this, we often use a Hamming window, which shrinks

16 the signal values to zero at the edges of the window, avoiding discontinuities Let's see the equation:

Figure 3.4 Windowing a sine wave with a rectangular window or Hamming window

The Discrete Fourier Transform (DFT) is essential for extracting feature information from a signal window, enabling the analysis of discrete sequences or signal samples effectively.

The Discrete Fourier Transform (DFT) processes an input sequence 𝑥[𝑛]…𝑥[𝑚] to produce complex numbers 𝑋[𝑘] for each of the N discrete frequency bands, representing the amplitude and phase of the original signal's frequency components By plotting the amplitude across the frequency domain, we can visualize the signal's spectrum An example illustrates a 25ms Hamming window and its corresponding spectrum calculated using the DFT.

Figure 3.5 (a) A 25 ms Hamming-windowed portion of a signal from the vowel [iy] and (b) it spectrum computed by a DFT

The Mel filter bank and logarithmic scaling are essential for analyzing audio signals, as the Discrete Fourier Transform (DFT) reveals energy across frequency bands Human perception of sound is not uniform; we are more sensitive to lower frequencies while less sensitive to higher ones To address this, the Mel scale was introduced, with the Mel serving as a unit of pitch, allowing for a more accurate representation of auditory perception The Mel spectrum is derived using a specific formula that transforms frequency data into a scale that aligns with human hearing.

Finally, we take the log of each value in the Mel spectrum

Figure 3.6 Audio information has been clarify by using Mel filter and log c) Speech recognition process:

Speech enhancement: Aims to improve speech quality by using various algorithms

The goal of speech enhancement is to enhance the clarity and overall quality of degraded speech signals through audio signal processing techniques A key focus of this field is noise reduction, which is crucial for various applications including mobile phones, VoIP, teleconferencing systems, speech recognition, speaker diarization, and hearing aids.

Feature extraction involves transforming the speech waveform into a parametric representation that reduces data rate for efficient processing and analysis High-quality features are essential for achieving accurate classification outcomes.

Acoustic modeling: Acoustic models are used to represent the relationship between the extracted features and phonemes, which are the basic units of speech sounds [12]

Language models play a crucial role in phonetic unit recognition by analyzing the probabilistic relationships between words in a language This enables the identification of the most probable word sequences based on the provided acoustic input Additionally, the Whisper speech recognition model exemplifies advanced techniques in this domain.

In this process, we use the Whisper model, an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web [22]

The Whisper architecture utilizes an end-to-end encoder-decoder Transformer model to process audio input It breaks down audio into small segments, transforms them into log-Mel spectrograms, and feeds them into an encoder The decoder is trained to generate corresponding text captions, incorporating special tokens that enable the model to perform various tasks, including language identification, phrase-level timestamps, multilingual speech transcription, and translation into English.

The five available model sizes include four English-only versions, each designed to balance speed and accuracy The models differ in memory requirements and inference speed compared to the large model, with actual performance varying based on hardware and other factors.

Table 3.1 The parameters of the Whisper model

Size Parameters English-only model

Relative speed tiny 39 M tiny.en tiny ~1 GB ~32x base 74 M base.en base ~1 GB ~16x

Size Parameters English-only model

Relative speed small 244 M small.en small ~2 GB ~6x medium 769 M medium.en medium ~5 GB ~2x large 1550 M N/A large ~10 GB 1x

3.2.1.2 Text Generation with Large Language Model a) Text generation and Large Language model definition

Text generation is an AI-driven process that creates written content by mimicking human language patterns and styles This technology produces coherent and meaningful text, resembling natural communication, and has become increasingly important across various domains such as natural language processing, content creation, customer service, and coding assistance.

Large language models (LLMs) are advanced foundation models designed to process vast datasets, enabling them to comprehend and generate natural language as well as various content types Their development reflects significant advancements in artificial intelligence, allowing them to perform diverse tasks effectively.

Language is a sophisticated system of human expression governed by unique grammatical rules, while machines require advanced artificial intelligence algorithms to effectively understand and communicate in human language.

DEVELOPMENT

Visual and Mechanical Development

To obtain the previously shown hardware of the robot with appropriate mobility and endurance, there are certain steps to go through

When designing a pet robot, the initial step is to determine its appearance, drawing inspiration from the iconic WALL-E and the commercially available Loona robot The Loona robot offers distinct design advantages that can enhance the overall user experience.

• Two long limbs resemble a quadruped pet

• Look highly similar to a pet with movable ears

• Main color is white with some gray features, giving a friendly feeling

We admire the inspiring features of Loona and aspire to incorporate similar qualities into our robots However, we do not anticipate achieving the same level of compactness as Loona in our designs.

The design of a robot utilizing a Raspberry Pi 4 is influenced by the module's size and power consumption Inspired by WALL-E, the robot features continuous tracks on either side and a cubed-shaped main body that serves as a practical container The rectangular body allows for efficient arrangement of components, enhancing the robot's functionality.

To enhance the lifelike appearance of our robot, we incorporated a neck similar to that of WALL-E and Loona, allowing for a more realistic design This addition includes a chin that users can rub, mimicking the comforting gesture of rubbing a cat's chin.

The design process of the robot, illustrated in Figure 4.3, begins with a basic form composed of cubes Next, we implement a fillet process to introduce appropriate curves with varying radii Ultimately, the robot's body is split vertically, with the lower section serving as a container for power components and the upper section dedicated to elements that convey expression.

Designing a robot's side limbs presents significant challenges, particularly in the placement of components It is crucial to maintain a distance between the wheels and the body during leg movement, while also accommodating DC motors within the legs Additionally, the size of the legs must be proportionate to the body to ensure a compact design.

We began by establishing a general shape with precise dimensions, as illustrated in Figure 4.4 To accommodate the DC motor, we designed the back part of the leg to be thicker while tapering the front section to minimize weight and enhance design balance Finally, we refined the leg's appearance by applying fillets to achieve the final shape.

The design of the robot's ears, while seemingly minor, plays a crucial role in its overall aesthetics We aim for an elegant curvature and appropriate size by first creating a curved plate as the base shape To accommodate the microphone sensor and LED, we incorporate a gradual thickness that tapers towards the edges The ear's shape is then refined through filleting, and the contrasting colors between the top and bottom sections enhance its visual appeal.

To enhance our robot's performance, we customized the front wheel to minimize friction during rotation, which can lead to kinetic failure This design features a fixed wheel equipped with a caster ball for improved maneuverability Additionally, we refined the edges with chamfering to enhance the overall aesthetics and functionality.

For such a compact setup, we need an appropriate arrangement of components and wiring Components need to be place in a space-saving manner, while maintaining minimal distances between them

Figure 4.7 Bottom Transparent View (without PCB)

The assembly of the major components is designed for efficiency, with the power block positioned at the front to accommodate the actuating RC servos at the back To facilitate the installation of the L298N motor driver, the RC servos are elevated slightly The switch and charging port are strategically placed behind the L298N, remaining below the top surface of the RC servos to allow for the main PCB to sit on top Additionally, the camera and speaker are cleverly positioned in front of the power block, with a bowtie added to conceal the camera's external parts The remaining space is reserved for wires and connectors, ensuring a neat and organized setup.

The spacious design of the robot's head facilitates easy arrangement of its internal components The front features a screen, while the Raspberry Pi 4 is securely mounted at the back SG90 motors are positioned on either side, complemented by touch sensors located on the head and chin The ears are strategically placed at an appropriate height, housing PCBs that contain microphone sensors and LEDs.

Figure 4.10 Stress Analysis of Leg Support Plate

We install a lightweight yet highly durable 1mm INOX support plate in each leg of the robot, which has an estimated weight of 2.5kg This results in each plate bearing a total load of 12.5N, and simulations confirm that the plate can easily withstand this weight.

Figure 4.11 Stress Analysis of Bottom Body

The body must support the total weight of the head and its internal components, with an estimated load of 20N Despite this, the lower part of the body demonstrates resilience under such pressure.

52 part suffers a significant stress compared to the walls, the body can still withstand the load without any noticeable deformation

When calculating the requirements for DC motors, two key factors must be considered: torque and power The motors need to generate sufficient torque to enable the robot to overcome friction and achieve the desired acceleration To begin, it is essential to calculate the necessary force that the robot must exert.

To achieve a desired acceleration of 2 m/s² with a mass of 2.5 kg, while overcoming a rotational friction coefficient of 0.1 and gravitational force of 9.8 m/s², a necessary force of 7.45 N is required Given that the wheels have a radius of 0.0425 m, the essential torque produced by each motor can be calculated based on these parameters.

2 = 0.158(𝑁 𝑚) ≈ 1.5(𝑘𝑔𝑓 𝑐𝑚) (4.2) About the motor speed, for the robot to reach 0.5m/s, the rotational speed required is:

2𝜋 = 11.76(𝑟𝑎𝑑/𝑠) ≈ 113(𝑅𝑃𝑀) (4.3) With all of these parameters, the required power of the DC motor is:

Therefore, the chosen motor JGY370 with rated torque 1.5kgf.cm, rated speed 150RPM, maximum power of 15.6W can satisfy the robot’s requirements

Electrical Development

To accommodate the robot's compact size, a reliable and efficient electrical system is essential This system includes a versatile power supply that adapts to varying consumption rates, regulators for microcontrollers, actuators, and peripherals, along with durable connections designed to handle the operating current of all components.

4.2.2 Power Supply and Regulator Choosing

The power consumption summary gives us requirements of the power supply:

• Minimum discharge rate: 12.06A (with safety factor of 0.6, the recommended discharge rate for the power supply is 20A)

A compact and efficient solution for power needs is achieved by connecting three high-discharge rate 18650-sized lithium-ion cells in series, resulting in a total voltage of 12.6V With each cell providing a capacity of 2000mAh, the overall power capacity of this battery pack reaches 25.2Wh Additionally, the setup includes a 5A regulator for powering microcontrollers and small servos at 5V, along with another 5A regulator designed for actuating servos at 7.4V.

To know whether the capacity is appropriate for the robot to operate, we need to calculate the estimated average consumption

Table 4.2 Average Power Consumption Calculation

The whole system uses 21.65W on average, considering 10-20% power loss due to wires, we have the final average power of 23-25W, resulting in an hour of energy capacity

To compute the trace width, we have to define the internal operating temperature of the PCB for correct result Given that the ambient temperature is 25°C, the temperature

55 rise ∆𝑇 is 10°C, the thickness is 1(oz/ft 2 ) First, we calculate the needed cross-sectional area of the trace:

𝑐 (4.7) where 𝑘 = 0.048, 𝑏 = 0.44, 𝑐 = 0.725 Then, the width is computed:

The complete results for each type of traces are as below:

30kgf.cm power servo transmission

DC motor power transmission (single)

ESP32 & VC-02 power supply (single)

Raspberry Pi 4 power supply (single)

Actual trace widths on PCBs may vary due to factors like available space and the width of adjacent traces Despite these variations, the traces are designed to function correctly, as a safety margin has been incorporated into the calculations.

4.2.3.2 PCB Components Choosing & Complete Design

Connections on the PCB must be considered carefully because overloading components can cause intensive temperature rise and component failure, which lead to

56 short-circuiting and fire Therefore, when choosing components to install on the PCB, we should follow the recommended conditions of components

30kgf.cm power servo transmission (single)

SG90 servo power transmission (single)

Signal pins 0.05 0.1 2.54mm Headers/PH2.0

ESP32 & VC-02 power supply (single)

Low-level and High-level Controller development

To effectively control the JGY370 motor, we will utilize the MATLAB System Identification Toolbox to determine the motor's transfer function This will involve sampling data generated by the ESP32, which reads the motor's encoder by counting the number of pulses over a specific time interval The speed of the motor can be calculated using the encoder's data.

Figure 4.12 confirms the rightfulness of the formula, with the maximum speed read approximates the maximum speed given by the manufacturer

Figure 4.12 Input PWM signal and the corresponding motor angular speed (sampling time = 0.05s)

Putting the input signal and the result into MATLAB System Identification Tool, we have the transfer function of each motor:

To effectively design a PID controller, it's essential to establish the desired rise time and settling time for the system's response While a rapid adaptation is ideal, physical limitations of the system must be considered Analyzing collected data reveals that the input signal transitioned from -255 to 255 within 0.1 seconds, allowing the motor speed to approach its maximum positive value swiftly, stabilizing fully in an additional 0.1 seconds Consequently, the rise time is determined to be 0.1 seconds, and the settling time is set at 0.3 seconds to ensure a smoother acceleration of the motor speed These insights lead to the finalization of the PID parameters.

Applying the PID controller, the result produced is:

Figure 4.13 Control result after PID application (sampling time = 0.05s)

RC servos operate by interpreting PWM signals to adjust the shaft's rotation angle Typically, these generic servos respond to a 50Hz PWM signal with pulse widths between 500µs and 2400µs, translating to a duty cycle of 2.5% to 12% This pulse width directly correlates to a rotation range of 0 to 180 degrees.

The generic servo library in Arduino IDE disrupts the interrupt function, causing issues with DC motor controllers To address this, we utilize the same function for generating PWM signals for both servo and DC motors However, specific parameters must be pre-defined before implementation The PWM range is 9.5%, and to achieve a resolution of 1° or higher, this range must correspond to at least 180 discrete values.

Therefore, we choose the resolution of 12-bit (4096 values), and the low and high limit of the control value will be 103 to 491

The linear velocity 𝑣 and angular velocity 𝜔 will be computed by a Fuzzy Logic Controller, where the input of the controller are distances in directions measured from the depth estimation process

Figure 4.14 Fuzzy Logic Control Flowchart

Figure 4.15 Fuzzification graph and showcase (left proximity = 0.9, front proximity = 1.0, right proximity = 0.9, 𝑣 = 0, 𝜔 = 69%)

The input distances are normalized to a range of [0.0, 1.0], representing the farthest and nearest points detectable by the depth estimation model The output speeds, including both linear and angular velocities, are scaled from 0 to 100% This demonstrates how the three methods interact with the depth estimation model effectively.

60 are blocked, the robot will stop moving linearly and proceed to rotate to find the direction with no obstacles

The fuzzy logic controller generates outputs for linear velocity (𝑣) and angular velocity (𝜔), but the computed motor speed can sometimes exceed the motor's maximum physical capability To address this, we prioritize direction control by limiting the linear velocity based on the angular velocity.

Let the maximum motor speed be 𝜔 𝑚𝑎𝑥 , using (3.2), we have:

(4.14) Using (4.13) and (4.14), we get the limit of the linear velocity:

Software development

With all the feature decided, we came up with an overview diagram of the robot system:

Figure 4.16 Overview of robot system

MQTT is a lightweight messaging protocol that facilitates efficient communication between robots and their control systems, enabling real-time data exchange while minimizing resource strain This technology allows for the seamless transmission of images and critical information, empowering robots to make quick, informed decisions.

Utilizing a powerful GPU computer for data analysis enables complex artificial intelligence algorithms to run efficiently This capability is essential for robots to swiftly process and interpret information, facilitating their ability to navigate environments and detect obstacles.

By leveraging a powerful GPU to handle complex computational tasks, we can streamline the robot's onboard processor, ensuring it remains efficient while delivering advanced intelligence and functionality This article will explore key features of the robot and the development processes behind them.

4.4.1 Speech Recognition and Communication System

To enhance user interaction with our robot, we have implemented a wake word system that identifies when assistance is needed, preventing unnecessary data processing and resource waste The VC-02 module enables efficient offline voice recognition, allowing the robot to understand 150 commands in English or Chinese at a low cost Users can easily set up and optimize these commands by accessing the training software on Ai-Thinker's homepage To initialize the VC-02 commands, visit http://voice.ai-thinker.com/#/login and select the appropriate configuration before configuring the control commands through front-end signal processing.

Figure 4.17.Set up the configuration for input audio processing

The default configuration options are as follows:

• Microphone configuration: The default is single MIC because the VC-02 module integrates only one microphone

• Distance recognition: Choose between far-field or near-field sound recognition

• AEC echo cancellation: When enabled, it can filter out noise caused by the VC-

• Steady-state noise reduction: When enabled, it can reduce environmental noise

After that, we will configure the pin for VC-02, there are 2 steps, first is configure default level for pin, then configure the output level for each pin:

Figure 4.18 Control detail for each pins

A basic command has these options below:

• Command word: which is the command we want the user to use to communicate with the robot

• Control type: the specific pin that will be triggered when it recognizes a command

There are two types of actions that can be triggered by a pin: settings, which allow you to configure the pin for low or high levels, and pulse, which enables the pin to generate customizable pulses with a specified period.

When users greet the pet robot by saying "Hello Ivy" or "Hey Ivy," the robot identifies the wake word and prepares to receive commands With the integration of VC-02, default commands such as "Sing a song," "Go around," and "Shut down" are programmed into the robot, ensuring it functions effectively even without a Wi-Fi connection.

The online communication system of the robot, enabled by a Wi-Fi connection, allows it to chat with users similarly to ChatGPT Users can initiate conversations with the robot, and in offline mode, this feature can also be activated By connecting a microphone to the Raspberry Pi, the robot captures the user's voice and utilizes the Whisper model to convert audio files into text through the OpenAI API This cloud-based service supports over 90 languages, enhancing the robot's ability to communicate effectively and adaptively with users from diverse linguistic backgrounds.

Once we obtain the text from the Whisper model, we create a chatbot system using the ChatGPT API The ChatGPT API generates a response based on the transcribed text

To preserve chat history, we add both the user's input and the bot's response after each interaction, ensuring that recent exchanges are included when generating new responses, thereby providing essential context for the AI model.

To achieve a desirable response from the robot, such as being cute, helpful, or concise, we adjust the model's system role This system role consists of a developer-written message that directs the bot's interpretation and responses during conversations, allowing it to prioritize these instructions over the ongoing dialogue.

Meet Ivy, an innovative pet robot created by three students in Vietnam This cool AI is designed to explore the world with curiosity, drawing images and answering questions Ivy is here to assist users with a variety of tasks, always keeping responses concise at around 70 words While some text may contain spelling errors due to transcription, Ivy is adept at deciphering the intended meaning.

Figure 4.19 Basic illustration of ChatGPT chatbot system

In addition to engaging in conversation, the ChatGPT API offers a function calling feature, allowing users to receive requests for function execution instead of traditional message responses This dynamic approach enables the model to intelligently determine the appropriate moments to invoke specific functions for task performance or information retrieval.

Our robot has the capability to create any image you request, simplifying the process of generating art Unlike traditional programming, which requires anticipating every possible variation in requests and often leads to unsatisfactory results, our approach utilizes function calling This allows the model to accurately interpret your request and invoke a painting function, ensuring that the final output closely matches your expectations.

After obtaining the desired text output, we utilize the tts-1 model from the OpenAI API to convert the message into speech This process generates a sound file that is played through a speaker connected to a Raspberry Pi Although the default voice is impressively human-like, it tends to be monotonous To improve the speech output, we customize the voice by raising its pitch and slightly increasing its speed, infusing the robot's speech with more liveliness and energy.

Figure 4.20 Flowchart of communication system

Utilizing the Yolov8 vision function, we enhance object recognition by leveraging a pre-trained COCO dataset, allowing for high-accuracy perception of various surroundings The robot can identify different animals and mimic their sounds, creating an engaging and educational experience for users This capability fosters interactions with animals like cows, cats, dogs, and ducks, making the robot an enjoyable companion for all ages Additionally, the model incorporates facial recognition to identify its owner and employs an object-tracking algorithm, enabling the robot to follow the user seamlessly.

Figure 4.21 Face recognition with object tracking

Our advanced vision system utilizes an image-to-text model through the OpenAI API, allowing the robot to articulate its visual surroundings This technology transforms visual information into descriptive text, serving multiple functions, including offering verbal descriptions for visually impaired individuals and narrating actions and scenes for educational applications.

Figure 4.22 Input image and the text received by image-to-text model

EXPERIMENT AND RESULT

Prototype Overview

After assembly, here is our prototype of the pet robot:

The robot features a simple and sleek design with soft, rounded shapes that create a friendly appearance Its large blue eyes serve as a focal point for user interaction, fostering a sense of connection The clean white coloring enhances its approachable aesthetic, complemented by a bowtie that seamlessly integrates the camera Additionally, the RGB ears are set to an idle blue color, matching the eyes Below are the specifications of this prototype.

In terms of functionalities, here is the detail of our achievements:

Functionality Availability Completion Remaining Problems/Potential Improvement

Planar Movement Yes 80% Occasional movement block when the caster ball stuck into the grooves

Leg Movement Yes 95% Occasional servo jittering due to high-speed response

Ear Movement Yes 100% Further synchronization with leg movement may be developed

Yes 95% Occasional ignorance may occur due to unclear pronunciation

Yes 100% A more various set of emotional expressions may be developed

Yes 100% Multiple other APIs should be put into test to find the one with best performance

Chat with users Yes 100% Reduce chat receival time

Yes 100% Further entertainment on waiting time can be included

Follow one user Yes 80% Occasional misjudgment may occur due to internet transmission delay

Further inspection into the battery life:

Usage Case Description Battery Life

Continuous camera usage, wheel movements and leg movements 30 minutes

High Constant camera usage, planar movements command every 10 seconds and leg lifting and lowering every 20 seconds

50% of camera usage, 50% of speaker usage, movements every one minutes 50 minutes

Medium 1 minute of respective behaviors: wheel movements, leg movements, describe what it sees, then rest for 30 seconds

No camera usage, movements every two minutes, offline answers only, interaction every one minutes

Low Movements limited to integrated ones into other commands, interactions every two minutes

No wheels and legs movements, only ears movements and occasional interaction

Hardware Experimenting

• Planar movement: Forward, backward, rotate

• Record the number of stuck or failure out of 100 attempts on a flat surface

• Leg lifting test: lift 10°, 20° and 30° respectively

• Leg holding test: hold at the lifting angle for 2s

• Repeat the experiment 10 times for each angle, each motor, record the successful and failing cases

• Move each ear to 0°, 45°, 90°, 135° and 180° for 100 times

• Make each ear wave randomly around 90° for 2s, 100 times

Table 5.4 Results of DC Motor Testing

Type of movement Number of tests Number of sucesses Success Rate

• The success rate of flat movements is 87-95%, thanks to the high gear ratio of the DC motor with a gearbox

Failures often occur when the front caster wheel and rear wheel spikes become lodged in floor grooves To address this issue, it is recommended to design specialized tires rather than relying on existing tire options.

Overall, the wheel works with occasional failure

5.2.2.2 RC Servo Motor 30kgf.cm at the Leg

Table 5.5 Results of Actuating RC Servo Motor Testing

Mode Angle Side Success Rate

• For the forward and backward rotation movements, although the legs achieved the angle, there was a shaking phenomenon when stopping

In the leg lifting and holding experiments, the motor effectively achieved the desired angle during leg lifts; however, it struggled to maintain that angle for more than 2 seconds during the holding phase Consequently, the success rate for holding the leg was slightly lower than that for lifting it.

Figure 5.2 Robot lift each leg

Overall, no failure in leg movements, but there are occasional jittering

5.2.2.3 SG90 Motor at the Ear

Table 5.6 Results of SG90 Motor Testing

Mode Case Side Success Rate

Rotating at each angle 0°, 45°, 90°, 135°, 180° Left 100 %

• When rotating to each angle, the motor operates completely stably

• During oscillation, due to the high-speed operation, there is a rare shaking phenomenon when stopping at the 90° angle

Overall, ear movements work excellently.

Software Experimenting

5.3.1.1 Speech recognition and communication system

• Test the wake word detection accuracy in various noise enviroments (quiet room, moderate noise, loud noise)

• Record the number of successful detections out of 100 attempts in each enviroments

• Use a set of 50 predefined phrases of varying lengths and complexities

• Measure the time taken to transcribe each phrase and calculate the accuracy rate by comparing transcriptions to the original phrases

• Simulate different conversational scenarios (e.g., casual chat, asking for help, storytelling)

• Measure response time from input to output and evaluate the relevance and appropriateness of responses using a scoring system (1-5 scale)

• Use a set of 30 common objects from the COCO dataset Measure detection accuracy when identify each object

• Perform tracking tests where the robot follows a person walking

• Test the recognition accuracy with a database of 5 faces (including the robot's owner) Measure the the accuracy rate in different lighting conditions

• Let the robot move around with obstacles and see if it can avoid that

• Watch how robot move in open enviroments

• Provide the robot with 20 different image prompts (e.g., "astronaut on a horse")

• Measure the time taken to generate each image and evaluate the quality of the generated images

Voice recognition systems excel in quiet environments, effectively detecting wake words and default commands However, in noisy settings, the robot may struggle to hear, requiring users to speak louder and articulate their words more clearly.

The robot boasts impressive hearing capabilities, enabling it to swiftly transcribe short sentences into text, typically within 700ms, influenced by factors such as sentence length and Wi-Fi stability To maintain quick responses without compromising quality, we utilize the ChatGPT API role system for brevity However, a significant challenge is the text-to-speech processing time, which can reach up to 6 seconds for approximately 100 words To mitigate this, we showcase animations of the robot "thinking" during processing Looking ahead, we are considering options to present text in real-time, allowing users to follow along seamlessly.

Figure 5.3 Robot look at the book to find the answer

Robots possess the ability to express a wide range of emotions, including happiness, sadness, disgust, excitement, surprise, and curiosity However, it's important to note that these emotional expressions are based on predefined patterns and responses, lacking genuine understanding or interpretation of emotions.

Figure 5.4 The robot will happy when being caressed and angry when being hit on the head

Face recognition, object detection, object tracking: The robot can detect the owner's face, track, and follow them effectively Additionally, it can recognize over

30 everyday objects and interact with them playfully

The robot is equipped with auto navigation capabilities, allowing it to move independently and occasionally avoid obstacles in its path However, its limited camera viewing angle restricts its ability to detect and evade sudden obstacles approaching from the sides.

Education and creative expression are enhanced by a robot capable of generating images from user imagination However, the process can occasionally fall short of expectations, and it typically takes about 8 seconds to produce a high-quality image To mitigate the wait time, we have introduced an animation of the robot creating the final picture, keeping users engaged while they wait for the result.

CONCLUSION

Conclusion

The team has successfully developed an innovative pet robot that not only moves and performs various actions but also observes its environment, avoids obstacles, and recognizes individuals This interactive companion engages users in conversation, assists them, and even creates images to spark creativity With its cute and attractive design, featuring smoothly moving ears that enhance its lively appearance, the robot is sure to capture the hearts of users.

Despite advancements in robotic technology, challenges remain in the development of autonomous robots Currently, these robots are unable to make independent decisions and depend heavily on pre-programmed scenarios Additionally, certain features of their vision systems, like environmental awareness, are not fully leveraged, resulting in limited interactive capabilities.

Future development

• Strive to make the robot's movements smoother

• Use and integrate OpenAI API more effectively to combine the imaging system and chatbot system, enabling the robot to make more autonomous decisions

• Optimize the hardware to make the robot more compact

• Equip the robot with a camera with higher angle for higher obstacle observation

[1] Kanamori M, Suzuki M, Tanaka M “Maintenance and improvement of quality of life among elderly patients using a pet-type robot” Nihon Ronen Igakkai zasshi Japanese Journal of Geriatrics 2002 Mar;39(2):214-218 DOI: 10.3143/geriatrics.39.214 PMID: 11974948

[2] Obaigbena, A., Lottu, O A., Ugwuanyi, E D., Jacks, B S., Sodiya, E O., Daraojimba, O D., & Lottu, O A (2024) AI and human-robot interaction: A review of recent advances and challenges GSC Advanced Research and Reviews, 18(2), 321–330 https://doi.org/10.30574/gscarr.2024.18.2.0070

[3] Luo, C (2023) A voice recognition sensor and voice control system in an intelligent toy robot system Journal of Sensors, 2023, 1–8 https://doi.org/10.1155/2023/4311745

[4] Leyzberg, D., Avrunin, E., Liu, J., & Scassellati, B (2011) Robots that express emotion elicit better human teaching HRI ’11: Proceedings of the 6th International Conference on Human-robot Interaction https://doi.org/10.1145/1957656.1957789

The study by Berque et al (2021) explores the cross-cultural design and evaluation of robot prototypes that incorporate kawaii (cute) attributes Featured in the book "Cross-Cultural Design," this research highlights applications in cultural heritage, tourism, autonomous vehicles, and intelligent agents The findings underscore the significance of cultural context in designing engaging robotic interfaces, promoting user interaction and acceptance across diverse populations This work is part of the HCII 2021 series, contributing to the understanding of how aesthetic qualities can enhance technology usability For further details, refer to the publication available through Springer.

[6] Lv, X., Luo, J., Liu, Y., Lu, Y., & Li, C (2022) Is cuteness irresistible? The impact of cuteness on customers’ intentions to use AI applications Tourism Management, 90, 104472 https://doi.org/10.1016/j.tourman.2021 104472

[7] Lehmann, H., Sureshbabu, A V., Parmiggiani, A., & Metta, G (2016) Head and face design for a new humanoid service robot In Lecture notes in computer science (pp 382–391) https://doi.org/10.1007/978-3-319-47437-3_37

[8] Zabala, U., Rodriguez, I., Martínez-Otzeta, J M., & Lazkano, E (2021) Expressing Robot Personality through Talking Body Language Applied Sciences, 11(10), 4639 https://doi.org/10.3390/app11104639

[9] Pahwa, Rohit & Tanwar, Harion & Sharma, Sachin (2020) Speech Recognition System: A review International Journal of Future Generation Communication and Networking 13 2547-2559

[10] J Benesty, S Makino, J Chen (ed) Speech Enhancement pp.1-8 Springer,

The article "The Speed Submission to DIHARD II: Contributions & Lessons Learned," authored by a team including Sahidullah, Patino, Cornell, and others, was published on November 6, 2019 It discusses key contributions and insights gained from participating in the DIHARD II challenge, focusing on advancements in the field of audio signal processing The research emphasizes the importance of collaborative efforts and innovative methodologies in improving performance metrics for speech and audio analysis.

[12] S Ajibola Alim and N Khair Alang Rashid, ‘Some Commonly Used Speech Feature Extraction Algorithms’, From Natural to Artificial Intelligence - Algorithms and Applications IntechOpen, Dec 12, 2018 doi: 10.5772/intechopen.80419

[13] Kortli Y, Jridi M, Al Falou A, Atri M Face Recognition Systems: A Survey Sensors 2020; 20(2):342 https://doi.org/10.3390/s20020342

[14] Napoléon, T.; Alfalou, A Pose invariant face recognition: 3D model from single photo Opt Lasers Eng 2017, 89, 150–161

[15] Vinay, A.; Hebbar, D.; Shekhar, V.S.; Murthy, K.B.; Natarajan, S Two novel detector-descriptor based approaches for face recognition using sift and surf Procedia Comput Sci 2015, 70, 185–197

[16] Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F Z., Daniele, A F., Mostajabi, M., Basart, S., Walter, M R., & Shakhnarovich, G (2019, August 1) DIODE: a dense indoor and outdoor DEpth dataset arXiv.org https://arxiv.org/abs/1908.00463

[17] Ali, S., Ravi, P., Williams, R., DiPaola, D., & Breazeal, C (2024) Constructing Dreams Using Generative AI Proceedings of the AAAI Conference on

Artificial Intelligence, 38(21), 23268-23275 https://doi.org/10.1609/aaai.v38i21.30374

[18] Stanislav Frolov, Tobias Hinz, Federico Raue, Jửrn Hees, Andreas Dengel, Adversarial text-to-image synthesis: A review, Neural Networks, Volume 144, 2021, Pages 187-209, ISSN 0893-6080, https://doi.org/10.1016/j.neunet.2021.07.019

[19] Goodfellow I.J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., et al Generative adversarial nets Advances in neural information processing systems (2014), pp 2672-2680

[20] Mirza M., Osindero S Conditional generative adversarial nets (2014) arXiv:1411.1784

[21] Liliane Momeni, et al, “Seeing wake words: Audio-visual Keyword Spotting”,

[22] Radford, A., Kim, J W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I

(2022) Robust speech recognition via Large-Scale Weak Supervision arXiv (Cornell University) https://doi.org/10.48550/arxiv.2212.04356

[23] Introducing whisper (n.d.) OpenAI https://openai.com/index/whisper/

[24] Awan, A A (2023b, May 24) What is Text Generation? https://www.datacamp.com/blog/what-is-text-generation

A survey of large language models (LLMs) examines the evolution of language modeling from statistical approaches to advanced neural networks, particularly pre-trained language models (PLMs) that leverage large-scale datasets This research highlights the significant performance improvements achieved through model scaling, leading to the emergence of LLMs that possess unique capabilities absent in smaller models The launch of ChatGPT marks a pivotal moment in LLM development, capturing societal attention and influencing the AI landscape This survey reviews key advancements in LLMs, focusing on pre-training, adaptation tuning, utilization, and capacity evaluation, while also addressing available resources and future challenges in the field.

[26] Tan, X., Qin, T., Soong, F K., & Liu, T (2021) A survey on Neural Speech Synthesis arXiv (Cornell University) https://doi.org/10.48550/arxiv.2106.15561

[27] Zhats (2023, July 14) How Text-to-Speech models work: Theory and practice It-Jim https://www.it-jim.com/blog/how-text-to-speech-models-work- theory-and-practice/

[28] Sutskever, I., Vinyals, O., & Le, Q V (2014) Sequence to Sequence Learning with Neural Networks arXiv (Cornell University) https://doi.org/10.48550/arxiv.1409.3215

[29] Hu, D (2018, November 12) An introductory survey on attention mechanisms in NLP problems arXiv.org https://arxiv.org/abs/1811.05544

I don't know!

[30] Wolfe, C R., PhD (2023, September 4) Language model training and inference: From concept to code Deep (Learning) Focus https://cameronrwolfe.substack.com/p/language-model-training-and-inference

[31] Yenduri, G., Ramalingam, M., Selvi, G C., Supriya, Y., Srivastava, G., Maddikunta, P K R., Raj, G D., Jhaveri, R H., Prabadevi, B., Wang, W., Vasilakos,

A V., & Gadekallu, T R (2024) GPT (Generative Pre-trained Transformer) – a comprehensive review on enabling technologies, potential applications, emerging

85 challenges, and future directions IEEE Access, 1 https://doi.org/10.1109/access.2024.3389497

[32] Wikipedia contributors (2024, May 21) ChatGPT Wikipedia https://en.wikipedia.org/wiki/ChatGPT

[33] Open AI (2022, November 30) Introducing ChatGPT https://openai.com/index/chatgpt/

[34] Fedir Zadniprovskyi Faster Whisper transcription with Ctranslate2 https://github.com/SYSTRAN/faster-whisper

[35] Chuang, C., Yao, C., & Wu, S (2020) Study on Fast Charging Method of Series Connected Lithium-Ion Battery Strings with Intelligent Control 2020

International Conference on Fuzzy Theory and Its Applications https://doi.org/10.1109/ifuzzy50310.2020.9297813

Tiêu đề	Design and Manufacture a Pet Robot
Tác giả	Nguyen Dang Duy Tan, Nguyen Ngoc Nhan, Le Pham Hoang Thuong
Người hướng dẫn	Master Tuong Phuoc Tho
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Mechanical Engineering Technology
Thể loại	Graduation Thesis
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	101
Dung lượng	6,87 MB