Rules on states of threatened species and the categorization of them Figure 2-4.. Extract information about Redbook creature’s html pag Figure 3-7.. Interface of footprint has Viet Nam n
Questions and answers manageMent
Question management screen To ensure the safety of the content, the admin can review it before displaying it to the user In the management screen will allow admin to see all existing questions and allow admin to preview, confirm, delete questions. © ben me man
‘oh Quản lí sinh vật
(ho mình hổi loài tủa biển này tường snh sống & đâu
Câu tạo của loài ấn
When viewing specific questions, admin can moderate the content of answers.
We reorganized the browse functions, deleted replies, and filterable replies by all, approved, and unapproved to make it easier for administrators to moderate.
Activity diagram when user creates question, answer
Cho min tủ id nay thường số sinh tổng & đâu vb oho đi ba a vey 2 cribs: Thtcd vì
Minh ng tu trợ ván tầm 80 Naty đồng 2021.12.14 2346-20
Figure 4-42 Question preview and management answers
Publish it in user's page
Choose question Display messages to notify use
Publish it in user's page
Contributions manageImeIi( - - ô+ ssx+e+exeeeeeerer 57
Contributions management screen To ensure the safety of the content, the admin can review it In the management screen will allow admin to see all contributions of users, then checks if the image belongs to the species, then admin can add this image for species.
Bài viết # — Thumbnail Tên người gửi Thông tin người gửi Nơi phát hiện Ghi chú Xác nhận Ngày tạo Thao tie x 2021-12-14 09:30:34 eeu
Người dùng J ro of, Quan lí sinh vật 3 = Z2 —= ss đ mau gỗ
Activity diagram when user contributes image for system:
AUTOMATED FLOWER CLASSIFICATIO
Flower classification base on Vision Transformer (ViT)
The Vision Transformer (ViT) model was introduced in a research paper published as a conference paper at ICLR 2021 titled “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale” It was developed and published by Neil Houlsby, Alexey Dosovitskiy, and 10 more authors of the Google Research Brain Tea.
Transformer models have become the de-facto status quo in natural language processing (NLP) In computer vision research, there has recently been a rise in interest in Vision Transformers (ViTs) and Multilayer perceptrons (MLPs).
In 2021, the Vision Transformer (ViT) emerged as a competitive alternative to convolutional neural networks (CNNs) that are currently state-of-the-art in computer vision and therefore widely used in different image recognition tasks ViT exhibits an extraordinary performance when trained on enough data, breaking the performance of a similar state-of-art CNN with 4x fewer computational resources Vision Transformer (ViT) achieves remarkable results compared to convolutional neural networks (CNN)
65 while obtaining fewer computational resources for pre-training The ViT is a visual model based on the architecture of a transformer originally designed for text-based tasks These transformers have high success rates when it comes to NLP models and are now also applied to images for image recognition tasks CNN uses pixel arrays, whereas ViT splits the images into visual tokens The visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder Moreover, ViT models outperform CNNs by almost four times when it comes to computational efficiency and accuracy.
Linear Projection of Flattened Patches
The architecture of the model consists of 3 main components:
- Linear Projection of Flattened Patches
5.3.4.3 Linear Projection and Flattend Patches
Patch Embedding: ViT divides an image into a grid of square patches The standard Transformer receives as input a 1D sequence of token embeddings To handle
2D images, ViT reshape the image x € into a sequence of flattened 2D patches
Xp € RNx@ °C) where (H, W) is the resolution of the original image, C is the number of
'2 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
66 channels, (P, P) is the resolution of each image patch, and N = = is the resulting
P number of patches, which also serves as the effective input sequence length for the Transformer The Transformer uses constant latent vector size D through all of its layers, so ViT flatten the patches and map to D dimensions with a trainable linear projection (Eq 4-3) ViT refer to the output of this projection as the patch embeddings.
Zo = [Xaass: XpE; xX6E; ; Xp E] + Enos,
Ee ROOD Exos € R(N+1)xb Eq 5-3 e D
Figure 5-9 The intuition of Linear Projection Block before feeding in Encoder!
Class Embeeding: Similar to BERT’s [class] token, ViT prepends a learnable embedding to the sequence of embedded patches, whose state at the output of the Transformer encoder (z¡) serves as the image representation y Both during pre-training and fine-tuning, the classification head is attached to z0.
'3 https://blog.paperspace.com/vision-transformers/
Figure 5-10 N is the number of patches form the image!*
Positional Embeddings: Position embeddings are added to the patch embeddings to retain positional information ViT use standard learnable 1D position embeddings The resulting sequence of embedding vectors serves as input to the encoder.
'4 https://blog.paperspace.com/vision-transformers/
'S https://blog.paperspace.com/vision-transformers/
5.3.4.4 Transformer encoder Feed the sequence as an input to a state-of-the-art transformer encoder
- Multi-Head Self Attention Layer (MSP) concatenates all the attention outputs linearly to the right dimensions The many attention heads help train local and global dependencies in an image.
- Multi-Layer Perceptrons (MLP) Layer: This layer contains a two-layer with
Gaussian Error Linear Unit (GELU).
- Layer Norm (LN): This is added prior to each block as it does not include any new dependencies between the training images This thereby helps improve the training time and overall performance.
Self-Attention layer: This layer is the main component to create a block in Transformer Encoder.
!6 An image is worth 16x16 words transformers for image recognition at scale
Self-Attention Layer (Parameters: W,, Wx, W,)
The input to the Self attention layer is a sequence X = [X, Xạ, Xa, - , Xm |.
The output of the Self Attention layer is a context vector C containing the most important information of the input sequence C = [cy, C2, C3, - , Cm|].
The parameters of this layer include Wg, Wx, Wy.
Below is a picture showing how the Self attention layer works:
Step 1: For each x; of input X, computes the values q;, k;, vị corresponding to the formula qj = Wox;, ki = W,z¡, vị = W,x;.
Step 2: Calculate the alignment score corresponding to x;according to the formula ô;= Soƒtmax(KTqĂ)
'7 https://viblo.asia/p/vision-transformer-for-image-classification-ORNZqV7810n
Step 3: Calculate the context vector C corresponding to x; according to the formula € =%¡¡ Vị + ®¿¡ Vo te + ựni Vin
Multi-head Attention is simply a superposition of self-attention layers For example, a class of Multi-head Attention has | layers of self-attention The output of each self-attention class has size d X m, the output of multi-head attention will be (Id) x m.
Multi-Head Self-Attention Layer
This part is simply an MLP (Multilayer perceptron) block that takes as input the context vector cc returned from the Transformer Encoder and outputs the final result as the probabilities corresponding to the classes.
!8 https://viblo.asia/p/vision-transformer-for-image-classification-ORNZqV7810n
0.07 bird car cat dog fox jet snake tiger > Classes
The overall architecture of the vision transformer model is given as follows in a step-by-step manner:
- Split an image into patches (fixed sizes)
- Create lower-dimensional linear embeddings from these flattened image patches
- Feed the sequence as an input to a state-of-the-art transformer encoder
- Pre-train the ViT model with image labels, which is then fully supervised on a big dataset
- Fine-tune on the downstream dataset for image classification
Implement 5.3.6 Overall performance evaluation 5.3.7 Conclusion 5.4 Integrate into the system
The proposed menthod was pretrained on the “102 Category Flower Data” using the Adam optimizer using a learning rate policy where the learning rate decreases when learning stagnates for a period of time (i.e., ’patience’) The following hyperparameters were used for training: learning rate=le-4, number of epochs , batch sized, patience=5.
The deep learning classifiers framework have been implemented using Python and the PyTorch on Intel(R) Core(TM) i7-2.2 GHz processor In addition, the
! https://viblo.asia/p/vision-transformer-for-image-classification-ORNZqV7810n
72 experiments were executed using the graphical processing unit (GPU) NVIDIA Quadro RTX 6000 and RAM with 32 GB.
In this section, we analyze the effectiveness of our proposed framework in light of results of experiments conducted As discussed earlier, the experimental study is conducted using the Oxford Flowers 102 dataset, which is a consistent of 102 flower categories commonly occurring in the United Kingdom Each class consists of between
40 and 258 images The images have large scale, pose and light variations In addition, there are categories that have large variations within the category and several very similar categories.
The dataset is divided into a training set, a validation set and a test set The training set and validation set (totaling 818 images each) The test set consists of the remaining 6552 images (minimum 20 per class) For the experimental setup, all images were scaled to the size of 224 x 224 pixels.
We used a powerful pre-trained architecture of CNN DensetNet121 and ViT have been used for classification of flowers These networks have been achieved dramatic success in a wide range of computer vision and hence were chosen in this study It is worth noting here that these CNN models were originally trained on a large- scale labeled dataset called ImageNet and later fine-tuned over the “102 Category Flower Data” The last layer in these models has been removed and a new Fully Connected (FC) layer is inserted with an output size of two that represents 102 different classes In these resulted models, only the final FC layer is trained, whereas other layers are initialized with pre-trained weights.
The result overview of the pre-trained CNN models and ViT model are tabulated in Table:
Method Running Time Top 1 Accuracy | Detect Test Set
Time (seconds) Densenet121 with data | 28 min 5 sec 89% 6.6 augmentation
DenseNet121 — without | 18 min 1 sec 88% 6.4 data augmentation
Table 5-1 Performance of proposed methods
Plots of loss on the training and validation sets over training epochs are shown in below:
— Taining loss — Faining loss — Taining loss
— Validation loss 40 — Validation loss — Validation loss
DeseNet121 without data augmentation DenseNet121 with data augmentation vit
Figure 5-17 Plots loss on training and validation sets over training epochs.
According to the results obtained, we decided to use Vision Transformer model to build a flower recognition function for our application.
| Form data contains Flower image
Data contains name of flower and some example images
Figure 5-18 Flower classification Server-Client architecture
Step 1: User adds a image and click into “Nhận dạng”.
Nhận dang / Đóng góp cho VNCREATURES
Figure 5-19 Add a image for classification task.
Figure 5-20 Results for flower classification task.
DEPLOY AND TEST SYSTEM
We use heroku integration with github to deploy image recognition: https://vncreatures.herokuapp.com/docs
We use Apache host to deploy ReactJS application, Slim 4 API application The information of host:
The public website URL: http://vnnew.vncreatures.net/
We use the software to test the current performance of the system, and this is the result we got.
CHAPTER 7 CONCLUSION AND FUTURE DEVELOPMENT
Applying information technology to solve the still new problem in Vietnam is biodiversity and conservation.
The topic is the premise for a system towards the goal of a data center, looking up the largest biodiversity in Vietnam.
The topic opens for multidisciplinary cooperation between professional researchers and technology groups from the laboratory.
The topic is deployed in several stages, ensuring the development, which stage is completed to serve the social interests there Preparing the foundation for the next steps are tools such as automatic identification of creatures through photos, a network of volunteers, e-learning about forest creatures
The topic is free and voluntary to serve the society.
A practical web application running 24/7 serving the community to look up and refer to data on Vietnam's forest biodiversity.
The system includes information on more than 3532 creatures, pictures of 107 wood samples, images of 36 kinds of animal footprints and information of more than 31 national parks in the territory of Vietnam.
Information systems come into service The system is attached with the information sponsor logo of the Institute of Applied Technology of Thu Dau Mot University, the Institute also supports the cost of maintaining the infrastructure running the system The technology sponsor logo of the Faculty of Information Systems, University of Information Technology will also be displayed on the product when it comes into operation, contributing to the prestige and affirmation of the student ability of the university.
Continue to add additional information about the missing organism Update more articles about the beauty of nature, introduce about the national park conservation areas of Vietnam.
Building a community of scientists, biologists and people who love nature, together contribute valuable knowledge and beautiful images about Vietnam's biodiversity.
Applying modern machine learning technologies, the system can classify and search for biological species through images.
Build a more powerful search engine by allowing users to enter keywords that describe more details about the organism such as life form, leaf type, flower type. Update users interface more beautiful and powerful.
Search creatures by national park: Linking the data of the tables together can help users search for species that are present in that national park or find any existing creatures in any national park.
Apply 3D modeling: Using 3D modeling technology to describe creatures in the most realistic way.
[1] Science Learning Hub, “Classification system”, 2018
[2] “TONG QUAN VE SACH ĐỎ”, vncreatures.net
[3| “What is MySQL?”, MySQL website official
[4] Hossam M Zawbaa, Mona Abbass, Sameh H Basha, Maryam Hazman and Abul Ella Hassenian, “An Automatic Flower Classification Approach Using Machine Learning Algorithms”, 2014
[5] Huthaifa Almogdady, Dr Saher Manaseer and Dr.Hazem Hiar , “A Flower Recognition System Based On Image Processing And Neural Networks”, 2018
[6] Musa Cibuka, Umit Budakb, Yanhui Guoc, M.Cevdet Inced and Abdulkadir Sengurd ,“Efficient deep features selections and classification for flower species recognition”, 2019
[7] Isha Patel and Sanskruti Pat, “An Optimized Deep Learning Model for Flower Classification Using NAS-FPN And Faster RCNN”, 2020
[8] Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do and Kaori Togashi,
“Convolutional neural networks: an overview and application in radiology”, 2018
[9] Gaudenz Boesch, “Vision Transformers (ViT) in Image Recognition — 2021 Guide’, 2021.
[10] Emine Cengil and Ahmet Cinar, “Multiple Classification of Flower Images Using Transfer Learning”, 2019
[11] Gurnani A, Mavani V, Gajjar V, Khandhediya Y.,“Flower Categorization using Deep Convolutional Neural Networks”, 2017
[12] R Shiva Shankar, L V Srinivas, VV Sivarama Raju and KVSS Murthy, “A Comprehensive Analysis of Deep Learning Techniques for Recognition of Flower Species”, 2021
[13] Neda Alipour, Mohammad Awrangjeb, Hongda Tian and Omid Tarkhaneh,
“Flower Image Classification Using Deep Convolutional Neural Network”, 2021
[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q Weinberger, “Densely Connected Convolutional Networks”, 2017
[15] Gaudenz Boesch, “Deep Residual Networks (ResNet, ResNet50) — Guide in 2021”, 2021