Strategy
Our proposed system for scene text detection employs STRAug [22], a compre-
hensive data augmentation library specifically designed for the task of scene text recognition. In this work, we evaluate the efficacy of STRAug on scene text detection task.
Springer Nature 2021 IXIEX template
10 Enhancing Scene Text Detection With A Novel Contrastive...
3.4.1 Categories of STRAug Functions
STRAug offers a selection of 36 unique augmentation functions, each address- ing different facets of text appearance to cater to the broad range of challenges encountered in real-world scene text recognition. These functions are logically categorised into eight distinct groups: Warp, Geometry, Noise, Blur, Weather, Camera, Pattern, and Process.
Warp. This category encompasses functions that transform the shape of text characters within an image, including horizontal and vertical shearing, as well as local warping.
Geometry. The functions in this category apply geometric transforma- tions to the text, such as rotation, perspective, and affine transformations, aiding the model in adapting to text in various orientations and distortions.
Noise. Functions in this category introduce a variety of noise types to the images, including Gaussian noise, salt-and-pepper noise, and speckle noise. This allows the model to learn to recognise text under diverse noise conditions.
Blur. This category’s functions apply different types of blur to the images, such as Gaussian blur, motion blur, and defocus blur, simulating effects of camera shake, object motion, and out-of-focus images.
Weather. Functions in this category simulate weather-related conditions that could impact scene text recognition, such as rain, fog, and snow.
Camera. The functions in this category replicate the effects of various camera settings, such as exposure, brightness, contrast, and saturation, which can all influence the appearance of text in images.
Pattern. Functions in this category introduce patterns to the images, such
as stripes, checkerboards, and halftone dots, challenging the model’s ability to recognise text on textured backgrounds.
Process. This category of functions simulate image processing operations, such as resizing, cropping, and sharpening, that could be applied to images prior to inputting them into the model.
3.4.2 Employing STRAug
To maximize the advantages of STRAug, we implement the RandAugment strategy. Mathematically, the operation of the RandAugment strategy can be represented as follows:
Given an image J, the set of augmentation functions A = Aj, 4a,..., An where ứ is the number of augmentation types included, and the corresponding probabilities P = p,, po, ..., Pn, the augmentation process can be formulated as:
n
Taug = [[ 4"? mp) (1)
i=1
where Igug is the augmented image, Al) is the randomly selected aug-
mentation function of type i, m is the randomly chosen magnitude within
a predefined range, and p; is the probability of augmentation type i being
Springer Nature 2021 IXIEX template
Enhancing Scene Text Detection With A Novel Contrastive... 11
applied. The selection of Aw and m is determined by a uniform random
distribution.
A comprehensive series of experiments were conducted to evaluate the effi- cacy of the RandAugment strategy. We compared the performance of our model trained with the RandAugment strategy to models trained without data augmentation and those trained with conventional augmentation tech- niques. The results unambiguously demonstrated that the integration of the RandAugment strategy significantly boosts the overall accuracy of our scene text detection system, particularly when confronted with challenging input
text images.
4 Experiment
4.1 ICDAR15 Dataset
The experimental design underpinning our research hinges on two pivotal ele- ments: the meticulous selection of the training and testing dataset, and the adoption of appropriate evaluation metrics. These parameters are instrumen- tal in determining the robustness and effectiveness of our scene text detection
system.
In training our model, we have selected the ICDAR2015 [23] dataset, a highly regarded benchmark dataset provided by the International Conference
on Document Analysis and Recognition (ICDAR). Lauded for its diversity
and complexity, the dataset encompasses 1000 training images and 500 test images, each with a resolution of 720 x 1280 pixels. These images, derived from
a variety of everyday scenes and annotated with oriented text boxes, offer a robust training platform for our model. The richness of the dataset, combined with the high-resolution images, allows the model to capture intricate text details, thus enhancing its detection capabilities.
The ICDAR2015 dataset aligns perfectly with our experimental objec- tives, providing an authentic representation of diverse real-world scenarios. The wide-ranging text styles, fonts, sizes, orientations, and lighting conditions present in the dataset ensure the model garners a comprehensive under- standing, thereby augmenting its ability to generalize effectively to unseen data.
4.2 Training Configuration
The architecture of our model is anchored on two pivotal components: the CLIPResNet backbone and the Cascade Double Head Mask RCNN.
The CLIPResNet backbone, an adaptation of the ResNet model, is specifi- cally designed for the Contrastive Language-Image Pretraining (CLIP) model. This backbone is meticulously engineered to extract robust visual features from input images, which are then utilized for the task of scene text detection. The CLIPResNet backbone is initialized with the weights of the oCLIP model,
Springer Nature 2021 IXIEX template
12 Enhancing Scene Text Detection With A Novel Contrastive...
as in 1, which has been pre-trained on a comprehensive dataset. This pre- training phase endows the backbone with an in-depth understanding of visual and textual data, thereby enhancing its capacity to extract relevant features from scene text images.
The distinction between the CLIPResNet and the traditional ResNet archi- tectures primarily lies in two key areas: the pooling layer and the bottleneck
structure.
In the conventional ResNet design, a max pooling layer is positioned at the end of the input stem. However, the CLIPResNet architecture replaces this max pooling layer with an average pooling layer. This modification is a strategic alignment with the architectural framework employed in the CLIP model.
In terms of the bottleneck structure, the CLIPResNet introduces a notable alteration. After the second convolution layer within the bottleneck of CLIPResNet, an additional average pooling layer is incorporated. This layer, characterized by a kernel size of 2 and a stride of 2, is appended as a plu- gin when the input stride exceeds 1. Unlike the traditional ResNet design, the stride for each convolution layer in the CLIPResNet is consistently set to 1. This represents a significant departure from the bottleneck structure of the traditional ResNet model.
The training of our model is conducted in two stages: pre-training and fine- tuning. During the pre-training stage, the CLIPResNet backbone is trained
on the SynthText dataset using the oCLIP model. This process facilitates the backbone in learning a rich set of features that are beneficial for understand- ing both visual and textual data. In the fine-tuning stage, the entire system, encompassing the CLIPResNet backbone and the Cascade Double Head Mask RCNN, is fine-tuned on the ICDAR2015 dataset. This fine-tuning process enables the model to adapt to the specific task of scene text detection, thereby improving its performance on this task.
The implementation of our model was accomplished using the PyTorch framework. The model was trained on a machine equipped with a NVIDIA Tesla V100 GPU. The learning rates for the pre-training and fine-tuning stages were set to le — 4 and 2e — 3, respectively. The number of epochs for the pre-training and fine-tuning stages were set to 100 and 160, respectively. The batch size was set to 32.
5 Results and Discussion
Table 2 and 1 provide a comprehensive presentation of the results derived from our experiments. In this section, we conduct a comparative anal. of these results against existing state-of-the-art methodologies and offer an insightful interpretation of the implications of our findings. In order to align our results more closely with practical, real-world scenarios, all testing and evaluation carried out in this study utilized an RTX2080ti GPU—a hardware setup often found in regular computational systems.
Springer Nature 2021 TX template
Enhancing Scene Text Detection With A Novel Contrastive... 13
Evaluated on the challenging ICDAR2015 dataset, our model exhibits supe-
rior performance. It achieves an impressive precision rate of 90.03%, a recall rate of 83.05%, and an H-mean score of 86.5%. As illustrated in Table 1, our model evidently surpasses the base Cascade Mask R-CNN system, demon-
strating substantial improvements in precision (>7.18 percentage points), recall (>1.15 percentage points), and H-mean (>4.13 percentage points). This
advancement in performance underscores the efficacy of the innovations we propose—specifically, the incorporation of a CLIP-ResNet backbone and the application of a Cascade Double Head Mask R-CNN for text detection.
When evaluating the results in the context of other well-regarded methods, Cas-Dou Mask-RCNN and Clip Cas-Dou Mask-RCNN stand out even more
significantly. The Frame Per Second (FPS) values of our methods, while lower
than certain high-speed methods such as PANet and DBNet, are not the low- est within the table, still outpacing FCENet, which operates at 8 FPS. It is important to stress again that real-world applications often necessitate a fine balance between speed and accuracy, and our methods champion the latter.
In terms of computational complexity, as indicated by FLOPS, our meth- ods are indeed on the higher end of the scale, with Cas-Dou Mask-RCNN
at 2.088TFLOPs and Clip Cas-Dou Mask-RCNN at 1.997TFLOPs. However, these metrics represent model complexity and not execution speed per se. A high FLOPS value suggests that our models, despite being computationally complex, are able to deliver high-accuracy results, likely due to their capacity
to handle more intricate operations. As advancements in processing units and hardware accelerators continue, such computational complexity is becoming less of a constraint.
Examining the Params metric, our methods have more parameters than most other methods, indicating more complex models with more parameters
to learn from. However, with a larger number of parameters to learn, our mod- els are capable of understanding more nuanced patterns from the data, which likely contributes to the superior performance in precision and H-mean. Even though these models are more complex and potentially require more compu- tational resources, the improvement in performance outcomes can justify this increased complexity.
The commendable performance of our system suggests that the combina- tion of CLIP pre-training with ResNet50 and the use of Cascade Double Head Mask R-CNN for scene text detection indeed confers significant benefits. The marked increase in precision indicates our model’s enhanced ability to accu- rately distinguish between text and non-text regions. While the recall score isn’t the highest, it’s still competitive. Moreover, the superior H-mean score, which represents a balanced measure of precision and recall, illustrates our model’s proficiency in both metrics—an essential requirement for real-world applications.
Building on these observations, Table 2 presents a comprehensive eval- uation of the effectiveness of our proposed RandAug image augmentation technique when integrated with various scene text detection methodologies,
Springer Nature 2021 TX template
14 Enhancing Scene Text Detection With A Novel Contrastive...
Table 1 Comparative Performance of Various Scene Text Detection Methods.
Methods P R H FPS | Params | FLOPS
PANet [24] 84.55 73.23 78.48 32 24.809M | 52.004G
PSENet [25] 83.96 76.36 79.98 14.4 | 29.216M 0.133T
TextSnake[26] 82.6 84.9 80.4 10.8 | 36.356M | 54.303G
MaskRCNN [27] 86.44 77.66 81.82 16.1 | 44.396M 0.25T
DBNet [28] 87.44 82.76 85.04 29.6 25.41M 46.254G FCENet [29] 82.43 88.34 | 85.28 8 26.256M | 40.746G DBNet DCN [28] 87.84 83.15 85.43 28.2 | 26.281M | 35.528G Cascade-Mask RCNN [30] 82.85 81.90 82.37 16.1 77.325M 1.814T
Cas-Dou Mask-RCNN (Ours) 85.00 81.85 83.99 14 81.913M 2.088T
Clip Cas-Dou Mask-RCNN (Ours) | 90.03 | 83.77 | 86.78 14 81.317M 1.997T
This table provides a comparative evaluation of different scene text detection methods, including our proposed model, based on precision (P), recall (R), and H-mean score (H).
Our model demonstrates superior performance, achieving competitive precision, recall, and the highest H-mean score among the compared methods.
including MaskRCNN, Cascade-Mask RCNN, and Cascade-Doublehead-Mask RCNN.. A detailed examination of the results reveals the significant impact of the RandAug method on the performance of the Cascade-Doublehead-Mask RCNN model. When trained on the ICDAR2015 dataset with the integration
of RandAug, the model’s precision improves from 85.00% to 87.44%, and the H-mean score sees a noticeable increase from 83.99% to 85.07%. This improve-
ment in precision and H-mean score underscores the potency of the RandAug method in enhancing the model’s ability to accurately detect text regions within a scene.
Table 2 Assessment of the Efficacy of RandAug Methods.
Method Training set P R H
MaskRCNN ICDAR2015[23] Train 86.44 77.66 81.82
Cascade-Mask RCNN | ICDAR2015 82.85 81.90 82.37
Cas-Dou-Mask RCNN | ICDAR2015 85.00 81.85 83.99
Cas-Dou-Mask RCNN | ICDAR2015 RandAug | 87.44 82.76 85.07
Proposed method ICDAR2015 89.23 83.00 86.01
Proposed method ICDAR2015 RandAug | 90.03 | 83.77 | 86.78
This table offers a comparative evaluation of various scene text detection methodologies, encompassing MaskRCNN, Cascade-Mask RCNN, and Cascade-Doublehead(Cas-Dou) Mask RCNN , both with and without the integration of our proposed RandAug image augmen- tation technique. The performance metrics employed for this evaluation include precision, recall, and the H-mean score. The results underscore the potency of our RandAug method
in significantly bolstering the performance of the Cascade-Doublehead-Mask RCNN model.
The results provide strong validation of the advantages inherent in our proposed method, demonstrating its potential to not only enhance the per-
formance of the model but also to redefine the benchmarks of state-of-the-art scene text detection. This paves the way for further exploration and exam- ination of these enhancements and their potential applications, indicating a promising direction for future research.
Springer Nature 2021 IXIEX template
Enhancing Scene Text Detection With A Novel Contrastive... 15
6 Conclusion and Future Work
This research introduced a ground-breaking approach to scene text detection
by harnessing the strengths of a CLIP-pretrained ResNet50 backbone com- bined with a Cascade Double Head Mask R-CNN. Rigorously tested on the ICDAR2015 dataset, our model outperformed the base Cascade Mask R-CNN model, achieving a precision of 90.031%, a recall of 83.051%, and an H-mean score of 86.51%. This outcome attests to the potential of our model to redefine the state-of-the-art in scene text detection. The implications of our research are substantial, not only in the specific domain of scene text detection but also in other areas of computer vision. The superior performance of our model could expand its applicability across various sectors, including autonomous vehicles, robotics, and augmented reality. Moving forward, potential research directions include extending the model to handle multi-lingual and complex scripts, exploring the integration of CLIP pre-training with other backbone models, and investigating the adaptability of the Cascade Double Head Mask R-CNN for other object detection tasks. In conclusion, our study presents a novel, effective solution for scene text detection, setting a new performance benchmark on the ICDAR2015 dataset. It not only pushes the current limits
of computer vision but also opens up exciting pathways for future research.
Statements and Declarations
Funding. This research was supported by The VNUHCM-University o Information Technology’s Scientific Research Support Fund.
Conflict of Interests. The authors declare that they have no conflict o
interests.
Data availability statements. In this study, the authors made use of the publicly available ICDAR2015 dataset [23].
Authors’ Contributions. Mr. Bao Tran and Mr. Minh Dinh collaborate:
on conceiving, designing, and conducting experiments for the model. Mr. Bao Tran integrated the double head architecture and introduced STRAug for scene text detection. Mr. Minh Dinh proposed using ResNet-oClip and lec the writing of the article. Mr. Doanh C. Bui contributed to the writing an reviewing process of the article, providing valuable support in ensuring its academic quality and rigor. Mr. Nguyen D. Vo, as the research manager, writing-review, edited and provided essential non-academic support, ensuring timely completion.
References
[1] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,
Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transfer- able visual models from natural language supervision. In: International
16
lì
5]
ot
Springer Nature 2021 IXIEX template
Enhancing Scene Text Detection With A Novel Contrastive...
Conference on Machine Learning, pp. 8748-8763 (2021). PMLR
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 770-778 (2016)
Xue, C., Zhang, W., Hao, Y., Lu, S., Torr, P.H., Bai, S.: Language mat- ters: A weakly supervised vision-language pre-training approach for scene text detection and spotting. In: Computer Vision-ECCV 2022: 17th Euro- pean Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part
XXVII, pp. 284-302 (2022). Springer
Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance segmentation. In: Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 32 (2018)
Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., Jia, J.: Learning shape-aware embedding for scene text detection. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4234-4243 (2019)
Ye, J., Chen, Z., Liu, J., Du, B.: Textfusenet: Scene text detection with richer fused features. In: IJCAI, vol. 20, pp. 516-522 (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in
neural information processing systems 32 (2019)
Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder represen- tations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: A uni- versal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336-11344 (2020)
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint
arXiv:1908.08530 (2019)
Springer Nature 2021 IXIEX template
Enhancing Scene Text Detection With A Novel Contrastive... 17
[13] Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng,
14
16
17
18
23
Y., Liu, J.: Uniter: Universal image-text representation learning. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow,
UK, August 23-28, 2020, Proceedings, Part XXX, pp. 104-120 (2020).
Springer
Chiou, M.-J., Zimmermann, R., Feng, J.: Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access 9, 50441-50451 (2021). https://doi.org/10.1109/ACCESS.2021. 3069041
Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. Conference on
Computer Vision and Pattern Recognition (CVPR) (2020)
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A recur- rent vision-and-language bert for navigation. In: Proceedings of the
EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1643-1653 (2021)
Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In: Computer Vision — ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28,
2020, Proceedings, Part XVIII, pp. 336-352. Springer, Berlin, Heidel- erg (2020). https://doi.org/10.1007/978-3-030-58523-5_20. https://doi. org/10.1007/978-3-030-58523-5_20
Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, $.C.: Vd- ert: A unified vision and dialog transformer with bert. arXiv preprint arXiv:2004.13278 (2020)
Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond
empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
Atienza, R.: Data augmentation for scene text recognition. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision, pp. 1561-1570 (2021)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation
in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315-2324 (2016)