A Hybrid Deep Learning Architecture for Sentence Unit Detection Duy-Cat Can∗† , Thi-Nga Ho‡ and Eng-Siong Chng†‡ of Information Technology, University of Engineering and Technology, VNUH, Vietnam † Temasek Laboratories@NTU, Nanyang Technological University, Singapore ‡ School of Computer Science and Engineering, Nanyang Technological University, Singapore catcd@vnu.edu.vn, ngaht@ntu.edu.sg, ASESChng@ntu.edu.sg ∗ Faculty Abstract—Automatic speech recognition systems currently deliver an unpunctuated sequence of words which is hard to peruse for human and degrades the performance of the downstream natural language processing tasks In this paper, we propose a hybrid approach for Sentence Unit Detection, in which the focus is on adding the full stop [.] to the unstructured text Our model profits from the advantage of two dominant deep learning architectures: (i) the ability to learn the long dependencies in both directions of a bidirectional Long Short-Term Memory; (ii) the ability to capture the local context with Convolutional Neural Networks We also empirically study the training objective of our networks using extra-loss and further investigate the impacts of each model component on the overall result Experiments conducted on two large-scale datasets demonstrated that the proposed architecture outperforms previous separated methods by a substantial margin of 1.82-1.91% of F1 Availability: the source code and model are available at https://github.com/catcd/LSTM-CNN-SUD Keywords-Sentence Unit Detection; Punctuation; Recurrent Neural Networks; Long Short-Term Memory; Convolutional Neural Network; I I NTRODUCTION Recent years have witnessed tremendous progress in automatic speech recognition (ASR) However, the text transcript generated by current recognition systems is still simply a stream of words without punctuation and segmentation Generally, the human readability of the transcript can be comprehensively improved by the existence of punctuation marks [1], and the segmentation of the text based on punctuation positions will also increase the accuracy of post-processing such as question answering or machine translation In this work, we approach Sentence Unit Detection (SUD) problem as a sequential tagging, which aims to segment the sequence of words by labeling the end of each sentence a full stop mark Besides, we further try to predict fined-grain punctuation mark In the last decade, deep learning methods have produced state-of-the-art results in many tasks of natural language processing (NLP) by manipulating multiple hidden layers to learn the robust representations of data+ Two most typical deep neural networks (DNNs) are the Convolutional Neural Network (CNN) [2] and the Recurrent Neural Networks, or RNNs [3], with Long Short-Term Memory (LSTM) unit [4] CNN is good at capturing n-gram features in the flat structure and has been proved effective in NLP [5] RNN performs effectively on sequential data, and it furthermore has many different improvements among several state-of-the-art NLP systems [6] 978-1-7281-1175-9/18/$31.00 c 2018 IEEE In this work, we present an analysis of a neural architecture that takes advantage of these two DNNs The hybrid model benefits from far context capturing ability of a multilayer bidirectional LSTM and local features extracted by the CNN Compared with the prior works, the contributions of our work can be summarized as follows: • • • We propose a hybrid model of LSTM-CNN and show that our model is effective in detecting sentence units and punctuations on two corpora We demonstrate the effectiveness of our proposed extra-loss on training the proposed model We try to handle the class imbalance problem by using weighted Cross-Entropy II R ELATED WORK There has been a considerable amount of efforts on constructing computational models to detect the sentence unit in unpunctuated text Most of the recent works can be divided into two categories: hand-crafted feature based and automatic features extracted methods Previous researches frequently used lexical features include bag-of-words or n-grams model [7] These techniques have been compared with ConvNets by Zhang and LeCun [8] When training, some state-of-the-art SUD approaches chose decision tree [9] or Conditional Random Fields (CRFs) [10] as the classifier These approaches only take traditional lexical features as input which depend on expert knowledge, thus normally cannot generalize data well enough because of expensive human exertion In recent years, many studies attempt other possibilities by using the word embeddings [11] with various DNN architectures to learn the features without prior knowledge Che et al [11] applied a CNN model on purely lexical, with pre-trained word vectors as the only input Study of Tilk and Alumăae [12] presented a two-stage RNN based model using LSTM units, showed the performance of LSTM on SUD task Many enhancements and techniques are also used to improve LSTM, such as CRF on the LSTM output [13, 14] or attention mechanism [15] Meanwhile, some recent experiments have shown the effectiveness of stacking the CNN on the output of LSTM in the task of relation classification or sentiment analysis [16, 17] Inspired by these experiments, we attempt this combination in a different way for sentence unit detection in this paper 129 The embedding matrix is generated by using a pretrained word embedding model which learned the word vectors captured hidden information about a language, such as word analogies or semantic, based on its external context For this paper, we use the fastText word embedding model [18] which is trained on Wikipedia data B Features extraction 1) Multilayer Bidirectional Long Short-Term Memory: To take advantage of sequential data, we make use of Recurrent Neural Network [3] with Long Short-Term Memory unit [4], which is demonstrated the effectiveness in capturing the long-term dependencies A common LSTM unit is composed of four components: a memory cell ct , an input gate it , an output gate ot , and a forget gate ft The hidden state ht is calculated using current input xt with previous hidden state ht−1 and memory cell ct−1 , as follow: Figure An overview of proposed model III P ROPOSE MODEL Figure depicts the overall architecture of our proposed model Given the unpunctuated text as input, it is passed through an embedding vector generation layer for our neural network Along the sequence of words, two recurrent neural networks with Long Short-Term Memory units are applied to learn hidden representations of words in the embedded space respectively A convolution layer is also applied to capture local features from words and their neighbors We have a multi-softmax layer on the output of the previous phase for classification During the training stage, the hidden states of LSTM and local features from CNN are concatenated, and a fine-grained softmax layer is followed to perform a (K + 1)-class classification Additionally, two coarse-grained softmax classifiers of LSTM and CNN are also used to perform binary classifications The final (K + 1)-class distribution is the (K + 1)-class distribution provided by fine-grained classifier during the testing stage The details of each layer are described below A Embeddings In the Embeddings layer, each word in the input sequence is transformed into a vector by looking up the embedding matrix We ∈ Rd×|V | , where d is the dimension of a vector, and V is a vocabulary of all words we consider it = σ (Wi xt + Ui ht−1 + bi ) (1) gt = (Wc xt + Uc ht−1 + bc ) (2) ft = σ (Wf xt + Uf ht−1 + bf ) (3) ct = it ◦ gt + ft ◦ ct−1 (4) ot = σ (Wo xt + Uo ht−1 + bo ) (5) ht = ot ◦ (ct ) (6) lt = f (Wconv xt−n:t+n + bconv ) (7) In which, σ denotes the sigmoid function, and ◦ denotes the entry-wise product 2) Local features with Convolutional Neural Network: To improve the performance of LSTM model, we use a CNN [2] layer to capture the context features around each word We use several filter’s region sizes for this CNN layer which allow CNN model to capture wider ranges of n-grams Local features lt for the tth word in the context of 2n neighbors can be extracted, utilizing convolution filter size d × (2n + 1) I.e, where Wconv is the weight matrix for the convolution layer; bconv is bias for the hidden state vector; xt−n:t+n is stack of 2n + word vectors from (t − n) to (t + n); f is a non-linear activation function C Multi-softmax classifier A fine-grained softmax classifier is used to predict a (K + 1)-class distribution yt for each word, yt = softmax (Wf [ht ⊕ lt ] + bf ) (8) yth = softmax (Wh ht + bh ) (9) where Wf is the transformation matrix, and bf is the bias vector Fine-grained classifier makes use of representation with bidirectional information combined with local features This (K + 1)-class distribution then become final prediction on the decoding phase Two coarse-grained softmax classifiers are applied to ht and lt separately with linear transformation to give the binary distribution yth and ytl respectively, i.e, ytl (10) = softmax (Wl lt + bl ) 2018 International Conference on Asian Language Processing (IALP) 130 Table I S UMMARY OF TWO BENCHMARK DATASETS Example Non-punctuated Full stop [.] Comma [,] Question mark [?] Exclamation mark [!] Three dots [ ] Train 5359 299693 25573 - RT-03-04 Dev 359 18454 1508 - Test 275 12808 1162 - Table II R ESULTS ON RT-03-04 DATASET Model CNN LSTM LSTM-CNN LSTM-CNN+ MGB Train Dev 284436 6433 5722741 112190 582004 12368 384732 8519 109757 2714 71598 2061 30607 918 Model CNN LSTM LSTM-CNN LSTM-CNN+ D Objective function and learning method i=0 i=0 h vti log yti − l vti log yti +λ θ i=0 (11) where ut ∈ R(K+1) , vt ∈ R2 , indicating the one-hot represented ground truth; θ is the set of model parameters to be learned, and λ is a regularization coefficient The model parameters θ can be efficiently computed via backpropagation through neural network structures To minimize L, we apply mini-batch gradient descent with Adam optimizer [19] in our experiments IV E XPERIMENTAL EVALUATION A Datasets We evaluate our LSTM-CNN model on two benchmark datasets: RT-03-041 and the subset of MGB Challenge data [20] The details of two datasets are shown in Table I The RT-03-04 corpus consists of transcripts and annotations of 40 of hours English Broadcast News (BN) and Conversational Telephone Speech (CTS) audio data For this dataset, we predict a label full stop [.] for each word that is at the end of a full sentence The model is finetuned using training set with validation on development set, and the results are reported on the test set, which is kept secret with the model Our subset of MGB data includes approximately 1,340 hours over 1,600 hours of broadcast audio taken from four BBC TV channels on seven weeks Since the provided dataset does not contain punctuation marks, we use the original subtitles and preprocessed subtitles to obtain the correct boundaries for each unit In these experiments, we predicted a fine-class label for each word, include full stop [.], comma [,], question mark [?], exclamation mark [!] and three dots [ ] With MGB, we separate ten percent of training set for validation and report the result on development set MDE Model CNN LSTM LSTM-CNN LSTM-CNN+ 1 uti log yti − Training Data Speech LDC2004S08 and LDC2005S16 F1 +1.54) 64.06(− +1.02) 73.53(− +0.36) 74.68(− +0.27) 75.44(− P 67.88 65.62 65.20 67.65 Micro R F1 +0.47) 35.62 46.72(− +0.09) 58.73 61.98(− +0.11) 60.70 62.87(− +0.04) 60.6 63.80(− Macro F1 +0.61) 36.39(− +0.61) 45.19(− +0.25) 48.09(− +0.36) 49.49(− Table IV F1 OF EACH LABEL ON MGB DATASET The two binary softmax classifiers are used to estimate the probability that word is punctuation The (K + 1)class softmax classifier is used to estimate the probability of which punctuation type that word belongs to For a single word in a data sample, the training objective is the penalized cross-entropy of three classifiers, given by K R 52.45 66.70 70.36 70.24 Table III R ESULTS ON MGB DATASET where Wc is the transformation matrix, and bc is the bias vector Classifying tth word into two coarseclasses (punctuated vs non-punctuated) can strengthen the model’s ability to judge the fine-class L=− P 82.28 81.92 79.56 81.47 [.] 59.91 74.89 74.45 74.91 [,] 38.52 56.86 57.96 59.21 [?] 34.99 71.21 69.78 70.87 [!] 29.04 19.66 27.99 29.49 [ ] 19.50 3.35 10.25 12.96 B Performance of the LSTM-CNN model We conduct the training and testing process 20 times and calculate the averaged results For evaluation, the predicted labels were compared to the golden annotated data using standard precision (P ), recall (R), and F1 score metrics Table II and III show the performance of our proposed model with different variants on two benchmark datasets In both RT-03-04 and MGB datasets, the ensemble LSTMCNN model outperforms all uni-models using CNN or LSTM only On RT-03-04, the local features extracted from CNN help to increase the recall of the LSTM model by 3.66%; the F1 is increased by 1.15% The result on the MGB dataset is similar, the recall and the micro-average F1 are increased by 1.97% and 0.89% respectively The standard deviations of 20 runs on two datasets are 0.36 on RT-03-04 and 0.11 on MGB (micro-average) In addition, applying the extra-loss using two coarsegrained softmax to final training objective helps to boost F1 0.76% and 0.93% on two datasets respectively Our LSTM-CNN+ model is more stable and outperforms other models by a large margin Table IV compares the result of each class on the MGB dataset The LSTM model with long dependency information fails to predict minor classes, such as three dots, in the testing dataset Under other conditions, the CNN model, with local features, able to capture the features to resolve this problem Therefore, the out-performance of the proposed combined model in detecting minor classes is remarkable and understandable C Handle data imbalance Since we find out that all models achieve much higher precision than recall, we make one additional adjustment to our better performers on RT-03-04: reduce the contribution to objective function for class “Non-punctuated” 2018 International Conference on Asian Language Processing (IALP) 131 R EFERENCES Figure Investigate the impact of weighted loss on the RT-03-04 dataset using LSTM-CNN+ model By this effort, more border cases would be classed as “Punctuated”, which balanced the precision and recall Figure shows that with the increase of the ratio for class “Non-punctuated”, the precision and recall increase and decrease respectively The best-achieved result is 76.68% of F1 (P = 75.57%, R = 77.82%) with the ratio of two classes are 0.35-0.65 V C ONCLUSION In this paper, we have presented a novel sentence unit detection model that consists of two dominant deep learning networks The proposed model takes advantage of the ability to capture long-range dependencies on multiple time scales of LSTM and ability to learn the local features in the context of neighbor words Experiments on two datasets showed improvements of LSTM-CNN model for all punctuation types compared to traditional LSTM model The overall F1 scores were improved by 1.91% and 1.82% on two datasets respectively and the standard deviations of 20 run reduced significantly These most significant improvements were achieved when adding extra-loss on training phase Several experiments were conducted to verify the rationality and effectiveness of the model’s components and proposed materials The results also demonstrated the robustness of our model that can automatically adapt to different types of data from telephone speech to broadcast news with different label schemata In addition, our proposed model is scalable to perform well on the small (40 hours) or large (1,340 hours) corpora The experiments also highlighted out the limitation of our model about data imbalance problem We aim to address this problem as well as further extensions of our model in the future works Future research includes the use of a richer set of prosodic input representations and training a new English word embeddings model on a verbal text dataset ACKNOWLEDGMENT This research is supported by the National Research Foundation Singapore under its AI Singapore Programme [Award No.: AISG-100E-2018-006] We also thank the anonymous reviewers for their comments and suggestions [1] D A Jones, F Wolf, E Gibson, E Williams, E Fedorenko, D A Reynolds, and M Zissman, “Measuring the readability of automatic speech-to-text transcripts,” in Eighth European Conference on Speech Communication and Technology, 2003 [2] Y LeCun, L Bottou, Y Bengio, and P Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol 86, no 11, pp 2278–2324, 1998 [3] D E Rumelhart, G E Hinton, and R J Williams, “Learning representations by back-propagating errors,” nature, vol 323, no 6088, p 533, 1986 [4] S Hochreiter and J Schmidhuber, “Long short-term memory,” Neural computation, vol 9, no 8, pp 1735–1780, 1997 [5] Y Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp 1746–1751 [6] Q Qian, M Huang, J Lei, and X Zhu, “Linguistically regularized lstm for sentiment classification,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol 1, 2017, pp 1679–1689 [7] N Ueffing, M Bisani, and P Vozila, “Improved models for automatic punctuation prediction for spoken and written text.” in INTERSPEECH, 2013, pp 3097–3101 [8] X Zhang and Y LeCun, “Text understanding from scratch,” arXiv preprint arXiv:1502.01710, 2015 [9] A Stolcke et al., “Automatic detection of sentence boundaries and disfluencies based on recognized words,” in Fifth International Conference on Spoken Language Processing, 1998 [10] X Wang, H T Ng, and K C Sim, “Dynamic conditional random fields for joint sentence boundary and punctuation prediction,” in INTERSPEECH, 2012 [11] X Che, C Wang, H Yang, and C Meinel, “Punctuation prediction for unsegmented transcript based on word vector.” in LREC, 2016 [12] O Tilk and T Alumăae, Lstm for punctuation restoration in speech transcripts,” in INTERSPEECH, 2015 [13] C Xu, L Xie, G Huang, X Xiao, E S Chng, and H Li, “A deep neural network approach for sentence boundary detection in broadcast news,” in INTERSPEECH, 2014 [14] K Xu, L Xie, and K Yao, “Investigating lstm for punctuation prediction,” in International Symposium on Chinese Spoken Language Processing IEEE, 2016, pp 15 [15] O Tilk and T Alumăae, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration.” in INTERSPEECH, 2016, pp 3047–3051 [16] H Q Le, D C Can, S T Vu, T H Dang, M T Pilehvar, and N Collier, “Large-scale exploration of neural relation classification architectures,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp 2266–2277 [17] N Kalchbrenner, E Grefenstette, and P Blunsom, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol 1, 2014, pp 655–665 [18] P Bojanowski, E Grave, A Joulin, and T Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol 5, pp 135–146, 2017 [19] D P Kingma and J Ba, “Adam: A method for stochastic optimization,” CoRR, 2014 [Online] Available: http://arxiv.org/abs/1412.6980 [20] P Bell et al., “The mgb challenge: Evaluating multigenre broadcast media recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) IEEE, 2015, pp 687–693 2018 International Conference on Asian Language Processing (IALP) 132 ... phase for classification During the training stage, the hidden states of LSTM and local features from CNN are concatenated, and a fine-grained softmax layer is followed to perform a (K + 1)-class... detecting minor classes is remarkable and understandable C Handle data imbalance Since we find out that all models achieve much higher precision than recall, we make one additional adjustment to... 1)-class classification Additionally, two coarse-grained softmax classifiers of LSTM and CNN are also used to perform binary classifications The final (K + 1)-class distribution is the (K + 1)-class