Candidate Term Boundary Conflict Reduction Method for Chinese Geological Text Segmentation
Hits:
Release time:2023-04-03
Journal:Applied Sciences
Key Words:Chinese word segmentation; information entropy; degree of freedom of terms; zero-sample; single geological text
Abstract:Though Chinese word segmentation (CWS) relies heavily on arithmetic power to train huge models and human work to label corpora, models and algorithms are still less accurate, especially for segmentation in a specific domain. In this study, a high-degree-of-freedom-priority candidate term boundary conflict reduction method (HFCR) is proposed to solve the problem of manually setting thresholds on segmentation based on information entropy. We quantify the uncertainty of left and right character connections of candidate terms and then arrange them in descending order for local comparisons to determine term boundaries. Dynamic numerical comparisons are adopted instead of setting a threshold manually and randomly. Experiments show that the average F1-value of CWS for Chinese geological text is higher than 95% and the F1-value for Chinese general datasets is higher than 87%. Compared with representative tokenizers and the SOTA model, our method performs better, which solves the term boundary conflict problem well and has excellent performance on single geological text segmentation without any samples or labels.
Co-author:Zhiyong Guo
First Author:Yu Tang
Indexed by:Journal paper
Correspondence Author:Jiqiu Deng*
Document Type:J
Volume:13
Issue:7
Page Number:4516
Translation or Not:no
Date of Publication:2023-04-02
Included Journals:SCI
Links to published journals:https://www.mdpi.com/2076-3417/13/7/4516
-
|
Zip Code:c576f0c85848a14d10428d1b79269fc7d68bde9ba6ecb95648b168a03671c3821c778c2bedf0cae5d03c95a065c3a5efa4243241acb9eb6089885db6f9f3fa7f5b7491eb5af48e7ab43d253e29e0fa6991b0af260ac98903cc7fbca150cb2fc86cb6bd41d95f78a263ba3d6f6aa504ee56cdd8693f2e6f13b576940cee3633de
Postal Address:b8aa5f788763ab7b115d9f4a2f5775cc0dec4df3d1f3c99f00943bec73bb76a8ab1071ab6e7b703b4d29611fe78e5af4b9f5ef666e69fa87eabd44d4423899c62128523356b51b88864f8fbf1f0eedf9f51049b1ff09b5382f8711772f0676fd8391ca14bf45bce2ddb0e92e640a2575c1a0e206a262b5d05d30bff7d4cd4559
Mobile:95863e09035286dda21e7c015a9a30e61699b949e5e8f79ea5799e29d2e8538e7023e78b5741bac8199489409b3d640bf8ea5aa0e82d3ecbe37b8896509d5381af491d379548ea34e0ccf2a8c34c89bfeaf92847f74d6a785735085a5d8b3a97c6ea43179c2d58a8034e458b52a3ba80a8d02454857ce6f40e324f878a098c33
Email:92be96ccd7bb71d47f9f10df47629670af270a59331d108e8c3b01ef1af33ce92c14b7aff0d30c1cbf2a121e06e3104723da9adae625923ecd3fe7d7b9873e8d1886c3fffb116663e97a57ff6a538d209d5c0bfe12c2b073e83f940ae41036b26f756f0e93155e7426bdf0c59631c0bc2de324875ec962f47a170ff93d81e7cc
|