中文

Candidate Term Boundary Conflict Reduction Method for Chinese Geological Text Segmentation

Hits:

  • Release time:2023-04-03

  • Journal:Applied Sciences

  • Key Words:Chinese word segmentation; information entropy; degree of freedom of terms; zero-sample; single geological text

  • Abstract:Though Chinese word segmentation (CWS) relies heavily on arithmetic power to train huge models and human work to label corpora, models and algorithms are still less accurate, especially for segmentation in a specific domain. In this study, a high-degree-of-freedom-priority candidate term boundary conflict reduction method (HFCR) is proposed to solve the problem of manually setting thresholds on segmentation based on information entropy. We quantify the uncertainty of left and right character connections of candidate terms and then arrange them in descending order for local comparisons to determine term boundaries. Dynamic numerical comparisons are adopted instead of setting a threshold manually and randomly. Experiments show that the average F1-value of CWS for Chinese geological text is higher than 95% and the F1-value for Chinese general datasets is higher than 87%. Compared with representative tokenizers and the SOTA model, our method performs better, which solves the term boundary conflict problem well and has excellent performance on single geological text segmentation without any samples or labels.

  • Co-author:Zhiyong Guo

  • First Author:Yu Tang

  • Indexed by:Journal paper

  • Correspondence Author:Jiqiu Deng*

  • Document Type:J

  • Volume:13

  • Issue:7

  • Page Number:4516

  • Translation or Not:no

  • Date of Publication:2023-04-02

  • Included Journals:SCI

  • Links to published journals:https://www.mdpi.com/2076-3417/13/7/4516


  • Zip Code:

  • Postal Address:

  • Mobile:

  • Email:

Central South University  All rights reserved  湘ICP备05005659号-1 Click:
  MOBILE Version

The Last Update Time:..