• 中国出版政府奖提名奖

    中国百强科技报刊

    湖北出版政府奖

    中国高校百佳科技期刊

    中国最美期刊

    留言板

    尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

    姓名
    邮箱
    手机号码
    标题
    留言内容
    验证码

    基于ELMO-CNN-BiLSTM-CRF模型的地质实体识别

    储德平 万波 李红 方芳 王润

    储德平, 万波, 李红, 方芳, 王润, 2021. 基于ELMO-CNN-BiLSTM-CRF模型的地质实体识别. 地球科学, 46(8): 3039-3048. doi: 10.3799/dqkx.2020.309
    引用本文: 储德平, 万波, 李红, 方芳, 王润, 2021. 基于ELMO-CNN-BiLSTM-CRF模型的地质实体识别. 地球科学, 46(8): 3039-3048. doi: 10.3799/dqkx.2020.309
    Chu Deping, Wan Bo, Li Hong, Fang Fang, Wang Run, 2021. Geological Entity Recognition Based on ELMO-CNN-BiLSTM-CRF Model. Earth Science, 46(8): 3039-3048. doi: 10.3799/dqkx.2020.309
    Citation: Chu Deping, Wan Bo, Li Hong, Fang Fang, Wang Run, 2021. Geological Entity Recognition Based on ELMO-CNN-BiLSTM-CRF Model. Earth Science, 46(8): 3039-3048. doi: 10.3799/dqkx.2020.309

    基于ELMO-CNN-BiLSTM-CRF模型的地质实体识别

    doi: 10.3799/dqkx.2020.309
    基金项目: 

    国家重点研发计划项目 2016YFB0502300

    中国地质调查局项目 12120114074001

    详细信息
      作者简介:

      储德平(1997-), 男, 硕士, 研究方向为地质大数据挖掘.ORCID: 0000-0003-3577-4973.E-mail: Chudeping_2019@cug.edu.cn

      通讯作者:

      万波, ORCID: 0000-0003-2387-5419.E-mail: wanbo@cug.edu.cn

    • 中图分类号: P628.4

    Geological Entity Recognition Based on ELMO-CNN-BiLSTM-CRF Model

    • 摘要: 地质实体是地质文本中的关键和核心信息,对其准确识别是地质信息提取和挖掘的重要前提.设计了ELMO-CNN-BiLSTM-CRF模型,基于预训练字向量构建深层BiLSTM-CRF神经网络模型,通过添加词语动态特征以及词语字符级别的特征,弥补字向量特异性缺失的问题,提高对于地质文本中复杂多词义的识别水平和对地质实体局部特征的提取能力.以《西藏自治区谢通门县雄村铜矿勘探地质报告》为例,对该模型的性能进行了评估,模型的准确率、召回率和F1值分别为95.15%、95.26%和95.21%.实验表明相比BiLSTM-CRF和CNN-BiLSTM-CRF模型,该模型在小规模语料地质实体识别方面效果更优,且能够有效识别长地质实体词汇和地质多义词.

       

    • 图  1  基于ELMO-CNN-BiLSTM-CRF命名实体识别流程

      Fig.  1.  Named entity recognition process based on ELMO-CNN-BiLSTM-CRF

      图  2  双向长短时记忆网络编码模式

      Fig.  2.  Bidirectional long-time memory network coding mode

      图  3  ELMO特征向量维度影响

      Fig.  3.  Influence of ELMO eigenvector dimension

      表  1  部分地质文本内容

      Table  1.   Part of geological text content

      序号 例句
      1 麻木下组主要岩性为灰-灰白色中-厚层状结晶灰岩,含燧石结核大理岩,夹薄层泥灰岩和钙质角岩…
      2 拉嘎组(C2P1l): 该组以含砾碎屑岩为特征. 下部为灰白、灰黄色中层细粒岩屑石英砂岩、长石石英砂岩、含砾泥质长石石英粉砂岩…
      3 F1、F2断裂都具有早期韧性剪切、晚期脆性变形的特征,而且F1断裂韧性剪切特征更为明显…
      4 如钙碱性岩多见于褶皱区,碱性岩多见于断裂区等…
      下载: 导出CSV

      表  2  模型参数

      Table  2.   Model Parameter

      超参数 数值
      CNN 窗口大小 3
      滤波器个数 30
      ELMO 映射维度 40
      LSTM 状态大小 200
      初始状态 0.0
      孔洞
      其他 dropout率 0.5
      批量大小 10
      学习率 0.001
      梯度裁剪 5.0
      下载: 导出CSV

      表  3  地质实体类别划分及相关样例

      Table  3.   Classification of geological entities and related samples

      实体类型 样例
      实体对象(GEO) 雅鲁藏布江缝合带、班公湖-怒江缝合带、狮泉河-纳木错断裂带、冈底斯-念青唐古拉地体、花岗岩、白朗蛇绿岩带等
      地质年代(TIME) 第四纪、震旦纪、前寒武纪、古生代等
      地质作用(PROCESS) 大理岩化、绢云母化、铜矿化等
      其他地质指标(OTHERS) 品位、倾角、产状等
      下载: 导出CSV

      表  4  BIOES标注样例

      Table  4.   BIOES labeled sample

      标注 标注
      O O
      O O
      O B-GEO
      O I-GEO
      O E-GEO
      O O
      B-GEO O
      E-GEO O
      . O O
      O O
      O . O
      下载: 导出CSV

      表  5  不同模型训练结果

      Table  5.   Training results of different models

      模型 Precision Recall F1
      迭代次数为100
      BiLSTM-CRF 86.51% 87.24% 86.87%
      CNN-BiLSTM-CRF 92.49% 91.01% 91.74%
      ELMO-CNN-BiLSTM-CRF 94.83% 94.39% 94.61%
      迭代次数为200
      BiLSTM-CRF 87.36% 88.70% 88.03%
      CNN-BiLSTM-CRF 93.24% 90.48% 91.84%
      ELMO-CNN-BiLSTM-CRF 94.95% 94.57% 94.76%
      迭代次数为500
      BiLSTM-CRF 89.61% 87.76% 88.68%
      CNN-BiLSTM-CRF 92.17% 91.13% 91.64%
      ELMO-CNN-BiLSTM-CRF 95.15% 95.26% 95.21%
      下载: 导出CSV

      表  6  ELMO-CNN-BiLSTM-CRF模型部分识别实例

      Table  6.   ELMO-CNN-BiLSTM-CRF model partial identification instance

      原文内容 标注信息 识别结果
      …主要对雄村铜矿体进行了较为详细的研究… 铜矿体 铜矿
      …花岗闪长岩岩脉断裂形成层状地形… 花岗闪长岩岩脉 花岗闪长岩岩脉
      …断裂南侧较老地层及局部超基性岩覆于北侧… 断裂、地层、超基性岩 断裂、地层、超基性岩
      …常具有强烈的绢云母化及泥化… 绢云母化、泥化 绢云母化、泥化
      …中心相带以石英二长岩为主,边缘相带为花岗岩、斑状黑云母角闪石花岗岩… 石英二长岩、花岗岩、斑状黑云母角闪石花岗岩 中心相带、石英二长岩、花岗岩、斑状黑云母角闪石花岗岩
      …磨棱岩化花岗岩基本保留原岩特征… 磨棱岩化花岗岩 岩化、花岗岩
      下载: 导出CSV
    • [1] Baumann, P., Mazzetti, P., Ungar, J., et al., 2016. Big Data Analytics for Earth Sciences: The Earth Server Approach. International Journal of Digital Earth, 9(1): 3-29. https://doi.org/10.1080/17538947.2014.1003106
      [2] Chen, S.D., Ouyang, X.Y., 2020. Overview of Named Entity Recognition Technology. Radio Communications Technology, 46(3): 251-260 (in Chinese with English abstract).
      [3] Chiu, J. P. C., Nichols, E., 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4: 357-370. https://doi.org/10.1162/tacl_a_00104
      [4] Collobert, R., Weston, J., Bottou, L., et al., 2011. Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12(1): 2493-2537. http://d.wanfangdata.com.cn/periodical/Arxiv000000493885
      [5] Fan, R. Y., Wang, L. Z., Yan, J. N., et al., 2019. Deep Learning-Based Named Entity Recognition and Knowledge Graph Construction for Geological Hazards. ISPRS International Journal of Geo-Information, 9(1): 15. https://doi.org/10.3390/ijgi9010015
      [6] Hochreiter, S., Schmidhuber, J., 1997. Long Short-Term Memory. Neural Computation, 9(8): 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
      [7] Jiang, B.C., Wan, G., Xu, J., et al., 2018. Geographic Knowledge Graph Building Extracted from Multi-Sourced Heterogeneous Data. Acta Geodaetica et Cartographica Sinica, 47(8): 1051-1061 (in Chinese with English abstract). http://www.zhangqiaokeyan.com/academic-journal-cn_acta-geodaetica-cartographica-sinica_thesis/0201230440688.html
      [8] Kim, Y., 2014. Convolutional Neural Networks for Sentence Classification. Conference on Empirical Methods in Natural Language Processing (EMNLP). The Association for Computational Linguistics, Doha.
      [9] Lafferty, J.D., McCallum, A., Pereira, F., 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco.
      [10] Lample, G., Ballesteros, M., Subramanian, S., et al., 2016. Neural Architectures for Named Entity Recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The Association for Computational Linguistics, San Diego. https://doi.org/10.18653/v1/n16-1030
      [11] Li, C.L., Li, J.Q., Zhang, H.C., et al., 2015. Big Data Application Architecture and Key Technologies of Intelligent Geological Survey. Geological Bulletin of China, 34(7): 1288-1299 (in Chinese with English abstract). http://www.researchgate.net/publication/286100282_Big_data_application_architecture_and_key_technologies_of_intelligent_geological_survey
      [12] Li, L.S., Guo, Y.K., 2018. Biomedical Named Entity Recognition with CNN-BLSTM-CRF. Journal of Chinese Information Processing, 32(1): 116-122 (in Chinese with English abstract). http://europepmc.org/abstract/MED/29718118
      [13] Liu, Y.P., Li, D.D., 2020. Chinese Named Entity Recognition Method Based on Bi-Directional LSTM-CNN-CRF. Journal of Harbin University of Science and Technology, 25(1): 115-120 (in Chinese with English abstract).
      [14] Ma, K., 2018. Research on the Key Technologies of Geological Big Data Representation and Association (Dissertation). China University of Geosciences, Wuhan (in Chinese with English abstract).
      [15] Ma, X. Z., Hovy, E., 2016. End-to-End Sequence Labeling via Bi-Directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). The Association for Computational Linguistics, Berlin. https://doi.org/10.18653/v1/p16-1101
      [16] Qiu, Q. J., Xie, Z., Wu, L., et al., 2019a. GNER: A Generative Model for Geological Named Entity Recognition without Labeled Data Using Deep Learning. Earth and Space Science, 6(6): 931-946. https://doi.org/10.1029/2019ea000610
      [17] Qiu, Q. J., Xie, Z., Wu, L., et al., 2019b. BiLSTM-CRF for Geological Named Entity Recognition from the Geoscience Literature. Earth Science Informatics, 12(4): 565-579. https://doi.org/10.1007/s12145-019-00390-3
      [18] Tan, Y.J., Qu, H.G., Wen, M., 2018. On Big Data of Geological Survey. Geomatics World, 25(2): 7-11 (in Chinese with English abstract). http://en.cnki.com.cn/Article_en/CJFDTotal-CHRK201802003.htm
      [19] Tolle, K. M., Tansley, D. S. W., Hey, A. J. G., 2011. The Fourth Paradigm: Data-Intensive Scientific Discovery. Proceedings of the IEEE, 99(8): 1334-1337. https://doi.org/10.1109/jproc.2011.2155130
      [20] Turian, J.P., Ratinov, L., Bengio, Y., 2010. Word Representations: A Simple and General Method for Semi-Supervised Learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. The Association for Computational Linguistics, Uppsala.
      [21] Wang, C. B., Ma, X. G., Chen, J. G., et al., 2018. Information Extraction and Knowledge Graph Construction from Geoscience Literature. Computers & Geosciences, 112: 112-120. https://doi.org/10.1016/j.cageo.2017.12.007
      [22] Wang, J. M., Hu, Y. J., Joseph, K., 2020. NeuroTPR: A Neuro-Net Toponym Recognition Model for Extracting Locations from Social Media Messages. Transactions in GIS, 24(3): 719-735. https://doi.org/10.1111/tgis.12627
      [23] Yang, Y.Q., 2018. Current Situation, Problems and Countermeasures of Geological Prospecting Units Participate in the "Big Data" Project Construction. Natural Resource Economics of China, 31(7): 31-34 (in Chinese with English abstract). http://en.cnki.com.cn/Article_en/CJFDTOTAL-ZDKJ201807008.htm
      [24] Zhang, G.Y., Fu, J.Y., Ouyang, Z. Z., et al., 2020. The Importance of Space Database Establishment Based on DGSS in Big Data Environment. Earth Science, 45(9): 3451-3460 (in Chinese with English abstract).
      [25] Zhang, M.Z., Yu, M.L., Wang, Y., et al., 2013. Designing and Building the National Geo-Environment Monitoring Data Warehouse. Earth Science, 38(6): 1347-1355 (in Chinese with English abstract). http://www.researchgate.net/publication/289950672_Designing_and_building_the_national_Geo-Environment_Monitoring_data_warehouse
      [26] Zhang, X.Y., Ye, P., Wang, S., et al., 2018. Geological Entity Recognition Method Based on Deep Belief Networks. Acta Petrologica Sinica, 34(2): 343-351 (in Chinese with English abstract). http://www.zhangqiaokeyan.com/academic-journal-cn_acta-petrologica-sinica_thesis/0201252011589.html
      [27] Zhang, X.Y., Zhang, C.J., Wu, M.G., et al., 2020. SpatioTemporal Features Based Geographical Knowledge Graph Construction. Scientia Sinica Informationis, 50(7): 1019-1032 (in Chinese with English abstract). doi: 10.1360/SSI-2019-0269
      [28] Zhao, P.D., 2015. Digital Mineral Exploration and Quantitative Evaluation in the Big Data Age. Geological Bulletin of China, 34(7): 1255-1259 (in Chinese with English abstract). http://en.cnki.com.cn/Article_en/CJFDTOTAL-ZQYD201507001.htm
      [29] Zhao, Y.O., Zhang, J.Z., Li, Y.B., et al., 2020. Sentiment Analysis Using Embedding from Language Model and Multi-Scale Convolutional Neural Network. Journal of Computer Application, 40(3): 651-657 (in Chinese with English abstract). doi: 10.1007/s12652-018-1095-6
      [30] Zhu, Y.Q., Tan, Y.J., Zhang, J.T., et al., 2015. A Framework of Hadoop Based Geology Big Data Fusion and Mining Technologies. Acta Geodaetica et Cartographica Sinica, 44(S1): 152-159 (in Chinese with English abstract). http://www.cqvip.com/QK/90069X/2015B12/670679412.html
      [31] Zuo, R.G., Peng, Y., Li, T., et al., 2020. Challenges of Geological Prospecting Big Data Mining and Integration Using Deep Learning Algorithms. Earth Science, 46(1): 350-358 (in Chinese with English abstract).
      [32] 陈曙东, 欧阳小叶, 2020. 命名实体识别技术综述. 无线电通信技术, 46(3): 251-260. doi: 10.3969/j.issn.1003-3114.2020.03.001
      [33] 蒋秉川, 万刚, 许剑, 等, 2018. 多源异构数据的大规模地理知识图谱构建. 测绘学报, 47(8): 1051-1061. https://www.cnki.com.cn/Article/CJFDTOTAL-CHXB201808005.htm
      [34] 李超岭, 李健强, 张宏春, 等, 2015. 智能地质调查大数据应用体系架构与关键技术. 地质通报, 34(7): 1288-1299. doi: 10.3969/j.issn.1671-2552.2015.07.006
      [35] 李丽双, 郭元凯, 2018. 基于CNN-BLSTM-CRF模型的生物医学命名实体识别. 中文信息学报, 32(1): 116-122. doi: 10.3969/j.issn.1003-0077.2018.01.015
      [36] 刘宇鹏, 栗冬冬, 2020. 基于BLSTM-CNN-CRF的中文命名实体识别方法. 哈尔滨理工大学学报, 25(1): 115-120. https://www.cnki.com.cn/Article/CJFDTOTAL-HLGX202001018.htm
      [37] 马凯, 2018. 地质大数据表示与关联关键技术研究(博士学位论文). 武汉: 中国地质大学.
      [38] 谭永杰, 屈红刚, 文敏, 2018. 论地质调查工作大数据. 地理信息世界, 25(2): 7-11. doi: 10.3969/j.issn.1672-1586.2018.02.002
      [39] 杨宇谦, 2018. 地勘单位参与"大数据"项目建设的现状、问题及对策. 中国国土资源经济, 31(7): 31-34. https://www.cnki.com.cn/Article/CJFDTOTAL-ZDKJ201807008.htm
      [40] 张广宇, 付俊彧, 欧阳兆灼, 等, 2020. 大数据时代下基于DGSS系统下空间数据库建立的重要性. 地球科学, 45(9): 3451-3460. doi: 10.3799/dqkx.2020.130
      [41] 张鸣之, 喻孟良, 王勇, 等, 2013. 国家级地质环境数据仓库的设计与实现. 地球科学, 38(6): 1347-1355. doi: 10.3799/dqkx.2013.133
      [42] 张雪英, 叶鹏, 王曙, 等, 2018. 基于深度信念网络的地质实体识别方法. 岩石学报, 34(2): 343-351.
      [43] 张雪英, 张春菊, 吴明光, 等, 2020. 顾及时空特征的地理知识图谱构建方法. 中国科学: 信息科学, 50(7): 1019-1032. https://www.cnki.com.cn/Article/CJFDTOTAL-PZKX202007005.htm
      [44] 赵鹏大, 2015. 大数据时代数字找矿与定量评价. 地质通报, 34(7): 1255-1259. doi: 10.3969/j.issn.1671-2552.2015.07.001
      [45] 赵亚欧, 张家重, 李贻斌, 等, 2020. 融合基于语言模型的词嵌入和多尺度卷积神经网络的情感分析. 计算机应用, 40(3): 651-657. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJY202003008.htm
      [46] 朱月琴, 谭永杰, 张建通, 等, 2015. 基于Hadoop的地质大数据融合与挖掘技术框架. 测绘学报, 44(S1): 152-159. https://www.cnki.com.cn/Article/CJFDTOTAL-CHXB2015S1023.htm
      [47] 左仁广, 彭勇, 李童, 等, 2020. 基于深度学习的地质找矿大数据挖掘与集成的挑战. 地球科学, 46(1): 350-358. doi: 10.3799/dqkx.2020.111
    • 加载中
    图(3) / 表(6)
    计量
    • 文章访问数:  914
    • HTML全文浏览量:  517
    • PDF下载量:  60
    • 被引次数: 0
    出版历程
    • 收稿日期:  2020-09-17
    • 网络出版日期:  2021-09-14
    • 刊出日期:  2021-08-15

    目录

      /

      返回文章
      返回