Geological Entity Recognition Based on ELMO-CNN-BiLSTM-CRF Model
-
摘要: 地质实体是地质文本中的关键和核心信息,对其准确识别是地质信息提取和挖掘的重要前提.设计了ELMO-CNN-BiLSTM-CRF模型,基于预训练字向量构建深层BiLSTM-CRF神经网络模型,通过添加词语动态特征以及词语字符级别的特征,弥补字向量特异性缺失的问题,提高对于地质文本中复杂多词义的识别水平和对地质实体局部特征的提取能力.以《西藏自治区谢通门县雄村铜矿勘探地质报告》为例,对该模型的性能进行了评估,模型的准确率、召回率和F1值分别为95.15%、95.26%和95.21%.实验表明相比BiLSTM-CRF和CNN-BiLSTM-CRF模型,该模型在小规模语料地质实体识别方面效果更优,且能够有效识别长地质实体词汇和地质多义词.Abstract: Geological entity is the key and core information in geological texts, and its accurate recognition is an important prerequisite for geological information extraction and mining. The ELMO-CNN-BiLSTM-CRF model is designed in this paper. Based on the pre-trained word vector, the deep BiLSTM-CRF neural network model is constructed. By adding dynamic features of words and character-level features of words, it makes up for the lack of specificity of word vectors, improves the recognition level of complex multi-word meanings in geological text and the ability to extract local features of geological entities. Taking the geological survey report of Xiongcun copper mine in Xietongmen County of Xizang Autonomous Region as an example, the performance of the model is evaluated. The accuracy rate, recall rate and F1 value of the model are 95.15%, 95.26% and 95.21% respectively. Experiments show that compared with BiLSTM-CRF and CNN-BiLSTM-CRF models, this model is more effective in small-scale corpus geological entity recognition, and can effectively identify long geological entity words and geological polysemants.
-
表 1 部分地质文本内容
Table 1. Part of geological text content
序号 例句 1 麻木下组主要岩性为灰-灰白色中-厚层状结晶灰岩,含燧石结核大理岩,夹薄层泥灰岩和钙质角岩… 2 拉嘎组(C2P1l): 该组以含砾碎屑岩为特征. 下部为灰白、灰黄色中层细粒岩屑石英砂岩、长石石英砂岩、含砾泥质长石石英粉砂岩… 3 F1、F2断裂都具有早期韧性剪切、晚期脆性变形的特征,而且F1断裂韧性剪切特征更为明显… 4 如钙碱性岩多见于褶皱区,碱性岩多见于断裂区等… 表 2 模型参数
Table 2. Model Parameter
层 超参数 数值 CNN 窗口大小 3 滤波器个数 30 ELMO 映射维度 40 LSTM 状态大小 200 初始状态 0.0 孔洞 无 其他 dropout率 0.5 批量大小 10 学习率 0.001 梯度裁剪 5.0 表 3 地质实体类别划分及相关样例
Table 3. Classification of geological entities and related samples
实体类型 样例 实体对象(GEO) 雅鲁藏布江缝合带、班公湖-怒江缝合带、狮泉河-纳木错断裂带、冈底斯-念青唐古拉地体、花岗岩、白朗蛇绿岩带等 地质年代(TIME) 第四纪、震旦纪、前寒武纪、古生代等 地质作用(PROCESS) 大理岩化、绢云母化、铜矿化等 其他地质指标(OTHERS) 品位、倾角、产状等 表 4 BIOES标注样例
Table 4. BIOES labeled sample
字 标注 字 标注 有 O 围 O 时 O 绕 O 包 O 黄 B-GEO 裹 O 铜 I-GEO 少 O 矿 E-GEO 量 O 呈 O 石 B-GEO 环 O 英 E-GEO 带 O . O 分 O 有 O 布 O 的 O . O 表 5 不同模型训练结果
Table 5. Training results of different models
模型 Precision Recall F1 迭代次数为100 BiLSTM-CRF 86.51% 87.24% 86.87% CNN-BiLSTM-CRF 92.49% 91.01% 91.74% ELMO-CNN-BiLSTM-CRF 94.83% 94.39% 94.61% 迭代次数为200 BiLSTM-CRF 87.36% 88.70% 88.03% CNN-BiLSTM-CRF 93.24% 90.48% 91.84% ELMO-CNN-BiLSTM-CRF 94.95% 94.57% 94.76% 迭代次数为500 BiLSTM-CRF 89.61% 87.76% 88.68% CNN-BiLSTM-CRF 92.17% 91.13% 91.64% ELMO-CNN-BiLSTM-CRF 95.15% 95.26% 95.21% 表 6 ELMO-CNN-BiLSTM-CRF模型部分识别实例
Table 6. ELMO-CNN-BiLSTM-CRF model partial identification instance
原文内容 标注信息 识别结果 …主要对雄村铜矿体进行了较为详细的研究… 铜矿体 铜矿 …花岗闪长岩岩脉断裂形成层状地形… 花岗闪长岩岩脉 花岗闪长岩岩脉 …断裂南侧较老地层及局部超基性岩覆于北侧… 断裂、地层、超基性岩 断裂、地层、超基性岩 …常具有强烈的绢云母化及泥化… 绢云母化、泥化 绢云母化、泥化 …中心相带以石英二长岩为主,边缘相带为花岗岩、斑状黑云母角闪石花岗岩… 石英二长岩、花岗岩、斑状黑云母角闪石花岗岩 中心相带、石英二长岩、花岗岩、斑状黑云母角闪石花岗岩 …磨棱岩化花岗岩基本保留原岩特征… 磨棱岩化花岗岩 岩化、花岗岩 -
[1] Baumann, P., Mazzetti, P., Ungar, J., et al., 2016. Big Data Analytics for Earth Sciences: The Earth Server Approach. International Journal of Digital Earth, 9(1): 3-29. https://doi.org/10.1080/17538947.2014.1003106 [2] Chen, S.D., Ouyang, X.Y., 2020. Overview of Named Entity Recognition Technology. Radio Communications Technology, 46(3): 251-260 (in Chinese with English abstract). [3] Chiu, J. P. C., Nichols, E., 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4: 357-370. https://doi.org/10.1162/tacl_a_00104 [4] Collobert, R., Weston, J., Bottou, L., et al., 2011. Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12(1): 2493-2537. http://d.wanfangdata.com.cn/periodical/Arxiv000000493885 [5] Fan, R. Y., Wang, L. Z., Yan, J. N., et al., 2019. Deep Learning-Based Named Entity Recognition and Knowledge Graph Construction for Geological Hazards. ISPRS International Journal of Geo-Information, 9(1): 15. https://doi.org/10.3390/ijgi9010015 [6] Hochreiter, S., Schmidhuber, J., 1997. Long Short-Term Memory. Neural Computation, 9(8): 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735 [7] Jiang, B.C., Wan, G., Xu, J., et al., 2018. Geographic Knowledge Graph Building Extracted from Multi-Sourced Heterogeneous Data. Acta Geodaetica et Cartographica Sinica, 47(8): 1051-1061 (in Chinese with English abstract). http://www.zhangqiaokeyan.com/academic-journal-cn_acta-geodaetica-cartographica-sinica_thesis/0201230440688.html [8] Kim, Y., 2014. Convolutional Neural Networks for Sentence Classification. Conference on Empirical Methods in Natural Language Processing (EMNLP). The Association for Computational Linguistics, Doha. [9] Lafferty, J.D., McCallum, A., Pereira, F., 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco. [10] Lample, G., Ballesteros, M., Subramanian, S., et al., 2016. Neural Architectures for Named Entity Recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The Association for Computational Linguistics, San Diego. https://doi.org/10.18653/v1/n16-1030 [11] Li, C.L., Li, J.Q., Zhang, H.C., et al., 2015. Big Data Application Architecture and Key Technologies of Intelligent Geological Survey. Geological Bulletin of China, 34(7): 1288-1299 (in Chinese with English abstract). http://www.researchgate.net/publication/286100282_Big_data_application_architecture_and_key_technologies_of_intelligent_geological_survey [12] Li, L.S., Guo, Y.K., 2018. Biomedical Named Entity Recognition with CNN-BLSTM-CRF. Journal of Chinese Information Processing, 32(1): 116-122 (in Chinese with English abstract). http://europepmc.org/abstract/MED/29718118 [13] Liu, Y.P., Li, D.D., 2020. Chinese Named Entity Recognition Method Based on Bi-Directional LSTM-CNN-CRF. Journal of Harbin University of Science and Technology, 25(1): 115-120 (in Chinese with English abstract). [14] Ma, K., 2018. Research on the Key Technologies of Geological Big Data Representation and Association (Dissertation). China University of Geosciences, Wuhan (in Chinese with English abstract). [15] Ma, X. Z., Hovy, E., 2016. End-to-End Sequence Labeling via Bi-Directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). The Association for Computational Linguistics, Berlin. https://doi.org/10.18653/v1/p16-1101 [16] Qiu, Q. J., Xie, Z., Wu, L., et al., 2019a. GNER: A Generative Model for Geological Named Entity Recognition without Labeled Data Using Deep Learning. Earth and Space Science, 6(6): 931-946. https://doi.org/10.1029/2019ea000610 [17] Qiu, Q. J., Xie, Z., Wu, L., et al., 2019b. BiLSTM-CRF for Geological Named Entity Recognition from the Geoscience Literature. Earth Science Informatics, 12(4): 565-579. https://doi.org/10.1007/s12145-019-00390-3 [18] Tan, Y.J., Qu, H.G., Wen, M., 2018. On Big Data of Geological Survey. Geomatics World, 25(2): 7-11 (in Chinese with English abstract). http://en.cnki.com.cn/Article_en/CJFDTotal-CHRK201802003.htm [19] Tolle, K. M., Tansley, D. S. W., Hey, A. J. G., 2011. The Fourth Paradigm: Data-Intensive Scientific Discovery. Proceedings of the IEEE, 99(8): 1334-1337. https://doi.org/10.1109/jproc.2011.2155130 [20] Turian, J.P., Ratinov, L., Bengio, Y., 2010. Word Representations: A Simple and General Method for Semi-Supervised Learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. The Association for Computational Linguistics, Uppsala. [21] Wang, C. B., Ma, X. G., Chen, J. G., et al., 2018. Information Extraction and Knowledge Graph Construction from Geoscience Literature. Computers & Geosciences, 112: 112-120. https://doi.org/10.1016/j.cageo.2017.12.007 [22] Wang, J. M., Hu, Y. J., Joseph, K., 2020. NeuroTPR: A Neuro-Net Toponym Recognition Model for Extracting Locations from Social Media Messages. Transactions in GIS, 24(3): 719-735. https://doi.org/10.1111/tgis.12627 [23] Yang, Y.Q., 2018. Current Situation, Problems and Countermeasures of Geological Prospecting Units Participate in the "Big Data" Project Construction. Natural Resource Economics of China, 31(7): 31-34 (in Chinese with English abstract). http://en.cnki.com.cn/Article_en/CJFDTOTAL-ZDKJ201807008.htm [24] Zhang, G.Y., Fu, J.Y., Ouyang, Z. Z., et al., 2020. The Importance of Space Database Establishment Based on DGSS in Big Data Environment. Earth Science, 45(9): 3451-3460 (in Chinese with English abstract). [25] Zhang, M.Z., Yu, M.L., Wang, Y., et al., 2013. Designing and Building the National Geo-Environment Monitoring Data Warehouse. Earth Science, 38(6): 1347-1355 (in Chinese with English abstract). http://www.researchgate.net/publication/289950672_Designing_and_building_the_national_Geo-Environment_Monitoring_data_warehouse [26] Zhang, X.Y., Ye, P., Wang, S., et al., 2018. Geological Entity Recognition Method Based on Deep Belief Networks. Acta Petrologica Sinica, 34(2): 343-351 (in Chinese with English abstract). http://www.zhangqiaokeyan.com/academic-journal-cn_acta-petrologica-sinica_thesis/0201252011589.html [27] Zhang, X.Y., Zhang, C.J., Wu, M.G., et al., 2020. SpatioTemporal Features Based Geographical Knowledge Graph Construction. Scientia Sinica Informationis, 50(7): 1019-1032 (in Chinese with English abstract). doi: 10.1360/SSI-2019-0269 [28] Zhao, P.D., 2015. Digital Mineral Exploration and Quantitative Evaluation in the Big Data Age. Geological Bulletin of China, 34(7): 1255-1259 (in Chinese with English abstract). http://en.cnki.com.cn/Article_en/CJFDTOTAL-ZQYD201507001.htm [29] Zhao, Y.O., Zhang, J.Z., Li, Y.B., et al., 2020. Sentiment Analysis Using Embedding from Language Model and Multi-Scale Convolutional Neural Network. Journal of Computer Application, 40(3): 651-657 (in Chinese with English abstract). doi: 10.1007/s12652-018-1095-6 [30] Zhu, Y.Q., Tan, Y.J., Zhang, J.T., et al., 2015. A Framework of Hadoop Based Geology Big Data Fusion and Mining Technologies. Acta Geodaetica et Cartographica Sinica, 44(S1): 152-159 (in Chinese with English abstract). http://www.cqvip.com/QK/90069X/2015B12/670679412.html [31] Zuo, R.G., Peng, Y., Li, T., et al., 2020. Challenges of Geological Prospecting Big Data Mining and Integration Using Deep Learning Algorithms. Earth Science, 46(1): 350-358 (in Chinese with English abstract). [32] 陈曙东, 欧阳小叶, 2020. 命名实体识别技术综述. 无线电通信技术, 46(3): 251-260. doi: 10.3969/j.issn.1003-3114.2020.03.001 [33] 蒋秉川, 万刚, 许剑, 等, 2018. 多源异构数据的大规模地理知识图谱构建. 测绘学报, 47(8): 1051-1061. https://www.cnki.com.cn/Article/CJFDTOTAL-CHXB201808005.htm [34] 李超岭, 李健强, 张宏春, 等, 2015. 智能地质调查大数据应用体系架构与关键技术. 地质通报, 34(7): 1288-1299. doi: 10.3969/j.issn.1671-2552.2015.07.006 [35] 李丽双, 郭元凯, 2018. 基于CNN-BLSTM-CRF模型的生物医学命名实体识别. 中文信息学报, 32(1): 116-122. doi: 10.3969/j.issn.1003-0077.2018.01.015 [36] 刘宇鹏, 栗冬冬, 2020. 基于BLSTM-CNN-CRF的中文命名实体识别方法. 哈尔滨理工大学学报, 25(1): 115-120. https://www.cnki.com.cn/Article/CJFDTOTAL-HLGX202001018.htm [37] 马凯, 2018. 地质大数据表示与关联关键技术研究(博士学位论文). 武汉: 中国地质大学. [38] 谭永杰, 屈红刚, 文敏, 2018. 论地质调查工作大数据. 地理信息世界, 25(2): 7-11. doi: 10.3969/j.issn.1672-1586.2018.02.002 [39] 杨宇谦, 2018. 地勘单位参与"大数据"项目建设的现状、问题及对策. 中国国土资源经济, 31(7): 31-34. https://www.cnki.com.cn/Article/CJFDTOTAL-ZDKJ201807008.htm [40] 张广宇, 付俊彧, 欧阳兆灼, 等, 2020. 大数据时代下基于DGSS系统下空间数据库建立的重要性. 地球科学, 45(9): 3451-3460. doi: 10.3799/dqkx.2020.130 [41] 张鸣之, 喻孟良, 王勇, 等, 2013. 国家级地质环境数据仓库的设计与实现. 地球科学, 38(6): 1347-1355. doi: 10.3799/dqkx.2013.133 [42] 张雪英, 叶鹏, 王曙, 等, 2018. 基于深度信念网络的地质实体识别方法. 岩石学报, 34(2): 343-351. [43] 张雪英, 张春菊, 吴明光, 等, 2020. 顾及时空特征的地理知识图谱构建方法. 中国科学: 信息科学, 50(7): 1019-1032. https://www.cnki.com.cn/Article/CJFDTOTAL-PZKX202007005.htm [44] 赵鹏大, 2015. 大数据时代数字找矿与定量评价. 地质通报, 34(7): 1255-1259. doi: 10.3969/j.issn.1671-2552.2015.07.001 [45] 赵亚欧, 张家重, 李贻斌, 等, 2020. 融合基于语言模型的词嵌入和多尺度卷积神经网络的情感分析. 计算机应用, 40(3): 651-657. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJY202003008.htm [46] 朱月琴, 谭永杰, 张建通, 等, 2015. 基于Hadoop的地质大数据融合与挖掘技术框架. 测绘学报, 44(S1): 152-159. https://www.cnki.com.cn/Article/CJFDTOTAL-CHXB2015S1023.htm [47] 左仁广, 彭勇, 李童, 等, 2020. 基于深度学习的地质找矿大数据挖掘与集成的挑战. 地球科学, 46(1): 350-358. doi: 10.3799/dqkx.2020.111