Debris Flow Susceptibility and Its Reliability Based on Random Forest and GIS
摘要: 目前基于GIS的泥石流易发性(简称DFS)评价模型中,统计类型模型的因子须保证独立性,且权重受区间划分控制;线性机器学习难以处理非线性问题、而常用非线性模型调试效率低.鉴于随机森林(RF)能有效克服常用模型的诸多不足,且在DFS评价中的应用极少,首先展开基于RF的DFS评价,采用线性、RBF支持向量机、二次判别分析、RF等经贝叶斯优化的模型和26种泥石流影响因子;然后,分别以RF的相对权重排序和蒙特卡洛方法研究因子组合和建模样本变化下DFS评价的可靠性.结果表明:RF不易发和较易发区中有21个因子可指示泥石流孕育环境差异;RF的相对权重排序能有效确定易发模型的局部最优因子组合;随机样本划分导致的评价不确定性在中易发区最大,应通过提高建模样本比例和改善模型降低;RF的预测能力指标AUC为0.86、全局预测精度为0.79、F1分数为0.66、brier分数为0.14,以及它们的可靠度最优,可作为DFS定量评估的优先选择.Abstract: Nowadays models extensively used in GIS for debris-flow susceptibility (DFS) assessment remain obviously inadequate. In models based on classical statistical theory (e.g. information value, weight of evidence, and certainty factors), the independence between debris-flow conditioning factors is necessary, and the weight of these factors depends on the classification method. The linear machine learning may fail in nonlinear classification problems, whereas hyper-parameter tuning of usual nonlinear techniques is always difficult. Random forest (RF) is capable of resolving the most of problems of these usual models, but have hardly been applied in DFS assessment. This article aims to investigate the DFS assessment of RF and evaluate its reliability, using 4 models with the hyper-parameters tuning of Bayesian optimization, random forest (RF), linear support vector machine (LSVM), radial basis function-support vector machine (RBF-SVM), and quadratic discriminant analysis (QDA), and 26 conditioning factors. A modified five-fold cross-validation method is adopted to evaluate DFS assessment firstly, and then the rank of the relative weight of RF and Monte Carlo method are used respectively, to investigate the reliability of DFS assessment under the different combinations of debris-flow conditioning factors or the random sample split. Results demonstrate that 21 out of 26 debris-flow conditioning factors indicate the difference of the environments with different debris-flow rates. Relative weight rank of RF, can effectively determine the local optimal combination of factors for the 4 models. The uncertainty of susceptibility assessment resulting from the random sample split is most significant in the medium susceptibility zone (0.4~0.6), and can be reduced by increasing the proportion of the model building sample and improving the susceptibility model. The prediction performance of RF is:AUC=0.86, overall accuracy=0.79, F1 score=0.66 and brier score=0.14. And their reliability is optimal in all these 4 models. Therefore, RF can be a superior model for quantitative DFS assessment.
图 8 预测能力指标分布
a. LSVM(a=1.43, c=91, s=0.825, μ=[0.823 01,0.823 79]),RBF-SVM(a=6.08, c=90.62, s=0.82, μ=[0.830 53,0.830 92]),QDA(a=2.41, c=490, s=0.79, μ=[0.816 85,0.817 01]),RF(a=8.09, c=131, s=0.85, μ=[0.860 10,0.860 36]);b. LSVM(a=216.9, c=25, s=0.16, μ=[0.170 17,0.170 29]),RBF-SVM(a=28.1, c=28, s=0.15, μ=[0.155 27,0.155 41]),QDA(a=139.7, c=30, s=0.16, μ=[0.174 17,0.174 28],RF(a=18.7, c=38, s=0.14, μ=[0.140 63, 0.140 74]).各模型后括号内为对应分布的参数,a、c为分布形状参数,s代表比例参数,位置参数均为0,μ代表平均值95%的置信区间
Fig. 8. The distribution of indices of prediction performance
表 1 影响因子汇总
Table 1. The summary of impact factors
表 2 模型的混淆矩阵
Table 2. Confusion matrices of 4 models
LSVM(线性支持向量机) 预测值 非泥石流 泥石流 真实值 非泥石流 915 135 泥石流 202 318 RBF-SVM(RBF支持向量机) 预测值 非泥石流 泥石流 真实值 非泥石流 979 71 泥石流 285 235 QDA(二次判别分析) 预测值 非泥石流 泥石流 真实值 非泥石流 835 215 泥石流 162 358 RF(随机森林) 预测值 非泥石流 泥石流 真实值 非泥石流 931 119 泥石流 204 316 表 3 模型分类预测能力
Table 3. Classification performance of models
易发性模型 全局预测精度(%) 泥石流准确率(%) 泥石流查全率(%) F1分数(%) AUC(%) LSVM 78.54 70.20 61.15 65.36 81.4 RBF-SVM 77.32 76.80 45.19 56.90 82.8 QDA 75.99 62.48 68.85 65.51 81.7 RF 79.43 72.64 60.77 66.18 85.9 完全随机 50.00 33.00 50% 39.75 50.0 注:全局预测精度=正确分类单元个数/单元总个数,泥石流准确率=预测正确的泥石流单元数/总共预测为泥石流的单元数,泥石流查 全率=预测正确的泥石流单元数/实际泥石流单元总数,F1=2×泥石流准确率×泥石流查全率/(泥石流准确率+泥石流查全率). 表 4 各模型局部最优因子组合
Table 4. local optimal combination of conditioning factors in each model
模型 因子组合 AUC提升 Brier分数降低 LSVM 相对权重最大的1~11号因子 1.8% 0.7% RBF-SVM 相对权重最大的1~21号因子 1.0% 1.4% QDA 相对权重最大的1~12号因子 0.4% 7.0% RF 相对权重最大的1~12号因子 0.4% 1.7% 表 5 2 000次易发性评价指标均值
Table 5. The mean evaluation indices of 2 000 susceptibility assessments
易发性模型 全局精度(%) 泥石流准确率(%) 泥石流查全率(%) F1分数(%) AUC(%) Brier分数 LSVM 78.10 69.63 60.20 64.57 82.3 0.176 RBF-SVM 76.80 75.17 44.77 56.09 83.1 0.155 QDA 76.11 62.65 69.00 65.67 81.7 0.174 RF 79.30 72.86 59.76 65.66 86.0 0.140 注:样本数量2 000下,各指标均值的95%置信区间大小已精确到小数点后4位,有很高的确定性,足够模型使用和相互之间的对比,故该表中不再以置信区间形式给出,而直接给出均值. -
