Home
avatar

yuanjh

项目_天池津南数字制造

标题

【赛场一】
地址:https://tianchi.aliyun.com/competition/entrance/231695/information

特征01_特征观察

INFO:__main__:columns_info:
         type  count count_rate  unique_count                                         unique_set         mean         std      min      25%       50%       75%        max
A1       int   1396      1.000             3                 [(300, 1377), (200, 13), (250, 6)]   298.853868   10.130552  200.000  300.000   300.000   300.000   300.0000
A10      int   1396      1.000             4    [(100, 658), (102, 416), (101, 298), (103, 24)]   100.861032    0.905198  100.000  100.000   101.000   102.000   103.0000
A11   object   1396      1.000            94  [(9:00:00, 251), (17:00:00, 247), (1:00:00, 15...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A12    float   1396      1.000             9  [(103.0, 684), (102.0, 406), (104.0, 140), (10...   102.641834    0.915387   98.000  102.000   103.000   103.000   107.0000
A13    float   1396      1.000             3                [(0.2, 1394), (0.15, 1), (0.12, 1)]     0.199907    0.002524    0.120    0.200     0.200     0.200     0.2000
A14   object   1396      1.000            92  [(10:00:00, 251), (18:00:00, 248), (2:00:00, 1...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A15    float   1396      1.000            10  [(104.0, 695), (103.0, 309), (105.0, 220), (10...   103.829370    0.963639  100.000  103.000   104.000   104.000   109.0000
A16   object   1396      1.000            94  [(11:00:00, 250), (19:00:00, 248), (3:00:00, 1...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A17    float   1396      1.000            13  [(105.0, 688), (104.0, 177), (106.0, 171), (10...   104.766905    1.401446   89.000  104.000   105.000   105.000   108.0000
A18    float   1396      1.000             2                            [(0.2, 1395), (0.1, 1)]     0.199928    0.002676    0.100    0.200     0.200     0.200     0.2000
A19      int   1396      1.000             6  [(200, 906), (300, 459), (100, 27), (150, 2), ...   231.067335   50.478071  100.000  200.000   200.000   300.000   350.0000
A2     float     42      0.030             2                                      [(125.0, 42)]   125.000000    0.000000  125.000  125.000   125.000   125.000   125.0000
A20   object   1396      1.000           159  [(11:00-12:00, 239), (19:00-20:00, 233), (3:00...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A21    float   1393      0.998            13  [(50.0, 1254), (40.0, 63), (30.0, 42), (35.0, ...    48.707825    4.976531   20.000   50.000    50.000    50.000    90.0000
A22    float   1396      1.000             4     [(9.0, 1216), (10.0, 174), (8.0, 5), (3.5, 1)]     9.117120    0.369152    3.500    9.000     9.000     9.000    10.0000
A23    float   1393      0.998             4                 [(5.0, 1391), (10.0, 1), (4.0, 1)]     5.002872    0.136638    4.000    5.000     5.000     5.000    10.0000
A24   object   1395      0.999            92  [(12:00:00, 258), (20:00:00, 244), (4:00:00, 1...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A25   object   1396      1.000            15  [(80, 542), (70, 527), (78, 186), (79, 79), (7...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A26   object   1394      0.999            89  [(13:00:00, 265), (21:00:00, 248), (5:00:00, 1...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A27    float   1396      1.000            13  [(73.0, 630), (78.0, 282), (75.0, 209), (72.0,...    74.396848    3.044490   45.000   73.000    73.000    77.000    80.0000
A28   object   1396      1.000           157  [(13:00-14:00, 243), (21:00-22:00, 234), (5:00...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A3     float   1354      0.970             4           [(405.0, 1336), (270.0, 12), (340.0, 6)]   403.515510   13.348093  270.000  405.000   405.000   405.000   405.0000
A4       int   1396      1.000             4      [(700, 1336), (980, 42), (470, 12), (590, 6)]   705.974212   53.214754  470.000  700.000   700.000   700.000   980.0000
A5    object   1396      1.000            67  [(6:00:00, 269), (14:00:00, 260), (22:00:00, 1...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A6     float   1396      1.000            39  [(29.0, 493), (30.0, 198), (21.0, 197), (24.0,...    28.287751    6.742765   17.000   24.000    29.000    30.000    97.0000
A7    object    149      0.107            76  [(12:40:00, 13), (15:40:00, 7), (7:00:00, 5), ...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A8     float    149      0.107             9  [(80.0, 118), (73.0, 16), (74.0, 8), (82.0, 3)...    78.818792    2.683920   70.000   80.000    80.000    80.000    82.0000
A9    object   1396      1.000            95  [(8:00:00, 252), (16:00:00, 248), (0:00:00, 15...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
B1     float   1386      0.993            22  [(320.0, 751), (300.0, 127), (350.0, 113), (34...   334.452742  105.120753    3.500  320.000   320.000   330.000  1200.0000
B10   object   1152      0.825           181  [(10:30-12:00, 166), (18:30-20:00, 165), (2:30...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
B11   object    547      0.392            38  [(20:00-21:00, 154), (12:00-13:00, 140), (4:00...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
B12    float   1395      0.999             5  [(1200.0, 777), (800.0, 584), (900.0, 20), (40...  1020.215054  205.920155  400.000  800.000  1200.000  1200.000  1200.0000
B13    float   1395      0.999             4               [(0.15, 1388), (0.03, 6), (0.06, 1)]     0.149419    0.008213    0.030    0.150     0.150     0.150     0.1500
B14      int   1396      1.000            21  [(400, 740), (420, 329), (440, 226), (460, 35)...   410.403295   26.018410   40.000  400.000   400.000   420.000   460.0000
B2     float   1394      0.999             4                [(3.5, 1374), (0.15, 19), (3.6, 1)]     3.454412    0.388585    0.150    3.500     3.500     3.500     3.6000
B3     float   1394      0.999             3                            [(3.5, 1393), (3.6, 1)]     3.500072    0.002678    3.500    3.500     3.500     3.500     3.6000
B4    object   1396      1.000           178  [(14:00-15:00, 240), (22:00-23:00, 212), (6:00...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
B5    object   1395      0.999            61  [(15:00:00, 245), (23:00:00, 215), (7:00:00, 1...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
B6       int   1396      1.000            36  [(80, 640), (65, 299), (60, 155), (79, 51), (7...    72.065186    9.161986   40.000   65.000    78.000    80.000    80.0000
B7    object   1396      1.000            58  [(17:00:00, 263), (1:00:00, 211), (9:00:00, 16...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
B8     float   1395      0.999            26  [(45.0, 1085), (40.0, 142), (50.0, 44), (28.0,...    43.709677    4.338396   20.000   45.000    45.000    45.000    73.0000
B9    object   1396      1.000           178  [(17:00-18:30, 189), (9:00-10:30, 158), (1:00-...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
rate   float   1396      1.000            73  [(0.902, 305), (0.93, 128), (0.890999999999999...     0.923244    0.030880    0.624    0.902     0.925     0.943     1.0008


target_corr
       pearson  spearman      mine
rate  1.000000  1.000000  0.998080
B14   0.478892  0.675522  0.835437
B12   0.392409  0.432610  0.522049
B6    0.365125  0.403446  0.462616
A10   0.350775  0.392945  0.438394
A19  -0.217527 -0.247848  0.377268
A27  -0.175551 -0.252845  0.360564
B1    0.102545  0.072208  0.285754
A6    0.026943  0.119990  0.268452
A15   0.206822  0.249648  0.258671
A12   0.254165  0.304103  0.243930
A17   0.177628  0.222954  0.241899
B8    0.174779  0.213332  0.187707
A22  -0.171821 -0.189534  0.142401
A8    0.210669  0.102360  0.139327
A2         NaN       NaN  0.117767
A21   0.108092  0.129685  0.108338
A4   -0.213740 -0.169453  0.074376
A1    0.021058  0.012813  0.052154
B13  -0.064316 -0.063643  0.036630
A3    0.023490  0.012256  0.036331
B2   -0.096040 -0.098150  0.034360
B3   -0.008892 -0.007674  0.033967
A23   0.027710  0.051047  0.030690
A13   0.011695  0.019792  0.004721
A18   0.008885  0.007691  0.003226


float基本是离散的
       type  count count_rate  unique_count                                         unique_set         mean         std      min      25%       50%       75%        max
A18   float   1396      1.000             2                            [(0.2, 1395), (0.1, 1)]     0.199928    0.002676    0.100    0.200     0.200     0.200     0.2000
A2    float     42      0.030             2                                      [(125.0, 42)]   125.000000    0.000000  125.000  125.000   125.000   125.000   125.0000
A13   float   1396      1.000             3                [(0.2, 1394), (0.15, 1), (0.12, 1)]     0.199907    0.002524    0.120    0.200     0.200     0.200     0.2000
B3    float   1394      0.999             3                            [(3.5, 1393), (3.6, 1)]     3.500072    0.002678    3.500    3.500     3.500     3.500     3.6000
B2    float   1394      0.999             4                [(3.5, 1374), (0.15, 19), (3.6, 1)]     3.454412    0.388585    0.150    3.500     3.500     3.500     3.6000
A22   float   1396      1.000             4     [(9.0, 1216), (10.0, 174), (8.0, 5), (3.5, 1)]     9.117120    0.369152    3.500    9.000     9.000     9.000    10.0000
A23   float   1393      0.998             4                 [(5.0, 1391), (10.0, 1), (4.0, 1)]     5.002872    0.136638    4.000    5.000     5.000     5.000    10.0000
B13   float   1395      0.999             4               [(0.15, 1388), (0.03, 6), (0.06, 1)]     0.149419    0.008213    0.030    0.150     0.150     0.150     0.1500
A3    float   1354      0.970             4           [(405.0, 1336), (270.0, 12), (340.0, 6)]   403.515510   13.348093  270.000  405.000   405.000   405.000   405.0000
B12   float   1395      0.999             5  [(1200.0, 777), (800.0, 584), (900.0, 20), (40...  1020.215054  205.920155  400.000  800.000  1200.000  1200.000  1200.0000
A12   float   1396      1.000             9  [(103.0, 684), (102.0, 406), (104.0, 140), (10...   102.641834    0.915387   98.000  102.000   103.000   103.000   107.0000
A8    float    149      0.107             9  [(80.0, 118), (73.0, 16), (74.0, 8), (82.0, 3)...    78.818792    2.683920   70.000   80.000    80.000    80.000    82.0000
A15   float   1396      1.000            10  [(104.0, 695), (103.0, 309), (105.0, 220), (10...   103.829370    0.963639  100.000  103.000   104.000   104.000   109.0000
A27   float   1396      1.000            13  [(73.0, 630), (78.0, 282), (75.0, 209), (72.0,...    74.396848    3.044490   45.000   73.000    73.000    77.000    80.0000
A21   float   1393      0.998            13  [(50.0, 1254), (40.0, 63), (30.0, 42), (35.0, ...    48.707825    4.976531   20.000   50.000    50.000    50.000    90.0000
A17   float   1396      1.000            13  [(105.0, 688), (104.0, 177), (106.0, 171), (10...   104.766905    1.401446   89.000  104.000   105.000   105.000   108.0000
B1    float   1386      0.993            22  [(320.0, 751), (300.0, 127), (350.0, 113), (34...   334.452742  105.120753    3.500  320.000   320.000   330.000  1200.0000
B8    float   1395      0.999            26  [(45.0, 1085), (40.0, 142), (50.0, 44), (28.0,...    43.709677    4.338396   20.000   45.000    45.000    45.000    73.0000
A6    float   1396      1.000            39  [(29.0, 493), (30.0, 198), (21.0, 197), (24.0,...    28.287751    6.742765   17.000   24.000    29.000    30.000    97.0000
rate  float   1396      1.000            73  [(0.902, 305), (0.93, 128), (0.890999999999999...     0.923244    0.030880    0.624    0.902     0.925     0.943     1.0008


columns_info[columns_info['type']=='object']
       type  count count_rate  unique_count                                         unique_set  mean  std  min  25%  50%  75%  max
A11  object   1389      1.000            94  [(9:00:00, 251), (17:00:00, 245), (1:00:00, 15...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A14  object   1389      1.000            92  [(10:00:00, 251), (18:00:00, 246), (2:00:00, 1...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A16  object   1389      1.000            94  [(11:00:00, 250), (19:00:00, 246), (3:00:00, 1...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A20  object   1389      1.000           158  [(11:00-12:00, 239), (19:00-20:00, 233), (3:00...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A24  object   1389      1.000            91  [(12:00:00, 259), (20:00:00, 243), (4:00:00, 1...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A26  object   1389      1.000            87  [(13:00:00, 267), (21:00:00, 248), (5:00:00, 1...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A28  object   1389      1.000           157  [(13:00-14:00, 243), (21:00-22:00, 234), (5:00...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A5   object   1389      1.000            67  [(6:00:00, 269), (14:00:00, 260), (22:00:00, 1...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
A9   object   1389      1.000            95  [(8:00:00, 252), (16:00:00, 246), (0:00:00, 15...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
B4   object   1389      1.000           177  [(14:00-15:00, 240), (22:00-23:00, 212), (6:00...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
B5   object   1389      1.000            60  [(15:00:00, 246), (23:00:00, 215), (7:00:00, 1...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
B7   object   1389      1.000            58  [(17:00:00, 263), (1:00:00, 211), (9:00:00, 16...   NaN  NaN  NaN  NaN  NaN  NaN  NaN
B9   object   1389      1.000           175  [(17:00-18:30, 189), (9:00-10:30, 157), (1:00-...   NaN  NaN  NaN  NaN  NaN  NaN  NaN

特征02_丢弃缺失率过高或值单一特征

丢弃有效取值<0.90的特征
drop_columns
Index(['A2', 'A7', 'A8', 'B10', 'B11'], dtype='object')

丢弃取值单一特征
A1,A13,A18,A23,A3, A4,B13,B2,B3

特征03_特征可视化corr_pic

a,target均值基本相等,未发现强关联
b,名义float实际基本离散的
4,float特征,数量<20的做dummy处理,>20的暂
5,A25的单个脏数据清理,转为int处理








特征04_第一版baseline

基础策略01
1,丢弃损失率>90%的特征:['A2', 'A7', 'A8', 'B10', 'B11']
2,丢弃取值单一:['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23', 'B3', 'B13']
3,缺失值填充众数:['A21', 'A23', 'A24', 'A26', 'A3', 'B1', 'B2', 'B3', 'B12', 'B13', 'B5', 'B8']
4,float如果unique_count<20,dummy,>20,暂时不处理
5,使用已有的float特征做预测
feature_columns = list(dummy_result_columns) + list(cannot_dummy_columns)
                                                name                  model          mean           std
0  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  DecisionTreeRegressor -9.625545e-04  7.795378e-05
1  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  RandomForestRegressor -7.542296e-04  6.568240e-05
2  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...           XGBRegressor -7.024208e-04  8.961907e-05
3  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...                    SVR -1.951809e-03  9.253026e-05
4  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...       LinearRegression -8.229440e+16  1.645888e+17


基础策略01_改动01
6,丢弃target<0.85的点.添加到上图第3步后面,第4步前

                                                name                  model          mean           std
0  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  DecisionTreeRegressor -8.219052e-04  6.105489e-05
1  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  RandomForestRegressor -6.369389e-04  4.182528e-05
2  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...           XGBRegressor -5.989824e-04  4.076728e-05
3  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...                    SVR -8.555181e-04  4.585744e-05
4  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...       LinearRegression -6.930078e+16  1.386016e+17

误差的确都降低了.约1*e-04

插播:计算基础误差
mean_squared_error(train_data[target_column],[train_data[target_column].mean() for i in range(0,train_data.shape[0])])
0.0008223730641064133

也就是说用taerget的自身mse为0.00082,可以大概评估自己距离瞎猜的距离
目前水平是基本瞎蒙

特征05_时间特征处理

典型格式:
9:00:00:B7,B5,A9,A5,A26,A24,A16,A14,A11
11:00-12:00:B9,B4,A28,A20

提取特征:
第一种:开始时间(小时)
第二种:开始时间(小时),结束时间(小时),持续时间(小时)

baseline的最后面添加

处理前成绩:(此时feature只有float特征dummy出来的特征)
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
                                                name                  model          mean           std
0  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  DecisionTreeRegressor -8.049540e-04  7.960457e-05
1  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  RandomForestRegressor -6.537043e-04  3.725424e-05
2  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...           XGBRegressor -5.961219e-04  3.902166e-05
3  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...                    SVR -8.561289e-04  4.623989e-05
4  71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...       LinearRegression -6.955097e+16  1.391019e+17

单纯时间特征以及时间特征和原有float特征结合
                                                name                  model          mean           std
0  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...  DecisionTreeRegressor -8.652548e-04  7.786419e-05
1  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...  RandomForestRegressor -6.603347e-04  4.249767e-05
2  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...           XGBRegressor -6.457705e-04  3.340577e-05
3  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...                    SVR -8.561289e-04  4.623989e-05
4  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...       LinearRegression -7.675846e-03  1.384563e-02
5  92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  DecisionTreeRegressor -8.042192e-04  5.273865e-05
6  92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...  RandomForestRegressor -5.910289e-04  3.373143e-05
7  92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...           XGBRegressor -5.538508e-04  2.729725e-05
8  92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...                    SVR -8.561289e-04  4.623989e-05
9  92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d...       LinearRegression -1.082012e+15  2.148065e+15

从图上看也的确没有太大差异,基本也是平均的

2,时间特征按照10分中转为int
                                                name                  model          mean           std
0  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...  DecisionTreeRegressor -9.235728e-04  6.219508e-05
1  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...  RandomForestRegressor -6.644564e-04  5.259317e-05
2  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...           XGBRegressor -6.250570e-04  2.938126e-05
3  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...                    SVR -8.561289e-04  4.623989e-05
4  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...       LinearRegression -7.507477e-04  8.794647e-05
5  177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...  DecisionTreeRegressor -7.938051e-04  3.984663e-05
6  177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...  RandomForestRegressor -5.772725e-04  3.244304e-05
7  177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...           XGBRegressor -5.519323e-04  3.310014e-05
8  177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...                    SVR -8.561289e-04  4.623989e-05
9  177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...       LinearRegression -3.866811e+15  7.706426e+15

3,将cannot_dummy_columns转为空,所有float都转为dummy
                                                name                  model          mean           std
0  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...  DecisionTreeRegressor -8.925815e-04  3.946585e-05
1  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...  RandomForestRegressor -6.580392e-04  5.113999e-05
2  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...           XGBRegressor -6.250570e-04  2.938126e-05
3  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...                    SVR -8.561289e-04  4.623989e-05
4  21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_...       LinearRegression -7.507477e-04  8.794647e-05
5  174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...  DecisionTreeRegressor -8.027813e-04  7.862450e-05
6  174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...  RandomForestRegressor -5.781245e-04  2.876818e-05
7  174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...           XGBRegressor -5.493320e-04  3.110661e-05
8  174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...                    SVR -8.561289e-04  4.623989e-05
9  174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm...       LinearRegression -7.316864e+15  1.463373e+16

特征06_双策略比对ref01

ref01策略:
1,删类别唯一:['B3', 'B13', 'A13', 'A18', 'A23']
2,删除缺失率超过90%的列:A1,A2,A3,A4,B2
3,train:只保留target>0.87
4,all_data.fillna(-1)
5,类别转float(/3600):['A5', 'A7', 'A9', 'A11', 'A14', 'A16', 'A24', 'A26', 'B5', 'B7']
6,时间差计算(/3600): ['A20', 'A28', 'B4', 'B9', 'B10', 'B11']
7,所有特征转为int形式:
8,数据样式:(样本id和target会被drop掉.)
train.head(5)
          样本id  A5  A6  A7  A8  A9  A10  A11  A12  A14  A15  A16  A17  A19  A20  A21  A22  A24  A25  A26  A27  A28  B1  B4  B5  B6  B7  B8  B9  B10  B11  B12  B14  target
0  sample_1528   0   0   0   0   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0   0   0   0   0   0   0   0    0    0    0    0   0.879
1  sample_1698   1   1   0   0   1    1    1    1    1    1    1    1    1    1    0    0    1    1    1    1    1   1   0   1   1   1   0   0    0    1    1    0   0.902
2   sample_639   1   1   0   0   1    2    1    1    1    1    1    1    1    0    0    0    1    2    1    1    1   1   0   1   1   2   0   0    0    1    1    0   0.936
3   sample_483   2   0   0   0   2    0    2    0    2    0    2    0    1    0    0    1    2    3    2    2    1   2   0   2   0   3   0   0    0    0    0    0   0.902
4   sample_617   3   1   0   0   3    1    3    1    3    1    3    1    1    1    0    0    3    1    3    1    1   1   0   3   1   4   0   0    0    1    1    1   0.983
9,效果评估:
for model in [DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()]:
    score=cross_val_score(model, X_train, y_train, scoring='mean_squared_error', cv=5)
    print(model.__class__.__name__+' '+str(score.mean())+' '+str(score.std()))

DecisionTreeRegressor   -0.0003171201297343733 3.808162841960937e-05
RandomForestRegressor   -0.000224005974881151 2.2218174576281365e-05
XGBRegressor            -0.00021295863145713904 2.0333409455864207e-05
SVR                     -0.0009419628520902001 4.666016023199918e-05
LinearRegression        -0.0006419399676652643 4.0137774687448616e-05

10,target切分5分,dunmmy形式分位5列
11,对于每个类别特征(前面的数据列)
for f1 in 类别:
    for f2 In label切分后5个标签(列):
        order_label = train.groupby([f1])[f2].mean()
        for df in [train, test]:
            df[col_name] = df[f].map(order_label)  # 正是此处逻辑错误
这是一个奇怪的代码
奇怪在1,df[f].map这里的笔误,其实是f1.这个后续可以改下
奇怪2,作者说改好后分数下降了.后续看分数差异大小在决定吧

12,丢弃5个列特征(5个列就用来干了统计的事情,之后就丢了)
13,最终特征:
train.columns
Index(['A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A14', 'A15',
       ...
       'B12_intTarget_0.0_mean', 'B12_intTarget_1.0_mean', 'B12_intTarget_2.0_mean', 'B12_intTarget_3.0_mean', 'B12_intTarget_4.0_mean', 'B14_intTarget_0.0_mean', 'B14_intTarget_1.0_mean', 'B14_intTarget_2.0_mean', 'B14_intTarget_3.0_mean', 'B14_intTarget_4.0_mean'], dtype='object', length=192)
train.head(5)
   A5  A6  A7  A8  A9  A10  A11  A12  A14  A15  A16  A17  A19  A20  A21  A22  A24  A25  A26  A27  A28  B1  B4  B5  B6  B7  B8  B9  B10  B11  B12  B14  A5_intTarget_0.0_mean  A5_intTarget_1.0_mean  A5_intTarget_2.0_mean  A5_intTarget_3.0_mean  A5_intTarget_4.0_mean  A6_intTarget_0.0_mean  A6_intTarget_1.0_mean  A6_intTarget_2.0_mean  A6_intTarget_3.0_mean  A6_intTarget_4.0_mean  A7_intTarget_0.0_mean  A7_intTarget_1.0_mean  A7_intTarget_2.0_mean  A7_intTarget_3.0_mean  A7_intTarget_4.0_mean  A8_intTarget_0.0_mean  A8_intTarget_1.0_mean  A8_intTarget_2.0_mean  A8_intTarget_3.0_mean  A8_intTarget_4.0_mean  A9_intTarget_0.0_mean  A9_intTarget_1.0_mean  A9_intTarget_2.0_mean  A9_intTarget_3.0_mean  A9_intTarget_4.0_mean  A10_intTarget_0.0_mean  A10_intTarget_1.0_mean  A10_intTarget_2.0_mean  A10_intTarget_3.0_mean  A10_intTarget_4.0_mean  A11_intTarget_0.0_mean  A11_intTarget_1.0_mean  A11_intTarget_2.0_mean  A11_intTarget_3.0_mean  A11_intTarget_4.0_mean  A12_intTarget_0.0_mean  A12_intTarget_1.0_mean  \
0   0   0   0   0   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0   0   0   0   0   0   0   0    0    0    0    0               0.285714               0.285714               0.428571               0.000000               0.000000               0.157895               0.210526               0.438596               0.105263               0.070175               0.161526               0.284903               0.396916               0.090097               0.056818               0.161526               0.284903               0.396916               0.090097               0.056818               0.058824               0.470588               0.235294               0.117647               0.117647                0.177743                0.272025                0.387944                0.092736                0.052550                0.058824                0.470588                0.235294                0.117647                0.117647                0.177500                 0.26500
1   1   1   0   0   1    1    1    1    1    1    1    1    1    1    0    0    1    1    1    1    1   1   0   1   1   1   0   0    0    1    1    0               0.285714               0.285714               0.428571               0.000000               0.000000               0.157895               0.210526               0.438596               0.105263               0.070175               0.161526               0.284903               0.396916               0.090097               0.056818               0.161526               0.284903               0.396916               0.090097               0.056818               0.058824               0.470588               0.235294               0.117647               0.117647                0.177743                0.272025                0.387944                0.092736                0.052550                0.058824                0.470588                0.235294                0.117647                0.117647                0.177500                 0.26500
2   1   1   0   0   1    2    1    1    1    1    1    1    1    0    0    0    1    2    1    1    1   1   0   1   1   2   0   0    0    1    1    0               0.285714               0.285714               0.428571               0.000000               0.000000               0.157895               0.210526               0.438596               0.105263               0.070175               0.161526               0.284903               0.396916               0.090097               0.056818               0.161526               0.284903               0.396916               0.090097               0.056818               0.058824               0.470588               0.235294               0.117647               0.117647                0.177743                0.272025                0.387944                0.092736                0.052550                0.058824                0.470588                0.235294                0.117647                0.117647                0.177500                 0.26500
3   2   0   0   0   2    0    2    0    2    0    2    0    1    0    0    1    2    3    2    2    1   2   0   2   0   3   0   0    0    0    0    0               0.285714               0.285714               0.428571               0.000000               0.000000               0.157895               0.210526               0.438596               0.105263               0.070175               0.161526               0.284903               0.396916               0.090097               0.056818               0.161526               0.284903               0.396916               0.090097               0.056818               0.058824               0.470588               0.235294               0.117647               0.117647                0.177743                0.272025                0.387944                0.092736                0.052550                0.058824                0.470588                0.235294                0.117647                0.117647                0.177500                 0.26500
4   3   1   0   0   3    1    3    1    3    1    3    1    1    1    0    0    3    1    3    1    1   1   0   3   1   4   0   0    0    1    1    1               0.130769               0.307692               0.369231               0.096154               0.080769               0.160243               0.292089               0.383367               0.093306               0.064909               0.250000               0.250000               0.500000               0.000000               0.000000               0.169492               0.220339               0.432203               0.110169               0.059322               0.134146               0.300813               0.394309               0.085366               0.077236                0.107383                0.325503                0.402685                0.097315                0.067114                0.130612                0.306122                0.391837                0.085714                0.077551                0.141593                 0.29646
   A12_intTarget_2.0_mean  A12_intTarget_3.0_mean  A12_intTarget_4.0_mean  A14_intTarget_0.0_mean  A14_intTarget_1.0_mean  A14_intTarget_2.0_mean  A14_intTarget_3.0_mean  A14_intTarget_4.0_mean  A15_intTarget_0.0_mean  A15_intTarget_1.0_mean  A15_intTarget_2.0_mean  A15_intTarget_3.0_mean  A15_intTarget_4.0_mean  A16_intTarget_0.0_mean  A16_intTarget_1.0_mean  A16_intTarget_2.0_mean  A16_intTarget_3.0_mean  A16_intTarget_4.0_mean  A17_intTarget_0.0_mean  A17_intTarget_1.0_mean  A17_intTarget_2.0_mean  A17_intTarget_3.0_mean  A17_intTarget_4.0_mean  A19_intTarget_0.0_mean  A19_intTarget_1.0_mean  A19_intTarget_2.0_mean  A19_intTarget_3.0_mean  A19_intTarget_4.0_mean  A20_intTarget_0.0_mean  A20_intTarget_1.0_mean  A20_intTarget_2.0_mean  A20_intTarget_3.0_mean  A20_intTarget_4.0_mean  A21_intTarget_0.0_mean  A21_intTarget_1.0_mean  A21_intTarget_2.0_mean  A21_intTarget_3.0_mean  A21_intTarget_4.0_mean  A22_intTarget_0.0_mean  A22_intTarget_1.0_mean  A22_intTarget_2.0_mean  A22_intTarget_3.0_mean  \
0                 0.39250                0.095000                0.057500                0.055556                0.500000                0.222222                0.111111                0.111111                0.179402                0.275748                0.388704                0.079734                0.053156                0.058824                0.470588                0.235294                0.117647                0.117647                0.195402                0.258621                0.425287                0.074713                0.028736                0.147903                0.275938                0.412804                0.092715                0.055188                0.171815                0.281853                0.386100                0.090734                0.055985                0.158743                0.283642                0.398872                 0.09025                0.058824                0.153527                0.289627                0.400000                0.087967
1                 0.39250                0.095000                0.057500                0.055556                0.500000                0.222222                0.111111                0.111111                0.179402                0.275748                0.388704                0.079734                0.053156                0.058824                0.470588                0.235294                0.117647                0.117647                0.195402                0.258621                0.425287                0.074713                0.028736                0.147903                0.275938                0.412804                0.092715                0.055188                0.171815                0.281853                0.386100                0.090734                0.055985                0.158743                0.283642                0.398872                 0.09025                0.058824                0.153527                0.289627                0.400000                0.087967
2                 0.39250                0.095000                0.057500                0.055556                0.500000                0.222222                0.111111                0.111111                0.179402                0.275748                0.388704                0.079734                0.053156                0.058824                0.470588                0.235294                0.117647                0.117647                0.195402                0.258621                0.425287                0.074713                0.028736                0.147903                0.275938                0.412804                0.092715                0.055188                0.171815                0.281853                0.386100                0.090734                0.055985                0.158743                0.283642                0.398872                 0.09025                0.058824                0.153527                0.289627                0.400000                0.087967
3                 0.39250                0.095000                0.057500                0.055556                0.500000                0.222222                0.111111                0.111111                0.179402                0.275748                0.388704                0.079734                0.053156                0.058824                0.470588                0.235294                0.117647                0.117647                0.195402                0.258621                0.425287                0.074713                0.028736                0.147903                0.275938                0.412804                0.092715                0.055188                0.171815                0.281853                0.386100                0.090734                0.055985                0.158743                0.283642                0.398872                 0.09025                0.058824                0.153527                0.289627                0.400000                0.087967
4                 0.40118                0.082596                0.067847                0.130081                0.304878                0.394309                0.085366                0.077236                0.164978                0.280753                0.390738                0.088278                0.068017                0.130081                0.304878                0.394309                0.085366                0.077236                0.160294                0.302941                0.388235                0.080882                0.060294                0.160535                0.283166                0.399108                0.088071                0.061315                0.154676                0.281775                0.401679                0.091127                0.062350                0.000000                1.000000                0.000000                 0.00000                0.000000                0.211765                0.217647                0.388235                0.105882
   A22_intTarget_4.0_mean  A24_intTarget_0.0_mean  A24_intTarget_1.0_mean  A24_intTarget_2.0_mean  A24_intTarget_3.0_mean  A24_intTarget_4.0_mean  A25_intTarget_0.0_mean  A25_intTarget_1.0_mean  A25_intTarget_2.0_mean  A25_intTarget_3.0_mean  A25_intTarget_4.0_mean  A26_intTarget_0.0_mean  A26_intTarget_1.0_mean  A26_intTarget_2.0_mean  A26_intTarget_3.0_mean  A26_intTarget_4.0_mean  A27_intTarget_0.0_mean  A27_intTarget_1.0_mean  A27_intTarget_2.0_mean  A27_intTarget_3.0_mean  A27_intTarget_4.0_mean  A28_intTarget_0.0_mean  A28_intTarget_1.0_mean  A28_intTarget_2.0_mean  A28_intTarget_3.0_mean  A28_intTarget_4.0_mean  B1_intTarget_0.0_mean  B1_intTarget_1.0_mean  B1_intTarget_2.0_mean  B1_intTarget_3.0_mean  B1_intTarget_4.0_mean  B4_intTarget_0.0_mean  B4_intTarget_1.0_mean  B4_intTarget_2.0_mean  B4_intTarget_3.0_mean  B4_intTarget_4.0_mean  B5_intTarget_0.0_mean  B5_intTarget_1.0_mean  B5_intTarget_2.0_mean  B5_intTarget_3.0_mean  B5_intTarget_4.0_mean  B6_intTarget_0.0_mean  B6_intTarget_1.0_mean  \
0                0.058921                0.090909                0.454545                0.363636                0.090909                0.000000                0.130435                0.260870                0.434783                0.000000                0.130435                0.090909                0.454545                0.363636                0.090909                0.000000                0.086957                0.304348                0.413043                0.043478                0.130435                0.153846                0.246154                0.410256                0.097436                0.071795               0.133929               0.339286               0.375000               0.080357               0.062500               0.158046               0.286398               0.396552               0.090038               0.059387               0.117647               0.176471               0.529412               0.176471               0.000000               0.180272               0.238095
1                0.058921                0.090909                0.454545                0.363636                0.090909                0.000000                0.130435                0.260870                0.434783                0.000000                0.130435                0.090909                0.454545                0.363636                0.090909                0.000000                0.086957                0.304348                0.413043                0.043478                0.130435                0.153846                0.246154                0.410256                0.097436                0.071795               0.133929               0.339286               0.375000               0.080357               0.062500               0.158046               0.286398               0.396552               0.090038               0.059387               0.117647               0.176471               0.529412               0.176471               0.000000               0.180272               0.238095
2                0.058921                0.090909                0.454545                0.363636                0.090909                0.000000                0.130435                0.260870                0.434783                0.000000                0.130435                0.090909                0.454545                0.363636                0.090909                0.000000                0.086957                0.304348                0.413043                0.043478                0.130435                0.153846                0.246154                0.410256                0.097436                0.071795               0.133929               0.339286               0.375000               0.080357               0.062500               0.158046               0.286398               0.396552               0.090038               0.059387               0.117647               0.176471               0.529412               0.176471               0.000000               0.180272               0.238095
3                0.058921                0.090909                0.454545                0.363636                0.090909                0.000000                0.130435                0.260870                0.434783                0.000000                0.130435                0.090909                0.454545                0.363636                0.090909                0.000000                0.086957                0.304348                0.413043                0.043478                0.130435                0.153846                0.246154                0.410256                0.097436                0.071795               0.133929               0.339286               0.375000               0.080357               0.062500               0.158046               0.286398               0.396552               0.090038               0.059387               0.117647               0.176471               0.529412               0.176471               0.000000               0.180272               0.238095
4                0.058824                0.127572                0.308642                0.382716                0.090535                0.082305                0.159851                0.284387                0.390335                0.100372                0.055762                0.129032                0.318548                0.383065                0.084677                0.076613                0.151515                0.271132                0.408293                0.095694                0.066986                0.161996                0.282837                0.401051                0.087566                0.057793               0.152203               0.292390               0.397864               0.089453               0.061415               0.196581               0.188034               0.410256               0.102564               0.085470               0.120930               0.316279               0.376744               0.093023               0.083721               0.154088               0.273585
   B6_intTarget_2.0_mean  B6_intTarget_3.0_mean  B6_intTarget_4.0_mean  B7_intTarget_0.0_mean  B7_intTarget_1.0_mean  B7_intTarget_2.0_mean  B7_intTarget_3.0_mean  B7_intTarget_4.0_mean  B8_intTarget_0.0_mean  B8_intTarget_1.0_mean  B8_intTarget_2.0_mean  B8_intTarget_3.0_mean  B8_intTarget_4.0_mean  B9_intTarget_0.0_mean  B9_intTarget_1.0_mean  B9_intTarget_2.0_mean  B9_intTarget_3.0_mean  B9_intTarget_4.0_mean  B10_intTarget_0.0_mean  B10_intTarget_1.0_mean  B10_intTarget_2.0_mean  B10_intTarget_3.0_mean  B10_intTarget_4.0_mean  B11_intTarget_0.0_mean  B11_intTarget_1.0_mean  B11_intTarget_2.0_mean  B11_intTarget_3.0_mean  B11_intTarget_4.0_mean  B12_intTarget_0.0_mean  B12_intTarget_1.0_mean  B12_intTarget_2.0_mean  B12_intTarget_3.0_mean  B12_intTarget_4.0_mean  B14_intTarget_0.0_mean  B14_intTarget_1.0_mean  B14_intTarget_2.0_mean  B14_intTarget_3.0_mean  B14_intTarget_4.0_mean
0               0.394558               0.102041               0.071429               0.111111               0.333333               0.444444               0.111111               0.000000               0.151206               0.286642               0.397032               0.089981               0.064007               0.167331               0.285857               0.392430               0.085657               0.060757                 0.15762                0.286013                0.395616                0.087683                0.063674                0.171463                0.264988                0.411271                0.087530                0.049161                0.179443                0.270035                0.397213                0.090592                0.047038                0.171196                0.305707                0.366848                0.088315                0.058424
1               0.394558               0.102041               0.071429               0.111111               0.333333               0.444444               0.111111               0.000000               0.151206               0.286642               0.397032               0.089981               0.064007               0.167331               0.285857               0.392430               0.085657               0.060757                 0.15762                0.286013                0.395616                0.087683                0.063674                0.171463                0.264988                0.411271                0.087530                0.049161                0.179443                0.270035                0.397213                0.090592                0.047038                0.171196                0.305707                0.366848                0.088315                0.058424
2               0.394558               0.102041               0.071429               0.111111               0.333333               0.444444               0.111111               0.000000               0.151206               0.286642               0.397032               0.089981               0.064007               0.167331               0.285857               0.392430               0.085657               0.060757                 0.15762                0.286013                0.395616                0.087683                0.063674                0.171463                0.264988                0.411271                0.087530                0.049161                0.179443                0.270035                0.397213                0.090592                0.047038                0.171196                0.305707                0.366848                0.088315                0.058424
3               0.394558               0.102041               0.071429               0.111111               0.333333               0.444444               0.111111               0.000000               0.151206               0.286642               0.397032               0.089981               0.064007               0.167331               0.285857               0.392430               0.085657               0.060757                 0.15762                0.286013                0.395616                0.087683                0.063674                0.171463                0.264988                0.411271                0.087530                0.049161                0.179443                0.270035                0.397213                0.090592                0.047038                0.171196                0.305707                0.366848                0.088315                0.058424
4               0.410377               0.091195               0.064465               0.117647               0.176471               0.411765               0.117647               0.176471               0.232558               0.255814               0.325581               0.139535               0.046512               0.000000               0.333333               0.666667               0.000000               0.000000                 0.16318                0.255230                0.447699                0.071130                0.046025                0.145349                0.312016                0.374031                0.098837                0.065891                0.150065                0.289780                0.401035                0.089263                0.063389                0.148936                0.240122                0.431611                0.106383                0.063830
14,评估结果
他使用lgb可以直接跑结果,但是我使用cross_val_cross却不行(lgb没fit方法,有train),
此时:x_train,y_train无法直接用于sklearn算法,(报错,数据类型错误)
cross_value_score也无法作用在lgb上(没fit方法)
清除异常数据:
np.any(np.isnan(X_train),axis=1)
=> np.logical_not(np.any(np.isnan(X_train),axis=1))
=> X_train,y_train=X_train[np.logical_not(np.any(np.isnan(X_train),axis=1))],y_train[np.logical_not(np.any(np.isnan(X_train),axis=1))]

DecisionTreeRegressor   -0.00029905497736250873 3.00256432609409e-05
RandomForestRegressor   -0.00020760366902823285 2.6380383953972225e-05
XGBRegressor            -0.0001981767607924398 2.808226334622625e-05
SVR                     -0.0008958782405495207 5.1208639280237106e-05
LinearRegression        -48030001685423.28 96060003370846.56


修复文中bug后效果
bug1,第11步中的
df[col_name] = df[f].map(order_label)  # 正是此处逻辑错误
修改为: df[col_name] = df[f1].map(order_label)  # 正是此处逻辑错误
修改后第14步结果:
DecisionTreeRegressor   -0.0003388445203694478 2.5746253315096493e-05
RandomForestRegressor   -0.00022420747637809046 2.0379324751299676e-05
XGBRegressor            -0.00018904401979792253 2.739032634668878e-05
SVR                     -0.0009419628520902001 4.666016023199918e-05
LinearRegression        -2.9286245046440532e+16 5.85713898888668e+16
可见:可见dtr,svr等效果均变差

bug2(基于bug01修改好情况下),
第1步,删除类别唯一
['B3', 'B13', 'A13', 'A18', 'A23']
A23    float   1393      0.998             4                 [(5.0, 1391), (10.0, 1), (4.0, 1)]     5.002872    0.136638    4.000    5.000     5.000     5.000    10.0000
A18    float   1396      1.000             2                            [(0.2, 1395), (0.1, 1)]     0.199928    0.002676    0.100    0.200     0.200     0.200     0.2000
A13    float   1396      1.000             3                [(0.2, 1394), (0.15, 1), (0.12, 1)]     0.199907    0.002524    0.120    0.200     0.200     0.200     0.2000
B13    float   1395      0.999             4               [(0.15, 1388), (0.03, 6), (0.06, 1)]     0.149419    0.008213    0.030    0.150     0.150     0.150     0.1500
B3     float   1394      0.999             3                            [(3.5, 1393), (3.6, 1)]     3.500072    0.002678    3.500    3.500     3.500     3.500     3.6000


第2步删除的特征,意思是删除90%缺失特征,实际删除了类似单变量特征(大部分取值集中1个)
         type  count count_rate  unique_count                                         unique_set         mean         std      min      25%       50%       75%        max
B2     float   1394      0.999             4                [(3.5, 1374), (0.15, 19), (3.6, 1)]     3.454412    0.388585    0.150    3.500     3.500     3.500     3.6000
A3     float   1354      0.970             4           [(405.0, 1336), (270.0, 12), (340.0, 6)]   403.515510   13.348093  270.000  405.000   405.000   405.000   405.0000
A4       int   1396      1.000             4      [(700, 1336), (980, 42), (470, 12), (590, 6)]   705.974212   53.214754  470.000  700.000   700.000   700.000   980.0000
A2     float     42      0.030             2                                      [(125.0, 42)]   125.000000    0.000000  125.000  125.000   125.000   125.000   125.0000
A1       int   1396      1.000             3                 [(300, 1377), (200, 13), (250, 6)]   298.853868   10.130552  200.000  300.000   300.000   300.000   300.0000

这一步,删除的其实第1步相同,应当还删除
自己删除的:['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23', 'B3', 'B13']
['A2', 'A7', 'A8', 'B11']

比较下:
'A7', 'A8', 'B11'
B11   object    547      0.392            38  [(20:00-21:00, 154), (12:00-13:00, 140), (4:00...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A7    object    149      0.107            76  [(12:40:00, 13), (15:40:00, 7), (7:00:00, 5), ...          NaN         NaN      NaN      NaN       NaN       NaN        NaN
A8     float    149      0.107             9  [(80.0, 118), (73.0, 16), (74.0, 8), (82.0, 3)...    78.818792    2.683920   70.000   80.000    80.000    80.000    82.0000

自己认为也应该删除
修改后第14步效果
DecisionTreeRegressor   -0.0003162703674004905 3.201639535961476e-05
RandomForestRegressor   -0.0002238364737549146 1.9140884255549184e-05
XGBRegressor            -0.00019055478967780663 2.6732679575302225e-05
SVR                     -0.0009419628520902001 4.666016023199918e-05
LinearRegression        -8.346887378848443e+16 1.6606005966762346e+17

整体看基本无影响

特征07_双策略比对self

基础策略01(ref01_bug2版本)
1,丢弃损失率>90%的特征:['A2', 'A7', 'A8', 'B10', 'B11']
2,丢弃取值单一:['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23', 'B3', 'B13']
以上对照ref02的:1,2步骤,
差异,多丢弃了B10特征
3,缺失值填充众数:['A21', 'A23', 'A24', 'A26', 'A3', 'B1', 'B2', 'B3', 'B12', 'B13', 'B5', 'B8']
4,target<0.87认为outline点
以上对照ref02的:3,4步骤,
差异,缺失值填充用众数而非-1
5,特殊处理A25脏数据
6,oneEncoder:['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12']
7,处理各object特征的脏数据
8,
time_features = ['B7', 'B5', 'A9', 'A5', 'A26', 'A24', 'A16', 'A14', 'A11']
period_features = ['B9', 'B4', 'A28', 'A20']
时间特征:类别转int(10分钟一个单位)
时间差计算(10分钟一个单位)
time_features:1->1
period_features:1->3,开始,结束,持续时间

以上对应ref,5,6,7,
差异:ref01的period_features只转为1个特征,时间差,这里转成3个特征.
ref01的B10也做了处理,但此处B10被丢弃了,

效果:
feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_sh', 'B9_factor_eh', 'B9_factor_pd', 'B4_factor_sh', 'B4_factor_eh', 'B4_factor_pd', 'A28_factor_sh', 'A28_factor_eh', 'A28_factor_pd', 'A20_factor_sh', 'A20_factor_eh', 'A20_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
                                                name                  model      mean       std
0  31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000809  0.000056
1  31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000585  0.000030
2  31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000543  0.000023
3  31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000943  0.000045
4  31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000641  0.000079



##########缩小差距
1,period_features,特征转化变为1-1
效果:
feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_pd', 'B4_factor_pd', 'A28_factor_pd', 'A20_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
                                                name                  model      mean       std
0  23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000800  0.000042
1  23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000572  0.000017
2  23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000537  0.000031
3  23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000943  0.000045
4  23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000649  0.000088



2,保留B10,并且fillna(00:00-00:00)

feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_pd', 'B4_factor_pd', 'A28_factor_pd', 'A20_factor_pd', 'B10_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
                                                name                  model      mean       std
0  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000787  0.000053
1  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000563  0.000034
2  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000537  0.000029
3  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000943  0.000045
4  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000640  0.000083

可见效果稍稍变好,

除了这2点,自己程序和ref01逻辑应该完全一致了,
但是ref01第11步,以及取得0.0002的成绩,自己目前是0.0005,

细节:ref01时间是float h.m形式,自己是整数形式
1,空值处理,fillna(-1),之后会编码为特定label
2,编码方式上,自己是10分的int,ref01是onelable ncoding

参考这2个细节,在做改造(忽略,第1点,由于空值其实非常少,基本可忽略,故不考虑)

feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_pd', 'B4_factor_pd', 'A28_factor_pd', 'A20_factor_pd', 'B10_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
                                                name                  model      mean       std
0  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000785  0.000035
1  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000582  0.000032
2  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000537  0.000029
3  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000943  0.000045
4  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000611  0.000052

额,还是没啥变化,干脆使用ref01的特征处理函数对这2个特征做处理

feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'A5', 'A9', 'A14', 'A16', 'A11', 'A24', 'A26', 'B5', 'B7', 'A20', 'A28', 'B4', 'B9', 'B10']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
                                                name                  model      mean       std
0  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000838  0.000029
1  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000585  0.000031
2  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000565  0.000045
3  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000943  0.000045
4  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000667  0.000056

可见依然无变化.
那就在向前面倒推?看哪一步出的差异?
继续,fillna方式和oneenbleencodeing方式
所有细节四楼都一致了,貌似还是不行
                                                name                  model      mean       std
0  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000827  0.000066
1  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000587  0.000025
2  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000562  0.000043
3  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000942  0.000047
4  24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000668  0.000055

原因找到了,之前的feature少了这几个特征,补充上去达到0.0002了
feature_columns.extend(['A10','A19','A25','B14','B6'])
    FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
Out[2]:
                                                name                  model      mean       std
0  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000341  0.000047
1  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000228  0.000032
2  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000217  0.000024
3  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000942  0.000047
4  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000645  0.000042

这5特征都是int类型,所以没作处理
相关特征处理回滚到自己代码部分
Out[4]:
                                                name                  model      mean       std
0  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000285  0.000033
1  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000200  0.000023
2  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000206  0.000021
3  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000942  0.000047
4  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000542  0.000092

取消float转为oneencoding
Out[2]:
                                                name                  model      mean       std
0  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000280  0.000025
1  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000206  0.000023
2  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000205  0.000021
3  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000942  0.000047
4  29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000544  0.000087


回复时间段特征的1转3,以及取消pd特征的encoding
Out[2]:
                                                name                  model      mean       std
0  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000280  0.000040
1  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000205  0.000019
2  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000204  0.000017
3  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000942  0.000047
4  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000550  0.000108

去掉单个实际特征的onecodindg操作:Out[2]:
                                                name                  model      mean       std
0  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000282  0.000046
1  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000207  0.000023
2  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000204  0.000017
3  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000942  0.000047
4  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000551  0.000109


xgb的rfe特征筛选。
list(zip(feature_columns,rfecv.support_))
Out[18]:
[('B7_factor_hh', True),
 ('B5_factor_hh', False),
 ('A9_factor_hh', True),
 ('A5_factor_hh', True),
 ('A26_factor_hh', False),
 ('A24_factor_hh', True),
 ('A16_factor_hh', True),
 ('A14_factor_hh', False),
 ('A11_factor_hh', False),
 ('B9_factor_sh', False),
 ('B9_factor_eh', True),
 ('B9_factor_pd', True),
 ('B4_factor_sh', False),
 ('B4_factor_eh', False),
 ('B4_factor_pd', True),
 ('A28_factor_sh', False),
 ('A28_factor_eh', True),
 ('A28_factor_pd', False),
 ('A20_factor_sh', False),
 ('A20_factor_eh', False),
 ('A20_factor_pd', False),
 ('B10_factor_sh', True),
 ('B10_factor_eh', False),
 ('B10_factor_pd', False),
 ('A10', True),
 ('A19', False),
 ('A25', True),
 ('B14', True),
 ('B6', True)]

特征08_重新整理出基准版本

初始成绩:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns])
Out[2]:
                                                name                  model      mean       std
0  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  DecisionTreeRegressor -0.000278  0.000043
1  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...  RandomForestRegressor -0.000207  0.000023
2  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...           XGBRegressor -0.000204  0.000017
3  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...                    SVR -0.000942  0.000047
4  39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A...       LinearRegression -0.000551  0.000109

修改点:
1,int类型的处理放到融入流程中
2,规范feature_org和feature_handle的用法
得分:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1...  DecisionTreeRegressor -0.000285  0.000047
1  39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1...  RandomForestRegressor -0.000204  0.000030
2  39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1...           XGBRegressor -0.000199  0.000021
3  39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1...                    SVR -0.000942  0.000047
4  39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1...       LinearRegression -0.000551  0.000109


测试点01:取值单一的,不丢弃,用mode()數填充
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[4]:
                                                name                  model      mean       std
0  57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'...  DecisionTreeRegressor -0.000287  0.000032
1  57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'...  RandomForestRegressor -0.000201  0.000029
2  57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'...           XGBRegressor -0.000200  0.000021
3  57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'...                    SVR -0.000942  0.000047
4  57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'...       LinearRegression -0.002282  0.003443

可見效果降低,所以對取值單一的丟棄較好

測試點02:int取值較小的dummy轉換。
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...  DecisionTreeRegressor -0.000282  0.000034
1  49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...  RandomForestRegressor -0.000206  0.000024
2  49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...           XGBRegressor -0.000196  0.000022
3  49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...                    SVR -0.000942  0.000047
4  49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...       LinearRegression -0.000546  0.000111

所有int特徵轉dummy
Out[2]:
                                                name                  model      mean       std
0  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  DecisionTreeRegressor -0.000295  0.000026
1  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  RandomForestRegressor -0.000205  0.000013
2  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...           XGBRegressor -0.000190  0.000020
3  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...                    SVR -0.000942  0.000047
4  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...       LinearRegression -0.000244  0.000026

修改int填充fillna方式爲-1,不合適
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  DecisionTreeRegressor -0.000285  0.000025
1  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  RandomForestRegressor -0.000209  0.000017
2  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...           XGBRegressor -0.000190  0.000020
3  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...                    SVR -0.000942  0.000047
4  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...       LinearRegression -0.000244  0.000026


测试点03:float,<10,dummy,>10,保留原样
Out[2]:
                                                name                  model          mean           std
0  132['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  DecisionTreeRegressor -2.909480e-04  3.179915e-05
1  132['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  RandomForestRegressor -2.110361e-04  1.625678e-05
2  132['A10_dummy_100', 'A10_dummy_101', 'A10_dum...           XGBRegressor -1.904408e-04  2.114660e-05
3  132['A10_dummy_100', 'A10_dummy_101', 'A10_dum...                    SVR -9.419629e-04  4.666016e-05
4  132['A10_dummy_100', 'A10_dummy_101', 'A10_dum...       LinearRegression -1.537115e+15  2.166613e+15
结论:效果下降


全部丢弃(不处理float特征)
Out[2]:
                                                name                  model          mean           std
0  107['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  DecisionTreeRegressor -2.905839e-04  2.038019e-05
1  107['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  RandomForestRegressor -2.293711e-04  1.696688e-05
2  107['A10_dummy_100', 'A10_dummy_101', 'A10_dum...           XGBRegressor -2.014942e-04  2.127065e-05
3  107['A10_dummy_100', 'A10_dummy_101', 'A10_dum...                    SVR -9.419629e-04  4.666016e-05
4  107['A10_dummy_100', 'A10_dummy_101', 'A10_dum...       LinearRegression -1.587048e+13  2.039310e+13

恢复默认方式,直接使用
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  DecisionTreeRegressor -0.000291  0.000036
1  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  RandomForestRegressor -0.000210  0.000024
2  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...           XGBRegressor -0.000190  0.000020
3  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...                    SVR -0.000942  0.000047
4  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...       LinearRegression -0.000244  0.000026

特征09_特殊特征处理

######  B14,clip(380,max)
处理方式:前置分析,没问题替换(没问题了替换原始B14特征)

FeatureTools.get_column_corr(train_data,target_column,feature_columns=['B14','B14_factor_clip'])
INFO:root:column_corr:
                  type  count count_rate  unique_count                                         unique_set        mean        std      min      25%      50%      75%       max
rate             float   1381      1.000            64  [(0.902, 305), (0.93, 128), (0.890999999999999...    0.924277   0.028407    0.871    0.902    0.925    0.943    1.0008
B14                int   1381      1.000            19  [(400, 736), (420, 329), (440, 226), (460, 35)...  410.913831  25.222039   40.000  400.000  400.000  420.000  460.0000
B14_factor_clip    int   1381      1.000            12  [(400, 736), (420, 329), (440, 226), (460, 35)...  412.260681  17.557032  380.000  400.000  400.000  420.000  460.0000
Out[2]:
                  pearson  spearman      mine
rate             1.000000  1.000000  0.997135
B14              0.462216  0.667185  0.835802
B14_factor_clip  0.635610  0.667125  0.835802
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B14'],['B14_factor_clip']])
Out[3]:
                   name                  model      mean       std
0              1['B14']  DecisionTreeRegressor -0.000364  0.000036
1              1['B14']  RandomForestRegressor -0.000364  0.000033
2              1['B14']           XGBRegressor -0.000364  0.000036
3              1['B14']                    SVR -0.000942  0.000047
4              1['B14']       LinearRegression -0.000648  0.000123
5  1['B14_factor_clip']  DecisionTreeRegressor -0.000369  0.000031
6  1['B14_factor_clip']  RandomForestRegressor -0.000369  0.000030
7  1['B14_factor_clip']           XGBRegressor -0.000369  0.000031
8  1['B14_factor_clip']                    SVR -0.000942  0.000047
9  1['B14_factor_clip']       LinearRegression -0.000481  0.000040


课件,新特征较好,使用B14_factor_clip

###### B6,clip(reverse(9),max)
print(FeatureTools.get_column_corr(train_data,target_column,feature_columns=['B6','B6_factor_clip']))
print(FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B6'],['B6_factor_clip']]))
                 pearson  spearman      mine
rate            1.000000  1.000000  0.997135
B6              0.375929  0.401743  0.466027
B6_factor_clip  0.404206  0.402282  0.466027
                  name                  model      mean       std
0              1['B6']  DecisionTreeRegressor -0.000670  0.000037
1              1['B6']  RandomForestRegressor -0.000671  0.000040
2              1['B6']           XGBRegressor -0.000666  0.000039
3              1['B6']                    SVR -0.000942  0.000047
4              1['B6']       LinearRegression -0.000694  0.000054
5  1['B6_factor_clip']  DecisionTreeRegressor -0.000674  0.000042
6  1['B6_factor_clip']  RandomForestRegressor -0.000673  0.000043
7  1['B6_factor_clip']           XGBRegressor -0.000673  0.000043
8  1['B6_factor_clip']                    SVR -0.000942  0.000047
9  1['B6_factor_clip']       LinearRegression -0.000676  0.000049

结论:采用

A19,clip(200,400)
print(FeatureTools.get_column_corr(train_data,target_column,feature_columns=['A19','A19_factor_clip']))
print(FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['A19'],['A19_factor_clip']]))

                  pearson  spearman      mine
rate             1.000000  1.000000  0.997135
A19             -0.220994 -0.250017  0.376382
A19_factor_clip -0.283522 -0.295258  0.376382
                   name                  model      mean       std
0              1['A19']  DecisionTreeRegressor -0.000718  0.000053
1              1['A19']  RandomForestRegressor -0.000717  0.000051
2              1['A19']           XGBRegressor -0.000717  0.000052
3              1['A19']                    SVR -0.000942  0.000047
4              1['A19']       LinearRegression -0.000769  0.000060
5  1['A19_factor_clip']  DecisionTreeRegressor -0.000743  0.000055
6  1['A19_factor_clip']  RandomForestRegressor -0.000743  0.000055
7  1['A19_factor_clip']           XGBRegressor -0.000743  0.000055
8  1['A19_factor_clip']                    SVR -0.000942  0.000047
9  1['A19_factor_clip']       LinearRegression -0.000743  0.000055

结论:基本无影响,LR提高了,采用


A27,clip(reverse(13),reverse(19))
print(FeatureTools.get_column_corr(train_data,target_column,feature_columns=['A27','A27_factor_clip']))
print(FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['A27'],['A27_factor_clip']]))
                  pearson  spearman      mine
rate             1.000000  1.000000  0.997135
A27             -0.174947 -0.251648  0.355126
A27_factor_clip -0.215349 -0.251601  0.355126
                   name                  model      mean       std
0              1['A27']  DecisionTreeRegressor -0.000685  0.000046
1              1['A27']  RandomForestRegressor -0.000687  0.000045
2              1['A27']           XGBRegressor -0.000685  0.000046
3              1['A27']                    SVR -0.000942  0.000047
4              1['A27']       LinearRegression -0.000783  0.000063
5  1['A27_factor_clip']  DecisionTreeRegressor -0.000685  0.000046
6  1['A27_factor_clip']  RandomForestRegressor -0.000684  0.000047
7  1['A27_factor_clip']           XGBRegressor -0.000684  0.000046
8  1['A27_factor_clip']                    SVR -0.000942  0.000047
9  1['A27_factor_clip']       LinearRegression -0.000770  0.000060



最终结果(末尾):
Out[2]:
                                                name                  model      mean       std
0  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...  DecisionTreeRegressor -0.000294  0.000025
1  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...  RandomForestRegressor -0.000209  0.000014
2  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...           XGBRegressor -0.000191  0.000019
3  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...                    SVR -0.000942  0.000047
4  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...       LinearRegression -0.000248  0.000023

参照比对,均不替换结果得分:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  DecisionTreeRegressor -0.000290  0.000030
1  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...  RandomForestRegressor -0.000215  0.000026
2  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...           XGBRegressor -0.000190  0.000020
3  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...                    SVR -0.000942  0.000047
4  117['A10_dummy_100', 'A10_dummy_101', 'A10_dum...       LinearRegression -0.000244  0.000026
目测差异不大,但特征少了较多,可以替换下

B14,clip(380,max)
处理前后

B6,clip(reverse(9),max)

A19,clip(200,400)

A27,clip(reverse(13),reverse(19))

替换结果

参照比对,均不替换结果得分:

特征10_收率分箱(玄幻特征)

1,常规处理,只对unique<15且非object处理。
train_data = all_data.loc[train_ids]
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B14'],mean_columns])
Out[35]:
                                                name                  model      mean       std
0                                           1['B14']  DecisionTreeRegressor -0.000369  0.000031
1                                           1['B14']  RandomForestRegressor -0.000369  0.000032
2                                           1['B14']           XGBRegressor -0.000369  0.000031
3                                           1['B14']                    SVR -0.000942  0.000047
4                                           1['B14']       LinearRegression -0.000481  0.000040
5  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...  DecisionTreeRegressor -0.000371  0.000031
6  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...  RandomForestRegressor -0.000369  0.000031
7  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...           XGBRegressor -0.000369  0.000031
8  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...                    SVR -0.000942  0.000047
9  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...       LinearRegression -0.000369  0.000033

LR的确降低了,其他影响不大

综合效果:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model          mean           std
0  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...  DecisionTreeRegressor -2.972214e-04  3.689761e-05
1  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...  RandomForestRegressor -2.131609e-04  1.459160e-05
2  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...           XGBRegressor -1.929303e-04  2.213503e-05
3  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...                    SVR -9.419629e-04  4.666016e-05
4  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...       LinearRegression -4.433353e+14  6.487895e+14

参照比对:
未处理前
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...  DecisionTreeRegressor -0.000295  0.000030
1  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...  RandomForestRegressor -0.000219  0.000024
2  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...           XGBRegressor -0.000191  0.000019
3  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...                    SVR -0.000942  0.000047
4  69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm...       LinearRegression -0.000248  0.000023

2.处理全部特征(train-data)中的特征
train_data = all_data.loc[train_ids]
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B14'],mean_columns])
Out[9]:
                                                name                  model      mean       std
0                                           1['B14']  DecisionTreeRegressor -0.000369  0.000031
1                                           1['B14']  RandomForestRegressor -0.000370  0.000031
2                                           1['B14']           XGBRegressor -0.000369  0.000031
3                                           1['B14']                    SVR -0.000942  0.000047
4                                           1['B14']       LinearRegression -0.000481  0.000040
5  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...  DecisionTreeRegressor -0.000370  0.000031
6  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...  RandomForestRegressor -0.000369  0.000030
7  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...           XGBRegressor -0.000369  0.000031
8  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...                    SVR -0.000942  0.000047
9  5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i...       LinearRegression -0.000369  0.000033

最终结果:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model          mean           std
0  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...  DecisionTreeRegressor -2.919802e-04  3.105376e-05
1  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...  RandomForestRegressor -2.144352e-04  2.369516e-05
2  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...           XGBRegressor -1.929303e-04  2.213503e-05
3  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...                    SVR -9.419629e-04  4.666016e-05
4  74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_...       LinearRegression -4.433353e+14  6.487895e+14
和未处理相比差异并不大,除了LR变很差之外

结论:不采用


插播:
03,玄幻特征中的错误映射修复回来,观察效果
train_data[col_name] = train_data['B14'].map(order_label)
=>train_data[col_name] = train_data[f1].map(order_label)
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[3]:
                                                name                  model          mean           std
0  209['B14_to_B14_intTarget_0_mean', 'B14_to_B14...  DecisionTreeRegressor -2.911757e-04  3.106674e-05
1  209['B14_to_B14_intTarget_0_mean', 'B14_to_B14...  RandomForestRegressor -2.093248e-04  2.677998e-05
2  209['B14_to_B14_intTarget_0_mean', 'B14_to_B14...           XGBRegressor -1.929303e-04  2.213503e-05
3  209['B14_to_B14_intTarget_0_mean', 'B14_to_B14...                    SVR -9.419629e-04  4.666016e-05
4  209['B14_to_B14_intTarget_0_mean', 'B14_to_B14...       LinearRegression -2.595957e+15  2.975308e+15

可见,也没有很大变化,说明这种处理方式并不有效。

04,依然进行<15和object的筛选,不过聚类函数也保留,新增每个类别的mean信息
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me...  DecisionTreeRegressor -0.000296  0.000032
1  80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me...  RandomForestRegressor -0.000208  0.000019
2  80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me...           XGBRegressor -0.000194  0.000022
3  80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me...                    SVR -0.000942  0.000047
4  80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me...       LinearRegression -0.000379  0.000158

可见:整体和未处理差别不大

特征11_样本id

01,mode500后观察有一定周期性
FeatureTools.get_column_corr(train_data,target_column,feature_columns=['id','id_mode500','id_div500','id_mode500_diff250'])
Out[3]:
                     pearson  spearman      mine
rate                1.000000  1.000000  0.997135
id                  0.063930  0.067572  0.571742
id_mode500         -0.036691 -0.042827  0.225065
id_mode500_diff250 -0.204026 -0.153917  0.176602
id_div500           0.075036  0.079958  0.059548

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['id'],['id','id_mode500','id_div500'],['id','id_mode500','id_div500','id_mode500_diff250'],['id','id_mode500','id_mode500_diff250']])
Out[5]:
                                                 name                  model      mean       std
0                                             1['id']  DecisionTreeRegressor -0.000978  0.000056
1                                             1['id']  RandomForestRegressor -0.000719  0.000034
2                                             1['id']           XGBRegressor -0.000498  0.000039
3                                             1['id']                    SVR -0.000942  0.000047
4                                             1['id']       LinearRegression -0.000805  0.000059
5                  3['id', 'id_mode500', 'id_div500']  DecisionTreeRegressor -0.000962  0.000057
6                  3['id', 'id_mode500', 'id_div500']  RandomForestRegressor -0.000719  0.000046
7                  3['id', 'id_mode500', 'id_div500']           XGBRegressor -0.000502  0.000031
8                  3['id', 'id_mode500', 'id_div500']                    SVR -0.000942  0.000047
9                  3['id', 'id_mode500', 'id_div500']       LinearRegression -0.000806  0.000056
10  4['id', 'id_mode500', 'id_div500', 'id_mode500...  DecisionTreeRegressor -0.000961  0.000056
11  4['id', 'id_mode500', 'id_div500', 'id_mode500...  RandomForestRegressor -0.000699  0.000046
12  4['id', 'id_mode500', 'id_div500', 'id_mode500...           XGBRegressor -0.000502  0.000031
13  4['id', 'id_mode500', 'id_div500', 'id_mode500...                    SVR -0.000942  0.000047
14  4['id', 'id_mode500', 'id_div500', 'id_mode500...       LinearRegression -0.000687  0.000049
15        3['id', 'id_mode500', 'id_mode500_diff250']  DecisionTreeRegressor -0.000962  0.000051
16        3['id', 'id_mode500', 'id_mode500_diff250']  RandomForestRegressor -0.000716  0.000033
17        3['id', 'id_mode500', 'id_mode500_diff250']           XGBRegressor -0.000502  0.000031
18        3['id', 'id_mode500', 'id_mode500_diff250']                    SVR -0.000942  0.000047
19        3['id', 'id_mode500', 'id_mode500_diff250']       LinearRegression -0.000687  0.000049

15              2['id_mode500', 'id_mode500_diff250']  DecisionTreeRegressor -0.001172  0.000114
16              2['id_mode500', 'id_mode500_diff250']  RandomForestRegressor -0.001027  0.000093
17              2['id_mode500', 'id_mode500_diff250']           XGBRegressor -0.000709  0.000054
18              2['id_mode500', 'id_mode500_diff250']                    SVR -0.000942  0.000047
19              2['id_mode500', 'id_mode500_diff250']       LinearRegression -0.000688  0.000058

可见,最佳的是4个全部保留
最终效果:

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  73['id', 'id_mode500', 'id_div500', 'id_mode50...  DecisionTreeRegressor -0.000251  0.000022
1  73['id', 'id_mode500', 'id_div500', 'id_mode50...  RandomForestRegressor -0.000161  0.000012
2  73['id', 'id_mode500', 'id_div500', 'id_mode50...           XGBRegressor -0.000150  0.000016
3  73['id', 'id_mode500', 'id_div500', 'id_mode50...                    SVR -0.000942  0.000047
4  73['id', 'id_mode500', 'id_div500', 'id_mode50...       LinearRegression -0.000225  0.000015


插播:
02:保留3个'id', 'id_mode500', 'id_mode500_diff250'
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  72['id', 'id_mode500', 'id_mode500_diff250', '...  DecisionTreeRegressor -0.000260  0.000027
1  72['id', 'id_mode500', 'id_mode500_diff250', '...  RandomForestRegressor -0.000159  0.000014
2  72['id', 'id_mode500', 'id_mode500_diff250', '...           XGBRegressor -0.000150  0.000016
3  72['id', 'id_mode500', 'id_mode500_diff250', '...                    SVR -0.000942  0.000047
4  72['id', 'id_mode500', 'id_mode500_diff250', '...       LinearRegression -0.000225  0.000015

差不多,还是保留4个吧

04,只保留id
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1...  DecisionTreeRegressor -0.000259  0.000027
1  70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1...  RandomForestRegressor -0.000156  0.000018
2  70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1...           XGBRegressor -0.000148  0.000017
3  70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1...                    SVR -0.000942  0.000047
4  70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1...       LinearRegression -0.000248  0.000022



05:在4个都保留情况下,测试特征的clip处理屏蔽操作
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model          mean           std
0  121['id', 'id_mode500', 'id_div500', 'id_mode5...  DecisionTreeRegressor -2.435147e-04  1.663540e-05
1  121['id', 'id_mode500', 'id_div500', 'id_mode5...  RandomForestRegressor -1.541371e-04  9.708604e-06
2  121['id', 'id_mode500', 'id_div500', 'id_mode5...           XGBRegressor -1.504820e-04  1.578562e-05
3  121['id', 'id_mode500', 'id_div500', 'id_mode5...                    SVR -9.419629e-04  4.666016e-05
4  121['id', 'id_mode500', 'id_div500', 'id_mode5...       LinearRegression -2.666657e+12  4.283183e+12

相比,clip确实稍稍提升,但考虑到特征数量的膨胀,暂时保留clip操作

特征12_工序时间差特征

特征序列:
time_series_list = ['A5', 'A9', 'A11', 'A14', 'A16', 'A20', 'A24', 'A26', 'A28', 'B4', 'B5', 'B7', 'B9', 'B10']
train_data=all_data.loc[train_ids]
FeatureTools.get_column_corr(train_data,target_column,feature_columns=diff_columns)
Out[3]:
               pearson  spearman      mine
rate          1.000000  1.000000  0.997135
B9_B10_diff   0.225574 -0.048804  0.501285
A24_A26_diff  0.103007  0.406307  0.480982
A5_A9_diff    0.070042  0.255253  0.363899
A16_A20_diff -0.165845 -0.323325  0.362730
A26_A28_diff -0.252735 -0.307826  0.355320
B5_B7_diff   -0.132855 -0.189066  0.307381
B4_B5_diff   -0.088838 -0.227567  0.247909
A20_A24_diff -0.024127  0.225474  0.198716
B7_B9_diff   -0.142460 -0.259364  0.176202
A28_B4_diff   0.069055  0.165902  0.155110
A9_A11_diff  -0.024438 -0.060434  0.025192
A11_A14_diff -0.000033 -0.018616  0.018966
A14_A16_diff  0.010135 -0.016737  0.011144

最终效果:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[4]:
                                                name                  model      mean       std
0  86['id', 'id_mode500', 'id_div500', 'id_mode50...  DecisionTreeRegressor -0.000269  0.000038
1  86['id', 'id_mode500', 'id_div500', 'id_mode50...  RandomForestRegressor -0.000159  0.000019
2  86['id', 'id_mode500', 'id_div500', 'id_mode50...           XGBRegressor -0.000155  0.000018
3  86['id', 'id_mode500', 'id_div500', 'id_mode50...                    SVR -0.000942  0.000047
4  86['id', 'id_mode500', 'id_div500', 'id_mode50...       LinearRegression -0.000247  0.000029

参考比对:不做处理
Out[1]:
                                                name                  model      mean       std
0  73['id', 'id_mode500', 'id_div500', 'id_mode50...  DecisionTreeRegressor -0.000258  0.000017
1  73['id', 'id_mode500', 'id_div500', 'id_mode50...  RandomForestRegressor -0.000153  0.000016
2  73['id', 'id_mode500', 'id_div500', 'id_mode50...           XGBRegressor -0.000150  0.000016
3  73['id', 'id_mode500', 'id_div500', 'id_mode50...                    SVR -0.000942  0.000047
4  73['id', 'id_mode500', 'id_div500', 'id_mode50...       LinearRegression -0.000225  0.000015

参考比对:只使用['B9_B10_diff','A24_A26_diff']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model      mean       std
0  75['id', 'id_mode500', 'id_div500', 'id_mode50...  DecisionTreeRegressor -0.000264  0.000035
1  75['id', 'id_mode500', 'id_div500', 'id_mode50...  RandomForestRegressor -0.000162  0.000019
2  75['id', 'id_mode500', 'id_div500', 'id_mode50...           XGBRegressor -0.000153  0.000018
3  75['id', 'id_mode500', 'id_div500', 'id_mode50...                    SVR -0.000942  0.000047
4  75['id', 'id_mode500', 'id_div500', 'id_mode50...       LinearRegression -0.000226  0.000015

结论:无用,不作处理是最佳的

特征13_训练预测数据分布差异

关注21,22

train_data['A21'].value_counts()
test_data['A21'].value_counts()

train_data['A21'].value_counts()
Out[6]:
50.0    1254
40.0      63
30.0      42
35.0      15
20.0       7
60.0       4
45.0       2
55.0       2
80.0       1
70.0       1
25.0       1
90.0       1
Name: A21, dtype: int64
test_data['A21'].value_counts()
Out[7]:
50    135
40      8
30      3
35      2
25      2
Name: A21, dtype: int64


train_data['A22'].value_counts()
test_data['A22'].value_counts()

train_data['A22'].value_counts()
Out[8]:
9.0     1216
10.0     174
8.0        5
3.5        1
Name: A22, dtype: int64
test_data['A22'].value_counts()
Out[9]:
9     131
10     19
Name: A22, dtype: int64

查看数据均值信息
pd.DataFrame(train_data[['A22',target_column]].groupby(by=['A22'])[target_column].mean()).join(train_data['A22'].value_counts())
Out[16]:
          rate   A22
A22
3.5   0.902000     1
8.0   0.939800     5
9.0   0.925468  1216
10.0  0.907345   174








特征14_第二批数据转化测试

1,填充中位数的特征查看
fillna_columns = ['A21',  'A24', 'A26',  'B1', 'B12', 'B5', 'B8']
FeatureTools.get_columns_info(all_data,fillna_columns)
Out[3]:
       type  count  nan_count count_rate  unique_count                                         unique_set         mean         std    min    25%     50%     75%     max
A21   float   1543          3      0.998            13  [(50.0, 1389), (40.0, 71), (30.0, 45), (35.0, ...    48.690862    4.954759   20.0   50.0    50.0    50.0    90.0
A24  object   1545          1      0.999            94  [(12:00:00, 289), (20:00:00, 268), (4:00:00, 1...          NaN         NaN    NaN    NaN     NaN     NaN     NaN
A26  object   1544          2      0.999            91  [(13:00:00, 297), (21:00:00, 271), (5:00:00, 1...          NaN         NaN    NaN    NaN     NaN     NaN     NaN
B1    float   1535         11      0.993            22  [(320.0, 835), (300.0, 142), (350.0, 130), (37...   335.336482  107.244534    3.5  320.0   320.0   330.0  1200.0
B12   float   1545          1      0.999             5  [(1200.0, 859), (800.0, 646), (900.0, 24), (40...  1019.805825  206.114344  400.0  800.0  1200.0  1200.0  1200.0
B5   object   1545          1      0.999            63  [(15:00:00, 276), (23:00:00, 236), (7:00:00, 1...          NaN         NaN    NaN    NaN     NaN     NaN     NaN
B8    float   1545          1      0.999            26  [(45.0, 1204), (40.0, 157), (50.0, 47), (28.0,...    43.711327    4.328276   20.0   45.0    45.0    45.0    73.0

可见缺失数据其实比较少的,所以没必要做其他特殊处理

2,其他使用fillna地方,空值区分填充-1。查看相关空值,占比非常低,无必要做处理

3,clip取消
B14,B6, A19,A27
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[3]:
                                                name                  model          mean           std
0  121['id', 'id_mode500', 'id_div500', 'id_mode5...  DecisionTreeRegressor -2.528032e-04  2.835551e-05
1  121['id', 'id_mode500', 'id_div500', 'id_mode5...  RandomForestRegressor -1.578689e-04  1.122595e-05
2  121['id', 'id_mode500', 'id_div500', 'id_mode5...           XGBRegressor -1.504820e-04  1.578562e-05
3  121['id', 'id_mode500', 'id_div500', 'id_mode5...                    SVR -9.419629e-04  4.666016e-05
4  121['id', 'id_mode500', 'id_div500', 'id_mode5...       LinearRegression -9.161214e+12  1.511062e+13

保留取消
测试全部float特征dummy

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model          mean           std
0  264['id', 'id_mode500', 'id_div500', 'id_mode5...  DecisionTreeRegressor -2.517986e-04  2.809433e-05
1  264['id', 'id_mode500', 'id_div500', 'id_mode5...  RandomForestRegressor -1.574401e-04  1.260128e-05
2  264['id', 'id_mode500', 'id_div500', 'id_mode5...           XGBRegressor -1.495908e-04  1.611149e-05
3  264['id', 'id_mode500', 'id_div500', 'id_mode5...                    SVR -9.419629e-04  4.666016e-05
4  264['id', 'id_mode500', 'id_div500', 'id_mode5...       LinearRegression -9.540508e+12  1.439639e+13

float<10 dummy,>10保持不动
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
                                                name                  model          mean           std
0  136['id', 'id_mode500', 'id_div500', 'id_mode5...  DecisionTreeRegressor -2.381848e-04  1.889620e-05
1  136['id', 'id_mode500', 'id_div500', 'id_mode5...  RandomForestRegressor -1.573015e-04  1.784003e-05
2  136['id', 'id_mode500', 'id_div500', 'id_mode5...           XGBRegressor -1.499761e-04  1.690648e-05
3  136['id', 'id_mode500', 'id_div500', 'id_mode5...                    SVR -9.419629e-04  4.666016e-05
4  136['id', 'id_mode500', 'id_div500', 'id_mode5...       LinearRegression -6.445922e+13  1.014251e+14

4,在3基础上,不做rfecv处理
  mode           mse            r2                                     best_estimator
0   cv  6.776363e-04  1.596780e-01  LinearSVR(C=0.1, dual=True, epsilon=0.01, fit_...
1   cv  1.414319e-04  8.246133e-01  RandomForestRegressor(bootstrap=False, criteri...
2   cv  7.663601e+20 -9.503465e+23  LinearRegression(copy_X=True, fit_intercept=Tr...
3   cv  1.504837e-04  8.133884e-01  GradientBoostingRegressor(alpha=0.9, criterion...
4   cv  3.571991e-04  5.570452e-01  ElasticNetCV(alphas=None, copy_X=True, cv='war...
5   cv  1.381657e-04  8.286637e-01  XGBRegressor(base_score=0.5, booster='gbtree',...

基于此建立提交08,09,10


5,4基础上恢复rfecv且算法的调参探测次数改为100次
rfecv算法,xgb
INFO:common.gscvTools:ret_df:
  mode       mse        r2                                     best_estimator
0   cv  0.000255  0.684052  LinearRegression(copy_X=True, fit_intercept=Tr...
1   cv  0.000136  0.831958  XGBRegressor(base_score=0.5, booster='gbtree',...
2   cv  0.000752  0.067833  LinearSVR(C=25.0, dual=True, epsilon=0.01, fit...
3   cv  0.000376  0.533659  ElasticNetCV(alphas=None, copy_X=True, cv=None...
4   cv  0.000133  0.835663  RandomForestRegressor(bootstrap=True, criterio...
5   cv  0.000139  0.827897  GradientBoostingRegressor(alpha=0.9, criterion...

rfecv算法,rfr
  mode       mse        r2                                     best_estimator
0   cv  0.000142  0.823842  XGBRegressor(base_score=0.5, booster='gbtree',...
1   cv  0.000140  0.825954  RandomForestRegressor(bootstrap=True, criterio...
2   cv  0.000220  0.727425  LinearRegression(copy_X=True, fit_intercept=Tr...
3   cv  0.000138  0.829087  GradientBoostingRegressor(alpha=0.99, criterio...
4   cv  0.000900 -0.115566  LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_...
5   cv  0.000357  0.557045  ElasticNetCV(alphas=None, copy_X=True, cv=None...

不做rfecv
  mode           mse            r2                                     best_estimator
0   cv  1.378900e-04  8.290056e-01  RandomForestRegressor(bootstrap=False, criteri...
1   cv  1.463699e-04  8.184899e-01  GradientBoostingRegressor(alpha=0.95, criterio...
2   cv  1.182782e+22 -1.466742e+25  LinearRegression(copy_X=True, fit_intercept=Tr...
3   cv  1.409546e-04  8.252052e-01  XGBRegressor(base_score=0.5, booster='gbtree',...
4   cv  7.924101e-04  1.734945e-02  LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_...
5   cv  3.571991e-04  5.570452e-01  ElasticNetCV(alphas=None, copy_X=True, cv=None...

不做rfecv最好

6,不做rfecv
添加特征id的knnreg算法的预测替换id特征
INFO:common.gscvTools:ret_df:
  mode           mse            r2                                     best_estimator
0   cv  1.252570e-04  8.446715e-01  RandomForestRegressor(bootstrap=False, criteri...
1   cv  2.016451e+22 -2.500557e+25  LinearRegression(copy_X=True, fit_intercept=Tr...
2   cv  2.929551e-04  6.367128e-01  ElasticNetCV(alphas=None, copy_X=True, cv=None...
3   cv  1.431421e-04  8.224926e-01  XGBRegressor(base_score=0.5, booster='gbtree',...
4   cv  6.831870e-04  1.527946e-01  LinearSVR(C=1.0, dual=True, epsilon=0.01, fit_...
5   cv  1.286914e-04  8.404125e-01  GradientBoostingRegressor(alpha=0.99, criterio...

3算法集成,提交12
xgb,gbr,rfr
INFO:__main__:StackingRegressor meta_model_scores:{'XGBRegressor': -0.000137, 'SVR': -0.000942, 'LinearRegression': -0.000134, 'Ridge': -0.000183, 'LinearSVR': -0.000129}
INFO:__main__:StackingCVRegressor meta_model_scores:{'XGBRegressor': -0.000133, 'SVR': -0.000942, 'LinearRegression': -0.000125, 'Ridge': -0.000195, 'LinearSVR': -0.000128}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'XGBRegressor': -0.000131, 'SVR': -0.000942, 'LinearRegression': -0.000126, 'Ridge': -0.000197, 'LinearSVR': -0.000132}

cv,Lr,0.000125,线上0.00008351

特征15_别人算法自己算法比对验证

他人:ref02
1,他人特征+他人算法
lgb:CV score: 0.00012217
xgb:CV score: 0.00012019
stcking score 0.0001159681469760565

2,自己特征+他人算法
lgb:CV score: 0.00014090
xgb:CV score: 0.00012429
stcking score 0.00012452678699482447

定制算法评估打分
scoring = 'neg_mean_squared_error'
cv = 5
score_df = pd.DataFrame(columns=['name', 'model', 'mean', 'std'])
models = [DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor(), LinearRegression()]
for model in models:
    print(model.__class__.__name__)
    scores = cross_val_score(model, X_train, y_train, scoring=scoring, cv=cv)
    tmp_series = pd.Series(
        {'name': 'xxxx', 'model': model.__class__.__name__,
         'mean': scores.mean(),
         'std': scores.std()})
    score_df = score_df.append(tmp_series, ignore_index=True)
print('score_df:\n%s' % score_df)

3,他人特征+自己打分算法
score_df:
   name                  model      mean       std
0  xxxx  DecisionTreeRegressor -0.000219  0.000049
1  xxxx  RandomForestRegressor -0.000140  0.000016
2  xxxx           XGBRegressor -0.000137  0.000017
3  xxxx       LinearRegression -0.008746  0.001601

4,自己算法+自己打分算法
score_df:
   name                  model          mean           std
0  xxxx  DecisionTreeRegressor -2.645024e-04  2.672804e-05
1  xxxx  RandomForestRegressor -1.488141e-04  7.269487e-06
2  xxxx           XGBRegressor -1.408158e-04  5.988981e-06
3  xxxx       LinearRegression -2.253568e+13  2.939562e+13

1,2=>他人特征更适合他人算法
3,4=>他人特征更适合自己评估算法
以上2条=>他人特征优于自己特征

采用他人特征+自己参数探测和集成方法尝试
5,采用他人特征+自己参数探测和集成方法尝试

自己快捷评估代码
score_df:
   name                  model      mean       std
0  xxxx  DecisionTreeRegressor -0.000213  0.000035
1  xxxx  RandomForestRegressor -0.000139  0.000014
2  xxxx           XGBRegressor -0.000137  0.000017
3  xxxx       LinearRegression -0.008746  0.001601

未rfecv的最佳算法和参数
  mode       mse         r2  \
0   cv  0.042856 -52.144454
1   cv  0.008748  -9.847654
2   cv  0.000126   0.844056
3   cv  0.000128   0.841737
4   cv  0.000365   0.547830
5   cv  0.000123   0.847924
                                                                                        best_estimator
0  LinearSVR(C=25.0, dual=True, epsilon=0.01, fit_intercept=True,\n     intercept_scaling=1.0, loss...
1                         LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
2  GradientBoostingRegressor(alpha=0.99, criterion='friedman_mse', init=None,\n             learnin...
3  RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n           max_features=...
4  ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,\n       l1_ratio=...
5  XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n       colsample_bytree=1, ...

rfecv方法
太吵,暂停

特征16_误差分析

连续-连续的数据关系观察
all_data_sort=all_data.sort_values(by=['id'])
all_data_sort['%s_tmp1'%target_column]=all_data_sort[target_column].replace(0,np.nan).fillna(method='ffill')

plt.scatter(all_data_sort['id'],all_data_sort['%s_tmp1'%target_column].rolling(20).mean());plt.show()
plt.plot(all_data_sort['%s_tmp1'%target_column].rolling(20).mean());plt.show()

plt.plot(all_data_sort[target_column]);plt.show()
plt.plot(all_data_sort[all_data_sort[target_column]>0][target_column]);plt.show()
plt.plot(all_data_sort[all_data_sort[target_column]>0][target_column].rolling(20).mean());plt.show()


误差分析-训练集合观察
需要分析的特征排序
train_data['y_predict']=y_predict
train_data_sort=train_data.sort_values(by=['id'])

查看待分析特征和target关系
plt.scatter(train_data_sort['id'],train_data_sort[target_column]);plt.show()
plt.scatter(train_data_sort['id'],train_data_sort['y_predict']);plt.show()

点关系,平滑处理的点关系
plt.plot(train_data_sort[target_column]);plt.show()
plt.plot(train_data_sort['y_predict']);plt.show()
plt.plot(train_data_sort['y_diff']);plt.show()

线关系,所有数据,合法数据,平滑处理后的合法数据视图
plt.plot(train_data_sort[target_column].rolling(20).mean());plt.show()
plt.plot(train_data_sort['y_predict'].rolling(20).mean());plt.show()
plt.plot(train_data_sort['y_diff'].rolling(20).mean());plt.show()

散点关系,target_column和误差项diff
plt.scatter(train_data_sort[target_column],train_data_sort['y_diff']);plt.show()


先观察id_predict效果
mean_squared_error(train_data[target_column],train_data[target_column])
Out[70]: 0.0

mean_squared_error(train_data[target_column],train_data[target_column].rolling(5).mean().fillna(method='bfill'))
Out[71]: 0.000640307076611152
mean_squared_error(train_data[target_column],train_data['id_predict'])
Out[72]: 0.00038063561661347604
mean_squared_error(train_data[target_column],train_data[target_column].rolling(2).mean().fillna(method='bfill'))
Out[74]: 0.0003902581028240407
也就是说预测的id_predict还是颇为靠谱的,基本等价于2日移动平均,需要注意的事rollingmean肯定会相对元数据增大mse

新思路:
1,采用插值方法,填充test数据的id_predict
插值法:效果反而变差导致,20折,3折基本都是0.0015左右,mean均值0008,
所以不可取,ffill填充房事,20折,0009也不好

2,依然采用knn,不过id_predict只填充test数据,train的id_predict=target

3,b14采用类似处理方式
特征简易打分(特征15的简易评估代码)
score_df:
   name                  model          mean           std
0  xxxx  DecisionTreeRegressor -2.495737e-04  1.781087e-05
1  xxxx  RandomForestRegressor -1.430621e-04  1.691274e-05
2  xxxx           XGBRegressor -1.296319e-04  9.291542e-06
3  xxxx       LinearRegression -1.645531e+13  2.947468e+13
算法参数寻优
  mode           mse            r2                                     best_estimator
0   cv  8.359894e-04 -3.669229e-02  LinearSVR(C=0.1, dual=True, epsilon=0.01, fit_...
1   cv  2.929551e-04  6.367127e-01  ElasticNetCV(alphas=None, copy_X=True, cv=None...
2   cv  1.248960e-04  8.451191e-01  XGBRegressor(base_score=0.5, booster='gbtree',...
3   cv  1.644659e+13 -2.039506e+16  LinearRegression(copy_X=True, fit_intercept=Tr...
4   cv  1.168506e-04  8.550961e-01  GradientBoostingRegressor(alpha=0.99, criterio...
5   cv  1.163944e-04  8.556618e-01  RandomForestRegressor(bootstrap=False, criteri...

集成参数寻优
  mode       mse        r2                                     best_estimator
0   cv  0.000035  0.956183  ElasticNetCV(alphas=None, copy_X=True, cv=None...
1   cv  0.000034  0.957520  LinearRegression(copy_X=True, fit_intercept=Tr...
2   cv  0.000037  0.953728  GradientBoostingRegressor(alpha=0.8, criterion...
3   cv  0.000048  0.940388  LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_...
4   cv  0.000038  0.952951  RandomForestRegressor(bootstrap=True, criterio...
5   cv  0.000037  0.954310  XGBRegressor(base_score=0.5, booster='gbtree',...

最终选择LR集成,本机:0.000034(对应提交16),提交线上

集成01_mlxtend

单算法表现:
  mode       mse        r2                                     best_estimator
0   cv  0.000188  0.766592  GradientBoostingRegressor(alpha=0.85, criterio...
--1   cv  0.001061 -0.315957  LinearSVR(C=0.01, dual=True, epsilon=0.01, fit...
--2   cv  0.000403  0.500398  ElasticNetCV(alphas=None, copy_X=True, cv='war...
3   cv  0.000192  0.762168  RandomForestRegressor(bootstrap=False, criteri...
4   cv  0.000241  0.700835  LinearRegression(copy_X=True, fit_intercept=Tr...
5   cv  0.000191  0.762554  XGBRegressor(base_score=0.5, booster='gbtree',...

集成算法:(比较好的4个)
len(candidate_models)
Out[17]: 4
INFO:__main__:StackingRegressor meta_model_scores:{'Ridge': -0.00022, 'LinearSVR': -0.000204, 'LinearRegression': -0.000207, 'XGBRegressor': -0.000205, 'SVR': -0.000942}
INFO:__main__:StackingCVRegressor meta_model_scores:{'Ridge': -0.000231, 'LinearSVR': -0.000189, 'LinearRegression': -0.000185, 'XGBRegressor': -0.000188, 'SVR': -0.000942}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'LinearSVR': -0.000187, 'LinearRegression': -0.00018, 'Ridge': -0.000232, 'SVR': -0.000942, 'XGBRegressor': -0.000186}

最终采用get_StackingCV_OOF的LR方法
生成结果:提交02.txt

提交

提交01_单算法_RFECV(XGB)_XGB_000187

文件:提交01_单算法_RFECV(XGB)_XGB_000187.csv
思路:单算法,rfecv特征过滤XGB,算法XGB
rfecv = RFECV(estimator=XGBRegressor(),  # 学习器
                      step=1,  # 移除特征个数
                      cv=5,
                      scoring='neg_mean_squared_error',  # 学习器的评价标准
                      verbose=0,
                      n_jobs=-1
                      ).fit(X_train, y_train)
best_rfe_model=rfecv
X_train = best_rfe_model.transform(X_train)
X_test = best_rfe_model.transform(X_test)
tmp_df = GSCVTools.best_modelAndParam_reg(X_train, None, y_train, None)
best_model = tmp_df.loc[2, 'best_estimator']#XGBRegressor,0.000187
线上:0.000138

提交02_集成_RFECV(XGB)_4算法_集成CVOOF(LR)_000185

提交03_单算法_RFECV(XGB)_RFR_000138

INFO:root:各rfecv特征筛选后在各模型上评分:
                        LinearRegression  LinearSVR  RandomForestRegressor  XGBRegressor
XGBRegressor                  -0.000267  -0.001267              -0.000150     -0.000146
RandomForestRegressor         -0.000225  -0.001068              -0.000170     -0.000146
LinearSVR                     -0.000282  -0.000307              -0.000274     -0.000276
LinearRegression              -0.000224  -0.001264              -0.000191     -0.000180

选择:XGBRegressor

INFO:common.gscvTools:ret_df:
  mode       mse        r2                                     best_estimator
0   cv  0.000267  0.669200  LinearRegression(copy_X=True, fit_intercept=Tr...
1   cv  0.000144  0.821609  XGBRegressor(base_score=0.5, booster='gbtree',...
2   cv  0.000138  0.828979  RandomForestRegressor(bootstrap=False, criteri...
--3   cv  0.000797  0.012188  LinearSVR(C=10.0, dual=True, epsilon=0.01, fit...
--4   cv  0.000406  0.496977  ElasticNetCV(alphas=None, copy_X=True, cv='war...
5   cv  0.000148  0.816161  GradientBoostingRegressor(alpha=0.95, criterio...

单个最优RandomForestRegressor,0.000138,线上成绩:0.00009482

集成:选择4个最优算法
INFO:__main__:StackingRegressor meta_model_scores:  {'LinearSVR': -0.000144, 'SVR': -0.000942, 'Ridge': -0.000174, 'LinearRegression': -0.000149, 'XGBRegressor': -0.000156}
INFO:__main__:StackingCVRegressor meta_model_scores:{'LinearSVR': -0.000141, 'SVR': -0.000942, 'Ridge': -0.000186, 'LinearRegression': -0.000135, 'XGBRegressor': -0.000135}
INFO:__main__:get_StackingCV_OOF meta_model_scores: {'LinearSVR': -0.000138, 'SVR': -0.000942, 'Ridge': -0.000184, 'LinearRegression': -0.000132, 'XGBRegressor': -0.000131}

最优:CVoof,XGB,0.000131,线上成绩0.00008971

提交05_单算法_rfecv(XGB)_RF(HYTools)_000134


INFO:common.hyperoptTools:ret_df:  mode       mse         r2                                     best_estimator                                              other
4   cv  0.000148   0.816798  Pipeline(memory=None,\n     steps=[('randomfor...  {'learner': (DecisionTreeRegressor(criterion='...
5   cv  0.000151   0.812444  Pipeline(memory=None,\n     steps=[('minmaxsca...  {'learner': XGBRegressor(base_score=0.5, boost...
2   cv  0.000267   0.669208  Pipeline(memory=None,\n     steps=[('standards...  {'learner': ElasticNet(alpha=6.740345196335565...
3   cv  0.000573   0.289710  Pipeline(memory=None,\n     steps=[('pca', PCA...  {'learner': SVR(C=412.7656895152976, cache_siz...
0   cv  0.000683   0.153145  Pipeline(memory=None,\n     steps=[('normalize...  {'learner': SVR(C=0.6522903726264077, cache_si...
1   cv  0.011210 -12.901622  Pipeline(memory=None,\n     steps=[('normalize...  {'learner': SGDRegressor(alpha=1.2944476618533...

分析xgb,没有自己的grid随机搜索效果好
tmp_df['best_estimator'][1]
:Pipeline(memory=None,
     steps=[('normalizer', Normalizer(copy=True, norm='l1')), ('sgdregressor', SGDRegressor(alpha=1.2944476618533275e-06, average=False,
       early_stopping=False, epsilon=24.444822037999185,
       eta0=1.2861442205377238e-05, fit_intercept=True,
       l1_ratio=0.3394045112936167, learning_rate='invs...True, tol=5.4490942553738105e-05, validation_fraction=0.1,
       verbose=False, warm_start=False))])
单独预测: preprocessing=[]
结果:000276

单独预测: #preprocessing=None
结果:000159

xgb
Pipeline(memory=None,
     steps=[('normalizer', Normalizer(copy=True, norm='l1')), ('svr', SVR(C=0.6522903726264077, cache_size=512, coef0=0.0, degree=1,
  epsilon=0.00841190976846431, gamma='auto', kernel='linear',
  max_iter=10852978.0, shrinking=False, tol=0.00014583123854438414,
  verbose=False))])

单独预测: preprocessing=[]
结果:0.000140
单独预测:#preprocessing=None
结果:0.000139

可见其执行结果偶然因素较大,扩大hy搜索时间和次数
INFO:common.hyperoptTools:ret_df:  mode       mse        r2                                     best_estimator                                              other
3   cv  0.000134  0.833866  Pipeline(memory=None,\n     steps=[('randomfor...  {'learner': (DecisionTreeRegressor(criterion='...
4   cv  0.000138  0.829081  Pipeline(memory=None,\n     steps=[('xgbregres...  {'learner': XGBRegressor(base_score=0.5, boost...
2   cv  0.000267  0.669218  Pipeline(memory=None,\n     steps=[('elasticne...  {'learner': ElasticNet(alpha=3.807842541114131...
1   cv  0.000267  0.669214  Pipeline(memory=None,\n     steps=[('lasso', L...  {'learner': Lasso(alpha=6.933484177963561e-06,...
0   cv  0.000413  0.488276  Pipeline(memory=None,\n     steps=[('svr', SVR...  {'learner': SVR(C=0.0001510031509641054, cache...

tmp_df['best_estimator'][3]
Out[4]:
Pipeline(memory=None,
     steps=[('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=0.6445046040302925, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=3, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=699, n_jobs=1,
           oob_score=False, random_state=1, verbose=False,
           warm_start=False))])
tmp_df['best_estimator'][4]
Out[5]:
Pipeline(memory=None,
     steps=[('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.9827719764498459,
       colsample_bytree=0.7267517612763672, gamma=0.001208164431091291,
       learning_rate=0.005837587884215306, max_delta_step=0, max_depth=10,
       min_child_weight=9, missing=na...590563218164,
       scale_pos_weight=1, seed=4, silent=True,
       subsample=0.7183586790188525))])

单算法选择:randomforestregressor,000134,
线上成绩:0.00009485

集成算法
array([Pipeline(memory=None,
     steps=[('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=0.6445046040302925, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=3, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=699, n_jobs=1,
           oob_score=False, random_state=1, verbose=False,
           warm_start=False))]),
       Pipeline(memory=None,
     steps=[('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.9827719764498459,
       colsample_bytree=0.7267517612763672, gamma=0.001208164431091291,
       learning_rate=0.005837587884215306, max_delta_step=0, max_depth=10,
       min_child_weight=9, missing=na...590563218164,
       scale_pos_weight=1, seed=4, silent=True,
       subsample=0.7183586790188525))])], dtype=object)

INFO:__main__:StackingRegressor meta_model_scores:{'XGBRegressor': -0.000146, 'LinearSVR': -0.00014, 'Ridge': -0.000237, 'SVR': -0.000942, 'LinearRegression': -0.000152}
INFO:__main__:StackingCVRegressor meta_model_scores:{'XGBRegressor': -0.000134, 'LinearSVR': -0.000135, 'Ridge': -0.000252, 'SVR': -0.000942, 'LinearRegression': -0.000132}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'XGBRegressor': -0.000136, 'LinearSVR': -0.000139, 'Ridge': -0.000376, 'SVR': -0.000942, 'LinearRegression': -0.000136}

最优:StackingCVRegressor,LinearRegression,000132,
提交版本的自测成绩0.000134(算法的score返回结果,和交叉验证可能不同)
线上:0.00009062

提交08_单算法_XGBRegressor0.000138

单算法
INFO:common.gscvTools:ret_df:
  mode           mse            r2                                     best_estimator
0   cv  6.776363e-04  1.596780e-01  LinearSVR(C=0.1, dual=True, epsilon=0.01, fit_...
1   cv  1.414319e-04  8.246133e-01  RandomForestRegressor(bootstrap=False, criteri...
2   cv  7.663601e+20 -9.503465e+23  LinearRegression(copy_X=True, fit_intercept=Tr...
3   cv  1.504837e-04  8.133884e-01  GradientBoostingRegressor(alpha=0.9, criterion...
4   cv  3.571991e-04  5.570452e-01  ElasticNetCV(alphas=None, copy_X=True, cv='war...
5   cv  1.381657e-04  8.286637e-01  XGBRegressor(base_score=0.5, booster='gbtree',...


array([RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,
           max_features=0.5, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=5, min_samples_split=5,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=6,
             max_features=0.6000000000000001, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=8, min_samples_split=3,
             min_weight_fraction_leaf=0.0, n_estimators=200,
             n_iter_no_change=None, presort='auto', random_state=None,
             subsample=0.7000000000000002, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False),
       XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=8, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.7500000000000001)], dtype=object)


INFO:__main__:StackingRegressor meta_model_scores:{'Ridge': -0.000186, 'XGBRegressor': -0.000155, 'LinearSVR': -0.000151, 'LinearRegression': -0.000169, 'SVR': -0.000942}
INFO:__main__:StackingCVRegressor meta_model_scores:{'Ridge': -0.000201, 'XGBRegressor': -0.000138, 'LinearSVR': -0.000138, 'LinearRegression': -0.000136, 'SVR': -0.000942}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'Ridge': -0.0002, 'XGBRegressor': -0.000139, 'LinearSVR': -0.000135, 'LinearRegression': -0.000132, 'SVR': -0.000942}
最优oof,lineregister,线上0.00008888



修改算法的参数搜索次数100次
基于此建立提交10,
INFO:common.gscvTools:ret_df:
  mode           mse            r2                                     best_estimator
0   cv  1.234367e-03 -5.307122e-01  LinearSVR(C=0.5, dual=True, epsilon=0.01, fit_...
1   cv  1.372586e-04  8.297886e-01  XGBRegressor(base_score=0.5, booster='gbtree',...
2   cv  7.663601e+20 -9.503465e+23  LinearRegression(copy_X=True, fit_intercept=Tr...
3   cv  1.390731e-04  8.275385e-01  RandomForestRegressor(bootstrap=True, criterio...
4   cv  1.372350e-04  8.298178e-01  GradientBoostingRegressor(alpha=0.95, criterio...
5   cv  3.571991e-04  5.570452e-01  ElasticNetCV(alphas=None, copy_X=True, cv='war...


INFO:__main__:StackingRegressor meta_model_scores:{'LinearRegression': -0.000147, 'Ridge': -0.00019, 'SVR': -0.000942, 'LinearSVR': -0.000135, 'XGBRegressor': -0.000143}
INFO:__main__:StackingCVRegressor meta_model_scores:{'LinearRegression': -0.000133, 'Ridge': -0.000203, 'SVR': -0.000942, 'LinearSVR': -0.000137, 'XGBRegressor': -0.000134}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'LinearRegression': -0.000133, 'Ridge': -0.000205, 'SVR': -0.000942, 'LinearSVR': -0.000136, 'XGBRegressor': -0.000131}

最优oof,xgb,自评测0.0001317283727497388,线上:0.00008679

提交12,
id替换为knn预测的id取值

提交14_集成(参数测试)_3算法_rfr集成0.0005


增加集成算法的参数调优测试
  mode       mse        r2                                     best_estimator
0   cv  0.000052  0.935504  RandomForestRegressor(bootstrap=True, criterio...
1   cv  0.000053  0.934192  XGBRegressor(base_score=0.5, booster='gbtree',...
2   cv  0.000054  0.932460  LinearRegression(copy_X=True, fit_intercept=Tr...
3   cv  0.000056  0.930052  LinearSVR(C=0.5, dual=True, epsilon=0.01, fit_...
4   cv  0.000055  0.932196  ElasticNetCV(alphas=None, copy_X=True, cv=None...
5   cv  0.000052  0.935667  GradientBoostingRegressor(alpha=0.99, criterio...

最终选择:RandomForestRegressor集成
提交结果14,线上0.00007808

成绩(TOP2.5%,68/2682)


项目