machine learning - Samples with no label assignment using multilabel random forest in scikit-learn -


i using scikit-learn's randomforestclassifier predict multiple labels of documents. each document has 50 features, no document has missing features, , each document has @ least 1 label associated it.

clf = randomforestclassifier(n_estimators=20).fit(x_train,y_train) preds = clf.predict(x_test) 

however, have noticed after prediction there samples assigned no labels, though samples not missing label data.

>>> y_test[0,:] array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) >>> preds[0,:] array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,     0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]) 

the results of predict_proba align of predict.

>>> probas = clf.predict_proba(x_test) >>> label in probas: >>>    print (label[0][0], label[0][1]) (0.80000000000000004, 0.20000000000000001) (0.94999999999999996, 0.050000000000000003) (0.94999999999999996, 0.050000000000000003) (1.0, 0.0) (1.0, 0.0) (1.0, 0.0) (0.94999999999999996, 0.050000000000000003) (0.90000000000000002, 0.10000000000000001) (1.0, 0.0) (1.0, 0.0) (0.94999999999999996, 0.050000000000000003) (1.0, 0.0) (0.94999999999999996, 0.050000000000000003) (0.84999999999999998, 0.14999999999999999) (0.90000000000000002, 0.10000000000000001) (0.90000000000000002, 0.10000000000000001) (1.0, 0.0) (0.59999999999999998, 0.40000000000000002) (0.94999999999999996, 0.050000000000000003) (0.94999999999999996, 0.050000000000000003) (1.0, 0.0) 

each output above shows each label, higher marginal probability has been assigned label not appearing. understanding of decision trees @ least 1 label has assigned each sample when predicting, leaves me bit confused.

is expected behavior multilabel decision tree / random forest able assign no labels sample?

update 1

the features of each document probabilities of belonging topic according topic model.

>>>x_train.shape (99892l, 50l) >>>x_train[3,:] array([  5.21079651e-01,   1.41085893e-06,   2.55158446e-03,      5.88421331e-04,   4.17571505e-06,   9.78104112e-03,      1.14105667e-03,   7.93964896e-04,   7.85177346e-03,      1.92635026e-03,   5.21080173e-07,   4.04680406e-04,      2.68261102e-04,   4.60332012e-04,   2.01803955e-03,      6.73533276e-03,   1.38491129e-03,   1.05682475e-02,      1.79368409e-02,   3.86488757e-03,   4.46729289e-04,      8.82488825e-05,   2.09428702e-03,   4.12810745e-02,      1.81651561e-03,   6.43641626e-03,   1.39687081e-03,      1.71262909e-03,   2.95181902e-04,   2.73045908e-03,      4.77474778e-02,   7.56948497e-03,   4.22549636e-03,      3.78891036e-03,   4.64685435e-03,   6.18710017e-03,      2.40424583e-02,   7.78131179e-03,   8.14288762e-03,      1.05162547e-02,   1.83166124e-02,   3.92332202e-03,      9.83870257e-03,   1.16684231e-02,   2.02723299e-02,      3.38977762e-03,   2.69966332e-02,   3.43221675e-02,      2.78571022e-02,   7.11067964e-02]) 

the label data formatted using multilabelbinarizer , looks like:

>>>y_train.shape (99892l, 21l) >>>y_train[3,:] array([0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 

update 2

the output of predict_proba above suggested above assigning of no classes might artifact of trees voting on labels (there 20 trees , probabilities approximately multiples of 0.05). however, using single decision tree, still find there samples assigned no labels. output looks similar predict_proba above, in each sample there probability given label assigned or not sample. seems suggest @ point decision tree turning problem binary classification, though the documentation says tree takes advantage of label correlations.

this can happen if train , test data scaled differently, or otherwise drawn different distributions (e.g., if tree learned split on values occur in train don't occur in test).

you inspect trees try better understanding of what's happening. this, @ decisiontreeclassifier instances in clf.estimators_ , visualize .tree_ properties (for example, using sklearn.tree.export_graphviz())


Comments