feature names mismatch with XGboost model

Question

feature names mismatch with XGboost model

aidandmorrison opened this issue 5 years ago · comments

I get an error when calling explain() on an explainer built with an XGB model:

Error in predict.xgb.Booster(x, newdata = newdata, reshape = TRUE, ...) :
Feature names stored in object and newdata are different!

I thought perhaps that I might need to transform the data into a matrix, but this didn't seem to help. It seems that the issue is just with the (Intercept) column appearing in the model but not in the explainer, thought perhaps I was doing something clumsy reading in the data. However, I haven't had any quick fixes on SOF so far. Perhaps it's an issue? Would be very grateful if there's a way to get past this so I could use lime on my xgb models. Thanks!

MWE:

library(pacman)
p_load(tidyverse)
p_load(xgboost)
p_load(Matrix)
p_load(lime)

### Prepare data with partition
df <- mtcars %>% rownames_to_column()
length <- df %>% nrow()
df_train <- df %>% select(-rowname) %>% head((length-10))
df_test <- df %>% select(-rowname) %>% tail(10)

### Transform data into matrix objects for XGboost
train <- list(sparse.model.matrix(~., data = df_train %>% select(-vs)), (df_train$vs %>% as.factor()))
names(train) <- c("data", "label")
test <- list(sparse.model.matrix(~., data = df_test %>% select(-vs)), (df_test$vs %>% as.factor()))
names(test) <- c("data", "label")
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)


### Train model
watchlist <- list(train=dtrain, test=dtest)
mod_xgb_tree <- xgb.train(data = dtrain,  booster = "gbtree", eta = .1, nrounds = 15, watchlist = watchlist)

### Check prediction works
output <- predict(mod_xgb_tree, test$data) %>% tibble()

### attempt lime explanation
explainer <- df_train %>% select(-vs) %>% lime(model = mod_xgb_tree)  ### works, no error or warning
explanation <- df_test %>% select(-vs) %>% explain(explainer, n_features = 4) ### error, Features stored names in `object` and `newdata` are different!

names_test <- test$data@Dimnames[[2]]  ### 10 names
names_mod <- mod_xgb_tree$feature_names ### 11 names
names_explainer <- explainer$feature_type %>% enframe() %>% pull(name) ### 11 names


### see whether pre-processing helps
my_preprocess <- function(df){
  data <- df %>% select(-vs)
  label <- df$vs
  
  test <<- list(sparse.model.matrix( ~ ., data = data), label)
  names(test) <<- c("data", "label")
  
  dtest <- xgb.DMatrix(data = test$data, label=test$label)
  dtest
}

explanation <- df_test %>% explain(explainer, preprocess = my_preprocess(), n_features = 4) ### Error in feature_distribution[[i]] : subscript out of bounds

### check that the preprocessing is working ok
dtest_check <- df_test %>% my_preprocess()
output_check <- predict(mod_xgb_tree, dtest_check)

jessiehsieh · Answer 1 · Fri May 31 2019 20:35:02 GMT+0800 (China Standard Time)

Hi, I have the same issue using lime in python to explain xgboost model. I have trained a XGBClassifier with a dataframe with features names. XGB_model.predict_proba() works as expected with x_train and x_test.

But when I try to explain the model with lime in the following codes, it told me "training data did not have the following fields: f131, f110, f4, f47.....".

explainer = lime.lime_tabular.LimeTabularExplainer(x_train,\
                                                   mode = 'classification', \
                                                   feature_names=x_train.columns, \
                                                   class_names='Chance of default', \
                                                   discretize_continuous=False)      # this line works


explainer.explain_instance(x_train,XGB_model.predict_proba,top_labels=3)    # this line throws the error

@aidandmorrison have you got your issue fixed?

Thomas Lin Pedersen · Answer 2 · Wed Jun 12 2019 02:50:59 GMT+0800 (China Standard Time)

You are removing vs so in your lime calls so yes, the names will differ... You should pass in the same data as used to training the model...

@jessiehsieh it is very unlikely that the Python and R versions have the same underlying bug as the two implementations are completely separate

Aidan Morrison · Answer 3 · Thu Jun 13 2019 08:30:43 GMT+0800 (China Standard Time)

@thomasp85 Thanks for your response! I'm still confused, however, I thought I had carefully also removed vs from the training data also... I did try not removing it from the lime call, as you suggested, still got the same thing. Have I missed something else really obvious?

@jessiehsieh Sorry I didn't get to this earlier, I still have this problem, but I agree that it seems unlikely that the same error is in both implementations.

Thomas Lin Pedersen · Answer 4 · Thu Jun 13 2019 17:56:04 GMT+0800 (China Standard Time)

So, I've found a bug with using preprocessors with data.frame explanations that are fixed now. This was not the cause of your problems though (but you'll need it to get it to work).

Compare this working code with yours:

df <- mtcars %>% rownames_to_column()
length <- df %>% nrow()
df_train <- df %>% select(-rowname) %>% head((length-10))
df_test <- df %>% select(-rowname) %>% tail(10)

### Transform data into matrix objects for XGboost
train <- list(sparse.model.matrix(~., data = df_train %>% select(-vs)), (df_train$vs %>% as.factor()))
names(train) <- c("data", "label")
test <- list(sparse.model.matrix(~., data = df_test %>% select(-vs)), (df_test$vs %>% as.factor()))
names(test) <- c("data", "label")
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)


### Train model
watchlist <- list(train=dtrain, test=dtest)
mod_xgb_tree <- xgb.train(data = dtrain,  booster = "gbtree", eta = .1, nrounds = 15, watchlist = watchlist)

my_preprocess <- function(df){
  xgb.DMatrix(sparse.model.matrix( ~ ., data = df))
}
explainer <- df_train %>% select(-vs) %>% lime(model = mod_xgb_tree, preprocess = my_preprocess)
explanation <- df_test %>% select(-vs) %>% explain(explainer, n_features = 4)

the preprocessor is passed to lime(), not explain()
the same data format must be passed to both lime() and explain()
my_preprocess() doesn't have access to vs and doesn't really need it - it just need to convert the data.frame into an xib.DMatrix