dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Home Page:https://xgboost.readthedocs.io/en/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't reproduce default MSE loss function

marcohkm opened this issue · comments

Hello

I aim to train two XGBoost models:

One using the built-in reg:squarederror (MSE) loss function.
Another using a custom loss function designed to mimic MSE.
Despite my custom loss function being theoretically identical to MSE, the predictions from the two models differ. Here's the code illustrating the issue:

Fixing seeds for reproducibility

np.random.seed(42)
random.seed(42)

def custom_loss_function(preds, dtrain):
labels = dtrain.get_label()
errors = preds - labels
grad = 2 * errors
hess = np.ones_like(grad) * 2
return grad, hess

def xgboost_model_with_custom_loss(params, n, num_boost_round):

features = select_features.sort_values(by='Rank_XGBoost', ascending=True)['Feature'].values.tolist()[:n]

X_train, X_test = df_train[features], df_test[features]
y_train, y_test = df_train['Y'], df_test['Y']

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params['seed'] = 42

# Training with custom loss function
model1 = xgb.train(params, dtrain, num_boost_round=num_boost_round, obj=custom_loss_function)
y_pred1 = model1.predict(dtest)

# Training with default MSE loss function
params['objective'] = 'reg:squarederror'
model2 = xgb.train(params, dtrain, num_boost_round=num_boost_round)
y_pred2 = model2.predict(dtest)

# Comparing predictions
difference = y_pred2 - y_pred1
print(f"Difference: {difference[:10]}")

return y_pred2, y_pred1

Example usage

params = {
'max_depth': 4,
'eta': 0.01,
'min_child_weight': 9,
'subsample': 0.5,
'alpha': 100,
'colsample_bytree': 0.3,
}

n = 10
num_boost_round = 100

y_pred2, y_pred1 = xgboost_model_with_custom_loss(params, n, num_boost_round)

(array([-0.00540604, 0.00381444, -0.0029127 , ..., -0.00581155,
0.00467715, 0.0053932 ], dtype=float32),
array([1.3519797, 1.3480775, 1.3533579, ..., 1.3520167, 1.3488159,
1.3469104], dtype=float32))

How can i fix this ?

Thanks in advance