Learning with XGBoost

There is a Mercedes-Benz Greener Manufacturing competition hosted on Kaggle. Data size is small and relatively simple, so it fits well as a quick weekend diversion.

As usual, before modeling the data, pre-processing is required. In this case, the categorical variables need to be one-hot encoded.

For modeling, I am using XGBoost this time.
The parameter tuning of the model is a bit complex. See here for a complete list, and here for a guide with code.

In the cross-validation part, I am using 5-fold.
One advantage of the library is that if you provide validation data set, it will incrementally print out metric evaluated by current model on validation set, so it will be easy to spot issues like over-fitting and under-fitting at runtime.

In the code below I am mainly tuning subsample=0.8 and lambda=10 to avoid over-fitting. Meanwhile it would be better to use a grid search for parameters tuning.

I have also tried neural net with Keras, but found that it would generally be unstable in deep neural network. My guess is that the data size is too small for deep neural network with too many parameters.

Advertisements
This entry was posted in Computer and Internet, Machine Learning and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s