Table of contents
- What is Gradient Boosting?
- What is XGBoost?
- XGBoost in R.
- Pros and Cons?
What is Gradient Boosting?Before diving deep into XGBoost, let us first understand Gradient Boosting. Boosting is just taking random samples of data from our dataset and learning a weak learner (a predictor with not so great accuracy) for it. There is one catch though, we give more weights to those data points which were misclassified by the previous learners. I know that this is making a little sense to you. Let us clear it with the help of an example and a visualization.
Example:Suppose that you are given a binary classification problem. Also let us assume that we are using Decision tree stumps (small decision trees) as our weak learner. There are weights assigned to each data point (all equal initially) and the error is the sum of weights of misclassified examples.
Step:1Here, we will have a weak predictor \(h_1\) which classifies the data points.
Step:2Now, we will increase the weights of the points misclassified by \(h_1\) and learn another predictor \(h_2\) on the data points (can be a sample of data points also). In another words, we are trying to minimize the \(residual\) = \(y\)–\(h_1\).
Step:3Current Hypothesis: \(H\) = \(h_1\) + \(h_2\) Now, we will learn another predictor \(h_3\) on the data points and try to minimize the \(residual\) = \(y\)-(\(h_1\) + \(h_2\)).
We will perform these steps upto say \(m\) times, each time giving boosted importance to misclasified points and trying reducing the residual. Our final hypothesis will be a weighted sum of these weak learners. \(H\) = \(w_1\)\(h_1\) + \(w_2\)\(h_2\) + … + \(w_m\)\(h_m\)
VisualizationBelow, there are three weak predictors and they are combining to become a strong one. Also notice that each subsequent predictor tries to classify correctly what the last predictor misclassified.
Ques: How do I set m?Well, \(m\) is a hyperparameter and is tuned using cross-validation.
What kind of predictors can ‘h’ be?There are mathematical restrictions on it (diffrentiablity requirements). Genreally, decision tree stumps are used.
I highly recommend you to see Abhishek Ghose’s answer for a Mathematical-cum-intuitive explanation and this for visualization of the process.
What is XGBoost?XGBoost stands for Extreme Gradient Boosting. It is a supervised learning algorithm. It is a library for developing fast and high performance gradient boosting tree models. Parallel computation behind the scenes is what makes it this fast. It has been very popular in recent years due to its versatiltiy, scalability and efficiency. Like any other ML algorithm, this too has its own pros and cons. But fortunately pros easily outweigh cons given we have an astute understanding of the algorithm and an intuition for proper parameter tuning. This has been proven by its huge popularity on Kaggle.
How to use XGBoost?There are library implementations of XGBoost in all major data analysis languages. We just have to train the model and tune its parameters. Below, is the series of steps to follow:
- Load your dataset.
- Prepare your data to contain only numeric features (yes, XGBoost works only with numeric features).
- Split the data into train and test sets.
- Train the model and tune the parameters.
- Deploy your model on test data.
Major hyperparameters in XGBoost
- silent [default = 0] : 0 for printing running messages, 1 for silent mode.
- booster [default =
gbm] : Booster to use:
- eta [default = 0.3] : It controls the learning rate. It scale the contribution of each tree by a factor of
etawhen it is added to current approximation. It is used to prevent overfitting. Lower
etameans robust to overfitting and higher should be
nround: Number of iterations for boosting. (
mas explained in Gradient Boosting part above).
max_depth [default: 6]: Maximum depth of tree.
gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. Larger values makes algorithm conservative.
min_child_weight [default : 1] : Minimum sum of instance weight needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. The larger, the more conservative the algorithm will be.
subsample [default : 1] : Fraction of data to be used for growing trees. It makes computation shorter and makes algorithm less prone to overfitting. It is advisable to tune this with
colsample_bytree [default : 1] : Subsample ratio of columns when constructing each tree.
- objective : Specify the learning task and the corresponding learning objective. A self-defined function can be passed. Commonly used are:
reg:linear: Linear regression (Default).
reg:logistic: Logistic regression.
binary:logistic: Logistic regression for binary classification. Outputs probability.
binary:logitrawLogistic regression for binary classification. Outputs score before logistic transformation.
multi:softmax: Used for doing multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to num_class – 1.
base_score : The initial prediction score of all instances.
eval_metric : Evaluation metrics for validation data. A self-defined function can be passed.
mean average precisionfor ranking).
Note: This is not a exhaustive list but it has covered all the major ones.
XGBoost in R
# Install xgboost from CRAN using 'install.packages("xgboost")' #Use XGBoost library for this post. require(xgboost)
## Loading required package: xgboost
## Warning: package 'xgboost' was built under R version 3.3.2
#For plotting require(Ckmeans.1d.dp)
## Loading required package: Ckmeans.1d.dp
## Warning: package 'Ckmeans.1d.dp' was built under R version 3.3.2
## Loading required package: DiagrammeR
## Warning: package 'DiagrammeR' was built under R version 3.3.2
#We will use agaricus dataset in the package XGBoost. #Agaricus comes with separate train and test dataset #But usually, you need to split the data into training and test sets yourself. data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train <- agaricus.train test <- agaricus.test
#setting hyperparametrs param <- list("objective" = "binary:logistic", "eval_metric" = "logloss", "eta" = 1, #Can play around with this "max.depth" = 2 #Dataset is not that complex ) #xgb.cv is used for cross-validation and finding best value for hyperparameter. #Here, we will find out the best value for nrounds. best_cv <- xgb.cv(param = param, data = as.matrix(train$data), label = train$label, nfold = 10, nrounds = 100, verbose = FALSE) #Use nrounds where error curve has minimum (50 in this case) plot(log(best_cv$evaluation_log$test_logloss_mean), type = 'l')
#Use nrounds = 50 final_model <- xgboost(data = as.matrix(train$data), label = train$label, params = param, nrounds = 50, verbose = FALSE) #Predict with this model pred <- predict(final_model, test$data)
#Print importance matrix (importance of variables in classification). #get the feature names feature_name <- dimnames(train$data)[] importance_matrix <- xgb.importance(feature_name , model = final_model) xgb.plot.importance(importance_matrix[1:10,])
# You can also plot the decision tree # xgb.plot.tree(feature_names = feature_name, model = final_model, n_first_tree = 1)
Pros and Cons?
- Extremely fast (parallel computation).
- Highly efficient.
- Versatile (Can be used for classification, regression or ranking).
- Can be used to extract variable importance.
- Do not require feature engineering (missing values imputation, scaling and normalization)
- Only work with numeric features.
- Leads to overfitting if hyperparameters are not tuned properly.
- Winning solution for many Kaggle competitions.
- Heavily used in industries due to its scalability.