Introduction to XGBoost
In the arsenal of Machine Learning algorithms, XGBoost has its analogy to Nuclear Weapon. It has recently been very popular with the Data Science community. Reason being its heavy usage in winning Kaggle solutions.
This post aims at giving an informal introduction of XGBoost and its implementation in R.
Prerequisites: Decision tree and beginner’s understanding of R.
Table of contents
 What is Gradient Boosting?
 What is XGBoost?
 XGBoost in R.
 Pros and Cons?
 Applications
What is Gradient Boosting?
Before diving deep into XGBoost, let us first understand Gradient Boosting.
Boosting is just taking random samples of data from our dataset and learning a weak learner (a predictor with not so great accuracy) for it. There is one catch though, we give more weights to those data points which were misclassified by the previous learners.
I know that this is making a little sense to you. Let us clear it with the help of an example and a visualization.
Example:
Suppose that you are given a binary classification problem. Also let us assume that we are using Decision tree stumps (small decision trees) as our weak learner. There are weights assigned to each data point (all equal initially) and the error is the sum of weights of misclassified examples.
Step:1
Here, we will have a weak predictor \(h_1\) which classifies the data points.
Step:2
Now, we will increase the weights of the points misclassified by \(h_1\) and learn another predictor \(h_2\) on the data points (can be a sample of data points also). In another words, we are trying to minimize the \(residual\) = \(y\)–\(h_1\).
Step:3
Current Hypothesis: \(H\) = \(h_1\) + \(h_2\) Now, we will learn another predictor \(h_3\) on the data points and try to minimize the \(residual\) = \(y\)(\(h_1\) + \(h_2\)).
We will perform these steps upto say \(m\) times, each time giving boosted importance to misclasified points and trying reducing the residual.
Our final hypothesis will be a weighted sum of these weak learners. \(H\) = \(w_1\)\(h_1\) + \(w_2\)\(h_2\) + … + \(w_m\)\(h_m\)
Visualization
Below, there are three weak predictors and they are combining to become a strong one. Also notice that each subsequent predictor tries to classify correctly what the last predictor misclassified.
Ques: How do I set m?
Well, \(m\) is a hyperparameter and is tuned using crossvalidation.
What kind of predictors can ‘h’ be?
There are mathematical restrictions on it (diffrentiablity requirements). Genreally, decision tree stumps are used.
I highly recommend you to see Abhishek Ghose’s answer for a Mathematicalcumintuitive explanation and this for visualization of the process.
What is XGBoost?
XGBoost stands for Extreme Gradient Boosting. It is a supervised learning algorithm. It is a library for developing fast and high performance gradient boosting tree models. Parallel computation behind the scenes is what makes it this fast. It has been very popular in recent years due to its versatiltiy, scalability and efficiency.
Like any other ML algorithm, this too has its own pros and cons. But fortunately pros easily outweigh cons given we have an astute understanding of the algorithm and an intuition for proper parameter tuning. This has been proven by its huge popularity on Kaggle.
How to use XGBoost?
There are library implementations of XGBoost in all major data analysis languages. We just have to train the model and tune its parameters. Below, is the series of steps to follow:
 Load your dataset.
 Prepare your data to contain only numeric features (yes, XGBoost works only with numeric features).
 Split the data into train and test sets.
 Train the model and tune the parameters.
 Deploy your model on test data.
A burning question here is “Which are the major hyperparameters in case of XGBoost?”
Major hyperparameters in XGBoost
General parameters:
 silent [default = 0] : 0 for printing running messages, 1 for silent mode.
 booster [default =
gbm
] : Booster to use:gbtree
(tree based),gblinear
(linear function).
Booster parameters:
 eta [default = 0.3] : It controls the learning rate. It scale the contribution of each tree by a factor of
eta
when it is added to current approximation. It is used to prevent overfitting. Lowereta
means robust to overfitting and higher should benround
. 
nround: Number of iterations for boosting. (
m
as explained in Gradient Boosting part above). 
max_depth [default: 6]: Maximum depth of tree.

gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. Larger values makes algorithm conservative.

min_child_weight [default : 1] : Minimum sum of instance weight needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. The larger, the more conservative the algorithm will be.

subsample [default : 1] : Fraction of data to be used for growing trees. It makes computation shorter and makes algorithm less prone to overfitting. It is advisable to tune this with
nround
andeta
. 
colsample_bytree [default : 1] : Subsample ratio of columns when constructing each tree.
Task parameters:
 objective : Specify the learning task and the corresponding learning objective. A selfdefined function can be passed. Commonly used are:
reg:linear
: Linear regression (Default).reg:logistic
: Logistic regression.binary:logistic
: Logistic regression for binary classification. Outputs probability.binary:logitraw
Logistic regression for binary classification. Outputs score before logistic transformation.multi:softmax
: Used for doing multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to num_class – 1.

base_score : The initial prediction score of all instances.
Default: 0.5

eval_metric : Evaluation metrics for validation data. A selfdefined function can be passed.
Default
: (rmse
for regression,error
for classification,mean average precision
for ranking).
Note: This is not a exhaustive list but it has covered all the major ones.
XGBoost in R
1 2 3 4 
# Install xgboost from CRAN using 'install.packages("xgboost")' #Use XGBoost library for this post. require(xgboost) 
1 
## Loading required package: xgboost 
1 
## Warning: package 'xgboost' was built under R version 3.3.2 
1 2 
#For plotting require(Ckmeans.1d.dp) 
1 
## Loading required package: Ckmeans.1d.dp 
1 
## Warning: package 'Ckmeans.1d.dp' was built under R version 3.3.2 
1 
require(DiagrammeR) 
1 
## Loading required package: DiagrammeR 
1 
## Warning: package 'DiagrammeR' was built under R version 3.3.2 
1 2 3 4 5 6 7 8 9 
#We will use agaricus dataset in the package XGBoost. #Agaricus comes with separate train and test dataset #But usually, you need to split the data into training and test sets yourself. data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train < agaricus.train test < agaricus.test 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
#setting hyperparametrs param < list("objective" = "binary:logistic", "eval_metric" = "logloss", "eta" = 1, #Can play around with this "max.depth" = 2 #Dataset is not that complex ) #xgb.cv is used for crossvalidation and finding best value for hyperparameter. #Here, we will find out the best value for nrounds. best_cv < xgb.cv(param = param, data = as.matrix(train$data), label = train$label, nfold = 10, nrounds = 100, verbose = FALSE) #Use nrounds where error curve has minimum (50 in this case) plot(log(best_cv$evaluation_log$test_logloss_mean), type = 'l') 
1 2 3 4 5 6 7 
#Use nrounds = 50 final_model < xgboost(data = as.matrix(train$data), label = train$label, params = param, nrounds = 50, verbose = FALSE) #Predict with this model pred < predict(final_model, test$data) 
1 2 3 4 5 6 
#Print importance matrix (importance of variables in classification). #get the feature names feature_name < dimnames(train$data)[[2]] importance_matrix < xgb.importance(feature_name , model = final_model) xgb.plot.importance(importance_matrix[1:10,]) 
1 2 3 
# You can also plot the decision tree # xgb.plot.tree(feature_names = feature_name, model = final_model, n_first_tree = 1) 
Pros and Cons?
Pros:
 Extremely fast (parallel computation).
 Highly efficient.
 Versatile (Can be used for classification, regression or ranking).
 Can be used to extract variable importance.
 Do not require feature engineering (missing values imputation, scaling and normalization)
Cons:
 Only work with numeric features.
 Leads to overfitting if hyperparameters are not tuned properly.
Applications
 Winning solution for many Kaggle competitions.
 Heavily used in industries due to its scalability.
Additional resources:
I highly recommend you to go through the links below for an indepth understanding of the Maths behind this algorithm.
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table tablecondensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });