9923170071 / 8108094992 info@dimensionless.in
Select Page

In the arsenal of Machine Learning algorithms, XGBoost has its analogy to Nuclear Weapon. It has recently been very popular with the Data Science community. Reason being its heavy usage in winning Kaggle solutions.
This post aims at giving an informal introduction of XGBoost and its implementation in R.
Pre-requisites: Decision tree and beginner’s understanding of R.

2. What is XGBoost?
3. XGBoost in R.
4. Pros and Cons?
5. Applications

Before diving deep into XGBoost, let us first understand Gradient Boosting.
Boosting is just taking random samples of data from our dataset and learning a weak learner (a predictor with not so great accuracy) for it. There is one catch though, we give more weights to those data points which were misclassified by the previous learners.
I know that this is making a little sense to you. Let us clear it with the help of an example and a visualization.

Example:

Suppose that you are given a binary classification problem. Also let us assume that we are using Decision tree stumps (small decision trees) as our weak learner. There are weights assigned to each data point (all equal initially) and the error is the sum of weights of misclassified examples.

##### Step:1

Here, we will have a weak predictor $$h_1$$ which classifies the data points.

##### Step:2

Now, we will increase the weights of the points misclassified by $$h_1$$ and learn another predictor $$h_2$$ on the data points (can be a sample of data points also). In another words, we are trying to minimize the $$residual$$ = $$y$$$$h_1$$.

##### Step:3

Current Hypothesis: $$H$$ = $$h_1$$ + $$h_2$$ Now, we will learn another predictor $$h_3$$ on the data points and try to minimize the $$residual$$ = $$y$$-($$h_1$$ + $$h_2$$).

We will perform these steps upto say $$m$$ times, each time giving boosted importance to misclasified points and trying reducing the residual.
Our final hypothesis will be a weighted sum of these weak learners. $$H$$ = $$w_1$$$$h_1$$ + $$w_2$$$$h_2$$ + … + $$w_m$$$$h_m$$

Visualization

Below, there are three weak predictors and they are combining to become a strong one. Also notice that each subsequent predictor tries to classify correctly what the last predictor misclassified.

Ques: How do I set m?

Well, $$m$$ is a hyperparameter and is tuned using cross-validation.

What kind of predictors can ‘h’ be?

There are mathematical restrictions on it (diffrentiablity requirements). Genreally, decision tree stumps are used.

I highly recommend you to see Abhishek Ghose’s answer for a Mathematical-cum-intuitive explanation and this for visualization of the process.

## What is XGBoost?

XGBoost stands for Extreme Gradient Boosting. It is a supervised learning algorithm. It is a library for developing fast and high performance gradient boosting tree models. Parallel computation behind the scenes is what makes it this fast. It has been very popular in recent years due to its versatiltiy, scalability and efficiency.
Like any other ML algorithm, this too has its own pros and cons. But fortunately pros easily outweigh cons given we have an astute understanding of the algorithm and an intuition for proper parameter tuning. This has been proven by its huge popularity on Kaggle.

How to use XGBoost?

There are library implementations of XGBoost in all major data analysis languages. We just have to train the model and tune its parameters. Below, is the series of steps to follow:

2. Prepare your data to contain only numeric features (yes, XGBoost works only with numeric features).
3. Split the data into train and test sets.
4. Train the model and tune the parameters.
5. Deploy your model on test data.

A burning question here is “Which are the major hyperparameters in case of XGBoost?”

Major hyperparameters in XGBoost

#### General parameters:

1. silent [default = 0] : 0 for printing running messages, 1 for silent mode.
2. booster [default = gbm] : Booster to use: gbtree (tree based), gblinear (linear function).

#### Booster parameters:

1. eta [default = 0.3] : It controls the learning rate. It scale the contribution of each tree by a factor of eta when it is added to current approximation. It is used to prevent overfitting. Lower eta means robust to overfitting and higher should be nround.

2. nround: Number of iterations for boosting. (m as explained in Gradient Boosting part above).

3. max_depth [default: 6]: Maximum depth of tree.

4. gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. Larger values makes algorithm conservative.

5. min_child_weight [default : 1] : Minimum sum of instance weight needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. The larger, the more conservative the algorithm will be.

6. subsample [default : 1] : Fraction of data to be used for growing trees. It makes computation shorter and makes algorithm less prone to overfitting. It is advisable to tune this with nround and eta.

7. colsample_bytree [default : 1] : Subsample ratio of columns when constructing each tree.

1. objective : Specify the learning task and the corresponding learning objective. A self-defined function can be passed. Commonly used are:
• reg:linear: Linear regression (Default).
• reg:logistic: Logistic regression.
• binary:logistic: Logistic regression for binary classification. Outputs probability.
• binary:logitraw Logistic regression for binary classification. Outputs score before logistic transformation.
• multi:softmax: Used for doing multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to num_class – 1.
1. base_score : The initial prediction score of all instances. Default: 0.5

2. eval_metric : Evaluation metrics for validation data. A self-defined function can be passed. Default: (rmse for regression, error for classification, mean average precision for ranking).

Note: This is not a exhaustive list but it has covered all the major ones.

## Pros and Cons?

### Pros:

• Extremely fast (parallel computation).
• Highly efficient.
• Versatile (Can be used for classification, regression or ranking).
• Can be used to extract variable importance.
• Do not require feature engineering (missing values imputation, scaling and normalization)

### Cons:

• Only work with numeric features.
• Leads to overfitting if hyperparameters are not tuned properly.

## Applications

• Winning solution for many Kaggle competitions.
• Heavily used in industries due to its scalability.