APPLICATION OF ANALYSIS OF VARIANCE(ANOVA) IN FEATURE SELECTION¶
- AYAN KUNDU
Feature selection is one of the important topics in the field of data science. Feature selection is extremely important in machine learning primarily because it serves as a fundamental technique to direct the use of variables to what’s most efficient and effective for a given machine learning system.
What is feature selection?¶
In some dataset there may be some features which are either redundant of some other features or may be irrelevant in the context of the dataset, deleting those features do not hamper the model accuracy so much ,but they make the model more complex. So,we select a subset of the set of features from the datset,this process is known as feature selection.
What is the importance of feature selection?¶
Machine learning works on a simple rule-if you put garbage in you will receive garbage out. Unnecessary features make the model more complex. It is very much necessary to select the necessary features when the number of features is large. Including unnecessary features in the model may result in overfitting of the model.
Feature selection methods aid in our mission to create an accurate predictive model. They help by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. An empirical bias/variance analysis as feature selection progresses indicates that the most accurate feature set corresponds to the best bias-variance tradeoff point for the learning algorithm.
What are the different types of feature selection method?¶
Various methodologies and techniques can be used to select the optimum feature space that will give the best accuracy.
- Filter method
- Wrapper method
- Embedded method
Can we apply ANOVA for feature selection?¶
In Filter based feature selection method we use different statistical tools to select the features with best predicting power.We select an appropriate statistical tool that provides a score for each of the feature columns . The features with the best scores are included in the model and the other features are kept in the dataset but not used for analysis.
from IPython.display import Image
Image('/home/ayan.kundu/Desktop/download.png')
ANOVA is a statistical test to examine whether there is a significant difference between the means of several datasets. ANOAVA partitions the total variability in the sample data into two components, variation within and variation between the classes. Total variability in the dataset is described by the total sum of squares. So,
Total sum of squares(SST)=Between group sum of squares(SSA)+Within group sum of squares(SSE)
Between group sum of squares is also known as Treatment sum of squares and within group sum of squares is also known as Error sum of squares. SSE tells the proportion of the variance explained by the feature or groups of features to the total variance in the dataset. The features that explained largest proportion of the variance should be retained. Suppose there are a total of K treatments under a feature and each treatment has $n_i$ number of observations ,hence total number of observation=$\sum_{i=1}^k n_i$
F-statistic=(SSA/(K-1))/(SSE/(N-K))
p-value=prob[F(K-1,N-K)>F-statistic]
The F-statistic examines whether when we group the numerical feature by the target vector, the means for each group are significantly different. Features are ranked by sorting them according to the p value in ascending order. If tie occurs,sort them by F-statistic in descending order. The features are labeled as ‘important’ ,’marginal’ and ‘unimportant’ with values above 0.998,between 0.997 and 0.998 and below 0.997 respectively.
What can be done if F-statistic is not a good measure for classification?¶
Image('/home/ayan.kundu/Desktop/download (1).png')
Horizontal feature is better than vertical one so it has higher value of the F-statistic.But in some cases none of the features is good enough for classification,i.e F statistic is not good enough for classification.In that case we define F-statistic as function of our data,we define the projection of the feature classes on the axis which is not inside the variables(Fisher discriminant).
Image('/home/ayan.kundu/Desktop/download (2).png')
Although IRIS Dataset has only four features I have demonstrated the the process using Python just for reference.
# Load the necessary libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# Load iris data
iris = load_iris()
# Create features and target
x = iris.data
y = iris.target
iris
# Create an SelectKBest object to select features with two best ANOVA F-Values
fvalue_selector = SelectKBest(f_classif, k=2)
# Apply the SelectKBest object to the features and target
X_kbest = fvalue_selector.fit_transform(x, y)
# Show results
print('Original number of features:', x.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
X_kbest
#implementing RandomForest on original dataset and calculating accuracy
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=100)
model= RandomForestClassifier(n_estimators=130,max_features=None)
model.fit(x_train,y_train)
model.score(x_train,y_train)
pred=model.predict(x_test)
accuracy=accuracy_score(y_test,pred)
print(accuracy)
#implementing RandomForest on the dataset after feature selection and calculating accuracy
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
x=X_kbest
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=100)
model= RandomForestClassifier(n_estimators=150,max_features=None)
model.fit(x_train,y_train)
model.score(x_train,y_train)
pred=model.predict(x_test)
accuracy=accuracy_score(y_test,pred)
print(accuracy)
I have used Random Forest on both the original dataset and on the dataset after selecting the optimum features and it is noticed that the model accuracy does not hamper after deleting two features.(*I have used the model on the IRIS dataset which is very much simple and there is no need for feature selection )
P.S-Feature extraction is different from feature selection. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features.