Search

Translate

Feature selection techniques

 Why we need to select feature:

The performance of machine learning or statistical models is based on choosing the algorithm, selecting features and model selection. The datasets may contain many features but not all the features are important. So we need to remove irrelevant features to boost the model performance even for high dimensional datasets. Feature selection is also called as variable or attribute selection. Removing features reduce the dimensional size of the dataset, this makes the model to run faster. Feature selection should be in a way that a model should predicts with greater accuracy and not overly complex to interpret.

Here we remove features which
1. has high percentage of  NaN values or missing values.
2. has low variance.
3. has high correlation with the other feature, then we can remove either one feature for them.
4. has very low correlation with target variable.
5. is not significant.

Types of feature selection:

There are three types of feature selection:
1. Filter methods.
2. Wrapper methods.
3. Embedded methods.

Filter methods:

Filter methods are also called single factor analysis. 

Information gain:

Information gain is a measure of reduction in entropy. This tells us the amount of information a variable or feature contains. The higher the value of information gain the more important a feature is. information gain is often referred as Mutual information in feature selection. It is calculated when we consider categorical feature and categorical target.

Variance threshold:

Variance threshold removes the feature which doesn't meet up the specified variance. By default it removes features with zero variance. If the feature has same values in all the rows then it has no variance and nothing can be known from that feature.

Chi square:

Chi square is the difference between observed value and expected value. Chi square is calculated when both feature and the target are categorical. This value is used to find the dependency of feature with the target. If the chi square value is lesser than the critical chi square then the feature has no dependency with the target variable. Here we will select a feature which is highly dependent on the target variable.

ANOVA or F-test:

Analysis Of Variance is used to check whether the mean values of two or more groups different from each other. This value is calculated when we consider numeric feature and categorical target. As chi square value this is used to find the dependency of feature with the target. Higher the value more important the feature is.

Correlation:

Correlation is used to find the linear relationship between variables. We can use correlation while working with numeric feature and numeric target variable. We will select the feature with high correlation doesn't matter whether it is positive or negative.


Wrapper method:

Here the analyst usually chooses the P- value, F or Partial F as the criteria for selection and elimination of the variables. Analyst can set any significant level, but mostly P value is 0.05. These are often used to build Multiple Linear Regression models.

Forward selection:

At first in forward selection , the model has no features or variables. It selects the significant feature, then fits the model with that feature and then looks for the next significant feature and goes on. Then selects the model which has high accuracy. Once the feature entered the model then it stays inside. 

Backward elimination:

At first in backward elimination, the model starts with all the independent variables and remove them one by one if its not significant, then rebuild and fits the model with remaining variables and goes on. Then selects the model which has high accuracy. Here once the feature is eliminated from the model then it stays out. This is the fastest method. 

Stepwise selection:

Stepwise selection is a combination of forward selection and backward elimination. At each step all variables are evaluated for their unique contributions. The variable entered the model might not stay and eliminated may enter again.
For example: consider three variables x, y and z. all are significant
Now, x is included in the model and then y.
As soon as y gets included performance of x decreases and x is no more significant, so x is eliminated and then z is included. y and z is still significant.
Now the again x is included and evaluated.

Embedded or shrinkage method:

This method is a combination of filter and wrapper methods. Embedded methods are used by algorithms which has inbuilt feature selection techniques. Most common example of this is LASSO - L1 Regularization and RIDGE - L2 Regularization. 
Here for example if the data contain three feature it fits the model with each feature individually and all possible combinations of that three features and select the subset of feature of the model which has high accuracy.











Previous
Next Post »