This code implements a Random Forest Classifier to predict employee attrition in an HR dataset. The goal is to assess and understand the factors that contribute to employee turnover.
Importing Necessary Libraries:
- pandas: Used for data manipulation and analysis.
- matplotlib.pyplot: For data visualization.
- train_test_split: Splits the dataset into training and testing sets.
- LabelEncoder: Encodes categorical labels into numerical values.
- RandomForestClassifier: Implements the Random Forest classification algorithm.
- confusion_matrix: Measures the performance of the classification model.
- accuracy_score: Computes the accuracy of the model.
- SMOTE: Synthetic Minority Over-sampling Technique for dealing with imbalanced data.
Data Loading:
- df: Creates a copy of the dataset for manipulation.
- LabelEncoder: Encodes the 'Attrition' column from text labels to numerical values (0 for 'No', 1 for 'Yes').
- pd.get_dummies(): Converts categorical columns into binary (dummy) variables.
- Dropping columns: Removes columns 'EmployeeCount' and 'EmployeeNumber' as they are not relevant for prediction.
Data Splitting and Balancing:
- X: Features (independent variables).
- y: Target variable (attrition).
- SMOTE: Applies oversampling to balance the dataset.
- train_test_split: Splits the dataset into training and testing sets.
Model Training:
- RandomForestClassifier: Initializes a Random Forest classifier with 100 decision trees.
Model Evaluation:
- y_pred: Predicts attrition for the test set.
- accuracy_score: Calculates the accuracy of the model and prints the result.
Accuracy_Score:
- Our model has accuracy score of 92.71%
Summary:
This code performs data preprocessing, model training, and evaluation using a Random Forest Classifier to predict employee attrition. The accuracy score at the end measures the model's performance in predicting attrition based on the given dataset.






No comments:
Post a Comment