Importing essential libraries for EDA (Exploratory Data Analysis):
NumPy: For numerical operations.
Pandas: For data manipulation.
Matplotlib.pyplot: For basic plotting.
Importing the dataset using pd.read_csv allows you to read data from a CSV file into a Pandas DataFrame. Printing the head of the DataFrame using df.head() displays the first few rows of the dataset, providing a quick overview of its structure and content. This initial step is crucial for understanding the data you'll be working with during your EDA.
Segmenting your dataset into numerical and categorical columns is a fundamental EDA task. Numerical columns contain quantitative data, such as integers or floating-point numbers, while categorical columns hold qualitative data, often in the form of labels or categories. This segregation allows you to apply specific analysis and visualization techniques tailored to each data type, aiding in a deeper understanding of your dataset's characteristics
Assessing the presence of null or missing values within your dataset's features is a vital EDA action. Null values can impact analysis and modeling, so identifying and handling them appropriately is essential. You can use functions like df.isnull().sum() to count the number of null values in each feature or df.isnull().any() to check if any nulls exist in the dataset. This process helps ensure data quality and informs decisions on how to address missing data.
Examining the shape of your dataset using df.shape provides the number of rows and columns, giving you an overview of its size. Meanwhile, df.info() offers essential details about the dataset, such as data types, non-null counts, and memory usage. These actions are essential for understanding the dataset's dimensions and content during EDA.
Describing the dataset using df.describe() provides summary statistics, including the mean, standard deviation, minimum, and maximum values for each numerical attribute. These statistics offer insights into the central tendency, variability, and range of your data, aiding in the initial assessment of its distribution and characteristics during EDA.
To find the total attrition ratio in a company using the formula df.Attrition.value_counts() / len(df) * 100, you are essentially counting the number of employees who left (attrition) and dividing it by the total number of employees in your DataFrame df. The result, multiplied by 100, gives you the attrition ratio as a percentage.
Given that your company has an attrition ratio of 16.122449%, it means that approximately 16.12% of the employees in your dataset have left the company. This can be a significant metric for HR and management to monitor and address employee turnover
Identifying unique values within dataset attributes is important for gaining insights into data diversity. The unique() method in Pandas allows you to extract all distinct values in a specific attribute, while nunique() provides the count of unique values. This exploration helps reveal the range and variety of information contained within each attribute, aiding in data understanding during EDA.
- The "Research & Development" department has a total of 133 attrition cases.
- The "Sales" department has a total of 92 attrition cases.
- The "Human Resources" department has a total of 12 attrition cases.
These numbers indicate the number of employees who have left (experienced attrition) within each department. Such insights can be valuable for HR and management to understand attrition patterns across different areas of the company and potentially identify areas that require attention or improvement

- 1043 employees have "Travel_Rarely."
- 277 employees have "Travel_Frequently."
- 150 employees are categorized as "Non-Travel."
This breakdown shows how many employees fall into each category of business travel in your dataset. Such information is valuable for understanding the travel patterns of employees within the company.
- Non-Travel Employees:
- 12 employees have attrited.
- 138 employees didn't attrite.
- Travel_Frequently Employees:
- 69 employees have attrited.
- 208 employees didn't attrite.
- Travel_Rarely Employees:
- 157 employees have attrited.
- 887 employees didn't attrite.
These numbers show the attrition counts for employees based on their business travel frequency. This breakdown provides valuable insights into how attrition is distributed among employees with different travel patterns.
- The "Laboratory Technician" role has the highest attrition count.
- The "Research Director" role has the lowest attrition count.
This information indicates that laboratory technicians are more likely to experience attrition compared to research directors in your dataset. Understanding attrition patterns by job role can be valuable for HR and management to address specific retention strategies or potential issues within certain roles.
The highest paying positions in the company are typically found in the 'Manager' and 'Research Director' job roles. This reflects the competitive compensation associated with these managerial and leadership positions.
- The company has a total of 882 male employees.
- The company has a total of 588 female employees.
This data provides an understanding of the distribution of male and female employees within the organization. In this case, there are more male employees than female employees, which can be useful for diversity and inclusion analysis and initiatives within the company.
- Sales Executives have an attrition ratio of approximately 17.48%.
- Research Scientists have an attrition ratio of approximately 16.10%.
- Laboratory Technicians have the highest attrition ratio, at approximately 23.94%.
These attrition ratios indicate the percentage of employees who left (attrited) within each specific job role. They can be valuable for understanding attrition patterns within the company and may help identify areas that require further attention or retention efforts.
- The "Human Resources" department has an average monthly income of approximately 6654.51.
- The "Research & Development" department has an average monthly income of approximately 6281.25.
- The "Sales" department has the highest average monthly income, at approximately 6959.17.
These figures reflect the average income levels for employees within each department, helping to understand salary structures and disparities among different parts of the company.
By adding a new feature that records the age of employees when they joined the company, you can now analyze this information. The mean age of employees when joining is approximately 29.91 years. Additional statistics, such as standard deviation and minimum age, provide insights into the distribution and characteristics of employees' ages at the time of joining.
Analyzing correlations within the dataset involves assessing how numerical attributes relate to one another. Commonly used correlation methods, like Pearson correlation coefficient, quantify the strength and direction of relationships between variables. Identifying correlations aids in understanding which attributes may influence each other and can be essential for feature selection, predictive modeling, and insights during EDA.
Insight from EDA Analysis:
"The attrition rate is notably higher among Laboratory Technicians, indicating a potential area of concern for employee retention strategies. Additionally, Sales Executives, who have the highest average monthly income, may require targeted efforts to ensure their job satisfaction and retention. Addressing these findings can help improve overall workforce stability and performance."
















No comments:
Post a Comment