0% found this document useful (0 votes)
94 views

IS5312 Mini Project-2

This project involves analyzing an HR dataset containing employee information using Python. Students will practice data manipulation, exploratory data analysis, and predictive modeling skills. The tasks include reading data files, descriptive statistics, data visualization, and partitioning the data for attrition prediction modeling. Students are to submit a Jupyter Notebook with all code and outputs as well as a 4 page report on the analysis results. Completing optional tasks can provide a 10% bonus to the final grade.

Uploaded by

lengbiao111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

IS5312 Mini Project-2

This project involves analyzing an HR dataset containing employee information using Python. Students will practice data manipulation, exploratory data analysis, and predictive modeling skills. The tasks include reading data files, descriptive statistics, data visualization, and partitioning the data for attrition prediction modeling. Students are to submit a Jupyter Notebook with all code and outputs as well as a 4 page report on the analysis results. Completing optional tasks can provide a 10% bonus to the final grade.

Uploaded by

lengbiao111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IS5312 Analytical Programming with Python

Project Description

Updated on October 24, 2023

1 Project Objectives
In this project, you will practice manipulating data files, processing data, conducting
exploratory data analysis, and making predictions based on data. This project has two
objectives:
Ø By conducting this project, students can review and comprehensively practice
most Python programming skills learned from this course:

• Numbers and variables


• Input and output
• Relational operations
• Strings
• List and tuple
• Set and dictionary
• If-else control flow
• For and while loop control flow
• File processing
• Functions
• Module/Package
• Class
• Numpy
• Pandas
• Visualization/Matplotlib

Ø We also include tasks a little beyond the above listed points but still with
reasonable difficulty level, specifically, Task 3.3. The rationale is that when
programming, it is very usual to come across new problems you have never seen
before, especially considering the rapid development of programming
technology and tools. Hence, it is necessary to train students to solve new
problems creatively. Task 3.3 is designed to induce students to train their
creative problem solving skills when facing new programming tasks. With the
help of references from books, papers, and Internet, students can solve these
tasks successfully.

2 Data Description
This project has two data files. The first data file named variables.txt is detailed
definitions of the variables in the data set. The second data file named
HR_Analytics.csv offers a comprehensive and varied analysis of an organization's
employees. It contains 1470 observations and 35 variables. The variables could be
further classified into 4 types, which include Personal factors, Financials factors, Job-
related factors, and Attrition factors. The Personal factors consist of demographic
factors such as age, gender, and so on. The Financials factors include employees’
salary-related factors like Monthly Income, Hourly Rate, etc. The Job-related factors
include variables related to the job characteristic, while the Attrition factors only
include NumCompaniesWorked and Attrition.

3 Tasks
You need to write Python programs to finish the following tasks and manual
manipulations do not count in your score. For those tasks labeled by Optional, students
can freely decide to do or not. The optional tasks do not count in the total score. But
successfully finishing the optional tasks will obtain a grade bonus of 10% for each task.

3.1 Read data from txt & csv file (30%)


Ø (6%) Please write code to read data stored in the file named variables.txt.

Ø (10%) Please write code to delete the brief introduction content at the first
several lines of the file, delete the column of Definition and Types, only keep
Variable Names as a list.

Ø (8%) Please read data HR_Analytics.csv as a dataframe, add column names to


the dataframe with the above Variable Names list.

Ø (6%) Please store the combined dataset into a new CSV file named
dataforanalysis.csv.

3.2 Exploratory data analysis (60%)


To overview the distribution of data in the dataset, you need to conduct the
descriptive statistics:

Ø (10%) Please find all numerical variables, conduct descriptive statistics and
draw histograms of the variables.

Ø (8%) Please find and print all the values of all categorical variables as the figure
1 shows (partial example).
Figure 1 Partial example of categorical variables

You are also required to explore the valuable information from the dataset, such
as:

Ø (6%) Please calculate the monthly income of each education level and draw a
line chart to show how average monthly income vary with educational
attainment.

Ø (8%) Please calculate the turnover rates by department and gender, print the
results in a table and draw a bar chat to show the turnover of employees of
different genders in different departments.

Ø (8%) Please calculate the number of employees in each department with


monthly salary higher than the average monthly income of the whole company,
and draw a pie chart to show the distribution.

Ø (8%) Please create a pie chart to show the proportion of different levels of
monthly income in attrition group like Figure 2, you can choose any color
combination for the chart.
Figure 2 Attrition by income group

Ø (12%) Given an age group list [‘<25’, ‘25-35’, ‘35-45’, ‘45-60’], please add a
column named age_group to the dataframe, filling values by the division of the
age group list, then draw a pie chart to display the distribution of employees
counts by age group. Calculate and print a table to show within each age group
the number of employees leave within one year if they did not get a promotion.

3.3 Partitioning data and predicting attrition (10%+20% bonus optional)


Ø (Optional, 10%) To make the subsequent analysis more convenient. Transform
categorical variables to numerical type, and print the first 5 lines (you can try
LabelEncoder).

Ø (10%) Partitioning data set into train data set and test data set. The train data set
should be about 80% of all data points and the test data set should be 20% of
them. Print their rows:

Ø (Optional, 10%) Predict the attrition based on other variables in the file. The
prediction model can be decision tree or any other feasible one. Evaluating the
performance and print the accuracy. Here is an example (you can try
accuracy_score):

4 Submission Files
To obtain scores of the project, you need to submit these files:

Ø A outputted .html file containing all your source codes and the running results
by the order of tasks from the Jupyter Notebook. To name this file, please follow
this format: studentNameStudentNumberprojectcode.html. For example, if you
are CHAN Wai Ting and your student No. is 55664332, then your submitted
source code file should be named as CHANWaiTing55664332projectcode.html.
Please put all source codes into one file for the ease of grading.

Ø The CSV file generated during Tasks 3.1.

Ø A report on the results of exploratory data analysis (Task 3.2) and attrition
prediction (Task 3.3), with four pages at most. To name this file, please follow
this format: studentNameStudentNumberprojectreport.pdf. For example, if you
are CHAN Wai Ting and your student No. is 55664332, then your submitted
report file should be named as CHANWaiTing55664332projectreport.pdf.

References
Karanth, M. (2020). Tabular summary of HR analytics dataset. [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.4088439

You might also like