IS5312 Mini Project-2
IS5312 Mini Project-2
Project Description
1 Project Objectives
In this project, you will practice manipulating data files, processing data, conducting
exploratory data analysis, and making predictions based on data. This project has two
objectives:
Ø By conducting this project, students can review and comprehensively practice
most Python programming skills learned from this course:
Ø We also include tasks a little beyond the above listed points but still with
reasonable difficulty level, specifically, Task 3.3. The rationale is that when
programming, it is very usual to come across new problems you have never seen
before, especially considering the rapid development of programming
technology and tools. Hence, it is necessary to train students to solve new
problems creatively. Task 3.3 is designed to induce students to train their
creative problem solving skills when facing new programming tasks. With the
help of references from books, papers, and Internet, students can solve these
tasks successfully.
2 Data Description
This project has two data files. The first data file named variables.txt is detailed
definitions of the variables in the data set. The second data file named
HR_Analytics.csv offers a comprehensive and varied analysis of an organization's
employees. It contains 1470 observations and 35 variables. The variables could be
further classified into 4 types, which include Personal factors, Financials factors, Job-
related factors, and Attrition factors. The Personal factors consist of demographic
factors such as age, gender, and so on. The Financials factors include employees’
salary-related factors like Monthly Income, Hourly Rate, etc. The Job-related factors
include variables related to the job characteristic, while the Attrition factors only
include NumCompaniesWorked and Attrition.
3 Tasks
You need to write Python programs to finish the following tasks and manual
manipulations do not count in your score. For those tasks labeled by Optional, students
can freely decide to do or not. The optional tasks do not count in the total score. But
successfully finishing the optional tasks will obtain a grade bonus of 10% for each task.
Ø (10%) Please write code to delete the brief introduction content at the first
several lines of the file, delete the column of Definition and Types, only keep
Variable Names as a list.
Ø (6%) Please store the combined dataset into a new CSV file named
dataforanalysis.csv.
Ø (10%) Please find all numerical variables, conduct descriptive statistics and
draw histograms of the variables.
Ø (8%) Please find and print all the values of all categorical variables as the figure
1 shows (partial example).
Figure 1 Partial example of categorical variables
You are also required to explore the valuable information from the dataset, such
as:
Ø (6%) Please calculate the monthly income of each education level and draw a
line chart to show how average monthly income vary with educational
attainment.
Ø (8%) Please calculate the turnover rates by department and gender, print the
results in a table and draw a bar chat to show the turnover of employees of
different genders in different departments.
Ø (8%) Please create a pie chart to show the proportion of different levels of
monthly income in attrition group like Figure 2, you can choose any color
combination for the chart.
Figure 2 Attrition by income group
Ø (12%) Given an age group list [‘<25’, ‘25-35’, ‘35-45’, ‘45-60’], please add a
column named age_group to the dataframe, filling values by the division of the
age group list, then draw a pie chart to display the distribution of employees
counts by age group. Calculate and print a table to show within each age group
the number of employees leave within one year if they did not get a promotion.
Ø (10%) Partitioning data set into train data set and test data set. The train data set
should be about 80% of all data points and the test data set should be 20% of
them. Print their rows:
Ø (Optional, 10%) Predict the attrition based on other variables in the file. The
prediction model can be decision tree or any other feasible one. Evaluating the
performance and print the accuracy. Here is an example (you can try
accuracy_score):
4 Submission Files
To obtain scores of the project, you need to submit these files:
Ø A outputted .html file containing all your source codes and the running results
by the order of tasks from the Jupyter Notebook. To name this file, please follow
this format: studentNameStudentNumberprojectcode.html. For example, if you
are CHAN Wai Ting and your student No. is 55664332, then your submitted
source code file should be named as CHANWaiTing55664332projectcode.html.
Please put all source codes into one file for the ease of grading.
Ø A report on the results of exploratory data analysis (Task 3.2) and attrition
prediction (Task 3.3), with four pages at most. To name this file, please follow
this format: studentNameStudentNumberprojectreport.pdf. For example, if you
are CHAN Wai Ting and your student No. is 55664332, then your submitted
report file should be named as CHANWaiTing55664332projectreport.pdf.
References
Karanth, M. (2020). Tabular summary of HR analytics dataset. [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.4088439