COMSATS UNIVERSITY ISLAMABAD
Department of Computer Science
Assignment No. 2
Course: Data Mining (DSC306) Total marks 10
[CLO 2 Apply preprocessing and classification techniques to solve classification problems of
moderate complexity.]
Applying Pre-processing and Classification Techniques
Objective:
The purpose of this assignment is to apply pre-processing and classification techniques to solve
classification problems of moderate complexity. Students will gain hands-on experience with data
preparation, feature selection, model training, and evaluation.
1. Data Selection:
• Choose a dataset that presents a classification problem of moderate complexity. This could
be from sources like UCI Machine Learning Repository, Kaggle, or any other relevant
source.
• Provide a brief description of the dataset, including the number of instances, features, and
the target variable.
2. Data Pre-processing:
• Data Cleaning: Handle missing values, remove duplicates, and correct inconsistencies in
the dataset.
• Data Transformation: Normalize or standardize the data as necessary. Convert categorical
variables into numerical format using techniques such as one-hot encoding or label
encoding.
• Feature Selection: Identify and select relevant features that contribute to the classification
task. You can use techniques like correlation analysis, recursive feature elimination, or
feature importance from tree-based models.
3. Model Selection and Training:
• Apply at least two classification models (Decision Trees, Random Forest, Support Vector
Machines, or Neural Networks.)
• Split your dataset into training and testing sets (e.g., 80/20 split).
• Train the selected models on the training set.
4. Model Evaluation:
• Evaluate the performance of your models using appropriate metrics such as accuracy,
precision, recall, F1-score, and ROC-AUC.
• Create confusion matrices for each model to visualize performance.
• Discuss the strengths and weaknesses of each model based on the evaluation metrics.
5. Hyperparameter Tuning:
• For one of the models, perform hyperparameter tuning using techniques like Grid Search
or Random Search to optimize performance.
• Report the best parameters and the resulting performance metrics.
6. Conclusion:
• Summarize your findings, including which model performed best and why.
• Discuss any challenges faced during the pre-processing and modeling phases and how you
overcame them.
Deliverables:
• A well-documented Jupyter Notebook containing:
• Code for each step of the assignment.
• Visualizations where applicable (e.g., plots for data distribution, confusion matrices).
• Comments explaining your thought process and decisions made throughout the
assignment.
• (Optional) A written report (2-3 pages) summarizing your approach, findings, and conclusions.
Evaluation Criteria:
Your assignment will be evaluated based on the following criteria:
1. Dataset Selection and Description (1 points):
• Appropriateness of the chosen dataset for a moderate complexity classification problem.
• Clarity and completeness of the dataset description.
2. Data Pre-processing (1 points):
• Effectiveness of data cleaning methods applied.
• Appropriateness of data transformation techniques used.
• Justification for feature selection methods and the relevance of selected features.
3. Model Selection and Training (3 points):
• Justification for the choice of classification algorithms.
• Correct implementation of data splitting and model training.
4. Model Evaluation (3 points):
• Use of appropriate evaluation metrics and clarity in presenting results.
• Quality of confusion matrices and analysis of model performance.
• Depth of discussion regarding the strengths and weaknesses of each model.
5. Hyperparameter Tuning (2 points):
• Effectiveness of the hyperparameter tuning process.
• Clarity in reporting the best parameters and their impact on model performance.
6. Conclusion and Reporting:
• Clarity and depth of the summary of findings.
• Insightfulness in discussing challenges faced and solutions implemented.
• Overall organization and professionalism of the written report and code documentation.