25306: Python for Data Science
Course Objective:
To explore programming skills relevant to data science and gain knowledge of various libraries
and packages, such as NumPy, Pandas, and Matplotlib, required for data analysis, data
visualization, natural language processing, and machine learning.
Course Outcomes:
1. Master the use of NumPy and Pandas for efficient data manipulation.
2. Develop skills in data cleaning, preprocessing, and transformation.
3. Visualize data insights using Matplotlib and Seaborn.
4. Conduct exploratory data analysis (EDA) and statistical testing.
5. Apply machine learning models using Scikit-learn.
6. Build and present complete data analysis pipelines for business use cases.
Unit I: NumPy for Business Applications: Introduction to NumPy arrays vs Python lists -Array
creation, indexing, slicing, reshaping - Performing arithmetic operations across arrays-
Broadcasting and vectorized operations - Real-world applications: cost simulations, quantity
forecasting, margin calculations - Mini-case: Using NumPy to simulate product-level discounts
across large volumes
Unit II: Data Handling and Transformation with Pandas: Loading, Filtering, and Merging
Datasets - Reading CSV files from local storage - Data Frames vs Series - Indexing, slicing,
selecting specific rows/columns - Filtering rows using conditions (e.g., “Region = East”) - Merging
customer and sales data (inner, outer joins) - Sorting, ranking, and rearranging business data. Data
Cleaning and Preprocessing: Handling missing values: dropping vs imputing - Detecting and
fixing outliers - Converting data types (e.g., strings to dates) - Replacing and renaming columns -
String operations on categorical data (e.g., formatting product names) - Encoding categorical
variables for future modeling
Unit III: Data Visualization with Matplotlib and Seaborn: Plotting bar charts, line charts,
scatterplots - Creating histograms and boxplots - Designing dashboards for business KPIs -
Customizing chart aesthetics (colors, labels, grids) - Using Seaborn’s heatmap and pairplot for
correlation and pattern detection - Case: Visualize product-wise and region-wise sales trends
Exploratory Data Analysis (EDA): Descriptive statistics: mean, median, std, quantiles -
Grouping and aggregating (e.g., sales by region, product category) - Pivot tables in Pandas -
Detecting patterns and seasonality using plots - Correlation matrix and business implications
Unit IV: Statistical Analysis and Forecasting: Statistical Testing using SciPy and Stats
models - Basics of statistical inference - One-sample and two-sample t-tests - Chi-square test for
categorical comparisons - ANOVA for comparing multiple groups (e.g., region-wise averages) -
Interpretation of p-values and business implications
Unit V: Predictive Modeling : Introduction to Machine Learning with Scikit-learn - Supervised
vs unsupervised learning - Train/test split and cross-validation - Evaluation metrics: accuracy,
precision, recall, F1-score - First model: Logistic regression to predict binary outcomes (e.g.,
churn). Regression and Classification Models: Linear regression for sales prediction - Logistic
regression for classification - Decision Trees and Random Forests Clustering and Customer
Segmentation: Unsupervised learning: K-means clustering - Customer segmentation based on
purchasing behavior. Time Series Forecasting with Pandas
Textbook:
1. Python Data Science Handbook: Essential Tools for Working with Data, by Jake
VanderPlas, O’reilly Media, 2017.
Reference Books:
1. Data Science From Scratch: First Principles with Python by Joel Grus, Second Edition,
2019, O’reilly Media.
2. Python for Data Science by Mohd. Abdul Hameed, May 2021, Wiley.
3. Python for Data Science: A Crash Course for Data Science and Analysis, Python
Machine Learning and Big Data by Computer Science Academy.
4. Python for Data Science: The Ultimate Step-by-Step Guide to Python Programming by
Daniel, March 2021, O’reilly.