ECBS5294: Introduction to Data Science: Working with Data

Central European University

Welcome! This repository contains everything you need for this course: teaching materials, datasets, exercises, assignments, and resources for learning practical data skills.

📚 What This Course Is About

Real data is messy. Real questions require SQL. Real work needs reproducible pipelines.

In this course, you'll learn the day-one skills that analysts and data scientists use every day:

Tidy data principles – structuring data for analysis
SQL with DuckDB – querying, aggregating, joining, and window functions
JSON/API ingestion – turning messy nested data into clean tables
Multi-stage pipelines – bronze → silver → gold patterns
Validations as code – proving your data is correct
Stakeholder communication – explaining what you found and why it matters

By the end, you'll handle 500K-row datasets, write production-quality queries, and build reproducible analysis pipelines.

🗓️ Course Details

Instructor: Eduardo Ariño de la Rubia
Email: [email protected]
Office: Room A104
Office Hours: Schedule at cal.com/earino

Format: Three full-day sessions (2 blocks per day)
Credits: 1.0 (runs alongside Coding 1 and Data Science 1)

📄 Full syllabus with dates: syllabus.md

🚀 Quick Start

Before Day 1

Clone this repository:

git clone https://github.com/earino/ECBS5294.git
cd ECBS5294

Set up your environment:
- Python 3.x
- JupyterLab or VS Code
- DuckDB (pip install duckdb)
- Git

Test your setup:

jupyter lab notebooks/day1_setup_check.ipynb

Bring your laptop to every class!

📂 Repository Structure

ECBS5294/
│
├── data/                    # Teaching datasets (offline, provided)
│   ├── day1/                  # Dirty cafe sales data for practice
│   ├── day2/                  # Olist e-commerce marketplace data
│   └── day3/                  # Chicago + NYC government data + Olist subsets
│
├── notebooks/               # Teaching notebooks and exercises
│   ├── day1/
│   │   ├── day1_block_a_tidy_foundations.ipynb
│   │   ├── day1_block_b_01_sql_foundations.ipynb
│   │   ├── day1_block_b_02_aggregations.ipynb
│   │   ├── day1_block_b_03_window_functions_primer.ipynb
│   │   ├── day1_block_b_04_window_functions_deep_dive.ipynb
│   │   ├── day1_exercise_tidy.ipynb
│   │   └── day1_setup_check.ipynb
│   ├── day2/
│   │   ├── day2_block_a_joins.ipynb
│   │   ├── day2_block_b_01_api_json_basics.ipynb
│   │   ├── day2_block_b_02_json_to_duckdb.ipynb
│   │   └── day2_exercise_joins.ipynb
│   └── day3/
│       ├── day3_block_a_pipelines_and_validations.ipynb
│       └── day3_exercise_mini_pipeline.ipynb
│
├── assignments/             # Homework assignments with instructions
│   ├── hw1/                   # SQL single-table + window functions
│   ├── hw2/                   # JSON normalization + multi-table queries
│   └── hw3/                   # End-to-end data integration project
│
├── solutions/               # Encrypted solution ZIPs (see below!)
│   ├── README.md              # How to use encrypted solutions
│   └── solutions-*.zip        # Password-protected (released after due dates)
│
├── references/              # Quick references and cheat sheets
│   ├── tidy_data_checklist.md
│   ├── sql_quick_reference.md
│   ├── pipeline_patterns_quick_reference.md
│   ├── datasets/              # Dataset documentation
│   ├── images/                # Diagrams and visual aids
│   ├── papers/                # Summaries of key papers
│   └── teaching/              # Teacher notes (for your benefit!)
│
├── scripts/                 # Utility scripts
│   ├── decrypt_solution.py    # Extract encrypted solutions
│   ├── encrypt_solutions.py   # (Instructor use)
│   └── build_slides.sh        # Generate HTML slides from Markdown
│
├── slides/                  # Marp-based presentation slides
│   ├── day1_kickoff.md        # Course introduction (~10 min)
│   ├── day1_block_a_intro.md  # Optional: Tidy data intro (~3 min)
│   ├── day1_block_b_intro.md  # Optional: SQL intro (~3 min)
│   └── themes/                # Custom CEU theme
│
├── syllabus.md              # Full course syllabus with schedule
├── CLAUDE.md                # Repository development guide
└── README.md                # You are here!

Note: All course materials (Days 1-3) are complete and ready to use.

🔐 About Solutions (They're Here, But Locked!)

Good news: All solutions are in this repo from day one.
Smart design: They're password-protected until after deadlines.

Why This Approach?

✅ No anxiety about "losing" solutions
✅ Available immediately after due dates
✅ Encourages trying before looking
✅ Lets you verify your work once released

How to Use Solutions

Try the assignment first – Even partial attempts build understanding
Check Moodle for password release dates – Posted after deadlines

Decrypt when available:

python scripts/decrypt_solution.py solutions/solutions-day1-blockA.zip

Learn, don't copy – Type out solutions to build muscle memory

📖 Full guide: solutions/README.md

📅 Course Overview

Day 1 – Tidy Data & SQL Foundations

Block A: Tidy data foundations, primary keys, types, missing values
Block B: SQL basics + window functions primer (ROW_NUMBER, LAG, moving averages)
Assigned: Homework 1 (due start of Day 2)

Day 2 – Joins & JSON Ingestion

Block A: SQL joins & relational modeling (INNER/LEFT, diagnosing join issues)
Block B: JSON & APIs → tidy tables
Assigned: Homework 2 (due start of Day 3)

Day 3 – Pipelines & Assessment

Block A: Pipeline patterns + validations (bronze/silver/gold layers)
Block B: In-class exam (paper/pen, 90 minutes, one A4 reference sheet)
Assigned: Homework 3 (due one week after class)

📅 Specific dates: See syllabus.md for the current term's schedule.

📊 Grading

Component	Weight	What It Tests
Homework 1 (SQL + windows)	20%	Query writing, aggregations, window functions
Homework 2 (JSON pipeline)	25%	Data ingestion, normalization, persistence
Homework 3 (End-to-end)	25%	Complete pipeline with validation + communication
In-class deliverables	5%	Short exercises (completion-based)
In-class exam (Day 3)	25%	SQL, joins, data thinking (paper/pen, closed-book)

Late policy: −10% per 24 hours (48-hour max)
Median target: B+ (roughly ⅓ of class at A/A-)

🎯 Key Resources

Quick References

Tidy Data Checklist – Primary keys, types, missing values
SQL Quick Reference – Syntax cheat sheet
Pipeline Patterns Quick Reference – Bronze/silver/gold layers

Textbooks & Docs

Arthur Turrell, Coding for Economists (selected chapters)
The Carpentries (Unix, Git, Python episodes)
DuckDB Documentation

Teaching Materials

All notebooks in notebooks/ – worked examples with explanations
references/teaching/ – Teacher notes (these show you what's important!)

💡 Tips for Success

Reproducibility Is Everything

✅ Restart & Run All before submitting – If it doesn't run cleanly, you'll lose points
✅ Use relative paths – data/file.csv, not /Users/yourname/...
✅ Commit often – Small, logical commits help you track changes

Data Thinking Habits

Always identify and validate your primary key
Handle NULL values consciously (they affect everything!)
Use assertions as tests – Prove your data is correct
Document your choices – Why did you handle missing values this way?

SQL Strategies

Start small: Use LIMIT 10 while developing queries
Build incrementally: Add one clause at a time
Format for readability: Uppercase keywords, indentation, one clause per line
Check for NULLs: Always consider IS NULL / IS NOT NULL

Window Functions Mental Model

Windows KEEP row count (every row gets a value)
GROUP BY COLLAPSES rows (aggregates reduce rows)
You can't filter on window functions directly – use a subquery!

When You're Stuck

Review teaching notebooks
Check the quick references
Read error messages carefully
Start simpler (remove complexity, then add back)
Ask for help! Office hours, email, WhatsApp

🚫 Academic Integrity

What You CAN Do

✅ Review course materials and documentation ✅ Discuss concepts with classmates (conceptually, not code) ✅ Ask instructor/TA for clarification ✅ Use official documentation (pandas, DuckDB, etc.)

What You CANNOT Do

❌ Use AI tools (ChatGPT, Claude, Copilot) for graded work ❌ Copy code from classmates ❌ Share your solutions with others ❌ Use solutions from previous years

Why? We need to assess YOUR understanding so we can help where you're stuck.

Policy: To align with parallel courses (Coding 1, Data Science 1), AI assistants are not permitted for homework or exams. Use them for personal study only.

🎓 Learning Outcomes

By the end of this course, you will:

Apply tidy data principles to messy real-world datasets
Write SQL queries confidently (SELECT, WHERE, JOIN, GROUP BY, window functions)
Ingest JSON/APIs and normalize into analysis-ready tables
Build reproducible pipelines with validations as code
Communicate findings to stakeholders with data dictionaries and clear narratives
Develop performance intuition for data operations
Handle real data problems (NULL values, type issues, missing data, duplicate keys)

These are the skills you'll use on day one of any analyst or data science role.

🌟 Philosophy

Real data is messy. Embrace it.

This course doesn't hide the messiness – we work with dirty CSVs, inconsistent types, missing values, and 500K-row datasets because that's what you'll face in the real world.

Your job isn't to memorize syntax. It's to:

Think clearly about data structure
Make defensible choices about how to handle issues
Document your work so others (and future you) understand
Validate everything – trust, but verify!

By the end, you'll feel confident tackling real data problems, not just textbook examples.

📬 Getting Help

Stuck on something? Don't suffer in silence!

Office hours: By appointment (Room A104)
Email: [email protected]
WhatsApp: +34 654 69 13 63
In class: Ask questions! Everyone else probably has the same question.

Rule of thumb: If you're stuck for more than 15 minutes, reach out.

🛠️ Technical Setup

Required:

Python 3.x
JupyterLab or VS Code
DuckDB: pip install duckdb
Git

Recommended:

pandas: pip install pandas
numpy: pip install numpy

Test your setup:

jupyter lab notebooks/day1_setup_check.ipynb

All teaching datasets are provided offline in this repo – no downloads needed!

🎉 Let's Get Started!

This course moves fast, but that's because every minute is practical, hands-on skill-building. By the end of three sessions, you'll have built:

A cleaned dataset from messy CSV data
SQL queries on 500K+ row datasets
A JSON-to-database pipeline
An end-to-end analysis with validation and stakeholder communication

These are portfolio pieces. These are interview talking points. These are real skills.

Ready? Open up notebooks/day1_setup_check.ipynb and let's make sure you're ready to go.

See you in class! 🚀

Questions about this README or the course? Contact the instructor.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
assignments		assignments
data		data
interviews		interviews
notebooks		notebooks
references		references
scripts		scripts
slides		slides
solutions		solutions
.gitignore		.gitignore
CLAUDE-EXTENDED.md		CLAUDE-EXTENDED.md
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt
syllabus.md		syllabus.md

earino/ECBS5294

Folders and files

Latest commit

History

Repository files navigation