Skip to content

Commit 65c5ccc

Browse files
authored
Vincent Le's Capstone Project (#2)
* Clean up * Initialize project * Finish DAG * Visualize data using Streamlit * Ochestrate the pipeline * Add start and stop scripts * Modified README file * Added Dev Container Folder * Stop ignoring data folder for deployment purpose * Try updating base url * Try updating base url * If it's streamlit cloud, change base url * Try again * Revert * Update README for live streamlit page * Add diagram * Minor fix to start script * Remove devcontainer * Modulize the transform functions * Change scripts to utils, separate also tasks * Update README file
1 parent 4fd2a2f commit 65c5ccc

31 files changed

+51597
-145
lines changed

VincentLeV/.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
/__pycache__
2+
__pycache__
3+
logs
4+
.env
5+
servers.json

VincentLeV/README.MD

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Vincent's Capstone: Banned Books ETL Pipeline
2+
3+
This capstone project demonstrates how to ochestrate the process of scraping PEN America website for banned books data, cleaning and transforming the data, then loading them to CSV files. The data then illustrated using Streamlit.
4+
5+
## Tools
6+
7+
- Python (3.12)
8+
- Pandas
9+
- PostgreSQL
10+
- Airflow (3.0.0+)
11+
- Streamlit
12+
13+
---
14+
15+
## Project Flow
16+
17+
1. Scrape banned books data using Python
18+
2. Clean data using Pandas
19+
3. Extract data to CSV
20+
4. Save data to PostgreSQL
21+
5. Visualize data in Streamlit
22+
23+
Step 4 is redundant in this project since I load the using the CSV files but I want to include this so that the project will be scalable later.
24+
25+
![Diagram](./banned-books-pipeline-diagram.jpg)
26+
27+
---
28+
29+
## Project Structure
30+
```
31+
VincentLeV/
32+
├── app/
33+
│ ├── data/
34+
│ │ └── banned_books/
35+
│ │ ├── banned_books.csv # The main dataset that is used for visualization
36+
│ │ └── ...
37+
│ ├── app.py # Streamlit home page
38+
│ ├── ...
39+
│ └── Dockerfile
40+
├── config/
41+
│ ├── generate_pgadmin_server.py # Make sure that the server is ready in PgAdmin
42+
│ └── ...
43+
├── dags/
44+
│ └── get_banned_books.py # Airflow DAG that handles the data processing process
45+
├── tasks/
46+
│ └── banned_books_taks.py # Airflow tasks that are used in get_banned_books DAG
47+
├── utils/
48+
│ ├── constants.py # Common constants that are used in util functions/tasks/DAG
49+
│ └── transform_banned_books.py # Util functions that handle clean and transform the data
50+
├── docker-compose.yaml
51+
├── start.sh # App start script
52+
└── stop.sh # App end script
53+
```
54+
55+
## Data Visualization
56+
57+
Streamlit app is deployed here
58+
59+
```
60+
https://vincent-banned-books.streamlit.app/
61+
```
62+
63+
## Setup and Run Locally
64+
65+
### For Running the Project the First Time
66+
67+
1. In the terminal, run these commands
68+
```bash
69+
cd VincentLeV
70+
71+
./start.sh
72+
```
73+
74+
The terminal will prompt for some variable inputs, please type in the values as you want
75+
2. After Docker has completed the process, navigate to check out Airflow processes:
76+
```
77+
http://localhost:8080/
78+
```
79+
3. If everything runs well in step 2, the data is ready, navigate to this page to check out the visualization:
80+
```
81+
http://localhost:8502/
82+
```
83+
84+
For curiousity, the data in PostgreSQL can be checked here
85+
```
86+
http://localhost:5050/
87+
```
88+
89+
PgAdmin is pre-loaded with a server under the name that you input for this prompt `Enter pgadmin server name`
90+
91+
Log into the DB with the password you provide in this prompt `Enter postgres password`, you will see that the data after this
92+
93+
### For Running the Project Not the First Time
94+
95+
In the terminal, run these commands
96+
```bash
97+
cd VincentLeV
98+
99+
docker compose up -d
100+
```
101+
102+
## Stop the app
103+
104+
In the terminal, run this script when you are inside `VincentLeV` folder
105+
```bash
106+
107+
./stop.sh
108+
```

VincentLeV/app/Dockerfile

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
FROM python:3.9-slim
2+
3+
WORKDIR /app
4+
5+
RUN apt-get update && apt-get install -y \
6+
libpq-dev gcc build-essential --no-install-recommends && \
7+
rm -rf /var/lib/apt/lists/*
8+
9+
COPY . .
10+
11+
RUN pip install --upgrade pip
12+
RUN pip install --no-cache-dir -r requirements.txt
13+
14+
EXPOSE 8502
15+
16+
CMD ["streamlit", "run", "app.py", "--server.port=8502", "--server.address=0.0.0.0", "--server.fileWatcherType=poll"]

VincentLeV/app/app.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import streamlit as st
2+
3+
st.set_page_config(
4+
page_icon=":books:",
5+
layout="wide",
6+
)
7+
8+
pg = st.navigation([
9+
st.Page("home.py", title="Overview", icon=":material/home:"),
10+
st.Page("states_and_districts.py", title="States and Districts", icon=":material/location_on:"),
11+
])
12+
13+
pg.run()

VincentLeV/app/data/banned_books/banned_books.csv

Lines changed: 34258 additions & 0 deletions
Large diffs are not rendered by default.

VincentLeV/app/data/banned_books/pen-2021-2022.csv

Lines changed: 2533 additions & 0 deletions
Large diffs are not rendered by default.

VincentLeV/app/data/banned_books/pen-2023-2024.csv

Lines changed: 10047 additions & 0 deletions
Large diffs are not rendered by default.

VincentLeV/app/data/banned_books/pen-2023.csv

Lines changed: 267 additions & 0 deletions
Large diffs are not rendered by default.

VincentLeV/app/home.py

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
import streamlit as st
2+
import pandas as pd
3+
import plotly.express as px
4+
import os
5+
from utils import rank_dataframe
6+
from utils import get_base_data_url
7+
8+
BASE_DATA_URL = get_base_data_url()
9+
DATA_URL = os.path.join(BASE_DATA_URL, "banned_books.csv")
10+
11+
ban_colors = {
12+
"banned": "#bf0603",
13+
"banned from libraries and classrooms": "#ff6d00",
14+
"banned by restriction": "#0096c7",
15+
"banned pending investigation": "#ffea00"
16+
}
17+
18+
@st.cache_data
19+
def load_data(path: str):
20+
data = pd.read_csv(path)
21+
return data
22+
23+
def by_year_bar_chart(data: pd.DataFrame):
24+
year_status_counts = (
25+
data
26+
.groupby(["Year", "Ban Status"])
27+
["Title"].nunique()
28+
.reset_index(name="Titles")
29+
)
30+
31+
fig = px.bar(
32+
year_status_counts,
33+
x="Year",
34+
y="Titles",
35+
color="Ban Status",
36+
color_discrete_map=ban_colors,
37+
labels={"Titles": "Titles", "Year": "Year", "Ban Status": "Ban Status"},
38+
title="Banned Books by Year",
39+
barmode="group"
40+
)
41+
42+
st.plotly_chart(fig)
43+
44+
def by_origin_of_challenge_bar_chart(data: pd.DataFrame):
45+
year_status_counts = (
46+
data
47+
.groupby(["Origin of Challenge", "Ban Status"])
48+
["Title"].nunique()
49+
.reset_index(name="Titles")
50+
)
51+
52+
fig = px.bar(
53+
year_status_counts,
54+
x="Origin of Challenge",
55+
y="Titles",
56+
color="Ban Status",
57+
color_discrete_map=ban_colors,
58+
labels={"Titles": "Titles", "Origin of Challenge": "Origin of Challenge", "Ban Status": "Ban Status"},
59+
title="Banned Books by Origin of Challenge",
60+
barmode="group"
61+
)
62+
63+
st.plotly_chart(fig)
64+
65+
def top_5_banned_titles(data: pd.DataFrame):
66+
filtered_data = data[(data["Ban Status"] == "banned") | (data["Ban Status"] == "banned from libraries and classrooms")]
67+
title_counts = filtered_data.groupby(["Title", "Author"]).size().reset_index(name="Ban Count")
68+
top_titles = title_counts.sort_values(by="Ban Count", ascending=False).head(5)
69+
return top_titles[["Title", "Author", "Ban Count"]]
70+
71+
def top_5_challenged_titles(data: pd.DataFrame):
72+
filtered_data = data[(data["Ban Status"] == "banned by restriction") | (data["Ban Status"] == "banned pending investigation")]
73+
title_counts = filtered_data.groupby(["Title", "Author"]).size().reset_index(name="Ban Count")
74+
top_titles = title_counts.sort_values(by="Ban Count", ascending=False).head(5)
75+
return top_titles[["Title", "Author", "Ban Count"]]
76+
77+
def top_5_banned_authors(data: pd.DataFrame):
78+
filtered_data = data[(data["Ban Status"] == "banned") | (data["Ban Status"] == "banned from libraries and classrooms")]
79+
author_counts = filtered_data.groupby(["Author"]).size().reset_index(name="Count")
80+
top_authors = author_counts.sort_values(by="Count", ascending=False).head(5)
81+
return top_authors[["Author", "Count"]]
82+
83+
def top_5_challenged_authors(data: pd.DataFrame):
84+
filtered_data = data[(data["Ban Status"] == "banned by restriction") | (data["Ban Status"] == "banned pending investigation")]
85+
author_counts = filtered_data.groupby(["Author"]).size().reset_index(name="Count")
86+
top_authors = author_counts.sort_values(by="Count", ascending=False).head(5)
87+
return top_authors[["Author", "Count"]]
88+
89+
def display_data(data: pd.DataFrame):
90+
st.title("Overview of Banned Books in the US (2021-2024)")
91+
92+
st.info('Data is pulled from https://pen.org/', icon="ℹ️")
93+
94+
cols1= st.columns([0.3, 0.7], vertical_alignment="center")
95+
96+
unique_titles = data["Title"].nunique()
97+
cols1[0].markdown(f"<p style='text-align: center; font-size: 2.5rem; font-weight: bold;'>{unique_titles}</p>", unsafe_allow_html=True)
98+
cols1[0].markdown(f"<p style='text-align: center;'>books are banned between 2021 and 2024</p>", unsafe_allow_html=True)
99+
100+
unique_states = data["State"].nunique()
101+
cols1[0].markdown(f"<p style='text-align: center; font-size: 2.1rem; font-weight: bold;'>{unique_states}</p>", unsafe_allow_html=True)
102+
cols1[0].markdown(f"<p style='text-align: center;'>states are involved</p>", unsafe_allow_html=True)
103+
104+
unique_districts = data["District"].nunique()
105+
cols1[0].markdown(f"<p style='text-align: center; font-size: 2.1rem; font-weight: bold;'>{unique_districts}</p>", unsafe_allow_html=True)
106+
cols1[0].markdown(f"<p style='text-align: center;'>districts are involved</p>", unsafe_allow_html=True)
107+
108+
with cols1[1]:
109+
by_year_bar_chart(data)
110+
111+
by_origin_of_challenge_bar_chart(data)
112+
113+
cols2= st.columns([0.5, 0.5], vertical_alignment="center")
114+
115+
with cols2[0]:
116+
st.subheader("Top 5 Most Banned Titles")
117+
top_titles = top_5_banned_titles(data)
118+
ranked_titles = rank_dataframe(top_titles, rank_column_name="Rank")
119+
st.dataframe(ranked_titles.set_index("Rank"))
120+
121+
with cols2[1]:
122+
st.subheader("Top 5 Most Banned Authors")
123+
top_authors = top_5_banned_authors(data)
124+
ranked_authors = rank_dataframe(top_authors, rank_column_name="Rank")
125+
st.dataframe(ranked_authors.set_index("Rank"))
126+
127+
cols3= st.columns([0.5, 0.5], vertical_alignment="center")
128+
129+
with cols3[0]:
130+
st.subheader("Top 5 Most Challenged Titles")
131+
top_challenged_titles = top_5_challenged_titles(data)
132+
ranked_challenged_titles = rank_dataframe(top_challenged_titles, rank_column_name="Rank")
133+
st.dataframe(ranked_challenged_titles.set_index("Rank"))
134+
135+
with cols3[1]:
136+
st.subheader("Top 5 Most Challenged Authors")
137+
top_challenged_authors = top_5_challenged_authors(data)
138+
ranked_challenged_authors = rank_dataframe(top_challenged_authors, rank_column_name="Rank")
139+
st.dataframe(ranked_challenged_authors.set_index("Rank"))
140+
141+
display_data(load_data(DATA_URL))
142+
143+
144+

VincentLeV/app/requirements.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
python-dotenv
2+
pandas
3+
streamlit
4+
plotly
5+
requests
6+
beautifulsoup4
7+
apache-airflow
8+
apache-airflow-providers-common-sql
9+
apache-airflow-providers-postgres
10+
apache-airflow-providers-standard

0 commit comments

Comments
 (0)