final report intro
final report intro
INTRODUCTION
This project focuses on analysing and predicting forest fires in Algeria using data science
and machine learning techniques. The study involves collecting meteorological and
environmental data, performing exploratory data analysis (EDA), and developing
predictive models to assess fire risk.
Project Objective
Analyse historical forest fire data to identify trends and key contributing factors.
Develop a predictive model using machine learning techniques to assess fire risks.
Target Audience
Key Features
Paarsh Infotech Pvt Ltd provides more than website design and Software
development. Your business, web presence, and brand identity will be taken to the
next level. We also offer solutions for corporate Web Designing, Web Application
Development, Mobile Application Development, Software Development, Digital
Marketing, Software Testing, and many more.
Company Service:
Paarsh Infotech Pvt Ltd gives the services like web development, Application
Development, AI & ML, Software Development and Online corporate training
Page |3
Existing System
Satellite data (e.g., MODIS, VIIRS) to identify hotspots and burned areas after a
fire starts.
Weather stations that collect temperature, humidity, wind speed, and rainfall data.
Delayed response times, as detection often occurs after the fire has grown.
There is a growing need for a data-driven, predictive system to detect and prevent forest
fires in Algeria due to:
Delays in manual detection which hinder early response and containment efforts.
Availability of historical fire data and weather parameters, which can be used to
train machine learning models for accurate predictions.
Improve early warning and response, reducing damage and firefighting costs.
Support decision-makers with actionable insights for fire prevention and resource
allocation.
The scope of this project focuses on the development of a data-driven system to analyze,
predict, and visualize forest fire risks in Algeria using machine learning techniques. The
project encompasses the following key components:
Model Development
Model Optimization
o Connect with APIs for live weather data to enable dynamic predictions.
Page |6
Software Requirement
Software Purpose
Programing Environment Jupyter Notebook, Google Colab
Database MySQL
Hosting AWS
Development Tools GitHub
Page |7
PROPOSED SYSTEM
The primary purpose of this project is to analyze and predict forest fires in Algeria using
data science techniques. By leveraging machine learning models and historical
environmental data, the project aims to:
1. Problem Definition
2. Data Collection
Gather historical forest fire data (e.g., Algerian Forest Fire dataset).
3. Data Preprocessing
6. Model Evaluation
7. Prediction System
Use the trained model to predict fire occurrence based on new or test inputs.
8. Visualization (Optional)
Python is the core language used for data analysis and machine learning due to its
simplicity, extensive libraries, and community support.
a. Pandas
Helpful in loading datasets and performing operations like filtering, grouping, and
merging.
b. NumPy
Used for numerical computations and efficient handling of arrays and matrices.
d. Scikit-learn
Used for:
3. Jupyter Notebook
P a g e | 11
An interactive coding environment used for developing and presenting the entire
project.
Used to track changes in code, collaborate, and manage versions of the project.
Google Colab allows free GPU/TPU access and easy cloud-based execution.
As a data science intern, I actively contributed to all key phases of the Algerian Forest
Fire Detection and Prediction project. My responsibilities included:
Handled missing values and ensured the dataset was formatted correctly for
analysis.
Evaluated model performance using metrics like accuracy, precision, recall, and
F1-score.
Prepared visuals and findings to be included in the final project report and
presentation.
5. Team Collaboration
Contributed ideas and suggestions for improving model accuracy and data
handling
1.User
1.Data Collection
Inputs:
Functions:
2. Data Preprocessing
P a g e | 17
Functions:
3. Model Training
Functions:
Functions:
Functions:
1. Problem Definition – Define the goal: predict forest fires based on weather
data.
2. Data Collection – Gather data from weather APIs and historical fire records.
3. Data Preprocessing – Clean and prepare the data (handle missing values,
encode, normalize).
4. Exploratory Data Analysis (EDA) – Visualize and analyze data patterns and
feature relationships.
5. Model Building – Train ML models (e.g., Random Forest, SVM) on the data.
6. Model Evaluation – Test model accuracy using metrics like precision and
recall.
7. Conclusion / Recommendations – Summarize findings and suggest actions or
improvements.
P a g e | 21
3.3 ER Diagram
1. Initial Dataset
o Temperature
o Wind speed
o Humidity
o Rainfall
2. Cleaned Dataset
o Removing outliers
o Normalizing/scaling data
It reduces the number of features while keeping most of the variance in the data.
Helps improve:
o Model performance
o Training time
P a g e | 23
o Visualization (optional)
4. Cross Validation
o Each model is trained K times using a different fold for testing and the rest
for training
Goal: Ensure that the model does not overfit and is robust.
This stage tests multiple algorithms on the PCA-transformed dataset using cross-
validation:
LR (Logistic Regression):
o Good for high-dimensional data; finds the best margin between classes.
RF (Random Forest):
Each of these models is evaluated using the same validation technique to compare
performance.
P a g e | 24
6. Model
This final model is what will be used in the prediction system to output the fire
risk level.
The dataset includes 244 instances that regroup a data of two regions of
Algeria,namely the Bejaia region located in the northeast of Algeria and the Sidi
Bel-abbes region located in the northwest of Algeria.122 instances for each region.
The period from June 2012 to September 2012. The dataset includes 11 attribues
and 1 output attribue (class) The 244 instances have been classified into fire(138
classes) and not fire (106 classes) classes.
Attribute Information:
First import the dataset and then read it by using pandas ,head () is giving us the
top five row for analyse the data of column present in the data set.
P a g e | 26
This code snippet imports several essential Python libraries commonly used in data
analysis and visualization. import pandas as pd brings in the pandas library, which is used
for handling structured data (like tables) and provides powerful tools for data
manipulation and analysis. import numpy as np imports NumPy, a library that offers
support for large, multi-dimensional arrays and matrices, along with a collection of
mathematical functions to operate on them efficiently. import matplotlib.pyplot as plt
brings in matplotlib's plotting interface, allowing users to create static, interactive, and
animated visualizations in Python. import seaborn as sns imports seaborn, which is built
on top of matplotlib and provides a high-level interface for creating attractive and
informative statistical graphics. Lastly, %matplotlib inline is a magic command used in
Jupyter Notebooks that ensures that all plots generated by matplotlib will be displayed
directly within the notebook output cells, making it easier to view and interpret results
during exploratory data analysis.
provides a concise summary of the Data Frame , Column names and data
types. Non-null value counts. Total entries in the dataset. It gives the The
information contains the number of columns, column labels, column data types,
memory usage, range index, and the number of cells in each column (non-null
values).
P a g e | 27
Data Cleaning: dataset that contain at least one missing value (NaN),and This line
returns a new DataFrame containing only the rows from dataset that have at
least one missing (null) value. It’s commonly used during data cleaning to
identify and possibly handle incomplete data
This code assigns values 0 and 1 to the "Region" column of a DataFrame based on
row indices, likely to represent two regions (e.g., coastal and inland) in a dataset
like Algerian forest fires. The result is stored in a variable df, which is now a
reference to the updated dataset
P a g e | 28
This line ensures that all values in the "Region" column of df are stored as
integers, which is useful for machine learning models, plotting, or logical
conditions that require numeric or categorical values in integer format
This line cleans the DataFrame df by removing all rows with missing values and
then reindexing the rows to have a clean, continuous index. This is commonly
done before analysis or modeling to ensure data quality.
P a g e | 29
df.iloc is used to access and manipulate data within a DataFrame using integer-
based indexing.
Drop day, month, and year and categories the classes ,encoding of the categories in
classes ,to understand the main characteristics of a dataset and uncover patterns or
relationships within it.
P a g e | 34
Plot density plot for all Features This code block applies the Seaborn style, then
creates and displays a large set of histograms for each numeric column in
df_copy, using 50 bins and a large figure size to ensure clarity. It’s typically used
in exploratory data analysis (EDA) to understand the distribution, skewness,
and outliers in your dataset.
1. Temperature:
2. RH (Relative Humidity):
3. Ws (Wind Speed):
Shows most wind speeds were moderate, with few very high or low wind
speeds.
4. Rain:
Extremely right-skewed.
Indicates that fine fuels (like leaves, grasses) were typically very dry—high
fire risk.
Also right-skewed.
Most values are concentrated at the lower end (0–20), meaning moderate
moisture content in duff layers (decomposed leaves/soil).
7. DC (Drought Code):
Right-skewed.
A few days had much drier conditions (tail extending beyond 100).
Indicates low to moderate fire spread potential on most days, but a few high-
risk days exist.
Also right-skewed.
Most values are low, meaning limited build-up of combustible material, but
some high outliers.
P a g e | 36
P a g e | 37
This code generates a pie chart that visually shows the proportion of days (or records) that
had a forest fire versus no fire, with labels and percentage values clearly displayed. It's a
good way to understand class imbalance, which is important in classification problems
P a g e | 38
Correlation:
used to find the pairwise correlation of all columns in the Pandas Dataframe in
Python. Any NaN values are automatically excluded. To ignore any non-numeric
values
Its observed that August and September had the most number of forest fires for
both regions. And from the above plot of months, we can understand few things
Most of the fires happened in August and very high Fires happened in only 3
months - June, July and August.Less Fires was on September
P a g e | 39
0 = no correlation
P a g e | 40
sns.boxplot(...):
Shows:
plt.subplots(figsize=(13, 6))
Initializes a new figure for plotting with a width of 13 inches and height of 6
inches.
This code generates a grouped bar plot showing how fire and non-fire incidents vary
across months, styled with a clean grid and labeled axes/title. It’s useful for identifying
seasonal trends in fire occurrences.
P a g e | 42
This chart shows the number of occurrences per month (x='month') and splits the
bars by the 'Classes' column using different colors (hue='Classes'), likely
representing fire (Fire) and no fire (Not Fire) events.
The y-axis is labeled "Number of Fires" and the x-axis "Months", both with bold
font (weight='bold') to enhance visibility.
while the goal of the code is to show monthly fire patterns for Bejaia, the actual
plot includes data from all regions unless the data parameter is corrected.
P a g e | 43
Model Traninig:
the process of teaching a machine learning model to learn from data and make accurate
predictions on new, unseen data
P a g e | 44
They've already been used to extract useful information (e.g., seasonal trends,
month-wise plots).
They're not needed anymore for model training or further analysis.
You want to reduce dimensionality by removing irrelevant or redundant columns.
You’ve already encoded or grouped them in another way (like converting to
datetime or grouping by season).
P a g e | 45
Machine learning models require numeric inputs – they can't handle string
labels directly.
Converts a binary classification ("fire" vs "not fire") into numeric binary
labels (1 for fire, 0 for no fire).
Keeps the dataset consistent and clean for analysis, modeling, and visualization.
This setup allows you to train a model to predict the FWI based on other weather and
environmental conditions.
P a g e | 46
P a g e | 47
test_size=0.25: Specifies that 25% of the data should be allocated to the test set
To detect and return a set of feature names (columns) from the dataset that are
highly correlated (positively or negatively) above the given threshold. This is
useful in data preprocessing to remove multicollinearity
P a g e | 50
You're removing columns (features) that are highly correlated with others (correlation
coefficient > 0.85) from both the training (X_train) and testing (X_test) datasets. This
helps to:
Reduce overfitting.
Distance-based models like k-NN, SVM, and clustering give misleading results
without scaling.
box plots to visually compare the distribution of features before and after scaling
using matplotlib and seaborn. Let's go through it in detailed steps so you fully
understand what each line does and why this visualization is useful.
Before scaling: raw feature values (could be on different scales)
After scaling: all features should have similar distributions (mean = 0, std = 1)
P a g e | 53
r2_score: Tells how well the model explains the variance in the target variable.
Lasso Regression:
Fits multiple Lasso models with different alpha values using cross-validation.
Selects the best model and trains it on the full training data.
Predicts target values on the test set using the trained model with the best alpha
value.
Using LassoCV not only regularizes the model to prevent overfitting but also
automates the selection of the best alpha, making it more robust and reliable
than manually tuning the Lasso model.
P a g e | 57
P a g e | 58
r2_score: Measures how well predictions approximate the actual target values.
Trains the model using the scaled training features and target values.
Learns the best weights (coefficients) that minimize squared error plus L2 penalty.
P a g e | 59
P a g e | 60
The best possible score is 1.0 and it can be negative (because the model can be
arbitrarily worse). A constant model that always predicts the expected value of y ,
disregarding the input features, would get a score of 0.0.
P a g e | 61
Elasticnet Regression :
This code demonstrates how to build and evaluate an ElasticNet model, which is
a regularized regression model that combines L1 (Lasso) and L2 (Ridge)
penalties. Let’s break it down step by step to understand what’s happening.
L1 regularization (Lasso): Encourages sparsity by shrinking some coefficients to
zero, effectively removing some features.
L2 regularization (Ridge): Shrinks the coefficients of all features, but none are
exactly zero.
Fits the ElasticNet model to the scaled training data (X_train_scaled) and the
target values (y_train).
The model will learn the best coefficients to minimize the loss function, which
includes both L1 and L2 penalties.
P a g e | 62
Application.py:
This code sets up a Flask web application to serve a machine learning model
for making predictions. The model is trained using Ridge regression, and the
input data is standardized using StandardScaler. The web application allows
users to input data through a form, and based on the input, the model predicts a
result.
Flask Initialization: Initializes a new Flask web application object.
__name__: Refers to the current module. Flask uses this to determine where to
look for the application's resources.
This renders the index.html template, which will be displayed to the user when
they visit the root URL (/) of the app. This page likely contains a form where
users will input their data.
request.form.get() retrieves the values entered by the user for each feature
(Temperature, RH, Ws, etc.). These are converted to floats since the form data is
typically in string format.
Data Scaling: The StandardScaler is used to scale the input data ([[Temperature,
RH, Ws, Rain, FFMC, DMC, ISI, Classes, Region]]) to the same scale as the
training data. This ensures the prediction is accurate since the model expects data
to be standardized.
Prediction: The Ridge regression model (ridge_model) then makes a prediction
on the scaled data using predict(). This returns a prediction for the target variable
based on the input features.
This Flask app allows users to input data via a web form, processes the data by
scaling it, and then makes a prediction using a pre-trained Ridge regression
model. The prediction is displayed on the web page, providing an interactive
experience for the user to see how their inputs influence the output.
This code is used to create a web application that allows users to interact with a
machine learning model via a simple web interface. The primary purpose of this
code is to deploy a trained machine learning model (in this case, a Ridge
P a g e | 63
regression model) into a web-based environment, where users can input new data
and receive predictions in real-time.
P a g e | 64
P a g e | 65
P a g e | 66
Incomplete Data: In some regions, the historical fire records or environmental data
might be incomplete, leading to gaps in training the predictive model.
Real-time Prediction: The current model relies on historical data, which may not
always reflect current or near-future conditions. Integrating real-time data from
weather stations or satellite feeds could be challenging due to latency or data
access restrictions.
Weather Forecasting Accuracy: The model is heavily reliant on weather data. Any
inaccuracies or delays in weather forecasting could affect the predictions made by
the system.
3. Model Generalization
Satellite and Remote Sensing Limitations: Although satellite data can be helpful, it
may not always capture fires in remote areas accurately or in real time, leading to
delays in detection and prediction.
P a g e | 67
PROPOSED ENHANCEMENTS
CONCLUSION
This project successfully analyzed forest fire occurrences in Algeria using data science and
machine learning techniques. By leveraging meteorological and environmental datasets, we
identified key factors that contribute to wildfire risks. The study demonstrated that
variables such as temperature, humidity, wind speed, and drought indices significantly
impact fire outbreaks, reinforcing the importance of data-driven fire management strategies.
Machine learning models such as Random Forest, SVM, and Neural Networks were
implemented and evaluated for predictive accuracy. The results indicated that ensemble
models performed better in classifying fire occurrences, providing valuable insights for
early fire detection. The integration of feature engineering, data preprocessing, and model
optimization techniques played a crucial role in improving prediction accuracy.
Furthermore, the study highlights the increasing role of climate change in intensifying
forest fire risks. As temperatures rise and drought conditions persist, the frequency and
severity of wildfires are expected to increase. This underscores the need for continuous
monitoring and the adoption of advanced predictive models to mitigate fire hazards
effectively.
The findings of this research can assist policymakers, environmental agencies, and disaster
management teams in developing proactive strategies for fire prevention and mitigation. By
P a g e | 70
incorporating real-time weather data and extending the model to include satellite imagery,
future improvements can further enhance prediction accuracy and practical applications.
Overall, this project serves as a foundation for using data-driven approaches in wildfire
management and emphasizes the potential of artificial intelligence in addressing
environmental challenges.
BIBLOGRAPHY