Machine Learning-Based Air Pollution Prediction Model - PDF Useful
Machine Learning-Based Air Pollution Prediction Model - PDF Useful
Abstract— Air pollution is currently a critical issue for both predictions were implemented for everyday operations. The
public health and the environment. It is vital to provide use of machine learning has demonstrated remarkable
advance notice of pollution levels, and air quality forecasts can resilience in predicting natural phenomena. Therefore, the
play a crucial role in achieving this goal. To measure pollution application of Machine Learning (ML) techniques for Air
levels, experts rely on the air quality index (AQI). For this pollution level prediction is explored in this research.
research, gathered data on air pollution, specifically focusing
on Fine particles (PM2.5), Sulphur Dioxide (SO2), Nitrogen The prediction of the AQI was carried out in this paper
Dioxide (NO2), and Carbon monoxide (CO), which were used using supervised machine learning algorithms. Supervised
as the primary dataset. Different machine learning models, learning includes a range of algorithms, such as Random
including linear regression, lasso regression, random forest Forest, Naive Bayes, Nearest Neighbor, Kernel SVM, SVM,
regression, and K-nearest neighbor regression, were then and Linear Regression[4]. From there, we selected these
employed to analyze the collected data. The mean absolute regression algorithms which are Random Forest, K – Nearest
error(MAE), mean-squared error (MSE), root-mean squared Neighbor and linear regression and Lasso Regression
error(RMSE) and accuracy are used to evaluate the algorithms. To evaluate their accuracy, the RMSE, MSE and
performance of the ML model. The comparison between these MAE were calculated for each of these ML Models. Python,
models is also discussed in this paper. Random forest Pandas, Matplotlib, and Sci-kit Learn were utilized in the
regression is giving the high accuracy and low RMSE among
implementation of these algorithms. The primary objectives
other models.
of this research are to improve the procedures that have been
Keywords— Air Pollution Prediction, Feature Engineering, used to predict AQI, to enhance our knowledge of AQI, and
Air Quality Index, Machine learning Models to understand the effects of poor air quality. This research's
result, the Air Quality Index and AQI bucket, allows
I. INTRODUCTION individuals to know how polluted the air is before they
approach a city. After the COVID-19 situation and an
According to the central environmental authority (CEA) increase in the number of machinery and vehicles that release
in Sri Lanka, air pollution can cause several health effects carbon dioxide into the air, mainly in the Colombo area, the
such as headaches, vomiting, acid rain, dizziness, lung and AQI level has consistently exceeded unhealthy levels.
heart diseases. In Sri Lanka, burning organic waste products Therefore, reducing air pollution in Sri Lanka is the key goal
from the agriculture industry and petroleum refining are the of this paper.
main sources of air pollution. According to data particulate
matter with a diameter of 2.5 micrometers or less than 2.5 II. LITERATURE REVIEW
micrometers, is highly affected to human health [2]. There
In this section 䱷 discusses the literature review under
are different gasses that cause air pollution. So, we need one
index for our research to train the ML model. Because of two main categories which were past related research on the
that, we have calculated Air Quality Index (AQI) from the all air pollution systems and air pollution recording system in
gasses which we collected. So here we consider these four different countries.
major pollutants which are nitrogen dioxide (NO2), sulfur A. Related work on Air Pollution Prediction using Machine
dioxide (SO2), carbon monoxide (CO), and the leading Learning
pollutant fine particulate matter (PM2.5). Each pollutant has a
sub-index and scales at a different level. The AQI uses a The authors of [5] introduced an air quality monitoring
scale from 0 to 500 to evaluate the quality of the air, and it system that included both an assessment module and a
includes six levels ranging from "Severely polluted" to forecasting module based on input features such as pollution
"Heavily polluted," "Moderately polluted," "Lightly levels, weather conditions, and chemical components derived
polluted," "Moderate," and "Good."[3]. The purpose of these from the WRF-Chem model. Using various groups of
levels is to indicate the potential impact on human health and features and classification algorithms such as linear
offer a numeric point of reference for individuals engaging in regression, SVM and random forest, the authors conducted
outdoor activities. With the calculated sub-index of each experiments on 74 cities in China. Their findings
pollutant, we finally take the maximum sub-index as the Air demonstrated that the combined model yielded more
quality index. accurate predictions than the individual models.
Before the rise of machine learning, people were used In [6] the authors proposed an air pollution monitoring
to probabilistic functions and previously collected data model for determining air pollutants using a data mining
statistics to predict air pollution. But with the advancement algorithm. The model was developed to measure the
of machine learning, training highly accurate ML models presence of CO, NO2, SO2, and O3 gases using a smart
which work on probability functions and provide good sensor micro-converter that downloads the pollutant levels to
2
Authorized licensed use limited to: Khulna Univ of Engineering & Technology - KUET. Downloaded on October 25,2024 at 08:08:19 UTC from IEEE Xplore. Restrictions apply.
x Pre-processing the data
Measurements are sometimes unavailable due to a lack of
measuring equipment, most commonly due to defective
sensors or transmission failures, or a lack of required data
points. So, we can't train an algorithm with the dataset's
missing values. As a result, feature engineering or data pre-
processing were employed to improve performance.
However, to improve efficiency, implemented the main
three feature engineering techniques which are imputation,
scaling and handling outliers. Here used boxplot Fig. 5. Outliers of AQI dataset
visualization with pandas and seaborn for the visual
representation to find outliers in the data set. From Figure 1 C. Model Building
to Figure 5 shows the outliers of system features separately.
x Air Quality Index
x Boxplots with seaborn The concentration of various pollutants is used to
Compared to other statistical graph methods, the box plot evaluate the level of air quality through the Air Quality Index
is particularly effective in visually presenting outliers. (AQI).It reflects the level of air cleanliness or pollution and
Outliers refer to data points that lie far from the minimum provides health recommendations depending on the AQI
and maximum values within a dataset. By highlighting level. To determine the AQI, it is necessary to have data on
outliers or unusual data points, the box plot allows us to at least two pollutants, one of which must be PM2.5. If there
quickly spot them and examine them more closely. is an absence of PM2.5 data, then the data is inadequate for
computing the AQI.
We assume the AQI value as the maximum sub-pollutant
index after calculating each sub-index. The sub-pollutant
index for individual pollutants is calculated using from
measured gasses in 24 hourly average concentration value of
PM2.5, NO2, SO2, and CO also 8-hourly in case of at the
monitoring location [8] [11].
x Training Model
Fig. 1. Outliers of PM2.5 dataset
Once the data preprocessing and AQI value calculation
are completed, the dataset is split into two groups: training
and testing [5]. The proposed model's process is described in
Figure 6.
Training phase: After analyzing the data in the dataset,
the ML system employs the chosen ML algorithm to
construct a model represented as a line or curve. In our case,
we utilized 80% of the dataset specifically for training
purposes.
Testing phase: The inputs are sent to the system, which is
then checked for performance. The accuracy is evaluated. In
Fig. 2. Outliers of NO2 dataset testing, we used 20% dataset in all datasets.
These are the algorithms that are used for training the
dataset.
x Linear Regression
x Lasso Regression
x K Nearest Neighbors Regression
x Random Forest Regression
Fig. 3. Outliers of CO dataset
The dataset was trained using regression machine
learning algorithms, with default parameters used in each
case. The implementation used Python-based libraries
including sci-kit-learn and Pandas, and the PyCharm
integrated development environment was utilized.
x Model Evaluation
Once the model training phase has been completed, the
model is utilized to predict the AQI value using a
preprocessed dataset. The best machine learning algorithm is
Fig. 4. Outliers of SO2 dataset chosen based on accuracy.
3
Authorized licensed use limited to: Khulna Univ of Engineering & Technology - KUET. Downloaded on October 25,2024 at 08:08:19 UTC from IEEE Xplore. Restrictions apply.
There are three widely used evaluation metrics for In the above equation, the predicted value is obtained
regression problems: using A, B, and X. A is the intercept, and B is the coefficient
derived by the regression model for the predictor present in
x The MAE (Mean Absolute Error) is determined by the data. The performance of the Linear Regression model
taking the average of the absolute values of the errors. can be evaluated through the graph depicted in Figure 7,
which shows an accuracy of 89.23%.
ͳ x Lasso Regression
ܧܣܯൌ ȁݕ െ ݕ̱ ȁሺͳሻ
݊ Lasso regression is a form of linear regression that
ୀଵ
utilizes the "shrinkage" approach, where the coefficient of
determination is diminished to zero. This technique aids in
x Mean Squared Error (MSE) is the average of all eliminating insignificant independent variables from the
squared errors: it "punishes" greater errors and is model, resulting in more precise predictions with fewer
frequently helpful in real world applications. variables. The Lasso regression model's performance is also
noteworthy, with an accuracy of 88.95%, as displayed in
Figure 8.
ͳ
ܧܵܯൌ ሺݕ െ ݕ̱ ሻଶ ሺʹሻ x K Nearest Neighbor (Regressor)
݊ The K-Nearest Neighbors (KNN) is one of the most
ୀଵ
popular and flexible machine learning algorithms, which is
x The RMSE, assesses the average deviation of simple to understand and implement. The value of K
predicted values from actual values by computing the specifies the number of neighbors considered for making the
square root of the mean of the squared errors, and it prediction. When it comes to measuring similarity, various
may be expressed in "y" units as follows: factors can be taken into account, such as distance. The K-
Nearest Neighbors (KNN) method is a feature similarity-
based approach that can be applied to both classification and
regression problems. The KNN model has shown an
ͳ accuracy of 92.65% when k=7, as demonstrated in Figure 9.
ܴ ܧܵܯൌ ඩ ሺݕ െ ݕ̱ ሻଶ ሺ͵ሻ
݊
ୀଵ x Random Forest Regressor (RF)
Random forests (RFs) are a type of ensemble learning
Equations 1, 2, and 3 represent the mathematical method that can be employed for numerous applications,
expressions for the parameters, where yi denotes the including regression and classification. In the training phase,
observed value of the ith sample, yi~ represents the predicted Random Forests (RFs) create multiple decision trees and
value of the ith sample, and n is the total number of generate a class that corresponds to either the mode of
observations considered. There are many research classes or the average prediction of individual trees for
publications on machine learning prediction using ML classification and regression tasks, respectively. [16],[17].
algorithms, but many of them compare SVM, ID3, Random The Random Forest model demonstrated an accuracy of
Forest, and regression algorithms using single pollutant 88.95%, as represented in Figure 10, which is considerably
based on the MSE and r2 parameters The author of [14] two higher than other regression models.
major pollutants SO2 and NO2, and using a linear regression
model to predict the results. The author of [1] used five
machine learning methods and compared their results
with the only parameter RMSE values. However, in this
study, were used four major pollutants for prediction and
four different machine learning algorithms. Specially three
main feature engineering techniques were used for data pre-
processing. The results were predicted based on the
pollutants and compared and described using three
parameters which are main parameters accuracy, MAE, MSE
and RMSE to produce the best result for each of ML
algorithms.
D. Machine Learning Models
x Linear Regression
Regression is a statistical approach used to model a target
value based on independent predictors, often used for
forecasting and determining cause-and-effect relationships
between variables. The method of regression varies
depending on the number and type of independent variables
and the relationship between them and the dependent
variable.The generic equation for linear regression is given
as:
Y=A+B*X (1)
Fig. 6. Architecture of the proposed model
4
Authorized licensed use limited to: Khulna Univ of Engineering & Technology - KUET. Downloaded on October 25,2024 at 08:08:19 UTC from IEEE Xplore. Restrictions apply.
IV. RESULTS AND DISCUSSION
The research utilized four machine learning models,
namely Linear Regression, Lasso Regression, K-Nearest
Neighbor, and Random Forest Regressor to predict the AQI
based on various features. Several metrics, including Root
Mean Squared Error (RMSE), Mean Squared Error (MSE),
Mean Absolute Error (MAE) and accuracy were employed to
evaluate the model’s performance. In order to obtain a
reliable regression model, a low RMSE value and high
accuracy were considered crucial [15].
According to the Table 1, we have achieved 99.87%
accuracy in Random Forest Regressor, since RF model gives
the highest overall best accuracy and lowest RMSE value
Fig. 7: Linear Regression Predicted vs Actual AQI Value compared to the other models. The model that is best suited
for this prediction process can be identified. In figure 11
shows the comparison of RMSE value of four machine
learning models.
Many existing research that used KNN, Linear
Regression, and Random Forest algorithms achieve
improved accuracy. The machine learning algorithms' low
accuracies may be attributed to the training dataset's
incompleteness. The presence of missing values and noisy
features in the dataset can have a significant impact on the
accuracy of the output.
Author [8] also used a ML model to predict the AQI,
however their results for the random forest regressor were
only 80% accurate, but in our machine learning model have
given the 99.87% accuracy. because we used the three
primary feature engineering techniques and our model
Fig. 8: Lasso Regression Predicted vs Actual AQI Value
performed well on the dataset. The quality of that training
dataset has increased.
Fig. 9: KNN Regression Predicted vs Actual AQI Value Fig. 11: RMSE of Machine Learning Models
5
Authorized licensed use limited to: Khulna Univ of Engineering & Technology - KUET. Downloaded on October 25,2024 at 08:08:19 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION [6] S. Raipure and D. Mehetre, "Wireless sensor network based
pollution monitoring system in metropolitan cities," in 2015
The objective of this research is to construct regression International Conference on Communications and Signal
models utilizing supervised machine learning algorithms to Processing (ICCSP), 2015: IEEE, pp. 1835-1838.
forecast air quality index (AQI) by analyzing four different [7] M. R. Delavar et al., "A novel method for improving air pollution
air pollutants. To achieve the best results, three primary prediction based on machine learning approaches: a case study
applied to the capital city of Tehran," ISPRS International Journal
feature engineering techniques were implemented. The of Geo-Information, vol. 8, no. 2, p. 99, 2019.
findings revealed that both the RF Regressor model and [8] R. Adke, S. Bachhav, A. Bambale, and B. Wawre, "Air Pollution
KNN regressor produced satisfactory results, but the random Prediction using Machine Learning," 2019.
forest regressor provided the highest accuracy and lowest [9] R. O. Sinnott and Z. Guan, "Prediction of air pollution through
root-mean-square error (RMSE) compared to the KNN machine learning approaches on the cloud," in 2018 IEEE/ACM 5th
International Conference on Big Data Computing Applications and
regressor. Therefore, this model can be used to predict the Technologies (BDCAT), 2018: IEEE, pp. 51-60.
AQI and identify the AQI range. [10] O. Ileperuma, "Review of air pollution studies in Sri Lanka," Ceylon
Journal of Science, vol. 49, no. 3, pp. 225-238, 2020.
Future research will focus on enhancing this proposed [11] G. C. Khilnani and P. Tiwari, "Air pollution in India and related
model by developing a deep learning-based prediction model adverse respiratory health effects: past, present, and future
based on the day. directions," Current opinion in pulmonary medicine, vol. 24, no. 2,
pp. 108-116, 2018.
REFERENCES [12] K. He, H. Huo, and Q. Zhang, "Urban air pollution in China: current
status, characteristics, and progress," Annual review of energy and
[1] S. Simu et al., "Air Pollution Prediction using Machine Learning,"
the environment, vol. 27, no. 1, pp. 397-431, 2002.
in 2020 IEEE Bombay Section Signature Conference (IBSSC), 2020:
[13] D. L. Robinson, "Air pollution in Australia: Review of costs,
IEEE, pp. 231-236.
sources and potential solutions," Health Promotion Journal of
[2] S. D. Senevirathne, "Air pollution: a case study of
Australia, vol. 16, no. 3, pp. 213-220, 2005.
environmentalpollution," Transport, vol. 746, no. 1484.6, p.
[14] S. Bali and M. Sengar, "Indian air quality prediction and analysis
1484.56, 2003.
using machine learning," J Eng Sci, vol. 11, no. 5, pp. 554-557,
[3] M. H. Sowlat, H. Gharibi, M. Yunesian, M. T. Mahmoudi, and S.
2020.
Lotfi, "A novel, fuzzy-based air quality index (FAQI) for air quality
[15] C. Cortes and V. Vapnik, "Support-vector networks," Machine
assessment," Atmospheric Environment, vol. 45, no. 12, pp. 2050-
learning, vol. 20, no. 3, pp. 273-297, 1995.
2059, 2011.
[16] L. Breiman, "Random forests," Machine learning, vol. 45, no. 1, pp.
[4] U. M. Lanjewar and J. J. Shah, "Air Pollution Monitoring &
5-32, 2001.
Tracking System Using Mobile Sensors and Analysis ofData Using
[17] H. Liu, Q. Li, B. Yan, L. Zhang, and Y. Gu, "Bionic Electronic Nose
Data Mining," vol. 2, no. 6 December-2012, p. 23, 2012.
Based on MOS Sensors Array and Machine Learning Algorithms
[5] X. Xi et al., "A comprehensive evaluation of air pollution prediction
Used for Wine Properties Detection," (in eng), Sensors (Basel), vol.
improvement by a machine learning method," in 2015 IEEE
19, no. 1, Dec 22 2018, doi: 10.3390/s19010045.
international conference on service operations and logistics, and
informatics (SOLI), 2015: IEEE, pp. 176-181.
6
Authorized licensed use limited to: Khulna Univ of Engineering & Technology - KUET. Downloaded on October 25,2024 at 08:08:19 UTC from IEEE Xplore. Restrictions apply.