0% found this document useful (0 votes)
62 views

Sms Spam

Les spams sms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Sms Spam

Les spams sms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Building an Effective Spam Detection

System for Mobile Operators in the


Democratic Republic of the Congo
1. Introduction

Spam sms and their impact on users and network resources


With the increasing use of mobile devices in mobile telecommunication, the number of text
messages sent every day has grown exponentially. According to Statista [1], a company that
provides market and consumer data on a wide range of topics, including digital media and
technology, the number of mobile messages sent worldwide in 2020 reached 3.5 trillion. In
parallel, the proliferation of web pages and social media messaging applications such as
WhatsApp, Telegram, Snapchat, Facebook, and Instagram has expanded the types of messages
that phone users can send. Messages now include not only text but also videos and audios,
adding depth and richness to communication [2].
While email messages are commonly used for professional communication, in some regions
like the Democratic Republic of the Congo, people frequently rely on SIM cards provided by
telecommunication providers to access mobile messaging services. The diverse range of
services and interests that mobile messages encompass, from mobile banking to social media
communication, gaming, localization platforms, health, and more, has led to a significant
increase in the volume of messages being sent.
However, among the legitimate messages, there exist those with malicious intent aimed at
deceiving individuals into divulging personal information, spreading fake news, coercing
money transfers, issuing threats, or engaging in other harmful activities. Moreover it can also
affect voice experience by increasing network traffic.

The relevance of developing a spam detection application for mobile operators in the
DRC

In the realm of telecommunications, mobile devices are extensively utilized for sharing various
forms of communication, including texts, emails, and chats through targeted applications.
Specifically, Short Message Service (SMS) plays a crucial role in conveying personal and
professional information via mobile phones [3]. Initially limited to messages of fewer than 160
characters [4], SMS has evolved to accommodate longer texts and even multimedia content,
sourced from humans, other phone users, online applications, or network operators [5].
However, the accessibility and openness of SMS communication have also paved the way for
a subset of individuals to exploit this platform by disseminating deceptive or unwanted
messages. These messages often employ tactics such as false promises of monetary rewards,
requests for payments under false pretenses, or misleading job offers, among others. Such
fraudulent activities contribute to the proliferation of spam messages that can deceive
unsuspecting recipients.
Moreover, beyond the realm of deceptive content lies a more insidious threat posed by
scammers who exploit vulnerabilities within messaging systems to inject malware or install
mobile spyware on users’ devices. Notable instances like SimJacker [6] underscore the potential
risks that extend beyond message content to the very infrastructure of messaging systems.
Additionally, legitimate organizations that utilize network operators for marketing or
advertising purposes inadvertently provide an avenue for scammers to impersonate authorized
entities, further complicating the distinction between official and unofficial messages. This
confluence of challenges underscores the complexity of distinguishing between genuine and
fraudulent messages in the mobile communication landscape [7, 8].
Given the multifaceted nature of spam messages and the associated risks they pose, it is
imperative to devise solutions that aid and protect users from falling victim to malicious
activities. Addressing issues such as false market advertising, security threats, and unwanted
messages necessitates the development of robust tools and mechanisms to facilitate secure and
reliable messaging communication. By proactively tackling these challenges, stakeholders can
foster a safer and more trustworthy mobile communication environment in the Democratic
Republic of the Congo which have the particularity of Swahili.

2. Background and Motivation

SMS Spam Prevalence in the Democratic Republic of Congo (DRC)

The proliferation of mobile communication in the Democratic Republic of Congo (DRC) has
brought both opportunities and challenges. While SMS (Short Message Service) has become a
powerful tool for communication and marketing, it has also led to an increase in unwanted SMS
spam. Here, we explore the prevalence of SMS spam in the DRC, its impact on users, and
potential strategies to mitigate this issue.

Between 2002 and 2007, the DRC's mobile market transformed from an oligopolistic structure
to one characterized by monopolistic competition. New players entered the market, and
established leaders like Vodacom and Airtel significantly influenced the landscape. As a result,
the mobile penetration rate reached 44.6% in 2014, surpassing the African average [9].

Mobile operators in the DRC have leveraged SMS as a communication and marketing channel.
Promotional messages, service alerts, and transaction notifications are routinely delivered via
SMS. However, this widespread use has also opened the door to spam. Users receive unsolicited
messages, often related to commercial offers, contests, or dubious services.

Challenges and impact of sms spam can be grouped in three ways. Firstly, we have user
experience: SMS spam disrupts the user experience, causing annoyance and frustration.
Legitimate messages can get lost in the flood of unwanted content. Secondly, resource drain:
Spam consumes network resources, affecting overall system efficiency and potentially leading
to increased costs for operators. And thirdly, privacy concerns: Some spam messages request
personal information or promote fraudulent schemes, posing privacy risks to recipients.

While the DRC faces SMS spam challenges, it is essential to compare its situation with that of
other countries. For instance, in Nigeria [10], each subscriber receives an average of 2.45 spam
messages per day. The majority of these are commercial in nature. However, only a small
fraction of recipients report spam, highlighting the need for more effective regulation and user
empowerment.

Addressing SMS spam in the DRC requires a multi-pronged approach. By combining


regulatory efforts, user awareness, and technological solutions, we can create a cleaner and
more user-friendly mobile communication environment.

Interest

In the society, it is worthy to contribute to facilitating communication and reducing the impact
of spam messages, which are somehow annoying and stressful for citizens.
Economically, it is appears as time saver in business employees by decreasing concerns
basedonthreatsandunwillingmessagesduetoitscapacityofprovidingfilteringfunctions.
For scientists, this work is a reference for those whose to delve in and gaining skills and
techniques in mobile networks environment in machine learning operations systems.

3. Data Collection and Preprocessing

Description of dataset used for training and evaluation.

For this paper, we used two data sources: one from Kaggle [11, 12] web site, for English and
French, another from a survey for Swahili.
Kaggle is a thriving online platform, hosts the most extensive global community of data
scientists. With over 536,000 active members spanning 194 countries, Kaggle provides
powerful tools and resources to propel your data science endeavors. Whether you're a seasoned
professional or just starting out, Kaggle offers an array of opportunities to learn, collaborate,
and showcase your skills.
One fascinating aspect of Kaggle is its collection of datasets, including those related to spam
detection. Here there two groups of datasets:
Spam Mails Dataset: This dataset, containing 5,572 SMS messages, is meticulously labeled as
either "ham" (legitimate) or "spam". Researchers and data enthusiasts can use this dataset to
develop and fine-tune spam detection models. The dataset includes a variety of text messages,
allowing practitioners to explore different patterns and features associated with spam.
Email Spam Classification Dataset: Another valuable resource on Kaggle is the CSV file
containing information about 5,172 emails, categorized as either spam or not spam. Researchers
can use this dataset to build and evaluate spam filters, employing techniques such as Naive
Bayes or machine learning algorithms.
We used the first group, with English and French, the repartition of row data are shown in Table
1 below.
Languages Ham Spam Total
English 4825 747 5572
French 4825 747 5572
Table 1. Raw data counted before any wrangling, Kaggle

For Swahili, we get Table 2 from survey:


Languages Ham Spam Total
Swahili 117 15 132
Table 2. Raw data counted before any wrangling, Swahili

Data preparation

It where done in five steps:


 Check the columns inside the Kaggle dataset, we get five columns: labels, text, text_hi,
text_de and text_fr. labels contains spam or ham, text, text_hi, text_de and text_fr
contain respectively messages in English, Hindi, Dutch and French;
 Extract the valuable features: Here, we extract two sets of couples, (labels, text) and
(labels, text_fr). This because our study focus on three languages which two, English
and French, are in the dataset;
 Counting raw data: check Table 1;
 Ckecking null values;
 Dropping duplicated values;
The same process was execute for Swahili data and the results are shown in Table 3, Figure 1
and Figure 2.
Languages Ham Spam Total
English 4516 641 5157
French 4494 640 5134
Swahili 99 14 113
Table 3. Clean data counted after preprocessing
Raw vs Clean data
6000

5000

4000

3000

2000

1000

0
English French Swahili

Raw data Clean data

Figure 1. Data: Raw vs Clean

Clean data: Ham vs Spam


6000

5000

4000

3000

2000

1000

0
English French Swahili

Ham Spam

Figure 2. Clean data: Ham vs Spam

4. Tools and Feature Engineering

Tools and frameworks

The completion of this project necessitated the utilization of various dependencies [13], tools,
frameworks. These resources were instrumental in realizing the project’s objectives. Notably,
they were categorized into two main areas: those integral to the core functionality and others
relevant to the user side, distinguishing the back-end from the front-end. Moreover, the project
involved tools for data analysis and predictive modeling. The structure of these tools is
presented in the Table 4 for more clarity.
Tools Roles
Numpy Scientific computing library
Data manipulation and analysis
Panda
library
Python library used for 2D and 3D
Data analysis and Matplotlib
data visualization
Machine Learning
Another Python library for
libraries
Seaborn statistical data visualization, built
on matplotlib
Machine learning machine
Scickit-learn
learning library
Python framework for developing
Django Python
web applications and APIs
Maker of web pages by providing
Programming Languages HTML
its structure and content
(Front-end and Back-end)
CSS Styling the content
Rendering web pages interactive
JS
and dynamic
Table 4. Structuring the tools [13]

Feature engineering

Feature engineering plays a crucial role in spam detection by selecting and creating relevant
features from the available data to enhance the performance of the spam detection model. In
the context of spam detection, features can include elements such as the frequency of specific
words in the message text, the presence of commonly associated spam keywords, the message
length, the presence of hyperlinks, the time of message sending, and the sender’s domain. By
utilizing these features, a machine learning model can be trained to more effectively identify
spam messages. The process of feature engineering thus improves the model’s ability to detect
spams by providing it with informative and discriminative information.
For this paper, we have two relevant features: column header labels in dataset and word
frequency in message. The exploration of those features are done through sklearn Python
library, precisely by used of LabelEncoder and TfidfVectorizer functions [13].

5. Model Selection, Evaluation and Optimisation

Different machine learning algorithms for spam classification

When this stage is reached, it’s obviously understood that the previous, ie. gathering,
preparation, exploratory steps are already fulfilled. However, it is possible to go back there once
again according to the analysis requirements. What’s happen during the training and testing
stages ? Since all clean data are found in a dataset containing features in columns and data
values in rows, the class based on choices made for the comfortable algorithm can be
subsequently used to generate a model. This model is able to learn relationships and patterns
within the data, is what we call ’training’. Thus, the selection of suit ML algorithm is more
competitive involving studies, testing. However it is more lead by the kind of problem we wish
to solve, the number of features and its types, the kind of model that would suit the data more
the best [14].
So, according to those principles, here are the types of ML algorithms used:
 Supervised models: The Supervise ML algorithms are one of algorithms often used in
intelligent systems. Their manner of functioning is this: They get as inputs the data
related to the features, then they map them with desired outputs (the output is input’s
entry). Tasks used by supervised models to solve problems are in Table 5;
 Unsupervised Models: Unlike supervised models which learns from the labels data,
the unsupervised models are trained on unlabeled data. Their particularity lies in their
ability to learn from complex and large amount of data. Their best goal is to find hidden
patterns, structures or relationships within the data even though they are not
proportional. Therefore, they are categorized as non-linear models [15]. The Table 6
provides a comprehensive overview of these tasks and the associated algorithms
employed to achieve them.

Genre and problem solving


Tasks Algorithms
examples
Naive Bayes, Logistic
Categorize input data into
Regression, Support Vector
Classification [16] predefined classes or labels.
Machine (SVM),Random
E.g: Sentiment Analysis
Forest, Decision Trees, etc
Linear Regression,
Polynomial Regression, Predict continuous numerical
Regression [17] Lasso Regression, Ridge output value. E.g: House
Regression, Suport Vector prices forecasting
Regression (SVR), etc
YOLO (You Only Look
Convolutional Neural
Object Detection [18] Once)
Networks (CNNs), etc
E.g: Self-driving cars
Natural Language Processing RNNS(Recurrent Neural Understand human language
(NLP) [19] Networks), LSTMs, etc E.g: Language translation
ARIMA(Autoregressive
Integral Moving average),
Predict future values in a
Exponential Smoothing
Time series Analysis [20] time series.
methods, Seasonal
E.g: Weather prediction
Decomposition of Time
Series (STL), etc
Learn hierarchical
RNNs, NLP,
representations from data, for
Deep Learning [21] GANs(Generative
deep prediction E.g: Image
Adversarial Networks), etc
recognition
Table 5. Which task for which Supervised Machine learning algorithm [13]
Genre and problem solving
Tasks Algorithms
examples
Group data points into
K-Means Clustering, clusters based on similarity,
Clustering [13] Hierarchical Clustering, without prior knowledge of
DBSCAN, etc group labels
E.g: Customer Segmentation
Reduce the number of
PCA(Principal Component
features(dimensions) in a
Analysis), t-Distributed
Dimension Reduction [22] dataset and retain essential
Stochastic Neighbor
information.
Embedding, etc
E.g: Image compression
Discovering latent topics
LDA(Latent Dirichlet
within a collection of
Allocation), Non-Negative
Topic Modeling [23] documents
Matric Factorization(NMF),
E.g: Content
etc
Recommendation
Preprocess daat to make data
Data compression PCA more manageable.
E.g : Data storage.
Table 6. Which task for which Unsupervised Machine learning algorithm

Comparing different machine learning algorithms for spam classification.

To compare those algorithms, we used accuracy and AUC (Area Under Curve). Results are
shown in Table 7 and Figure 3.
Logistic Decision Ham Spam
Bayes SVM
regression three
English 99% 99% 99% 90% 4516 641
French 99% 99% 99% 90% 4494 640
Swahili 90% 87% 85% 67% 99 14
Table 7. AUC: Languages vs ML algorithms
120%

100%

80%

60%

40%

20%

0%
Bayes SVM Logistic regression Decision three

English French Swahili

Figure 3. AUC : Languages vs ML algorithms

Logistic Decision Ham Spam


Bayes SVM
regression three
English 96% 97% 95% 96% 4516 641
French 96% 98% 97% 96% 4494 640
Swahili 87% 87% 87% 91% 99 14
Table 8. Accuracy: Languages vs ML algorithms

100%
98%
96%
94%
92%
90%
88%
86%
84%
82%
80%
Bayes SVM Logistic regression Decision three

English French Swahili

Figure 4. Accuracy: Languages vs ML algorithms

By observing Tables 7, 8 and Figures 3, 4 we note that the performance of a model depend on
the language on which it is apply. Specifically, results of English and French are very similar
compare to Swahili due to the structure of the language and the number of messages in Swahili.
Those models where trained on 80% of data and test on 20%.
Model optimisation

Actually, Tables 7, 8 show that the every models have its characteristics allowing to be
convenient to certain data, thus combining all for more optimization can be appealing by taking
care of over-fitting. Therefore, two means are used: Voting Classifier and Grid Search.
 VotingClassifier: It is a versatile ensemble classifier that combines multiple base
estimators to make predictions.
Language Score
English 98%
French 98%
Swahili 87%
Table 9. VotingClassifier: Accuracy scores (en, fr and sw)

 GridSearchCV: it perform an exhaustive search over specified parameter values for an


estimator. It helps find the best combination of hyperparamater by evaluating the
model’s performance using cross validation. We test it on Swahili data and we obtain a
score of 87%, exacly like VotingClassifier.

6. Model Deployment and Integration

The deployment phase transitions the machine learning models from a development
environment to a production setting. For spam detection, this involves integrating the models
into an API that will act as the user interface. This API enables real-time analysis and
classification of SMS in English, French, and Swahili, providing immediate feedback to the
end-user.
The API serves as an intermediary between the machine learning models and the application
layer. It receives input SMS, processes them through the deployed models, and returns a spam
or non-spam verdict. The integration process requires careful planning to ensure that the API
can handle the expected load and provide low-latency responses. Figures 5, 6 Show model
deployement and integration.
The user interface, powered by the API, must be intuitive and user-friendly. It should provide
clear options for users to submit SMS for analysis and display the results in an easily
understandable format. Additionally, it should offer the ability to learn from user feedback to
improve the accuracy of the spam detection models. To achieve that, we had use Django, HTML
and CSS (Figure 7, [24]).
Figure 5. ML model

Figure 6. Deplyement model


Figure 7. Testing the end-point for api Services

7. Study Limitations and Future research paths


Although this work comes with several advantages, it also has its limitations. The solution it
offers isn’t universal, as it’s primarily tailored to a specific region, especially the eastern part of
the Democratic Republic of Congo.
Furthermore, a portion data, which is a fundamental component of any machine learning
algorithms, was collected from an unofficial source, namely Kaggle. This data doesn’t consider
the local interests and language cultures, as it predominantly consists of content in French and
English. Additionally, the other data sources were limited in quantity, making it challenging to
provide accurate estimations that represent the entire population.
Moreover, the solution is designed to work with only three languages. This means that messages
containing a mixture of languages might pose challenges for the model in terms of accurate
classification. There is a need for a more advanced multilingual model classifier to address this
issue for instance by utilizing advanced techniques in artificial intelligence.

8. Conclusion
9. References

[1] Statista’s own team of researchers and analysts. Number of mobile messages worldwide
from 2019 to 2023 (in trillions). https://fr.statista.com/, 2020.
[2] Cori Faklaris and Sara Anne Hook. Oh, snap! the state of electronic discovery amid the rise
of snapchat, whatsapp, kik, and other mobile messaging apps. 2016.
[3] M Lavanya and KR Aruna. Sms spam detection using deep learning. Journal homepage:
www. ijrpr. com ISSN, 2582:7421.
[4] Gwenael Le Bodic. Mobile messaging technologies and services: SMS, EMS and MMS.
John Wiley & Sons, 2005.
[5] Sunil Kumar Jangir, Manoj Kumar Sharma, and Pawan Kumar Gupta. Design and
implementation of sms gateway api for mobile communication networks. International
Journal of Computer Applications, 151(9):1–5, 2016.
[6] Catalin Cimpanu. Simjacker vulnerability exploited for surveillance by at least one nation-
state. ZDNet, 2019.
[7] Guangquan Chen, Weijun Wang, and Xuan Zhou. A survey on sms spam filtering
techniques. Journal of Network and Computer Applications, 80:149–159, 2017.
[8] Matti Leppäniemi and Heikki Karjaluoto. Mobile marketing: From marketing strategy to
mobile marketing campaign implementation. International Journal of Mobile Marketing,
3(1), 2008.
[9] Crispin Malingumu Syosyo. Analyse du marché des télécommunications mobiles en
République Démocratique du Congo : Dynamique du marché et stratégies des acteurs.HAL
open science. 2021.
[10] Oluwafemi Osho and al. Mobile spamming in Nigeria : An empirical survey.
ResearchGate. 2015.
[11] www.kaggle.com
[12] Kaggle : Tout ce qu'il faut savoir sur cette plateforme - DataScientest.com.
https://datascientest.com/kaggle-tout-ce-quil-a-savoir-sur-cette-plateforme.
[13] Christian Murhula Byabushi. Development of an interface application for detection of
spam on a mobile operator: Case study of Airtel, Vodacom and Orange. CATHOLIC
UNIVERSITY OF BUKAVU. Academic year: 2022-2023.
[14] H Wang, ZeZXeZBePJ Lei, X Zhang, B Zhou, and J Peng. Machine learning basics.
Deep learning, pages 98–164, 2016.
[15] Memoona Khanum, Tahira Mahboob, Warda Imtiaz, Humaraia Abdul Ghafoor, and
Rabeea Sehar. A survey on unsupervised machine learning algorithms for
automation,classificationandmaintenance. International Journal of Computer Applications,
119(13), 2015.
[16] FY Osisanwo, JET Akinsola, O Awodele, JO Hinmikaiye, O Olakanmi, J Akinjobi, et
al. Supervised machine learning algorithms: classification and comparison. International
Journal of Computer Trends and Technology (IJCTT), 48(3):128–138, 2017.
[17] Dastan Maulud and Adnan M Abdulazeez. A review on linear regression comprehensive
in machine learning. Journal of Applied Science and Technology Trends, 1(4):140–147,
2020.
[18] Qiang Bai, Shaobo Li, Jing Yang, Qisong Song, Zhiang Li, and Xingxing Zhang.
Objectdetectionrecognitionandrobotgraspingbasedonmachinelearning: Asurvey. IEEE
access, 8:181855–181879, 2020.
[19] Maria Razno. Machine learning text classification model with nlp approach.
Computational Linguistics and Intelligent Systems, 2:71–73, 2019.
[20] Martin Längkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised feature
learning and deep learning for time-series modeling. Pattern recognition letters, 42:11–24,
2014.
[21] Christian Janiesch, Patrick Zschech, and Kai Heinrich. Machine learning and deep
learning. Electronic Markets, 31(3):685–695, 2021.
[22] Carlos Oscar Sánchez Sorzano, Javier Vargas, and A Pascual Montano. A survey of
dimensionality reduction techniques. arXiv preprint arXiv:1403.2877, 2014.
[23] Jipeng Qiang, Zhenyu Qian,Yun Li, Yunhao Yuan, andXindongWu. Short text topic
modeling techniques, applications, and performance: a survey. IEEE Transactions on
Knowledge and Data Engineering, 34(3):1427–1445, 2020.
[24] IL FAUT METTRE LE LIEN VERS L’APPLICATION ICI!!!!

You might also like