Sms Spam
Sms Spam
The relevance of developing a spam detection application for mobile operators in the
DRC
In the realm of telecommunications, mobile devices are extensively utilized for sharing various
forms of communication, including texts, emails, and chats through targeted applications.
Specifically, Short Message Service (SMS) plays a crucial role in conveying personal and
professional information via mobile phones [3]. Initially limited to messages of fewer than 160
characters [4], SMS has evolved to accommodate longer texts and even multimedia content,
sourced from humans, other phone users, online applications, or network operators [5].
However, the accessibility and openness of SMS communication have also paved the way for
a subset of individuals to exploit this platform by disseminating deceptive or unwanted
messages. These messages often employ tactics such as false promises of monetary rewards,
requests for payments under false pretenses, or misleading job offers, among others. Such
fraudulent activities contribute to the proliferation of spam messages that can deceive
unsuspecting recipients.
Moreover, beyond the realm of deceptive content lies a more insidious threat posed by
scammers who exploit vulnerabilities within messaging systems to inject malware or install
mobile spyware on users’ devices. Notable instances like SimJacker [6] underscore the potential
risks that extend beyond message content to the very infrastructure of messaging systems.
Additionally, legitimate organizations that utilize network operators for marketing or
advertising purposes inadvertently provide an avenue for scammers to impersonate authorized
entities, further complicating the distinction between official and unofficial messages. This
confluence of challenges underscores the complexity of distinguishing between genuine and
fraudulent messages in the mobile communication landscape [7, 8].
Given the multifaceted nature of spam messages and the associated risks they pose, it is
imperative to devise solutions that aid and protect users from falling victim to malicious
activities. Addressing issues such as false market advertising, security threats, and unwanted
messages necessitates the development of robust tools and mechanisms to facilitate secure and
reliable messaging communication. By proactively tackling these challenges, stakeholders can
foster a safer and more trustworthy mobile communication environment in the Democratic
Republic of the Congo which have the particularity of Swahili.
The proliferation of mobile communication in the Democratic Republic of Congo (DRC) has
brought both opportunities and challenges. While SMS (Short Message Service) has become a
powerful tool for communication and marketing, it has also led to an increase in unwanted SMS
spam. Here, we explore the prevalence of SMS spam in the DRC, its impact on users, and
potential strategies to mitigate this issue.
Between 2002 and 2007, the DRC's mobile market transformed from an oligopolistic structure
to one characterized by monopolistic competition. New players entered the market, and
established leaders like Vodacom and Airtel significantly influenced the landscape. As a result,
the mobile penetration rate reached 44.6% in 2014, surpassing the African average [9].
Mobile operators in the DRC have leveraged SMS as a communication and marketing channel.
Promotional messages, service alerts, and transaction notifications are routinely delivered via
SMS. However, this widespread use has also opened the door to spam. Users receive unsolicited
messages, often related to commercial offers, contests, or dubious services.
Challenges and impact of sms spam can be grouped in three ways. Firstly, we have user
experience: SMS spam disrupts the user experience, causing annoyance and frustration.
Legitimate messages can get lost in the flood of unwanted content. Secondly, resource drain:
Spam consumes network resources, affecting overall system efficiency and potentially leading
to increased costs for operators. And thirdly, privacy concerns: Some spam messages request
personal information or promote fraudulent schemes, posing privacy risks to recipients.
While the DRC faces SMS spam challenges, it is essential to compare its situation with that of
other countries. For instance, in Nigeria [10], each subscriber receives an average of 2.45 spam
messages per day. The majority of these are commercial in nature. However, only a small
fraction of recipients report spam, highlighting the need for more effective regulation and user
empowerment.
Interest
In the society, it is worthy to contribute to facilitating communication and reducing the impact
of spam messages, which are somehow annoying and stressful for citizens.
Economically, it is appears as time saver in business employees by decreasing concerns
basedonthreatsandunwillingmessagesduetoitscapacityofprovidingfilteringfunctions.
For scientists, this work is a reference for those whose to delve in and gaining skills and
techniques in mobile networks environment in machine learning operations systems.
For this paper, we used two data sources: one from Kaggle [11, 12] web site, for English and
French, another from a survey for Swahili.
Kaggle is a thriving online platform, hosts the most extensive global community of data
scientists. With over 536,000 active members spanning 194 countries, Kaggle provides
powerful tools and resources to propel your data science endeavors. Whether you're a seasoned
professional or just starting out, Kaggle offers an array of opportunities to learn, collaborate,
and showcase your skills.
One fascinating aspect of Kaggle is its collection of datasets, including those related to spam
detection. Here there two groups of datasets:
Spam Mails Dataset: This dataset, containing 5,572 SMS messages, is meticulously labeled as
either "ham" (legitimate) or "spam". Researchers and data enthusiasts can use this dataset to
develop and fine-tune spam detection models. The dataset includes a variety of text messages,
allowing practitioners to explore different patterns and features associated with spam.
Email Spam Classification Dataset: Another valuable resource on Kaggle is the CSV file
containing information about 5,172 emails, categorized as either spam or not spam. Researchers
can use this dataset to build and evaluate spam filters, employing techniques such as Naive
Bayes or machine learning algorithms.
We used the first group, with English and French, the repartition of row data are shown in Table
1 below.
Languages Ham Spam Total
English 4825 747 5572
French 4825 747 5572
Table 1. Raw data counted before any wrangling, Kaggle
Data preparation
5000
4000
3000
2000
1000
0
English French Swahili
5000
4000
3000
2000
1000
0
English French Swahili
Ham Spam
The completion of this project necessitated the utilization of various dependencies [13], tools,
frameworks. These resources were instrumental in realizing the project’s objectives. Notably,
they were categorized into two main areas: those integral to the core functionality and others
relevant to the user side, distinguishing the back-end from the front-end. Moreover, the project
involved tools for data analysis and predictive modeling. The structure of these tools is
presented in the Table 4 for more clarity.
Tools Roles
Numpy Scientific computing library
Data manipulation and analysis
Panda
library
Python library used for 2D and 3D
Data analysis and Matplotlib
data visualization
Machine Learning
Another Python library for
libraries
Seaborn statistical data visualization, built
on matplotlib
Machine learning machine
Scickit-learn
learning library
Python framework for developing
Django Python
web applications and APIs
Maker of web pages by providing
Programming Languages HTML
its structure and content
(Front-end and Back-end)
CSS Styling the content
Rendering web pages interactive
JS
and dynamic
Table 4. Structuring the tools [13]
Feature engineering
Feature engineering plays a crucial role in spam detection by selecting and creating relevant
features from the available data to enhance the performance of the spam detection model. In
the context of spam detection, features can include elements such as the frequency of specific
words in the message text, the presence of commonly associated spam keywords, the message
length, the presence of hyperlinks, the time of message sending, and the sender’s domain. By
utilizing these features, a machine learning model can be trained to more effectively identify
spam messages. The process of feature engineering thus improves the model’s ability to detect
spams by providing it with informative and discriminative information.
For this paper, we have two relevant features: column header labels in dataset and word
frequency in message. The exploration of those features are done through sklearn Python
library, precisely by used of LabelEncoder and TfidfVectorizer functions [13].
When this stage is reached, it’s obviously understood that the previous, ie. gathering,
preparation, exploratory steps are already fulfilled. However, it is possible to go back there once
again according to the analysis requirements. What’s happen during the training and testing
stages ? Since all clean data are found in a dataset containing features in columns and data
values in rows, the class based on choices made for the comfortable algorithm can be
subsequently used to generate a model. This model is able to learn relationships and patterns
within the data, is what we call ’training’. Thus, the selection of suit ML algorithm is more
competitive involving studies, testing. However it is more lead by the kind of problem we wish
to solve, the number of features and its types, the kind of model that would suit the data more
the best [14].
So, according to those principles, here are the types of ML algorithms used:
Supervised models: The Supervise ML algorithms are one of algorithms often used in
intelligent systems. Their manner of functioning is this: They get as inputs the data
related to the features, then they map them with desired outputs (the output is input’s
entry). Tasks used by supervised models to solve problems are in Table 5;
Unsupervised Models: Unlike supervised models which learns from the labels data,
the unsupervised models are trained on unlabeled data. Their particularity lies in their
ability to learn from complex and large amount of data. Their best goal is to find hidden
patterns, structures or relationships within the data even though they are not
proportional. Therefore, they are categorized as non-linear models [15]. The Table 6
provides a comprehensive overview of these tasks and the associated algorithms
employed to achieve them.
To compare those algorithms, we used accuracy and AUC (Area Under Curve). Results are
shown in Table 7 and Figure 3.
Logistic Decision Ham Spam
Bayes SVM
regression three
English 99% 99% 99% 90% 4516 641
French 99% 99% 99% 90% 4494 640
Swahili 90% 87% 85% 67% 99 14
Table 7. AUC: Languages vs ML algorithms
120%
100%
80%
60%
40%
20%
0%
Bayes SVM Logistic regression Decision three
100%
98%
96%
94%
92%
90%
88%
86%
84%
82%
80%
Bayes SVM Logistic regression Decision three
By observing Tables 7, 8 and Figures 3, 4 we note that the performance of a model depend on
the language on which it is apply. Specifically, results of English and French are very similar
compare to Swahili due to the structure of the language and the number of messages in Swahili.
Those models where trained on 80% of data and test on 20%.
Model optimisation
Actually, Tables 7, 8 show that the every models have its characteristics allowing to be
convenient to certain data, thus combining all for more optimization can be appealing by taking
care of over-fitting. Therefore, two means are used: Voting Classifier and Grid Search.
VotingClassifier: It is a versatile ensemble classifier that combines multiple base
estimators to make predictions.
Language Score
English 98%
French 98%
Swahili 87%
Table 9. VotingClassifier: Accuracy scores (en, fr and sw)
The deployment phase transitions the machine learning models from a development
environment to a production setting. For spam detection, this involves integrating the models
into an API that will act as the user interface. This API enables real-time analysis and
classification of SMS in English, French, and Swahili, providing immediate feedback to the
end-user.
The API serves as an intermediary between the machine learning models and the application
layer. It receives input SMS, processes them through the deployed models, and returns a spam
or non-spam verdict. The integration process requires careful planning to ensure that the API
can handle the expected load and provide low-latency responses. Figures 5, 6 Show model
deployement and integration.
The user interface, powered by the API, must be intuitive and user-friendly. It should provide
clear options for users to submit SMS for analysis and display the results in an easily
understandable format. Additionally, it should offer the ability to learn from user feedback to
improve the accuracy of the spam detection models. To achieve that, we had use Django, HTML
and CSS (Figure 7, [24]).
Figure 5. ML model
8. Conclusion
9. References
[1] Statista’s own team of researchers and analysts. Number of mobile messages worldwide
from 2019 to 2023 (in trillions). https://fr.statista.com/, 2020.
[2] Cori Faklaris and Sara Anne Hook. Oh, snap! the state of electronic discovery amid the rise
of snapchat, whatsapp, kik, and other mobile messaging apps. 2016.
[3] M Lavanya and KR Aruna. Sms spam detection using deep learning. Journal homepage:
www. ijrpr. com ISSN, 2582:7421.
[4] Gwenael Le Bodic. Mobile messaging technologies and services: SMS, EMS and MMS.
John Wiley & Sons, 2005.
[5] Sunil Kumar Jangir, Manoj Kumar Sharma, and Pawan Kumar Gupta. Design and
implementation of sms gateway api for mobile communication networks. International
Journal of Computer Applications, 151(9):1–5, 2016.
[6] Catalin Cimpanu. Simjacker vulnerability exploited for surveillance by at least one nation-
state. ZDNet, 2019.
[7] Guangquan Chen, Weijun Wang, and Xuan Zhou. A survey on sms spam filtering
techniques. Journal of Network and Computer Applications, 80:149–159, 2017.
[8] Matti Leppäniemi and Heikki Karjaluoto. Mobile marketing: From marketing strategy to
mobile marketing campaign implementation. International Journal of Mobile Marketing,
3(1), 2008.
[9] Crispin Malingumu Syosyo. Analyse du marché des télécommunications mobiles en
République Démocratique du Congo : Dynamique du marché et stratégies des acteurs.HAL
open science. 2021.
[10] Oluwafemi Osho and al. Mobile spamming in Nigeria : An empirical survey.
ResearchGate. 2015.
[11] www.kaggle.com
[12] Kaggle : Tout ce qu'il faut savoir sur cette plateforme - DataScientest.com.
https://datascientest.com/kaggle-tout-ce-quil-a-savoir-sur-cette-plateforme.
[13] Christian Murhula Byabushi. Development of an interface application for detection of
spam on a mobile operator: Case study of Airtel, Vodacom and Orange. CATHOLIC
UNIVERSITY OF BUKAVU. Academic year: 2022-2023.
[14] H Wang, ZeZXeZBePJ Lei, X Zhang, B Zhou, and J Peng. Machine learning basics.
Deep learning, pages 98–164, 2016.
[15] Memoona Khanum, Tahira Mahboob, Warda Imtiaz, Humaraia Abdul Ghafoor, and
Rabeea Sehar. A survey on unsupervised machine learning algorithms for
automation,classificationandmaintenance. International Journal of Computer Applications,
119(13), 2015.
[16] FY Osisanwo, JET Akinsola, O Awodele, JO Hinmikaiye, O Olakanmi, J Akinjobi, et
al. Supervised machine learning algorithms: classification and comparison. International
Journal of Computer Trends and Technology (IJCTT), 48(3):128–138, 2017.
[17] Dastan Maulud and Adnan M Abdulazeez. A review on linear regression comprehensive
in machine learning. Journal of Applied Science and Technology Trends, 1(4):140–147,
2020.
[18] Qiang Bai, Shaobo Li, Jing Yang, Qisong Song, Zhiang Li, and Xingxing Zhang.
Objectdetectionrecognitionandrobotgraspingbasedonmachinelearning: Asurvey. IEEE
access, 8:181855–181879, 2020.
[19] Maria Razno. Machine learning text classification model with nlp approach.
Computational Linguistics and Intelligent Systems, 2:71–73, 2019.
[20] Martin Längkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised feature
learning and deep learning for time-series modeling. Pattern recognition letters, 42:11–24,
2014.
[21] Christian Janiesch, Patrick Zschech, and Kai Heinrich. Machine learning and deep
learning. Electronic Markets, 31(3):685–695, 2021.
[22] Carlos Oscar Sánchez Sorzano, Javier Vargas, and A Pascual Montano. A survey of
dimensionality reduction techniques. arXiv preprint arXiv:1403.2877, 2014.
[23] Jipeng Qiang, Zhenyu Qian,Yun Li, Yunhao Yuan, andXindongWu. Short text topic
modeling techniques, applications, and performance: a survey. IEEE Transactions on
Knowledge and Data Engineering, 34(3):1427–1445, 2020.
[24] IL FAUT METTRE LE LIEN VERS L’APPLICATION ICI!!!!