2024-48
2024-48
Comparative Analysis of
Conventional Approaches and
AI-Powered Tools for Unit
Testing within Web Application
Development
Emad Aldeen Issawi, Osama Hajjouz
ISSN 1650-2884
LU-CS-EX: 2024-48
LU-CS-EX: 2024-48
Keywords: unit tests, AI-based testing tools (AITT), manually implemented (MI) tests, eval-
uation criteria, multi-criteria decision-making (MCDM)
2
Acknowledgements
We would like to thank our supervisor at IKEA, Bahareh Mohajer, for helping us accomplish this
research. Her support facilitated a smooth and efficient process in accomplishing this thesis. We
also thank our supervisor at Lund University, Qunying Song, for guiding us through this thesis
during its minor and major milestones. Additionally, we are grateful to our examiner, Elizabeth
Bjarnason, for her crucial roadmap tips at the beginning of the thesis. Our thanks also go to Filip
Gustafsson for initiating this research study. We appreciate IKEA’s senior software engineers for
the interview on test quality evaluation and all valuable feedback. Lastly, we thank IKEA’s tech
leader at IKEA, for the interview on the challenges associated with adopting AI-based testing
tools.
3
4
Contents
1 Introduction 9
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Definitions and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Research Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Related Work 15
2.1 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Unit Testing in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Unit Testing Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Unit Testing Automation . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 Addressing Challenges in Manually Implemented Testing . . . . . . . 17
2.2 Good Practices of Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 How To Structure a Unit Test . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 The Four Pillars of a Good Unit Test . . . . . . . . . . . . . . . . . . . 19
2.2.3 Using Coverage Metrics to Measure Test Quality . . . . . . . . . . . . 20
2.2.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.5 Unit Testing Principles . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.6 Assertions and Clean Tests . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Advancements and Impact of Automation in Software Testing . . . . . . . . . 24
2.4 Challenges of AI in Software Testing . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Multiple-Criteria Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Assigning Weights to the Criteria . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 Formula for Computing the WSM Score . . . . . . . . . . . . . . . . . 28
5
CONTENTS
3 Method 29
3.1 Exploring the existing AITT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 GPT Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Codiumate - by CodiumAI . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 UnitTestAI - by Paterson Anaccius . . . . . . . . . . . . . . . . . . . . 33
3.1.4 TabNine - by Codota . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.5 PaLM API Unit Test Generator - by Zazmic . . . . . . . . . . . . . . . 34
3.1.6 GitHub Copilot - by GitHub . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.7 CodePal Unit-Tests Writer - by CodePal . . . . . . . . . . . . . . . . . 35
3.1.8 IKEA GPT - by IKEA . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Evaluating Unit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Interview with IKEA’s Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Using Multi-Criteria Decision-Making to Compare the AITT and MI Methods 40
3.4.1 Assigning Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Assigning Score Value . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.3 Computing the WSM Score . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Result 43
4.1 Evaluation of The Existing MI Unit Tests . . . . . . . . . . . . . . . . . . . . . 43
4.2 Evaluation of the Selected AITT . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 UnitTestsAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 PaLM API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3 TabNine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.4 Codiumate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.5 GitHub Copilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.6 CodePal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.7 IKEA GPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Interview with IKEA’s Tech Leader . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Results of the Multi-Criteria Decision-Making Method . . . . . . . . . . . . . 71
4.4.1 Normalized Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 WSM Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Discussion 79
5.1 Performance of MI Unit Testing (RQ1) . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Performance of the AITT in Unit Testing (RQ2) . . . . . . . . . . . . . . . . . 80
5.3 Comparison between AITT and MI approaches (RQ1 & RQ2) . . . . . . . . . 81
5.4 Challenges of Using AITT (RQ3) . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.1 Subjectivity in the Pairwise Comparison Process of MCDM . . . . . . 84
5.5.2 Generalizability of Results . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.3 Influence of Subjective Evaluations . . . . . . . . . . . . . . . . . . . . 84
5.5.4 Dependency on Tool Evolution . . . . . . . . . . . . . . . . . . . . . . 85
5.5.5 Lack of underlying data . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Ethical and Social Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6.1 Privacy of Proprietary Code . . . . . . . . . . . . . . . . . . . . . . . . 85
6
CONTENTS
6 Conclusion 87
References 89
7
CONTENTS
8
Chapter 1
Introduction
This chapter provides an overview of the research, beginning with background information on
the research area, including an overview of Artificial Intelligence (AI) and Machine Learning
(ML) in software testing. AI refers to the simulation of human intelligence in machines de-
signed to think and learn like humans [32], while ML refers to the study of algorithms and sta-
tistical models that enable computers to improve their performance on tasks by learning from
data without being explicitly programmed[38]. Additionally, it includes a description of IKEA
AB’s web application ILA and the motivation for integrating AI. Following this, definitions
and terminology used in the thesis are made clear to ensure clarity and understanding for the
reader. Subsequently, the goal and the primary problem of the research are described, along with
the research questions. Furthermore, the method’s steps are provided, along with the achieved
results of the thesis. Moreover, the contributions of this study are outlined to underscore its
significance and potential impact. Finally, a structured outline of the thesis report is presented
to provide a road map for navigating through the subsequent chapters.
1.1 Background
Artificial Intelligence (AI) and Machine Learning (ML) integration into software testing im-
proves efficiency while reducing time and cost [34]. AI-based testing tools are improving pro-
cesses like test case generation and bug analysis [19] [35]. Initial hypotheses suggest that AI-
generated unit tests could enhance code coverage efficiency, though they may not match the
quality of manually implemented tests due to developers’ in-depth system knowledge.
The integration of AI and ML into software testing automates numerous complex tasks, sig-
nificantly enhancing the efficiency of the testing process within a shorter time frame [34]. AI
and ML technologies are introducing new ways of solving the challenges associated with soft-
ware quality assurance. By utilizing AI techniques and solutions, AI-based testing tools have
9
1. Introduction
10
1.3 Problem Statement
Unit testing, also referred to as component or module is defined as a test that checks a small
piece of code (a unit), does so quickly, and operates in an isolated manner. This means that
the core attributes of unit tests are distinguished from other forms of testing, focusing on the
speed of execution, the scope of the code being tested, and the requirement for tests to be per-
formed independently of external influences or dependencies [33]. Unit tests are used to ensure
a particular section of code works correctly [28].
• RQ1: How do MI testing methods perform when testing web applications? Regarding:
- time to set-up and learn to use the tool?
- time to prepare and perform the tests?
- the quality of the testing including code coverage?
• RQ2: How do existing AITT perform when testing web applications? (same aspects as
listed in RQ1)
• RQ3: What are the challenges of using AITT in the context of web applications?
11
1. Introduction
MI unit testing. Subsequently, we review 18 AITT and selected 7 of them, providing detailed
documentation for each. Following this, we analyze the code of ILA and condense the litera-
ture on best practices of unit testing into a list of 18 criteria to use in the evaluation process,
with one additional criteria evaluated later during an interview with an IKEA engineer. Later,
we evaluate the existing MI unit tests, followed by evaluating the AI-generated unit tests for
each selected AITT. Moreover, we interview a tech leader at IKEA to address the challenges of
adopting the AITT within IKEA’s web development context. Finally, we use the Multi-Criteria
Decision-Making approach to compare the AITT and MI evaluation results and draw a final
ranking.
1.6 Contributions
This thesis provides an evaluation of AITT, comparing their strengths and weaknesses with MI
testing. It offers recommendations regarding the adoption of AITT to improve testing effi-
ciency, coverage, quality, and cost-effectiveness.
We empirically evaluate several AITT and provide valuable insights to both academia and
industry on how AITT can influence testing processes, including testing efficiency, test qual-
ity, and potential cost reductions. The thesis compares the strengths and weaknesses of AITT
with those of MI testing, covering aspects such as efficiency, coverage, quality of test cases, test
accuracy, test structure, and simplicity. Additionally, our thesis provides recommendations for
software developers in the ILA team regarding the future adoption of AITT, advising on the
selection of tools. These recommendations could potentially enable future developers to con-
centrate more on their development tasks rather than managing the entire testing process by
utilizing AITT, thus minimizing distractions and enhancing focus on development efforts.
12
1.7 Outline of Thesis
13
1. Introduction
14
Chapter 2
Related Work
In this chapter, we delve into the evolution and practices of unit testing in software development,
highlighting advancements in automation, frameworks, and methodologies, while addressing
the challenges faced in manual testing and the adoption of these practices across various or-
ganizations. We examine good practices for structuring unit tests, the principles underlying
effective unit testing, and the use of coverage metrics to assess test quality. Additionally, we
emphasize the importance of maintaining accurate and clean tests. The chapter also explores
the impact and benefits of automation in software testing, the integration of AI, and the associ-
ated challenges. Lastly, the chapter provides a foundation for comparing unit testing tools and
methodologies using a multiple-criteria decision-making approach that systemically evaluates
complex scenarios.
15
2. Related Work
16
2.2 Good Practices of Unit Testing
Xie et al. [57] discussed the advances in unit testing through parameterized unit testing
(PUT), which extends traditional unit tests by allowing tests to accept parameters. This ap-
proach separates the specification of external behaviors from the generation of internal test
inputs, facilitating higher code coverage and enabling tools to generate test inputs efficiently.
Their work highlights the integration of PUT with various testing frameworks and tools, offer-
ing a methodology that enhances developer testing practices by combining strong test oracles
with automated test generation.
Artho et al. [4] extended basic unit testing with innovations like low-overhead memory
leak detection, verified logging, and integrated coverage measurement. Their approach allowed
scaling unit tests for larger software components and utilizing unit test infrastructure for system-
wide tests. They found that systematic unit testing maintains code quality amidst frequent
developer changes and that their enhancements to unit testing practices could be adapted to
other frameworks.
17
2. Related Work
practices in unit testing, emphasizing their importance in achieving high quality, maintainabil-
ity, and efficiency. This section is summarized in a list of criteria in Section 3.2 which is later
used for the evaluation of the unit tests in this thesis.
Vladimir [33] gave a good example for the AAA pattern where each section of the arrange,
act, and assert is separated in that order, see Figure 2.1.
18
2.2 Good Practices of Unit Testing
19
2. Related Work
This quality ensures that the tests do not become a hindrance to improving the code’s struc-
ture, readability, or performance. A test’s resistance to refactoring is measured by its ability to
avoid generating false positives, which occur when a test fails despite the application working
correctly from an end-user’s perspective.
Fast feedback: Unit tests should execute quickly to provide immediate insights into whether
the code changes have introduced any errors. Fast feedback is crucial for maintaining high de-
velopment velocity, as it allows developers to promptly identify and address issues, thereby re-
ducing the time spent debugging and validating changes.
Maintainability: The tests themselves should be easy to understand and modify. This in-
volves clear naming conventions, a well-organized structure, and minimal coupling between
tests and the code under test. Maintainable tests are easier to update alongside the code they
test, ensuring that the test suite remains effective and relevant over time.
20
2.2 Good Practices of Unit Testing
Statement coverage is similar to line coverage where it measures the percentage of executable
statements in the code that have been executed by the tests. While they seem similar, line cover-
age and statement coverage can give slightly different results in languages where multiple state-
ments can be on a single line. Line coverage measures if a line is executed at all, while statement
coverage ensures each statement is executed [43]. To calculate the statement coverage percentage,
one would need to divide the number of executed statements by the total number of statements.
2.2.4 Accuracy
Enhancing test accuracy in software testing involves strategically managing false positives and
false negatives, focusing on regression protection to capture actual defects, and refactoring re-
sistance to prevent incorrect failure alerts [33].
Vladimir [33] outlined the concepts of test accuracy in software testing, particularly unit
testing, and how it is related to the concepts of true positives, true negatives, false positives
(type I error), and false negatives (type II error) as shown in Figure 2.2. Vladimir emphasized
two critical attributes of a good test: protection against regressions and resistance to refactoring.
Protection against regressions focuses on minimizing false negatives, meaning the test should
catch as many real bugs as possible. Resistance to refactoring is about minimizing false positives,
ensuring that the test does not fail when the functionality is correct but has been refactored.
To increase test accuracy, two methods are suggested:
1. Increasing the Signal: Improving the test’s ability to detect actual bugs (minimizing false
negatives).
2. Reducing the Noise: Ensuring the test does not raise unnecessary alarms (minimizing false
positives).
Figure 2.2: A re-print figure that shows the relation between protection
against regression to avoid false negatives (type II errors) and resistance
to refactoring to reduce false positives (type I errors)[33].
21
2. Related Work
Buffardi et al. [9] composed a conference paper where they discuss measuring unit test accu-
racy through a dual corpus methodology. This method evaluates the performance of unit tests
by applying them to a set of both acceptable and unacceptable solutions. Accuracy is deter-
mined based on the unit tests’ ability to correctly classify these solutions—approving acceptable
solutions and rejecting unacceptable ones. This method not only focuses on fault detection (as
with traditional bug identification rates) but also highlights the ability of the unit tests to val-
idate correct implementations accurately. This comprehensive measure provides an enhanced
perspective on the effectiveness of unit tests in distinguishing between correct and incorrect
code implementations.
Accuracy Formula
The accuracy formula based on the paper [9] can be defined as follows:
TP + TN
Accuracy =
T P + T N + FP + FN
Where:
• T P (True Positives)
• T N (True Negatives)
• FP (False Positives)
• FN (False Negatives)
22
2.2 Good Practices of Unit Testing
A Test Should Fail: It is necessary for tests to be able to fail to prove their value. A test that
never fails does not contribute to the software’s quality assurance. Including assertions and using
a test’s failure history can assess a test’s ability to detect issues effectively [8].
Reliability: Tests should consistently yield the same results under unchanged conditions,
free from randomness or non-determinism. This principle ensures that tests can be relied upon
as a stable safety net for development, with randomness and non-determinism identified and
eliminated [8].
Happy vs. Sad Tests: This principle encourages testing both expected (happy path) and unex-
pected (sad path) behaviors to counteract confirmation bias and ensure comprehensive coverage.
This balanced approach helps uncover potential faults while verifying expected functionalities,
aiming for an equal number of happy and sad test cases [8].
Test (in)dependency / Test Isolation: Asserts that tests should run independently of each other
and in any order, without any dependencies that could affect their outcomes. This isolation
ensures that the addition of new tests or changes to existing ones does not impact the overall
test suite’s integrity [8].
23
2. Related Work
24
2.3 Advancements and Impact of Automation in Software Testing
pacted user stories and requirements. This automation helps Quality Assurance (QA) engineers
avoid time wasted on false positive error reports [41].
Khaliq et al. [31] described the impact of AI on software testing in multiple ways. Firstly
when it comes to the specification, AI facilitates the automation of test case generation from
specifications, using techniques like Info-Fuzzy Networks (IFN) for inductive learning from ex-
ecution data, which helps in recovering missing specifications, designing regression tests, and
evaluating software outputs. Similar to what Pham et al. described, Khaliq et al. explained that
AI automates the creation of test cases that meet adequacy criteria, significantly impacting this
area. Various AI techniques, including genetic algorithms, ant colony optimization, and natural
language processing, are applied to generate test cases, improve software coverage, and create
efficient testing strategies. Also, AI approaches generate test inputs and data, improving test-
ing coverage and effectiveness. Techniques like genetic algorithms and ML models are utilized
to produce complex test data, such as images and sequences of actions for app testing. Zubair
added the possibility for AI to aid in solving the test oracle problem by automating the veri-
fication of correct behavior in software testing. Techniques include using neural networks and
ML algorithms to automatically generate test oracles and predict expected outputs. Another
important impact is Test Case Prioritization when AI arranges test execution orders to maxi-
mize defect detection early in the testing process. Techniques include reinforcement learning,
clustering, and ML models to prioritize based on factors like test history, failure rates, and test
case characteristics. Lastly, AI techniques can provide predictions on testing efforts and costs,
utilizing ML models to estimate efforts based on historical data and test suite characteristics
[31].
Souri et al. [19] employed a Multi-Objective Genetic Algorithm (MOGA) in the context
of web applications testing and found that MOGA can minimize test cases while maintaining
coverage and reducing costs.
Straub and Huber [50] introduced the Artificial Intelligence Test Case Producer (AITCP),
which optimizes human-generated scenarios using AI algorithms. Their findings indicate that
AI methods like AITCP can effectively test AI systems, proving particularly efficient for both
surface and airborne robots. Moreover, through the authors’ analysis it is evident that humans
perform slower when testing software compared to AI. Specifically, the AI-based testing tech-
nique (AITR) significantly outperforms manual human testing, especially in complex scenarios.
The results show that while human performance degrades significantly with the complexity of
the testing scenario, the AI maintains a consistent performance level, indicating a substantial
efficiency advantage of AI over manual testing in software evaluation.
Yerram et al. [58] discusses how AI-driven software testing leverages machine learning and
AI to automatically create test cases by analyzing software specifications and past testing data.
This approach enhances code coverage and fault detection, reducing the manual effort required
in test case design and allowing testers to focus on higher-level tasks.
Moreover, according to Porter et al. [42], a key application of machine learning and AI in
quality assurance is the automation of test generation which replaces the labor-intensive and
time-consuming manual creation of test cases. Traditional methods rely on requirements, speci-
fications, and domain expertise, but often miss critical edge cases or scenarios. Machine learning
addresses these limitations by automating test case creation.
Shajahan et al. [48] talk about how AI-based testing enhances productivity and allows testers
to focus on more critical tasks. It improves test coverage and accuracy by identifying intricate
25
2. Related Work
relationships within software systems, leading to more comprehensive test scenarios and better
fault detection.
Serra et al. [47] conducted an empirical study comparing the effectiveness of manual and
automatic unit test generation. The study found that while manually written tests tend to be
more readable and contextually accurate, automatic unit test generation can significantly reduce
the time and effort required for test creation. However, the automatically generated tests often
suffer from readability and maintenance issues.
26
2.5 Multiple-Criteria Decision-Making
infrastructure like Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) have
been beneficial, further work is needed to fully exploit their potential while designing techniques
that require less computation without sacrificing performance [31].
27
2. Related Work
where:
• ai j is the score value of the i-th alternative in terms of the j-th criteria (the calculation
method for the score values of each criteria is mentioned in Subsection 3.4.2),
28
Chapter 3
Method
In this chapter, we describe the methodological steps in this thesis. We start by conducting a
literature review of articles discussing unit testing, good practices in unit testing, the impact
and challenges of unit testing, and the Multiple-Criteria Decision-Making approach. We search
through peer-reviewed articles, books, and studies, utilizing multiple databases including IEEE
Xplore and ACM Digital Library. Keywords such as "Unit testing," "AI tools," and "AI in software
testing" are employed. Initially, we review the abstracts to assess their relevance to the targeted
information. Subsequently, we read the selected articles and studies, documenting the needed
information. For books, we focus on relevant sections by using their table of contents.
Then, we review 18 AITT and filter them down to 7 tools capable of generating executable
unit tests. We summarize each tool’s documentation, covering its functionalities, features, and
security aspects. Following this, we analyze the code of ILA to familiarize ourselves with its
functions. The next step involves condensing the literature study on best practices of unit testing
into a list of 18 criteria. This list serves as the road map for the evaluation. Additionally, another
criterion is also taken into consideration which is the quality of the generated test cases. We
initialize the evaluation by assessing the existing MI unit tests, followed by evaluating the AI-
generated unit tests for each selected AITT. Afterwards, an interview is held with the IKEA
engineer to obtain feedback on the quality of the AI-generated test cases, based on the typical
considerations IKEA engineers use for assessing test case quality. Thereafter, we compare the
evaluation results using the Multi-Criteria Decision-Making approach to rank the AITT along
with the MI unit testing. Finally, we interview an IKEA tech leader to discuss the challenges
associated with adopting each AITT. In Figure 3.1, we provide an overview figure of the entire
method to help readability.
29
3. Method
30
3.1 Exploring the existing AITT
1. Codiumate - by CodiumAI
3. TabNine - by Codota
8. Cody AI - by Sourcegraph
31
3. Method
A documentation study of these 7 promising AITT is conducted which includes their fea-
tures, functionality, and security. Moving forward, the unit tests generated by these tools un-
dergo further assessment based on the criteria outlined in Section 3.2
Evolution of GPT
The initial version of the GPT models was launched with 110 million parameters, establishing
the groundwork for utilizing Transformers in language modeling tasks. Subsequently, an en-
hanced version, GPT-2, was released, featuring 1.5 billion parameters. This version proved more
effective, and capable of generating more coherent and contextually appropriate text. However,
OpenAI initially restricted its release due to concerns about its potential misuse in creating de-
ceptive content. With an impressive 175 billion parameters, GPT-3 represented a substantial
advance in the capacity of machines to produce human-like text. It is capable of performing
a broad array of tasks, such as translation, summarization, and question-answering, often re-
quiring minimal to no task-specific training [17]. GPT-3.5 was then introduced as an improved
version that addressed some of GPT-3’s limitations, enhancing its ability to understand and
produce text for particular applications. Finally, GPT-4, the most recent iteration of OpenAI’s
language models, demonstrates significant progress over its predecessors. GPT-4 is adept at pro-
cessing both text and images, marking a notable enhancement in generating computer code from
visual inputs [46].
• Automated Test Generation: Codiumate automates the generation of test cases, signifi-
cantly reducing the time and effort developers spend on testing. It covers a wide range of
32
3.1 Exploring the existing AITT
scenarios to ensure comprehensive code coverage and helps identify potential issues early
in the development cycle.
• Code Explanation and Suggestions: The plugin offers in-depth explanations of code snip-
pets and provides suggestions for improvements. This feature aids developers in un-
derstanding complex code and implementing best practices, enhancing code quality and
maintainability.
• Customizable Testing Options: Users have the flexibility to customize test names, objec-
tives, and other parameters, making it easier to align tests with specific project require-
ments and standards.
Codiumate builds upon OpenAI’s GPT models and incorporates CodiumAI’s proprietary algo-
rithmic engine. This combination enables the plugin to understand context, generate relevant
test cases, and offer actionable insights for code enhancement. The approach involves distribut-
ing and chaining multiple prompts to create a diverse set of tests, efficiently gathering a broad
code context, and allowing interactive adjustments to the generated tests[14] [13]
For the paid version, the data including code that is stored in the servers for troubleshooting
purposes if needed, will automatically be deleted within 48 hours [18].
33
3. Method
developers’ needs, offering highly relevant code suggestions that significantly reduce the time
spent on coding tasks. One of the key features of TabNine is its ability to generate unit tests for
an implemented code by right-clicking on a specific function and then selecting the "generate
unit tests" option. TabNine can even generate unit tests based on natural language comments,
making it a powerful tool for speeding up the development process [51] [53].
TabNine employs a sophisticated approach by integrating three distinct pre-trained Ma-
chine Learning models that collaborate seamlessly. Initially, it leverages a vast dataset compris-
ing over a billion lines of code sourced from open repositories. This primary model offers the
flexibility to run either locally or on cloud infrastructure, with a default preference for local
execution of code which can enhance the code privacy of the user.
TabNine ensures the privacy of user code by employing temporary processing, where user
code is only retained on the server for the duration of computing the desired result and is never
persisted. As users code or ask questions in chat, TabNine requests AI assistance from its cluster.
Requests include some code from the local IDE workspace as context to provide relevant answers.
This context is immediately deleted after generating the tests. TabNine creates an RAG index
based on the code in the local workspace of the IDE for each user. The company develops its
models based on open-source code with permissive licenses, and no third-party Application
Programming Interfaces (APIs) are used [52]. All of this leads to high privacy of the processed
code. These privacy conditions apply to the two versions (Tabnine Protected) and (Tabnine +
Mistral), but not to the other versions that use GPT models, as they do not guarantee full privacy.
34
3.1 Exploring the existing AITT
comments [24]. Additionally, GitHub Copilot extends its innovation to testing, simplifying
the creation of unit tests. Through Copilot Chat or right-clicking on a selected function, it
offers the capability to generate test case snippets, drawing from the open code or highlighted
snippets within the editor. This feature aids in swiftly crafting tests for specific functions, sug-
gesting potential inputs and expected outputs, and even asserting the correctness of functions
based on their context and semantics. Copilot Chat’s ability to recommend tests for edge cases
and boundary conditions—such as error handling and unexpected inputs—further bolsters code
robustness [22].
GitHub Copilot operates predominantly on the user’s local machine. It generates code sug-
gestions based on the context of the current file and any related files without sending this sen-
sitive information to external servers. This local processing ensures that private code remains
private, safeguarding against unauthorized access or leakage [23].
35
3. Method
List of criteria:
3. Avoids if statements
8. Guarantees fast execution time (no longer than a few minutes) to ensure quick feedback
and facilitate a rapid development process.
9. Is easy to maintain with clear naming conventions, and a well-organized structure for the
test suite.
36
3.2 Evaluating Unit Tests
18. Is self-validating
We generate unit tests using each AITT, according to their usage instructions. Most AITT
offer the option to generate unit tests by right-clicking on the functions being tested. However,
IKEA GPT and CodePal are exceptions. IKEA GPT uses a natural language method where we
input the function to be tested and request unit tests, and the tool subsequently provides the
unit test as a result. Additionally, CodePal allows for the generation of unit tests by inputting
the tested functions on their website and selecting "generate". By using the Vitest Framework, we
are able to see the test outcome, including number of test cases failed, the number of test cases
passed, and the percentages for the code coverage, statement coverage, and branch coverage. We
assess both the MI unit tests and the AI-generated unit tests using the criteria mentioned above.
The results of these evaluations are detailed in Section 4.1 and Section 4.2 respectively. The
quality of the test cases is evaluated together with an IKEA engineer through an interview. This
step is crucial to intensify the evaluation process. Following this, the challenges associated with
adopting each AITT are explored in discussions with an IKEA tech leader in another interview.
The findings from these discussions are presented in Section 4.3.
The terms that are used for evaluation in the tables in Chapter 4 are:
• Pass: The criterion is fulfilled (=100% of the test cases meet the criteria).
• Predominantly pass: The criterion is mostly fulfilled (≥75% to <100% of the test cases
meet the criteria).
• Partially pass: The criterion is Partially fulfilled (≥25% to <75% of the test cases meet the
criteria).
• Fail: The criterion is not fulfilled (<25% of the test cases meet the criteria).
• C1: We verify whether the test cases include the sections "Arrange", "Act", and "Assert".
The evaluation terms above are used for this assessment.
• C2: We verify that there are no cycles of the "Arrange", "Act", and "Assert" sections. The
evaluation terms above are used for this assessment.
37
3. Method
• C3: We verify that there are no ’if’ statements in the test cases. The evaluation terms above
are used for this assessment.
• C4: We verify that there is only a one-line "Act" section(s) in the test cases. The evaluation
terms above are used for this assessment.
• C5: We introduce a fault in the code and check the number of failing test cases out of the
total number of test cases. The evaluation terms above are used for this assessment.
• C6: We calculate the number of passing test cases when the functionality of the code is
correct. The evaluation terms above are used for this assessment.
• C7: We follow the formula in Figure 2.2.4 by dividing the sum of the true positive and
true negative values by the sum of these values added to the false positive and false neg-
ative values. After calculating the percentage, we use the evaluation terms above for the
assessment of this criterion.
• C8: This criterion passes if the execution time is less than 2 minutes, otherwise it fails.
This ensures rapid feedback which does not hinder the developer from a fast continuous
development process as described in Subsection 2.2.2.
• C9: We control and see if the test cases have clear naming conventions and a well-organized
structure for the test suits. The evaluation terms above are used for this assessment.
• C10: We get a coverage percentage from the Vitest framework provided as an output in the
IDE that we are using (Visual Studio Code). This criterion passes if the coverage is 60% or
above, otherwise it fails. This ensures sufficient coverage as described in Subsection 2.2.3.
• C11: We get a coverage percentage from the Vitest framework provided as an output in the
IDE that we are using (Visual Studio Code). This criterion passes if the coverage is 60% or
above, otherwise it fails. This ensures sufficient coverage as described in Subsection 2.2.3.
• C12: We get a coverage percentage from the Vitest framework provided as an output in the
IDE that we are using (Visual Studio Code). This criterion passes if the coverage is 60% or
above, otherwise it fails. This ensures sufficient coverage as described in Subsection 2.2.3.
• C13: We control if the test cases have a small test code size (less than 50 lines of code),
and a low number of assertions (2 or less). The evaluation terms above are used for this
assessment.
• C14: We verify that the test cases have a single purpose. The evaluation terms above are
used for this assessment.
• C15: We control that the output of the test cases remains the same each time the tests are
run (without any modification on the source code). The evaluation terms above are used
for this assessment.
• C16: We verify that there are Happy and Sad Tests included in the test suite. The evalua-
tion terms above are used for this assessment.
38
3.3 Interview with IKEA’s Engineers
• C17: We control that the results of the existing test cases do not depend on newly added
test cases. In other words, no test is dependent on another test. The evaluation terms
above are used for this assessment.
• C18: We verify that the test cases provide either a pass or a fail output, where no manual
monitoring or interpretation of the test results is needed. The evaluation terms above are
used for this assessment.
• How relevant is the test case’s aim to the functionality of the tested function?
• How relevant is the test case’s assertion to the aim of the test case?
• Are there enough test case variations to cover most of the possible scenarios?
The outcomes of the interview are presented for each AITT evaluation results in Subsec-
tion 4.2.1,Subsection 4.2.2, Subsection 4.2.3,Subsection 4.2.4,Subsection 4.2.5,Subsection 4.2.6,
and Subsection 4.2.7. The outcome was translated to numbers by firstly using the evaluation
method described in Section 3.2, and secondly assigning scores using the method mentioned in
Subsection 3.4.2.
Likewise, a one hour long online interview is held with IKEA’s tech leader to address the
challenges associated with adopting AITT in IKEA’s web development, after showing them the
evaluation results. A tech leader/senior software engineer with 12 years of experience in the
software engineering field is chosen for this interview. They are responsible for decision-making
about the tools used in projects and have considerable expertise in incorporating new technolo-
gies into IKEA’s systems. Correspondingly, several questions are prepared beforehand to steer
the discussion outlined as the following:
• Would you trust the unit tests generated using the evaluated AITT? Furthermore, would
you undertake any additional actions if you were to use them?
• Could the AITT be used while considering both the AITT privacy policy (including those
with a policy of storing data for 48 hours or those that claim not to store any data) and
IKEA’s policy?
• Are there any other challenges of introducing the selected and evaluated AITT into IKEA’s
web development context?
39
3. Method
The outcomes of the interview are presented in Section 4.3. During the interviews, detailed
notes were taken to capture the key points and feedback provided by the engineers. These notes
were then reviewed and analyzed. The analysis involved identifying key points relevant to the
evaluation of the unit test cases and the associated challenges. These points were organized into
categories based on their type of challenge and relevance to the evaluation criteria. This step
facilitated an understanding of the engineers’ perspectives on the quality of the test cases and the
potential challenges of adopting AITT. These categorized points were then used in subsequent
comparison steps.
40
3.4 Using Multi-Criteria Decision-Making to Compare the AITT and MI Methods
10 [9 , 9, 9, 9, 1, 1, 1, 3, 5, 3 , 3 , 3 , 5 , 7 , 1 , 3 , 5 , 5 , 1] ,
11 [9 , 9, 9, 9, 1, 1, 1, 3, 5, 3 , 3 , 3 , 5 , 7 , 1 , 3 , 5 , 5 , 1] ,
12 [7 , 7, 7, 7, 1/3 , 1/3 , 1/3 , 1 , 5 , 1 , 1 , 1 , 5 , 5 , 1 , 3 , 5 , 5 , 1] ,
13 [3 , 1, 3, 1, 1/5 , 1/5 , 1/5 , 1/5 , 1 , 1/5 , 1/5 , 1/5 , 1 , 1 , 1/7 , 1/5 , 1 , 3 , 1/9] ,
14 [7 , 7, 7, 7, 1/3 , 1/3 , 1/3 , 1 , 5 , 1 , 1 , 1 , 5 , 5 , 1/3 , 1 , 5 , 7 , 1/3] ,
15 [7 , 7, 7, 7, 1/3 , 1/3 , 1/3 , 1 , 5 , 1 , 1 , 1 , 5 , 5 , 1/3 , 1 , 5 , 7 , 1/3] ,
16 [7 , 7, 7, 7, 1/3 , 1/3 , 1/3 , 1 , 5 , 1 , 1 , 1 , 5 , 5 , 1/3 , 1 , 5 , 7 , 1/3] ,
17 [3 , 1, 3, 1, 1/5 , 1/5 , 1/5 , 1/5 , 1 , 1/5 , 1/5 , 1/5 , 1 , 1 , 1/7 , 1/5 , 3 , 3 , 1/9] ,
18 [5 , 5, 5, 1, 1/7 , 1/7 , 1/7 , 1/5 , 1 , 1/5 , 1/5 , 1/5 , 1 , 1 , 1/7 , 1/7 , 1 , 3 , 1/9] ,
19 [9 , 9, 9, 9, 1, 1, 1, 1, 7, 3 , 3 , 3 , 7 , 7 , 1 , 5 , 7 , 7 , 1/3] ,
20 [7 , 7, 7, 7, 1/3 , 1/3 , 1/3 , 1/3 , 5 , 1 , 1 , 1 , 5 , 7 , 1/5 , 1 , 5 , 5 , 1] ,
21 [5 , 5, 5, 5, 1/5 , 1/5 , 1/5 , 1/5 , 1 , 1/5 , 1/5 , 1/5 , 1/3 , 1 , 1/7 , 1/5 , 1 , 1 , 1/9] ,
22 [3 , 3, 5, 5, 1/5 , 1/5 , 1/5 , 1/5 , 1/3 , 1/7 , 1/7 , 1/7 , 1/3 , 1/3 , 1/7 , 1/5 , 1 , 1 , 1/9] ,
23 [9 , 9, 9, 9, 1, 1, 1, 1, 9, 3 , 3 , 3 , 9 , 9 , 3 , 1 , 9 , 9 , 1]
24 ]
25
26
27 def c a l c u l a t e _ w e i g h t s ( matrix ) :
28 # C a l c u l a t e t h e e i g e n v a l u e s and e i g e n v e c t o r s
29 e i g e n _ v a l s , e i g e n _ v e c s = np . l i n a l g . e i g ( m a t r i x )
30
31 # Find the index of the l a r g e s t e i g e n v a l u e
32 m a x _ e i g e n _ i n d e x = np . a r g m a x ( e i g e n _ v a l s )
33
34 # Get t h e P r i n c i p a l E i g e n v e c t o r ( t h e e i g e n v e c t o r c o r r e s p o n d i n g t o t h e l a r g e s t
eigenvalue )
35 principal_eigenvector = eigen_vecs [ : , max_eigen_index ]
36
37 # Normalize the weights
38 n o r m a l i z e d _ w e i g h t s = p r i n c i p a l _ e i g e n v e c t o r / p r i n c i p a l _ e i g e n v e c t o r . sum ( )
39
40 r e t u r n np . r e a l ( n o r m a l i z e d _ w e i g h t s )
41
42
43 weights = c a l c u l a t e _ w e i g h t s ( matrix )
44
45 print ( " Weights : " , )
46 print ( weights )
47 print ( " \ nSum o f w e i g h t s : " )
48 print ( w e i g h t s . sum ( ) )
With the use of the above code, we are able to calculate the eigenvalues and eigenvectors
of our matrix. Next, we obtain the principal right eigenvector by choosing the eigenvector
corresponding to the largest eigenvalue. Lastly, we normalize this eigenvector to obtain the
normalized weights for our 19 criteria.
41
3. Method
2. The total score for each criterion in each tool table is calculated by averaging the scores for
all files. This is done by adding the scores for each file and then dividing by the number
of files (4 in this case). The average score is then recorded in the tables in the Chapter 4
as the total score for the criteria across all files.
42
Chapter 4
Result
In this chapter, we present the results of the evaluations of the existing MI unit tests and gen-
erated unit tests for each AITT. Additionally, we include feedback from an interview with an
IKEA engineer on assessing the quality of the test cases for the generated unit tests, presented
for each AAIT. Subsequently, we show the outcomes of an interview with IKEA’s tech leader re-
garding the challenges associated with adopting each AITT. Finally, we present the results from
the Multi-Criteria Decision-Making method, with tables showing the normalized weights and
the final WSM scores for each alternative.
43
4. Result
specific case, which involves 4 medium-sized TypeScript files, the engineer from IKEA reported
that implementing the unit tests required two workdays (approximately 16 hours). Moreover,
the performance of the MI unit tests averaged 11.9 seconds, as shown in Table 4.1 . This indicates
that the tests are quick in terms of execution time, which sustains a high development velocity
by allowing the developer to rapidly receive feedback and continue coding.
The test cases for the existing MI unit tests align closely with the functionality of the func-
tion being tested, incorporating relevant and essential assertions. The variation of test cases is
covering all possible scenarios with no redundancy. The success of these test cases often relies on
the tester’s expertise. Based on the approach we are using to assign the scores in Subsection 3.4.2,
the score for quality of test cases: 3.
In Table 4.1 and Table 4.2 we see that the MI unit tests are implemented using the AAA
approach without any repeated AAA cycles or ’if’ statements. Each test is designed to fulfill
a single responsibility, maintain test independence, and be self-validating. However, some test
cases include multiple lines in the act section and multiple responsibilities. For instance, the test
case titled "Should return a valid design List array for one selected item when the user selects
one value, template, dataset or app Method" could be divided into three separate tests. The first
would verify the return of a valid design list array upon selecting a template. The second would
do so upon selecting a dataset, and the third upon selecting an app method. By doing so, we
could eliminate multiple lines in the act section while maintaining a single responsibility per
unit test. Overall, the structure is adequate, and the multiple lines in the act section do not
compromise the test results.
Moreover, the Table 4.1 and Table 4.2 show that the MI unit tests are reliable, with no ran-
dom values used, and test results remain consistent unless the source code is modified. Addi-
tionally, The MI unit tests demonstrate a high measurement accuracy in terms of protection
against regression and resistance to refactoring. Consequently, they achieve a high accuracy
rate. According to IKEA’s software engineer, this outcome was anticipated because the tester
carefully considers the appropriate test cases to implement, which directly influences accuracy.
Moreover, the greater the time and effort invested in developing the test cases, the higher the
accuracy percentage will be.
Furthermore, the coverage in the MI unit tests is sufficient for line, code, and branch cov-
erage, exceeding 60%. The percentages achieved suffice to address the most critical test case
scenarios. However, the coverage could be improved if the developer concentrated on including
negative (sad path) tests where they were missing in some cases. Incorporating both positive
(happy path) and negative tests along with a high percentage of coverage can ensure compre-
hensive testing of the tested function’s full functionality.
Lastly, the test code for the MI unit tests has variables with clear names and well-defined
linked functions’ components. Additionally, the code includes a clear description of the test
cases that correspond to the purpose of the functionality. There are no redundant assertions that
would degrade the quality of the test cases. However, the size of the tests could be reduced by
creating shared test data, rather than initiating it in each test case. The degree of maintainability
can vary depending on an individual’s expertise. In our case, the maintainability is very good,
which positively influences the quality of the tests.
44
4.1 Evaluation of The Existing MI Unit Tests
Table 4.1: Evaluation of ILA Unit Tests Across Different Classes for MI
unit tests (Part 1)
45
4. Result
Table 4.2: Evaluation of ILA Unit Tests Across Different Classes for MI
unit tests (Part 2)
46
4.2 Evaluation of the Selected AITT
4.2.1 UnitTestsAI
Table 4.3 and Table 4.4 display the results of the evaluation of the unit tests generated by
UnitTestAI for each criterion, with a final average score. A detailed analysis of the results is
presented in this section.
The generated unit tests were divided across four files, Save-labels has 4 unit tests, Sort-
labels has 7 unit tests, Update-DesignList-Validity has 5 unit tests, and lastly, FilterFunctions
has 7 unit tests.
Based on the thoughts of IKEA’s software engineers obtained during our interview to assess
the quality of the test cases, we found that each test case demonstrates relevance between the test
case’s aim and the functionality of the tested function, as well as relevance between the assertion
and the test case’s aim. However, the test cases lack variety in the "arrange" section (inputs for
the assertions), as they explore only two scenarios: either all inputs are correct or they are not.
Nonetheless, these issues should not prevent users from utilizing the tools, as they can generally
be fixed manually, ensuring excellent quality of the test cases without requiring substantial time.
Based on the approach we are using to assign the scores in Subsection 3.4.2, the score for quality
of test cases: 2.
In Table 4.3 and Table 4.4, we see that UnitTestAI implements the AAA approach, main-
tains a single responsibility with a one-line act section, is self-validating, ensures test indepen-
47
4. Result
dence, and avoids multiple cycles of AAA and ’if’ statements in most of the generated unit tests.
These results demonstrate the tool’s ability to generate tests with proper structural considera-
tion. Therefore, the user does not need to concern or recheck the structure after the generation
process is complete. Moreover, the mentioned tables indicate that UnitTestsAI demonstrates
reliability, with no random values used, and test results remain consistent unless the source
code is modified. Additionally, the level of protection against regression and resistance to refac-
toring scored an average of 2.5/3 for both criteria, resulting in a high accuracy level, which in
turn indicates their safety for use in real-life projects. This level of test accuracy is sufficient to
demonstrate whether the function is correctly implemented in terms of functionality.
Furthermore, according to the measurements in the Table 4.3 and Table 4.4, UnitTestsAI
implements unit tests with sufficient coverage in terms of line, code, and branch coverage, ex-
ceeding 60%, by using various types of test cases and covering both happy and sad tests.
Additionally, Table 4.3 and Table 4.4 show that the tool features tests that are easy to main-
tain and understand in most cases where clear variable names are used with an organized code
structure, and the components of linked functions are well-defined. Moreover, there are no re-
dundant assertions, which could decrease the quality of the test cases. Moreover, UnitTestsAI
generally has a very appropriate code size where not a lot of shortening is needed.
48
4.2 Evaluation of the Selected AITT
Table 4.3: Evaluation of ILA unit tests across different files using
UnitTestsAI tool (Part 1)
49
4. Result
Table 4.4: Evaluation of ILA unit tests across different files using
UnitTestsAI tool (Part 2)
50
4.2 Evaluation of the Selected AITT
51
4. Result
Table 4.5: Evaluation of ILA unit tests across different files using PAL
tool(Part 1)
52
4.2 Evaluation of the Selected AITT
Table 4.6: Evaluation of ILA unit tests across different files using PAL
tool (Part 2)
53
4. Result
4.2.3 TabNine
Table 4.7 and Table 4.8 display the results of the evaluation of the unit tests generated by Tab-
Nine for each criterion, with a final average score. A detailed analysis of the results is presented
in this section.
The generated unit tests were divided across four files, Save-labels has 6 unit tests, Sort-
labels has 3 unit tests, Update-DesignList-Validity has 4 unit tests, and lastly, FilterFunctions
has 6 unit tests.
Based on the thoughts of IKEA’s software engineers obtained during our interview to assess
the quality of the test cases, we found that the unit tests generated for TabNine include a sig-
nificant number of irrelevant test cases. Consequently, the assertions in these test cases are also
irrelevant to their objectives. Some test cases fail to implement the arrange section with the
correct types of values (such as template names and template categories’ names), which results
in some failed test cases. Additionally, one file lacks test cases for certain functions that should
have been tested. The unit tests also do not offer sufficient variation in test scenarios. These nu-
merous issues cannot be resolved quickly manually, making the test cases unreliable. The quality
of test cases should not be overlooked; higher-quality tests more effectively fulfill the goals of
unit testing. Based on the approach we are using to assign the scores in Subsection 3.4.2, the score
for quality of test cases: 0.
In Table 4.7 and Table 4.8, we observe that TabNine does not successfully implement the
AAA approach in two files. This failure arises because the tool merges the act and assert sec-
tions, which draws away from the clear structure of the unit test. Nevertheless, the tool maintains
a single responsibility with a one-line act section, is self-validating, ensures test independence,
and avoids multiple cycles of AAA and ’if’ statements in most of the generated unit tests. These
results demonstrate the tool’s ability to generate tests with good structural consideration but
with a partial lack of implementation of the AAA pattern. Therefore, the user must recheck the
AAA pattern after the generation process is complete. Moreover, the mentioned tables indicate
that TabNine demonstrates reliability, with no random values used. The level of resistance to
refactoring is very high, however, the protection against regression is very low, with an average
score of 1.25/3. This score indicates the tool’s unreliability for use in real-life projects. Never-
theless, minor manual adjustments to the generated tests—such as correcting the data in the
arrange section or modifying the assert section—could enhance these protection and resistance
rates, leading to improved accuracy to a sufficient level. Overall, the accuracy evaluation of the
tool suggests that it might be enough in certain situations—for instance, if a developer lacks
the time to implement unit tests, or the tests are not a top priority, yet the developer seeks to
roughly gauge the quality of the source code. In such cases, this level of test accuracy would be
improved with some modifications to the tests.
Furthermore, according to the data in Table 4.7 and Table 4.8, TabNine implements unit
tests with sufficient coverage in terms of lines, code, and branches, exceeding 60%, by generating
various types of test cases. Regarding the coverage and diversity of test cases for TabNine, it is
the same analysis applied to the PaLM’s coverage and test case diversity analysis. On the other
hand, one file had missing test cases for some functions that should have been tested.
Additionally, Table 4.7 and Table 4.8 show that TabNine features tests that are easy to main-
tain and understand in most cases, with clear variable names, organized code, and well-defined
components of linked functions. Moreover, there are no redundant assertions, which could de-
54
4.2 Evaluation of the Selected AITT
crease the quality of the test cases. Lastly, TabNine generally has good code sizes, although they
could be further reduced by creating shared data.
55
4. Result
Table 4.7: Evaluation of ILA unit tests across different files using Tab-
Nine tool (Part 1)
56
4.2 Evaluation of the Selected AITT
Table 4.8: Evaluation of ILA unit tests across different files using Tab-
Nine tool (Part 2)
57
4. Result
4.2.4 Codiumate
Table 4.9 and Table 4.10 display the results of the evaluation of the unit tests generated by
Codiumate for each criterion, with a final average score. A detailed analysis of the results is
presented in this section.
The generated unit tests were divided across four files, Save-labels has 29 unit tests, Sort-
labels has 26 unit tests, Update-DesignList-Validity has 67 unit tests, and lastly, FilterFunctions
has 62 unit tests.
Based on the thoughts of IKEA’s software engineers obtained during our interview to assess
the quality of the test cases, we found that all the unit tests generated by Codiumate have relevant
aims and proper assertions. On the one hand, the unit tests offer sufficient variation to cover all
test scenarios, but on the other hand, the tool generates a large number of redundant test cases,
including some that are unnecessary. However, it is important to mention that it is the user
that decides which test cases to generate and how many. Nonetheless, these issues should not
prevent users from utilizing the tools, as they can generally be fixed manually, ensuring excellent
quality of the test cases without requiring substantial time. Based on the approach we are using
to assign the scores in Subsection 3.4.2, the score for quality of test cases: 2.
In Table 4.9 and Table 4.10 , we see that Codiumate implements the AAA approach, main-
tains a single responsibility with a one-line act section, is self-validating, ensures test indepen-
dence, and avoids multiple cycles of AAA and ’if’ statements in most of the generated unit tests.
These results demonstrate the tool’s ability to generate tests with proper structural considera-
tion. Therefore, the user does not need to concern or recheck the structure after the generation
process is complete. Moreover, the mentioned tables show that Codiumate demonstrates reli-
ability, with no random values used. The tool provides excellent protection against regression
with an average score of 2.75/3 and below-average resistance to the refactoring score of 1.75/3.
These scores indicate good accuracy but are insufficient for real-life scenarios. However, minor
manual adjustments to the generated tests—such as fixing the data in the arrange section or
modifying the assert section—could enhance these protection and resistance percentages, lead-
ing to sufficiently improved accuracy. Overall, this level of accuracy might be enough in certain
situations, for instance, if a developer has limited time for implementing unit tests or the tests
are not a top priority, yet the developer still wants to roughly assess the quality of the source
code. In such cases, this level of test accuracy would suffice with some manual modifications.
Furthermore, according to the measurements in Table 4.9 and Table 4.10, Codiumate imple-
ments unit tests with sufficient coverage in terms of line, code, and branch coverage, exceeding
60%, by employing various types of test cases covering both positive and negative scenarios.
Codiumate offers the option to continue generating test cases. This capability is beneficial as it
allows for broader coverage but also carries the risk of increasing test case redundancy, which
can make the test code less clear and harder to maintain for other developers.
Additionally, Table 4.9 and Table 4.10 indicate a good level of maintainability, where Codi-
umate features tests that are easy to maintain and understand in most cases, with clear variable
names, organized code, and well-defined linked functions’ components. There are no redundant
assertions, which could otherwise decrease the quality of the test cases. However, Codiumate
struggles with excessive code length and redundancy, including unnecessary test cases. While
this issue does not affect the accuracy of results, it does reduce the quality of the tests and com-
plicates future modifications.
58
4.2 Evaluation of the Selected AITT
Table 4.9: Evaluation of ILA unit tests across different files using Codi-
umate tool (Part 1)
59
4. Result
Table 4.10: Evaluation of ILA unit tests across different files using
Codiumate tool (Part 2)
60
4.2 Evaluation of the Selected AITT
61
4. Result
Table 4.11: Evaluation of ILA unit tests across different files using
GitHub Copilot tool (Part 1)
62
4.2 Evaluation of the Selected AITT
Table 4.12: Evaluation of ILA unit tests across different files using
GitHub Copilot tool (Part 2)
63
4. Result
4.2.6 CodePal
Table 4.13 and Table 4.14 display the results of the evaluation of the unit tests generated by Code-
Pal for each criterion, with a final average score. A detailed analysis of the results is presented
in this section.
The generated unit tests were divided across four files, Save-labels has 15 unit tests, Sort-
labels has 12 unit tests, Update-DesignList-Validity has 15 unit tests, and lastly, FilterFunctions
has 39 unit tests.
Based on the thoughts of IKEA’s software engineers obtained during our interview to assess
the quality of the test cases, we found that the majority of the unit tests generated by CodePal are
well-aligned with the intended aim and feature relevant assertions. Additionally, these unit tests
exhibit sufficient variation to encompass all test scenarios, though there are a few redundant test
cases. Based on the approach we are using to assign the scores in Subsection 3.4.2, the score for
quality of test cases: 3.
In Table 4.13 and Table 4.14, we see that CodePal implements the AAA approach, main-
tains a single responsibility with a one-line act section, is self-validating, ensures test indepen-
dence, and avoids multiple cycles of AAA and ’if’ statements in most of the generated unit tests.
These results demonstrate the tool’s ability to generate tests with proper structural considera-
tion. Therefore, the user does not need to concern or recheck the structure after the generation
process is complete. Moreover, the mentioned tables indicate that CodePal demonstrates relia-
bility, using no random values. CodePal provides protection with a 1.5/3 average score and re-
sistance with a 2.5/3 average score. These scores lead to satisfactory accuracy but are inadequate
for real-life applications. However, minor manual adjustments to the generated tests—such as
refining the data in the arrange section or modifying the assert section—could improve these
protection and resistance percentages, thereby enhancing accuracy to a sufficient level. Overall,
this accuracy level may suffice in certain circumstances—for instance, if a developer lacks the
time to implement unit tests or if these are not a high priority, yet the developer still wishes
to measure the quality of the source code roughly. In such instances, this level of test accuracy
would be sufficient to demonstrate whether the function is correctly implemented in terms of
functionality with some manual modifications.
Furthermore, according to the measurements in Table 4.13 and Table 4.14, CodePal imple-
ments unit tests with sufficient coverage in terms of line, code, and branch coverage, exceeding
60%, by employing various types of test cases that cover both positive and negative scenarios.
Additionally, Table 4.13 and Table 4.14 reveal that the tool features tests that are easy to
maintain and understand in most cases, with clear variable names, organized code, and linked
functions’ components being well defined. Moreover, there are no redundant assertions, which
could diminish the quality of the test cases. However, CodePal struggles with excessive code
length and redundancy or unnecessary test cases.
64
4.2 Evaluation of the Selected AITT
Table 4.13: Evaluation of ILA unit tests across different files using
CodePal tool (Part 1)
65
4. Result
Table 4.14: Evaluation of ILA unit tests across different files using
CodePal tool (Part 2)
66
4.2 Evaluation of the Selected AITT
67
4. Result
Table 4.15: Evaluation of ILA unit tests across different files using IKEA
GPT tool (Part 1)
68
4.2 Evaluation of the Selected AITT
Table 4.16: Evaluation of ILA unit tests across different files using IKEA
GPT tool (Part 2)
69
4. Result
1. Would you trust the unit tests generated using the evaluated AITT? Furthermore, would you
undertake any additional actions if you were to use them? The tech leader would rely on the
unit tests generated by the AITT, provided they consistently show sufficient coverage. If
this condition is met, no further verification of these unit tests would be necessary. This
approach applies solely to unit tests, unlike integration tests, which require additional
verification steps.
2. Could the AITT be used while considering both the AITT privacy policy (including those with
a policy of storing data for 48 hours or those that claim not to store any data) and IKEA’s pol-
icy? The tech leader noted that sensitive data typically does not reside in the front-end of
web applications. This absence of sensitive data permits the use of tools like Codiumate
and CodePal in front-end components. Codiumate retains data for 48 hours in its paid
version for troubleshooting purposes, while CodePal’s premium users are not required to
provide CodePal with a license to use their code policy. However, it is not clearly stated
whether CodePal stores users’ code, either temporarily or permanently. Other tools, such
as UnitTestAI, are also deemed acceptable despite lacking detailed data storage privacy in-
formation as long as there is no sensitive data in the front-end part. Nevertheless, direct
communication with tool providers is necessary to obtain specific policy details. Fur-
thermore, tools like PaLM, TabNine (free version), GitHub Copilot, and IKEA GPT are
considered safe for use in both front-end and back-end development at IKEA, as they do
not store data during use. However, verification of privacy policies is required from the
tool providers, except for IKEA GPT, which was developed internally.
3. Are there any other challenges of introducing the selected and evaluated AITT into IKEA’s web
development context? In response to the third question, the tech leader identified addi-
tional complications, such as the need for multiple team approvals, including the archi-
tecture team, which may consume a lot of time. Another challenge arises when changes to
the source code occur, corresponding adjustments in unit tests would then be necessary.
Modifying generated unit tests can be time-consuming, as developers must understand
and revise tests not originally written by them. Nonetheless, even in regular cases, it is
sometimes a different developer who modifies the MI unit tests. This amounts to the
same time consumption of modifying AI-generated unit tests. Lastly, developers would
not find it challenging to learn to use these tools because of the available documentation
and tutorial videos on the tools pages.
70
4.4 Results of the Multi-Criteria Decision-Making Method
71
4. Result
Criteria Weight
Includes the AAA pattern 0.01140775
Avoids multiple cycles of AAA steps 0.0087644
Avoids if statements 0.00722379
Contains a one-line act section 0.00999873
Provides protection against regressions 0.1096437
Offers resistance to refactoring 0.1096437
Ensures high test accuracy 0.1096437
Time to prepare and perform tests 0.07050637
Is easy to maintain 0.01683098
Achieves sufficient code coverage 0.0589628
Achieves sufficient statement coverage 0.0589628
Achieves sufficient branch coverage 0.0589628
Small test code size, and low number of assertions 0.01872291
Adheres to the single responsibility/ 0.01920049
Demonstrates reliability 0.11159307
Includes both Happy and Sad Tests 0.06018157
Ensures test independence 0.01999545
Is self-validating 0.01647219
Test case quality 0.1232828
Total: 1.0
72
4.4 Results of the Multi-Criteria Decision-Making Method
Demonstrates reliability
Avoids if statements
Is self-validating
Contains a one- 1
1 3 1 1 1 1 1
1 1 1 1
1 1 1 1 1 1 1
line act section 5 9 9 9 7 7 7 7 9 7 5 5 9
Provides pro-
tection against 9 9 9 9 1 1 1 3 5 3 3 3 5 7 1 3 5 5 1
regressions
Offers resistance
9 9 9 9 1 1 1 3 5 3 3 3 5 7 1 3 5 5 1
to refactoring
Ensures high test
9 9 9 9 1 1 1 3 5 3 3 3 5 7 1 3 5 5 1
accuracy
Time to prepare
7 7 7 7 1 1 1
1 5 1 1 1 5 5 1 3 5 5 1
and perform tests 3 3 3
Is easy to maintain
(simple to under- 3 1 3 1 1
5
1
5
1
5
1
5
1 1
5
1
5
1
5
1 1 1
7
1
5
1 3 1
9
stand and modify)
Achieves suf-
ficient code 7 7 7 7 1
3
1
3
1
3
1 5 1 1 1 5 5 1
3
1 5 7 1
3
coverage
Achieves suffi-
cient statement 7 7 7 7 1
3
1
3
1
3
1 5 1 1 1 5 5 1
3
1 5 7 1
3
coverage
Achieves suf-
ficient branch 7 7 7 7 1
3
1
3
1
3
1 5 1 1 1 5 5 1
3
1 5 7 1
3
coverage
Small test code
size, and low
3 1 3 1 1 1 1 1
1 1 1 1
1 1 1 1
3 3 1
number of asser- 5 5 5 5 5 5 5 7 5 9
tions
Adheres to the
single responsibil- 5 5 5 1 1
7
1
7
1
7
1
5
1 1
5
1
5
1
5
1 1 1
7
1
7
1 3 1
9
ity
Demonstrates re-
9 9 9 9 1 1 1 1 7 3 3 3 7 7 1 5 7 7 1
liability 3
Includes both
Happy and Sad 7 7 7 7 1
3
1
3
1
3
1
3
5 1 1 1 5 7 1
5
1 5 5 1
Tests
Ensures test inde-
pendence / Test 5 5 5 5 1
5
1
5
1
5
1
5
1 1
5
1
5
1
5
1
3
1 1
7
1
5
1 1 1
9
isolation
Is self-validating 3 3 5 5 1
5
1
5
1
5
1
5
1
3
1
7
1
7
1
7
1
3
1
3
1
7
1
5
1 1 1
9
Test case quality 9 9 9 9 1 1 1 1 9 3 3 3 9 9 3 1 9 9 1
73
4. Result
2.83
2.8 2.76
2.4
2.32
2.23
2.2
t
PT
Im I
I
e
Un Pal
d
ilo
tA
AP
in
at
ho
G
um
de
N
es
et
Co
M
b
itT
EA
Co
di
Ta
L
ub
Pa
Co
IK
M
itH
G
Figure 4.1: Various unit testing methods ranked by the WSM score
row illustrates the normalized weights corresponding to these criteria. Rows 3-10 show the 8
unit testing alternatives to be assessed and assigned a final WSM score.
For each unit testing alternative row, multiplication occurs between two numbers under
every criterion column. The first number is the normalized weight of a specific criterion, and
the second number is the score value that the unit testing alternative received for that specific
criterion (final column in the evaluation tables). For example, TabNine has a score of 1.5 for the
criterion Includes the AAA pattern (since it failed to completely implement the AAA approach,
see Subsection 4.2.3), and this criterion’s weight is 0.01140775, therefore "0.01140775∗1.5". PaLM
API has a score of 2.75 for this criterion since it implemented the AAA approach relatively better
than TabNine, thus "0.01140775∗2.75". The rest of the unit testing alternatives have the highest
score of 3 since they were able to fulfill this criterion flawlessly "0.01140775∗3", indicating that
these alternatives are equally good in this aspect.
These multiplications are calculated for each criterion and then summed together. In other
words, it is the total sum of multiplications, whose formula is introduced in Subsection 2.5.2.
This sum represents the WSM score. The overall WSM score for each alternative is displayed in
the column labeled ’WSM Score’.
Finally, the last column indicates the rank of each alternative, with rank 1 denoting the most
superior unit testing approach.
74
4.4 Results of the Multi-Criteria Decision-Making Method
Avoids if statements
Criteria
Weights 0.01140775 0.0087644 0.00722379 0.00999873 0.1096437
MI method 0.03422325 0.0262932 0.02167137 0.0124984125 0.3289311
Codiumate 0.03422325 0.0262932 0.02167137 0.02999619 0.301520175
UnitTestAI 0.03422325 0.0262932 0.02167137 0.02999619 0.301520175
TabNine 0.017111625 0.0262932 0.02167137 0.02999619 0.137054625
PaLM API 0.0313713125 0.0241021 0.02167137 0.0274965075 0.3289311
GitHub Copilot 0.03422325 0.0262932 0.02167137 0.02999619 0.27410925
CodePal 0.03422325 0.0262932 0.02167137 0.02999619 0.16446555
IKEA GPT 0.03422325 0.0262932 0.02167137 0.02999619 0.2192874
Table 4.19: WSM Scores for the AITT and MI Method Across 19 Eval-
uation Criteria (Part 1/4)
Time to prepare and perform tests
Offers resistance to refactoring
Criteria
Weights 0.1096437 0.1096437 0.07050637 0.01683098 0.0589628
MI method 0.3289311 0.3289311 0.21151911 0.046285195 0.1768884
Codiumate 0.191876475 0.2192874 0.21151911 0.05049294 0.1768884
UnitTestAI 0.27410925 0.3289311 0.21151911 0.05049294 0.1768884
TabNine 0.3289311 0.2192874 0.21151911 0.05049294 0.1768884
PaLM API 0.2192874 0.246698325 0.21151911 0.046285195 0.1621477
GitHub Copilot 0.16446555 0.191876475 0.21151911 0.05049294 0.1621477
CodePal 0.27410925 0.2192874 0.21151911 0.046285195 0.1768884
IKEA GPT 0.246698325 0.2192874 0.21151911 0.05049294 0.1768884
Table 4.20: WSM Scores for the AITT and MI Method Across 19 Eval-
uation Criteria (Part 2/4)
75
4. Result
Demonstrates reliability
Criteria
Weights 0.0589628 0.0589628 0.01872291 0.01920049 0.11159307
MI method 0.1768884 0.1621477 0.03744582 0.048001225 0.33477921
Codiumate 0.1768884 0.1621477 0 0.05760147 0.33477921
UnitTestAI 0.1768884 0.147407 0.05382836625 0.05760147 0.33477921
TabNine 0.1768884 0.1621477 0.0514880025 0.05760147 0.33477921
PaLM API 0.1621477 0.1326663 0.0421265475 0.05760147 0.33477921
GitHub Copilot 0.1621477 0.13266635 0.0514880025 0.05760147 0.33477921
CodePal 0.1768884 0.1621477 0.03744582 0.05760147 0.33477921
IKEA GPT 0.1768884 0.1768884 0.0514880025 0.05760147 0.33477921
Table 4.21: WSM Scores for the AITT and MI Method Across 19 Eval-
uation Criteria (Part 3/4)
76
4.4 Results of the Multi-Criteria Decision-Making Method
Is self-validating
WSM Score
Rank
Criteria
Weights 0.06018157 0.01999545 0.01647219 0.1232828 - -
MI method 0.0752269625 0.05998635 0.04941657 0.3698484 2.8299128749999998 1
Codiumate 0.18054471 0.05998635 0.04941657 0.2465656 2.53169852 5
UnitTestAI 0.18054471 0.05998635 0.04941657 0.2465656 2.76266266125 2
TabNine 0.12036314 0.05998635 0.04941657 0 2.2319168024999994 8
PaLM API 0.090272355 0.05998635 0.04941657 0.2465656 2.4950722225 6
GitHub Copilot 0.18054471 0.05998635 0.04941657 0.1232828 2.3187081475 7
CodePal 0.18054471 0.05998635 0.04941657 0.3698484 2.633397545 3
IKEA GPT 0.18054471 0.05998635 0.04941657 0.2465656 2.5705162974999993 4
Table 4.22: WSM Scores for the AITT and MI Method Across 19 Eval-
uation Criteria (Part 4/4)
77
4. Result
78
Chapter 5
Discussion
In this chapter, we discuss the findings from our evaluation of MI and AITT unit testing ap-
proaches, focusing on performance, quality, and challenges. Our comparative analysis highlights
the strengths and weaknesses of each method, illustrating how advanced AI tools are narrowing
the gap with traditional manual testing. Moreover, we discuss the comparison of the evalua-
tion results, which reveal that while manual inspection (MI) testing remains the gold standard
due to its thoroughness and developer insight, several AI tools show promising potential in au-
tomating and enhancing the testing process. Lastly, challenges such as data type recognition, test
redundancy, and privacy concerns need to be addressed to fully integrate AITT into mainstream
software development practices.
79
5. Discussion
downloading the required test framework, managing dependencies, and accessing the source
code. This process generally does not take much time. However, preparing the test data may
take longer if the developer is unfamiliar with the source code and its functionality.
Moreover, setting up the test environment is quick according to IKEA’s software engineer,
but implementing the unit tests depends on familiarity with the source code and the developer’s
expertise. Beyond the immediate results, it is important to consider the long-term impact of MI
unit testing on developer productivity. Based on what IKEA reported regarding the time to pre-
pare the tests, MI unit tests can be time-consuming, particularly in terms of setup and learning.
Frequent switching between writing code and tests can lead to cognitive fatigue, potentially
reducing overall productivity.
Based on the MI unit test evaluation results in Table 4.1 and Table 4.2, we find that tests are
structured and accurate where they adhere to the AAA pattern and have relative test case aim to
the tested function. The coverage is considered good since it exceeds 60%, but could be improved
with more negative tests. Test code is maintainable and well-organized. Overall, the high-quality
test cases effectively cover relevant scenarios, with a fast average execution time enabling rapid
feedback and sustained development, where the average time is 11.9 seconds. However, we can-
not regard this evaluation of MI unit testing as a standard for all MI unit tests, as it varies among
developers as mentioned earlier. Therefore, there can be significant differences in the evaluation
results with other MI unit tests implemented by different developers with different years of ex-
perience. This suggests a need for good knowledge-sharing and documentation practices within
development teams. Creating a culture where documentation is prioritized and peer reviews
are a regular practice can ensure that knowledge about the codebase and best testing practices
is scattered across the team. This can be particularly beneficial for new team members, who can
quickly get up to speed with the project’s testing standards and methodologies.
80
5.3 Comparison between AITT and MI approaches (RQ1 & RQ2)
umentation is generally sufficient. However, the setup process involves the same requirements
mentioned in Section 5.1, along with the installation of the tool itself. Typically, an additional
step includes registering and logging in, which is a brief procedure.
Moreover, AITT significantly reduce the time required to prepare and conduct unit tests,
with most tests generated and executed within minutes, supporting high development velocity
as they provide developers with rapid feedback without impeding progress.
Based on the evaluation of AITT, most tools adhere to good unit test practices, although
some tools require attention. On the one hand, the accuracy of the generated tests varies sig-
nificantly, but on the other hand, most tools achieve sufficient coverage. Maintaining the tests
appears straightforward due to organized code and clear variable naming, although code size
and redundancy issues with some tools could complicate future maintenance efforts. The qual-
ity of the test cases varies; some tools produce sufficient and well-aligned test cases with proper
assertions such as Codepal and UnitTestAI where they could generate high-quality tests as seen
in Table 4.3, Table 4.4,Table 4.13, and Table 4.14. However, other AITT have issues making
them unreliable, such as Codiumate which generated a high number of redundant test cases.
Lastly, while there are areas for improvement, particularly in enhancing test accuracy, coverage,
and quality, AITT offer significant advantages for developers looking to streamline unit test-
ing processes. With minor manual adjustments and skillful use of tool capabilities, these tools
can greatly aid in improving the quality and efficiency of software testing. Furthermore, the
findings revealed that AITT show great promise but sometimes lack the depth of understand-
ing seen in MI tests, specifically TabNine which had a 0 score value regarding the quality of
test cases based on the evaluation from IKEA’s software engineer. Customizing and configuring
AI tools to better fit specific project needs can help bridge this gap. This could involve incor-
porating domain-specific knowledge into the AI models. For example, configuring AI tools to
recognize common patterns in the codebase or specific business logic can improve the relevance
and accuracy of the generated tests.
81
5. Discussion
effectiveness of MI unit tests depends on the developer’s expertise and knowledge in software
testing, introducing variability based on individual skill levels.
Notably, the tools UnitTestAI and CodePal, which secured second and third places respec-
tively, are not far behind MI unit tests in terms of their WSM scores. This closeness suggests
that these tools use advanced technologies that nearly match the effectiveness of MI unit testing
or serve as a strong supplement to it. With rapid advancements in AI, it is expected that these
tools will soon match the quality of MI unit tests. For teams seeking an optimal level of quality
for unit tests, these AI tools can produce tests that, with minor modifications, can compete with
the quality of MI unit tests, thus saving significant time and resources. In scenarios where satis-
factory unit tests are needed quickly, these tools can effectively replace MI unit tests, providing
a reasonable solution in a shorter timeframe with less effort.
Moreover, the tools IKEA GPT, Codiumate, and PaLM, which ranked fourth, fifth, and sixth
respectively, also display competitive WSM scores. These tools can serve as valuable supplements
to MI unit tests by generating an initial suite of unit tests, which developers can then enhance
by adding any missing test cases. This hybrid approach can potentially reduce the overall time
and resources required for unit test implementation, offering a balanced strategy for achieving
quality and efficiency.
Conversely, tools like GitHub Copilot and TabNine, which ranked seventh and eighth respec-
tively, pose challenges as effective supplements for developers. Initially using these tools might
require extensive modifications or issue resolutions, potentially resulting in longer development
times compared to MI unit tests. Therefore, these tools may not currently offer the same level
of utility as the higher-ranked AI tools or MI unit tests.
Furthermore, a relevant study mentioned in Chapter 2 by Straub and Huber [50] found that
AI-based testing techniques, like the AITCP, are significantly faster than manual testing. Their
analysis revealed that AI maintains a consistent performance level even in complex scenarios,
whereas human testers experience a significant slowdown. This demonstrates the superior effi-
ciency and speed of AI in software testing compared to manual methods. This aligns with our
findings, where the AITT were extremely fast in preparing and generating unit tests, in con-
trast to the MI method, which takes more time and depends on the individual’s expertise and
familiarity with the source code.
Last but not least, a relevant study mentioned in Chapter 2 by Serra et al. [47] found that
automatic testing tools achieve line coverage comparable to or even higher than manual tests.
This high coverage is due to the tools’ primary goal of optimizing test coverage on production
code. This demonstrates that automatic tools can surpass manual testing in ensuring thorough
code coverage. This also aligns with our findings, where some AITT were able to achieve a high
line coverage and in some cases even higher than manual testing.
82
5.4 Challenges of Using AITT (RQ3)
test review, monitoring coverage, and ensuring data privacy are crucial to overcoming these
obstacles and effectively integrating AITT into software development workflows.
There are some challenges encountered when using the AITT. Firstly, some AITT occasion-
ally generate unit tests with incorrect data types or exclude some inputs. This can be seen with
tools like GitHub Copilot and Codiumate where a significant number of their generated test
cases did not pass partly due to issues with the data types. With that being said, this issue arises
because the tools cannot recognize all data types that are MI and not built-in in the program-
ming language itself. Although infrequent, it still poses a challenge. A research mentioned in
Chapter 2 by Ghosh et al. [20] discusses the importance of domain-specific knowledge in the
context of AI applications. The research mentions that AI models, particularly deep neural net-
works, often require large amounts of data for effective analysis and classification. The merging
of diverse datasets is emphasized as a key concept to improve model performance, indicating
the necessity for domain-specific knowledge and data to enhance AI capabilities. This is backed
by an article by Bajaj and Samal [6] where they discuss how AI models need to be trained on
datasets specific to the domain to ensure that test cases and bug identifications are relevant,
precise, and meaningful within the specific context of the software system. A potential solution
for this problem involves modifying the data type or importing data types from external files.
This solution is not time-consuming but needs to be kept in mind when using the AITT tools.
Another challenge involves tools like Codiumate, which have a click option to continue gen-
erating unit tests, and chat interaction tools like IKEA GPT, TabNine, or Copilot. The pri-
mary issue is determining when to stop generating new test cases, which sometimes leads to
redundancy. The solution involves continually monitoring coverage while generating unit tests,
which, although time-consuming, helps reduce redundancy. However, this could be complex as
coverage is not the sole quality indicator of test coverage; variation in test cases is crucial as it en-
compasses possible testing scenarios (edge, sad, and happy scenarios). Therefore, an additional
measure is required which involves manually reviewing the generated tests to ensure they cover
all potential scenarios and maintain sufficient coverage (line, statement, and branch coverage).
We encountered no challenges regarding the learning process or setting up the tools since all
tools feature clear documentation, including usage guides. All tools are free, although some, like
Codiumate, offer paid upgrades for more secure versions.
Moreover, as per the tech leader we interviewed, privacy is a challenge with tools that lack
privacy information, posing a risk of data leakage. Tools that retain data for up to 48 hours could
be restricted to front-end web application usage. This aligns with the article by Bajaj and Samal
[6] which is mentioned in Chapter 2 regarding the ethical considerations in AI-based technol-
ogy, particularly AI-generated test cases. The authors highlight significant privacy and security
concerns, which gives the need for proper safeguards and compliance with privacy regulations
to protect sensitive data, such as source code, from potential breaches and misuse. However,
tools deemed secure could be utilized in both front-end and back-end parts of the web appli-
cation. There are no challenges in starting to use these tools, except for the time required to
receive approval from IKEA’s architect team. Another challenge arises when the source code
gets modified, requiring modifications to the generated unit tests. The developer would need
to understand the existing unit tests before making changes. However, this could be swiftly re-
solved by generating new unit tests for the updated source code, although it is still required to
check the coverage and scenario coverage, which is faster than altering MI unit tests.
Additionally, another related work is mentioned in Chapter 2 regarding the challenges of
83
5. Discussion
AI in software testing. While Khaliq et al. [31] believe that AI techniques in software testing
often require significant computational resources, we have not encountered such an issue while
testing any of ILA’s files. In fact, we experienced the opposite, since the majority of the AITT
were simple extensions in the Visual Studio IDE which did not require significant computational
resources.
84
5.6 Ethical and Social Aspects
85
5. Discussion
By reducing the time required to implement unit tests, AI-based tools enable developers to
concentrate more on core development tasks or allocate time to acquiring new knowledge and
skills. This shift potentially fosters a more dynamic learning environment where developers can
improve their testing proficiency by studying and refining the generated unit tests. Thus, AI-
driven testing tools not only streamline the development process but also promote continuous
learning and skill enhancement among developers.
86
Chapter 6
Conclusion
AITT advances software development by generating unit tests with a lower learning curve and
faster preparation, though test quality varies. While MI unit testing currently produces higher-
quality tests, it requires more effort. A hybrid approach combining AI-generated tests with
manual adjustments is recommended for optimal results. The research highlights AITT’s po-
tential and suggests further investigation into hybrid methods and extending evaluations to
various languages and frameworks.
In software development, there is a wide range of existing AITT to help developers generate
unit tests. Since there are not enough studies on the AI-based tools’ performance, this lack
of knowledge might make companies hesitant to use the tools, leading them to MI unit tests
instead.
Our evaluation of MI unit tests (RQ1) shows they are of high quality, with quick setup but
variable implementation times depending on developers’ familiarity and expertise. The tests
are structured, accurate, and maintainable, with good coverage that could benefit from more
sad tests. Overall, the tests effectively cover relevant scenarios and offer rapid feedback for
sustained development.
Our evaluation of unit tests generated by AITT (RQ2) reveals several key insights. The tools
have a low learning curve, since they are supported by extensive documentation, and are effi-
cient in test preparation and execution. Most tools follow good unit test practices, though some
need improvement. While the accuracy of generated tests varies, coverage is generally sufficient.
Maintenance is facilitated by organized code and clear naming, despite potential issues with
code size and redundancy. Test case quality varies, with some tools producing reliable tests and
others less so. Overall, AITT significantly aid developers by simplifying and expediting unit
testing, especially with minor manual adjustments and skillful use of the tools to enhance test
accuracy, coverage, and quality.
The comparison study comparing AITT and MI methods found that MI unit testing is cur-
rently superior. Tools like UnitTestAI and CodePal are close to matching MI unit tests and can
87
6. Conclusion
provide a comprehensive unit test suite with few modifications. Conversely, IKEA GPT, Codiu-
mate, and PaLM, while not matching MI quality, can accelerate the testing process by generating
initial test suites that need further refinement. GitHub Copilot and TabNine require improve-
ments to produce reliable tests.
We addressed challenges in using AITT (RQ3), such as data type issues and redundancy in
generated unit tests with some tools, and the complexity of obtaining organizational approval
and privacy checks for proper AITT usage.
We recommend a hybrid approach, combining AI-generated unit tests with manual mod-
ifications and monitoring, to ensure efficiency, effectiveness, and high quality. For front-end
development without sensitive data, we recommend using UnitTestsAI and CodePal with man-
ual modifications. For back-end development with sensitive data, we recommend using IKEA
GPT and Codiumate with manual modifications. This enhances testing efficiency, quality, and
potentially reduces costs, and accelerates release cycles.
Our findings suggest future research in evaluating the selected AITT across different pro-
gramming languages to assess their versatility and adaptability to diverse development contexts.
Another future research is investigating hybrid testing approaches using the selected AITT. This
could include a detailed analysis and evaluation of scenarios where a hybrid approach yields op-
timal testing efficiency and effectiveness.
Overall, this thesis explores the impact of the AITT on unit testing methods in software
development.
88
References
[2] Pavel Anselmo Alvarez, Alessio Ishizaka, and Luis Martínez. Multiple-criteria decision-
making sorting methods: A survey. Expert Systems with Applications, 183:115368, 2021.
[3] Carina Andersson. Exploring the software verification and validation process with focus
on efficient fault detection. 2003.
[4] Cyrille Artho and Armin Biere. Advanced unit testing: How to scale up a unit test frame-
work. In Proceedings of the 2006 international workshop on Automation of software test, pages
92–98, 2006.
[5] Martin Aruldoss, T Miranda Lakshmi, and V Prasanna Venkatesan. A survey on multi cri-
teria decision making methods and its applications. American Journal of Information Systems,
1(1):31–43, 2013.
[6] Yatin Bajaj and Manoj Kumar Samal. Accelerating software quality: Unleashing the power
of generative ai for automated test-case generation and bug identification. International
Journal for Research in Applied Science and Engineering Technology, 11(7), 2023.
[7] Antonia Bertolino. Software testing research: Achievements, challenges, dreams. In Future
of Software Engineering (FOSE ’07), pages 85–103, 2007.
[8] David Bowes, Tracy Hall, Jean Petric, Thomas Shippey, and Burak Turhan. How good are
my tests? In 2017 IEEE/ACM 8th Workshop on Emerging Trends in Software Metrics (WETSoM),
pages 9–14. IEEE, 2017.
[9] Kevin Buffardi, Pedro Valdivia, and Destiny Rogers. Measuring unit test accuracy. In
Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pages 578–
584, 2019.
89
REFERENCES
[15] Ermira Daka and Gordon Fraser. A survey on unit testing practices and problems. In 2014
IEEE 25th International Symposium on Software Reliability Engineering, pages 201–211. IEEE,
2014.
[17] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences.
Minds and Machines, 30:681–694, 2020.
[19] Jerry Gao, Chuanqi Tao, Dou Jie, and Shengqiang Lu. What is ai software testing? and
why. In 2019 IEEE International Conference on Service-Oriented System Engineering (SOSE),
pages 27–2709. IEEE, 2019.
[20] Sourodip Ghosh, Ahana Bandyopadhyay, Shreya Sahay, Richik Ghosh, Ishita Kundu, and
K.C. Santosh. Colorectal histology tumor detection using ensemble deep neural network.
Engineering Applications of Artificial Intelligence, 100:104202, 2021.
90
REFERENCES
[26] IKEA. Ikea gpt. Internal website, accessible only to IKEA employees, 2024.
[27] Marko Ivanković, Goran Petrović, René Just, and Gordon Fraser. Code coverage at google.
In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference
and Symposium on the Foundations of Software Engineering, pages 955–963, 2019.
[28] Sukaina Izzat and Nada N Saleem. Software testing techniques and tools: A review. Journal
of Education and Science, 32(2):30–44, 2023.
[29] Janine Heinrichs . Codepal review: Can i instantly generate code with ai? https://www.
unite.ai/codepal-review/. Accessed: 2024-03-23.
[30] Paul C Jorgensen. Software testing: a craftsman’s approach, Third Edition. Auerbach Publica-
tions, pages 3–12, 2013.
[31] Zubair Khaliq, Sheikh Umar Farooq, and Dawood Ashraf Khan. Artificial intelligence in
software testing: Impact, problems, challenges and prospect. arXiv preprint arXiv:2201.05371,
2022.
[32] Z Khaliqa and S Farooqa. Artificial intelligence in software testing: Impact, problems,
challenges and prospect. 10.48550. arXiv preprint arxiv.2201.05371, 2022.
[33] Vladimir Khorikov. Unit Testing Principles, Practices, and Patterns. Simon and Schuster, 2020.
[34] Mengyun Liu and K. Chakrabarty. Adaptive methods for machine learning-based testing of
integrated circuits and boards. 2021 IEEE International Test Conference (ITC), pages 153–162,
2021.
[35] Pan Liu, Zhenning Xu, and Jun Ai. An approach to automatic test case generation for unit
testing. In 2018 IEEE International Conference on Software Quality, Reliability and Security
Companion (QRS-C), pages 545–552, 2018.
[37] Robert C Martin. Clean code: a handbook of agile software craftsmanship. Pearson Education,
2009.
[38] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learn-
ing. MIT press, 2018.
91
REFERENCES
[39] mspoweruser . Codepal review: Is it the best all-in-one ai coding solution? https://
mspoweruser.com/codepal-review/. Accessed: 2024-05-19.
[40] Robert E Noonan and Richard H Prosl. Unit testing frameworks. ACM SIGCSE Bulletin,
34(1):232–236, 2002.
[41] Phuoc Pham, Vu Nguyen, and Tien Nguyen. A review of ai-augmented end-to-end test
automation tools. In Proceedings of the 37th IEEE/ACM International Conference on Automated
Software Engineering, pages 1–4, 2022.
[42] Adam Porter, Cemal Yilmaz, Atif M Memon, Douglas C Schmidt, and Bala Natarajan.
Skoll: A process and infrastructure for distributed continuous quality assurance. IEEE
Transactions on Software Engineering, 33(8):510–525, 2007.
[43] Ramona Schwering, Jecelyn Yeen. Four common types of code coverage. https://web.
dev/articles/ta-code-coverage. Accessed: 2024-05-07.
[44] Per Runeson. A survey of unit testing practices. IEEE software, 23(4):22–29, 2006.
[45] Thomas Saaty, Luis Vargas, and Cahyono St. The Analytic Hierarchy Process. 07 2022.
[46] Katharine Sanderson. Gpt-4 is here: what scientists think. Nature, 615(7954):773, 2023.
[47] Domenico Serra, Giovanni Grano, Fabio Palomba, Filomena Ferrucci, Harald C Gall, and
Alberto Bacchelli. On the effectiveness of manual and automatic unit test generation: ten
years later. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories
(MSR), pages 121–125. IEEE, 2019.
[48] Mohamed Ali Shajahan. Fault tolerance and reliability in autosar stack development: Re-
dundancy and error handling strategies. Technology & Management Review, 3(1):27–45, 2018.
[49] Karuturi Sneha and Gowda M Malle. Research on software testing techniques and soft-
ware automation testing tools. In 2017 international conference on energy, communication, data
analytics and soft computing (ICECDS), pages 77–81. IEEE, 2017.
[50] Jeremy Straub and Justin Huber. A characterization of the utility of using artificial intel-
ligence to test two artificial intelligence systems. Computers, 2(2):67–87, 2013.
[51] Tabnine. The ai coding assistant that you control. https://www.tabnine.com/. Ac-
cessed: 2024-03-19.
92
REFERENCES
[54] Hamed Taherdoost and Mitra Madanchian. Multi-criteria decision making (mcdm) meth-
ods and concepts. Encyclopedia, 3(1):77–87, 2023.
[56] Yi Wei, Bertrand Meyer, and Manuel Oriol. Is branch coverage a good measure of testing
effectiveness? Empirical Software Engineering and Verification: International Summer Schools,
LASER 2008-2010, Elba Island, Italy, Revised Tutorial Lectures, pages 194–212, 2012.
[57] Tao Xie, Nikolai Tillmann, and Pratap Lakshman. Advances in unit testing: theory and
practice. In Proceedings of the 38th international conference on software engineering companion,
pages 904–905, 2016.
[58] Sridhar Reddy Yerram, Suman Reddy Mallipeddi, Aleena Varghese, and Arun Kumar
Sandu. Human-centered software development: Integrating user experience (ux) design
and agile methodologies for enhanced product quality. Asian Journal of Humanity, Art and
Literature, 6(2):203–218, 2019.
93
REFERENCES
94
Appendices
95
INSTITUTIONEN FÖR DATAVETENSKAP | LUNDS TEKNISKA HÖGSKOLA | PRESENTERAD 2024-06-13
EXAMENSARBETE
As web development grows increasingly complex, ensuring software quality has become
a key challenge. This thesis explores how AI-driven tools can revolutionize unit test-
ing—a vital step in software development—by comparing these tools with manually
implemented testing method within the context of IKEA’s ILA web application.
The shift toward AI solutions in companies like with the meticulousness of manual testing. In the
IKEA requires that software not only functions box below, you can find the tools that were evalu-
correctly but also efficiently. Manual testing ated. Maybe you want to try one? This thesis not
methods, while reliable, are often time-consuming.
AI-powered tools promise faster and potentially
more thorough testing processes. In this study,
seven different AI-based unit testing tools were
evaluated and compared against the manually im-
plemented testing method based on 19 criteria, in-
cluding efficiency, quality, accuracy, and test cov-
erage. We found that AI-based testing tools of-
fer the potential to speed up the unit testing pro-
cess by automating the generation of unit tests.
While they excel in covering scenarios, some do
not always match the precision or depth provided
by tests created manually by developers, who can
leverage intimate system knowledge. Surprisingly,
one of the AI tools tested was able to generate unit
tests of nearly the same quality as those manually
implemented.
The evaluation insights from this research could only benchmarks the current capabilities of AI-
guide software development teams on how to inte- based testing tools in enhancing software testing
grate AI-based testing tools effectively with exist- but also highlights the nuanced balance needed be-
ing testing protocols. This could lead to a hybrid tween automated efficiency and human expertise,
testing strategy, combining the speed of AI tools paving the way for faster software development.