0% found this document useful (0 votes)
61 views10 pages

Next Generation Flight Computers

Uploaded by

etrebnis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views10 pages

Next Generation Flight Computers

Uploaded by

etrebnis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Evaluation of the Multicore Performance

Capabilities of the Next Generation


Flight Computers
Marc Solé∗† Jannis Wolf† Ivan Rodriguez∗† Alvaro Jover†
Matina Maria Trompouki∗† Leonidas Kosmidis†∗ David Steenari‡
∗ † ‡
Universitat Politècnica de Catalunya (UPC) Barcelona Supercomputing Center (BSC) European Space Agency (ESA)

Abstract—Multicore architectures are currently under adop- tical threads executing in a synchronised way on all available
tion in the aerospace domain, however their software remains cores, can provide significant benefits, both at certification
single-threaded. In this paper we argue about the benefits level as well as in terms of the obtained performance.
offered by homogeneous parallel processing, both in terms of
performance, which is necessary for the implementation of We discuss how homogeneous parallel processing can re-
advanced functionalities, as well as in terms of certification and duce the effort required to achieve compliance to AMC 20-
in particular about mastering multicore interference. 193. Considering that avoiding the use of shared resources
We discuss the implementation details of this programming or mitigating their use is nearly impossible in multiprocessor
paradigm in avionics and space real-time operating systems. platforms, homogeneous parallel processing limits the Worst
We experimentally evaluate the performance benefits offered
by several high-performance multicore platforms which are Case Execution Time (WCET) analysis only to the interference
considered good candidates for next generation flight computers, among multiple instances of the same task, instead of any
using homogeneous parallel processing under a qualifiable real- possible combination of tasks that can execute at the same
time operating system used in the aerospace domain. Our time [5] [6] or pessimistic upperbounds of the single core
results indicate near to linear speed-ups compared to traditional execution time (i.e. 9× [1]).
sequential processing, showing the benefits of this approach.
Index Terms—Multicore, performance, interference, homoge- We discuss the implementation details of this paradigm
neous parallel processing, OpenMP, RTOS, RTEMS-SMP in the real-time operating systems used in the aerospace
domain. In addition, in order to demonstrate the performance
I. I NTRODUCTION benefits of this approach, we provide experimental evidence
Although the avionics and aerospace industries have re- of executing representative aerospace algorithms [7] on several
cently started migrating towards multicore hardware architec- multicore architectures which are considered for use in next
tures, at software level they currently use only single-threaded generation flight computers. Our evaluation uses OpenMP and
applications. While multicore platforms offer higher process- the space-qualified RTOS RTEMS-SMP, which is widely used
ing capacity than single core ones, i.e. a 4-core processor in the aerospace domain. Our results compare the parallel
can execute 4 different applications/tasks concurrently, the performance of the selected platforms showing near linear
actual aggregate performance is less than 4× due to resource speed-ups compared to their single-threaded performance,
sharing (i.e. shared caches, bus, memory) and intertask in- which can enable future advanced software capabilities.
terference. Prior research works [1] [2] have shown that the
II. M ULTICORE A DOPTION IN A EROSPACE
single-threaded performance of software running on 4-core
CPUs from different manufacturers can experience up to 9× A. Multicore Processors for Aerospace Systems
slowdown and 12× respectively. This not only complicates the Similar to all safety critical domains, the avionics and space
certification of airborne multicore systems according to AMC sectors are moving towards multicore architectures [8], both
20-193 [3] (formerly CAST-32A) [4], but also cannot satisfy for performance reasons, as well as due to the obsolescence
the performance required for advanced on-board processing. of high performance single core processors in the market.
In particular, single core performance is not sufficient for In the space domain, several processors particularly tar-
the implementation of systems with an increased level of geting this sector are available. Frontgrade Gaisler’s Next
autonomy relying on computationally expensive algorithms Generation Multiprocessor (NGMP), based on a quad-core
such as Machine Learning (ML) and vision-based processing. fault-tolerant LEON4 processor, has been available for over a
Therefore, there is a need for parallel processing in airborne decade, with its GR740 flight model being QML-Q and QML-
computers. V qualified in 2021 [9]. Currently, the octa-core GR765, based
In this paper, we argue that homogeneous parallel process- on both the SPARC-based LEON5 and RISC-V based NOEL-
ing, in which a demanding computation is spread among iden- V cores is under development by Frontgrade Gaisler. Both de-

© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse
of any copyrighted component of this work in other works. DOI: 10.1109/DASC58513.2023.10311151
velopments have been funded by the European Space Agency time
(ESA). The European Union has funded the development of CPU τ1 τ2 τ3 τ4 τ5 τ6
the quad-core DAHLIA processor [10], based on the ARM
Cortex-R52, which is currently adopted as a hard processor in
the NG-Ultra FPGA [11]. An open source multicore platform (a) Legacy single core task deployment.
based on NOEL-V, extended with short vector capabilities time
and an open source RISC-V Graphics Processing Unit (GPU)
CPU 0 τ1 τ3 τ5
prototype on an FPGA is under development [12] in the on-
going European Commission’s METASAT project [13]. CPU 1 τ2 τ4 τ6
In the US, the 4 core RAD5545 is available from BAE
(b) Multicore task deployment in multicores used in aerospace.
Systems [14] [15]. NASA is also funding the development
of Next-generation High Performance Spaceflight Computing
Fig. 1: Current multicore migration concept in the aerospace
(HPSC) processors. Boeing has been awarded a contract for
domain. Single-threaded tasks that were executed on the same
the development of an ARM-based multicore system [16],
CPU in a single core CPU are now distributed in multiple
with 8 ARM Cortex-A53 and dual core Cortex-R52 cores.
CPUs. Note the increase in the execution time of each task
Under the same program, Si-Five and Microchip were recently
due to multicore interference.
awarded a contract for the development of a multicore RISC-V
based processor for space [17].
In addition to these radiation-hardened processors specifi- pre-qualification of the space-profile subset of RTEMS-SMP
cally designed for space, radiation-tolerant and Commercial for the GR740 and GR712RC, and provides openly its pre-
Off-The-Shelf (COTS) multicore processors are considered qualification data package [28].
in lower cost missions for New Space [18]. The Xilinx
(now AMD) Zynq Ultrascale+ SoC (ZCU102) is an attractive C. Software Deployment in Multicores
platform combining a high performance multicore, with real- The shift from single core architectures to multicore ones
time processors and an FPGA, which is a common choice for in avionics is a disruptive move comparable to the one that
space missions [19]. Multicore SoC with embedded GPUs, the avionics sector experienced when migrating from federated
especially the ones developed for the automotive domain are to integrated architectures [29]. For this reason, the avionics
also good candidates for space [18], including the NVIDIA industry seeks solutions which are backwards compatible and
TX2 and Xavier, as well as AMD V1605B. incremental [25]. The IMA concept is based on space and
In avionics, both PowerPC and ARM-based solutions are time partitioning. In a single core environment, IMA allows
explored. The 8-core PowerPC NXP QorIQ P4080 [20] [21] the development of software partitions in isolation and their
and T2080 [22] are considered, as well as the 12-core NXP integration and deployment with other partitions in the final
T4240 [23]. However, there is a preference towards the T se- system, as if they were executing alone.
ries of NXP processors over the P ones, due to their increased When ARINC-653 Part 1 Supplement 4 added support for
performance and architectural improvements [24]. From the multicores, included a similar requirement; software developed
ARM-based multicore CPUs, NXP’s LS1088 which features for an ARINC-653 RTOS compliant with Part 1 Supplement
8 ARM Cortex A53 CPUs is considered by industry [25]. 3 to run on a single core processor, shall also run on a
single core of a multicore platform under an RTOS following
B. Multicore Support in Real-Time Operating Systems Part 1 Supplement 4 with the same behaviour. This allows
With the availability of multicore hardware, real-time oper- legacy avionics software to be migrated to multicore systems
ating systems (RTOSes) started including support for exploit- while maintaining portability. For this reason, this is the most
ing its capabilities. In the avionics domain, the implementation popular approach followed currently by both industry [21]
of Integrated Modular Avionics systems relies on an ARINC- and academia [30]. In the space domain, in non-partitioned
653 [26] compliant Real-Time Operating System. The update multicore operating systems such as RTEMS-SMP, the same
of the ARINC-653 Part 1 Supplement 3, Required Services as approach is also currently adopted in multicore deployments.
well as the later published Part 1 Supplement 4 added support Figure 1 depicts this concept. Single core tasks in RTEMS
in the specification for the support of multicores. All ARINC- or partitions in ARINC-653 RTOSes running on a single core
653 RTOS vendors are currently supporting multicores such (Figure 1a), are distributed across the multiple cores of a
as WindRiver VxWorks [24], DDC-I Deos [25], SYSGO multicore (Figure 1b). Theoretically, moving to an N-core
PikeOS [20] and fentISS LithOS [27] to name a few. multicore system, increases the capacity of the system by
While ARINC-653 RTOSes are also used in space, partic- N times. However, in practice this is not true. Note that in
ularly for their avionics part, other RTOSes are also widely Figure 1b, the execution time of each task/partition has been
used. In particular, the open source RTEMS is a com- increased, due to multicore interference.
mon choice used in numerous missions over the decades. Multicore interference is introduced due to resource sharing
Since version 5, RTEMS has obtained multicore support, between the different CPUs. This includes shared caches,
known as RTEMS-SMP. ESA has recently performed the i.e. the L2 cache or higher level caches, as well as DRAM.
Determining the execution time increase of each task/partition time
is a complex procedure, since it depends on the particular CPU τ1
characteristics of the tasks/partitions that are executing in
parallel and therefore are sharing resources.
Two tasks which are using the same resource, eg. the L2 (a) Single-threaded task.
cache, if they are executing at the same time in different CPUs time
will evict each other’s data from L2, causing them a large τ1,1
CPU 0
slowdown. However, if one of the two tasks fits in the private
L1 cache of their core, the slowdown will be smaller. CPU 1 τ1,2
In order to deal with the multicore contention problem,
(b) Parallel multithreaded task.
AMC 20-193 [3] requires the user to identify all the interfer-
ence channels that exist in the target multicore platform and
Fig. 2: Performance improvement of parallel tasks. The single
can affect the software behaviour, and apply mitigations for the
thread implementation of a demanding algorithm cannot meet
effects of contention (objective MCP_Resource_Usage_3).
its performance target, but the parallel implementation can.
There are two ways of mitigating the effects of multicore
interference. One approach is through robust partitioning. This
allows to compute the Worst Case Execution Time (WCET) interference, the task execution time is not reduced to the
of each task in isolation (objective MCP_Software_1). This half, but it is longer. Moreover, due to the non-determinism
includes different cache partitioning methods. Cache partition- introduced by resource interference, the tasks do not finish
ing can be performed either using way partitioning or cache exactly at the same time and neither with the same order.
colouring [31]. Another method for robust partitioning is to We call this execution model Homogeneous Parallelism,
take into account the effect of multicore interference in the because the tasks are executing the same code. Another work
timing analysis, in an independent way from all different co- used the term Data Parallel decomposition [31]. Homogeneous
runners, using worst case contention generation software [25]. parallelism is a very common parallel programming model,
However, robust partitioning can be pessimistic and reduce the both in multicore CPUs as well as in other types of parallel
obtained performance. For example, (static) cache partitioning processing. Variants of this programming model are known
and cache colouring, reduce the available L2 cache space also as single instruction, multiple data (SIMD) as well as
assigned to the software. If the working set of the software is single instruction, multiple threads (SIMT) used in GPUs.
larger than its L2 slice, its performance is degraded compared Apart from its performance benefits, homogeneous paral-
to its execution in isolation. For example, an application may lelism, when the parallel tasks are scheduled for execution
experience up to a 2.5× slowdown by partitioning the L2 in a synchronised way and in all available cores, has an
cache of a 4 core multicore [31]. additional benefit for certification. Since the co-runner tasks
The second approach to deal with resource interference is are fixed and identical, the generated contention has a fixed
to take into account the different co-runners, and therefore the intensity and characteristics at any point of their execution.
actual contention they generate. In this case, all combinations This means that each instance of the parallel task generates
of co-running tasks are taken into account [5] [6]. However, the same multicore contention and its execution time behaviour
the drawback of this approach is that the analysis needs to be is affected in the same way. For this reason, there is no need
repeated if the software or its schedule is changed. for neither contention mitigation nor taking into account the
effect in execution time received by running at the same time
III. H OMOGENEOUS PARALLELISM
with other software.
While maintaining single core partitions and distributing This means, that the parallel task can be analysed in isola-
them across the different cores facilitates the migration to tion, reducing significantly the multicore certification effort.
multicore architectures, it is not enough for providing the
required performance for the implementation of advanced A. Implementation of Homogeneous Parallelism in Aerospace
functionalities. Even if the combined performance of a multi- Systems
core is higher than a single core, a sequential task cannot take 1) Implementation under ARINC-653 Part 1 Supplement 4:
advantage of the multiple cores in the system. Until ARINC-653 Part 1 Supplement 3, which only supported
For example, modern and future aerospace systems will single core partitions, the implementation of homogeneous
require to perform an increasing amount of demanding compu- parallelism under an ARINC-653 RTOS required splitting the
tations, which cannot be satisfied with the single-threaded per- parallel processing in identical partitions, scheduled for exe-
formance of modern CPUs. Figure 3 shows such an example, cution in the different cores of the system. This parallelisation
in which a task is not able to meet its performance target on scheme has been demonstrated in an aerospace case study
a single core, using a sequential implementation (Figure 2a). by Thales Alenia Space [31], under the name Data Parallel
However, if the computation is divided in identical tasks which decomposition. Such implementation requires that each of the
perform the same operation but on different parts of the data, identical partitions processes a different part of the input data
the performance target can be met. Note that due to multicore and output data. For example, splitting the implementation in
two partitions requires that the first partition processes the first time
half of the data, and the second partition the second half. CPU 0 P1 P2,1 P1
ARINC-653 partitions offer both space and time partition- P2,2
CPU 1
ing. While the time partitioning in a multicore deployment
facilitates the synchronised execution of the identical parti- (a) Homogeneous parallelism implementation under ARINC-653 Part
tions on the different cores, which is required by the homo- 1 Supplement 4. At least two different partitions are required, which
geneous parallelism programming model, space partitioning exchange data through inter-partition communication mechanisms.
complicates the implementation of this task. Space partitioning time
means that one partition cannot access the memory of another
CPU 0 p1 p2,1 p1
partition, which is required for the implementation of the
homogeneous parallelism pattern, i.e. accessing the input and CPU 1 p2,2
the output data of the processing.
(b) Homogeneous parallelism implementation under ARINC-653 Part
Figure 3a shows the complexity of this implementation. 1 Supplement 5. A multicore partition with processes running in
First, the processing that was once implemented in a single different cores is used. The processes can exchange data through
partition, now needs to be decomposed in at least 2 different intra-partition communication mechanism, or directly access data,
types of partitions. The non-parallelisable part of the algorithm since they share the same address space.
remains in partition P1 , which needs to share the data to be
processed with the identical partitions P2,1 and P2,2 , through Fig. 3: Implementation of homogeneous parallelism in differ-
an inter-partition communication mechanism. Depending on ent variants of an ARINC-653 RTOS.
the access pattern of the particular parallelised algorithm, P1
needs to share only an exclusive part of the data with each
of the partitions, or the entire data with all partitions. The scheduling and core mapping can be used to achieve the
cost of the data communication depends on the actual inter- synchronised execution of the parallelised tasks.
partition communication primitive used (e.g. queuing port) A benefit compared to the previous approach is that since
and its particular implementation by the RTOS, the size of the processes share the same address space, intra-partition
the exchanged data and the number of partitions. Due to communication is significantly cheaper. Again the process p1
the ARINC-653 inter-partition communication semantics and which implements the sequential part of the software needs
space partitioning, the underlying implementation of inter- to pass the data to processes p2,1 and p2,2 , which execute in
partition communication usually involves actual data copies parallel. This can be done either using intra-partition APEX
instead of memory mapping, which increases their cost. In communication primitives or by directly accessing the shared
addition, the APEX inter-partition communication APIs re- data if they are defined as global data within the partition.
quire extra copies within the partitions in order to use the Similar to the previous example, the inter-process commu-
exchanged data. Moreover, since each partition operates in nication cost depends on the actual primitive used (e.g. black-
a distinct address space, the partitions are not able to take board) and its particular implementation by the RTOS, the size
advantage of the presence of a shared cache to speed up access of the exchanged data and the number of processes. Since
to shared data. the processes share the same address space, data transfers are
An additional drawback of the implementation of homoge- faster and can take advantage of the shared cache.
neous parallelism under Part 1 Supplement 3 is the coupling If no intra-partition communication mechanism is used, but
between different partitions. This is against the basic IMA the data are accessed through shared global memory, the cost is
principle which enables the development and verification of minimised, as well as the implementation complexity. In this
each partition in isolation. particular case, shared data accesses get a full performance
Therefore, despite its feasibility and benefits, the implemen- benefit from the shared cache, as well as consistent data
tation of homogeneous parallelisation under ARINC-653 Part through the processor’s cache coherence mechanism. If the
1 Supplement 4 can be challenging. parallel algorithm needs to write to a given memory position
2) Implementation under ARINC-653 Part 1 Supplement from multiple processes, atomic operations can be used.
5: The ARINC-653 Part 1 Supplement 5 published in 2019 While cache coherence is usually identified as a source
introduced the notion of a multicore partition, i.e. a partition of multicore interference [25] for multicore certification, note
with processes running in multiple cores. Note that the notion that in the cache of homogeneous parallelism this it does not
of process in ARINC-653 is different than the POSIX one, present a problem that needs to be mitigated. As mentioned
and it is equivalent to POSIX threads. Therefore, ARINC-653 earlier, under this execution paradigm, all parallel threads
processes share the same address space within a partition. (ARINC-653 processes) of the partition are executing exactly
This enables multithreaded processing within a partition. the same code and generate the same cache coherence traffic.
Figure 3b shows the implementation of homogeneous par- Therefore, both the effects of cache coherence and shared
allelism under this paradigm. Conceptually, the algorithm is cache contention are taken into account during their execution.
decomposed in the same way as before. However, now all In contrast to the implementation of homogeneous paral-
the processes belong in the same partition. Intra-partition lelism under Part 1 Supplement 4, multicore partitions allow
1 #pragma omp p a r a l l e l f o r d o t = vmlaq u8 ( d o t , d A 8x16 , B t r a n s p o s e d 8 x 1 6 ) ;
2 f o r ( i = 0 ; i < n ; i ++) {
3 f o r ( j = 0 ; j < n ; j ++) {
4 dot = 0; (a) NEON intrinsic example
5 f o r ( k = 0 ; k < n ; k ++) {
6 d o t += d A [ i * n+k ] * B t r a n s p o s e d [ j * n+k ] ; usum ( d o t , d A 8x4 , ” usmul ” , B t r a n s p o s e d 8 x 4 ) ;
7 }
8 d C [ i * n+ j ] = d o t ;
9 } (b) SPARROW intrinsic example.
10 }
Fig. 5: Examples of SIMD intrinsics implementing the dot
product operation shown in line 6 of Figure 4, over 16
Fig. 4: Excerpt of transposed matrix multiplication paralleli-
elements in NEON (a) and 4 elements in SPARROW [35]
sation using OpenMP from the GPU4S Bench [7] [34].
(b). The elements are 8-bit each and they are loaded from the
d_A and B_transposed arrays.
the parallel implementation of an algorithm to remain within
the boundaries of a single partition, facilitating in this way its
development and validation completely in isolation. parallel region is executed. When all the threads of the parallel
3) Implementation under RTEMS: Unlike ARINC-653 region finish their execution and reach a barrier, the data are
RTOSes, RTEMS does not implement space partitioning, and passed back to the main thread of execution (p1 in Figure 3).
therefore all tasks share the same flat address space. There- Unlike the manual implementation of homogeneous paral-
fore, homogeneous parallelism can be implemented similar lelism, OpenMP does not require the data to be in a global
to multicore partitions. RTEMS provides similar inter-task scope. The relevant data are passed using pointers (which
communication primitives which can be used if the data are would be discouraged in a manual implementation) or copied
chosen to be communicated between tasks, or the tasks can if there is a need to do so, since the runtime can take the
access directly the data if they defined as global data. optimal decision. OpenMP offers also options for specifying
In addition to this manual implementation using atomic accesses to certain variables or memory positions,
programmer-defined tasks, RTEMS-SMP offers seamless as well other information such as definition of reduction or
support for homogeneous parallelism using the OpenMP private variables per thread. In general, OpenMP supports all
programming model [32]. OpenMP is one of the oldest features required to parallelise any sequentially written code.
standardised parallel programming models, and the most In addition to homogeneous parallelism, OpenMP supports
widely used for shared memory parallel programming [33], a task parallelism, in which various parts of a sequential code
especially in high performance computing. OpenMP supports can be executed in parallel. However, since in the tasking
multiple types of parallelism. For homogeneous parallel subset the threads do not execute the same code, it does not
programming, which is the first one introduced in OpenMP offer the certification benefits we describe in this paper.
and most widely used, it supports loop parallelisation.
B. Vector/SIMD Processing
This is shown in the code listing 4, with the parallelisation of
a transposed matrix multiplication. Note that using OpenMP Vector instructions currently available in several processors
does not require any modification in the application source are another form of homogeneous parallelism. Examples of
code. The source code is simply annotated using #pragmas, this feature are the NEON instructions in ARM processors,
such as in line 1. If the code is compiled with a compiler which SSE and AVX instructions in Intel processors, the Altivec
does not support OpenMP or does not have the compilation instructions in PowerPC processors and the packed vector
option for OpenMP parallelisation enabled, this line is ignored, instructions and vector instructions of the RISC-V architecture.
and the code is compiled for sequential execution. Figure 6 shows the notion of SIMD performance improve-
This line instructs the compiler to execute the for loop ment. A computationally intensive part of the sequential code
in parallel. The number of threads by default match the can be accelerated using this type of instructions. These
number of the cores in the system, as it is required by operations can be manually accessed at C code level using
our homogeneous parallelism proposal, but it can also be compiler intrinsics, i.e. function-like constructs that translate to
configured. The threads can also be statically assigned to a single processor instruction and applies the same operation in
cores. When this line is encountered, OpenMP continues the multiple data elements. The code listings of Figure 5 show two
execution of that parallel region using all threads executed in examples of processor intrinsics implementing the dot product
parallel. Similar to Figure 3, the OpenMP runtime takes care operation in line 6 of the transposed matrix multiplication
of passing the relevant input to the forked threads, and then shown in Figure 4. In the case of ARM NEON, 16 multiply
collect their outputs when they are joined. For this reason, and accumulate operations of 8-bits are performed with a
OpenMP is known as fork-join programming model. single SIMD instruction. In the case of the SPARROW AI
In practice, the OpenMP threads are not dynamically forked Accelerator [36] which is used in the METASAT RISC-V
or destroyed, especially under RTEMS, since this would create based platform [37], the operation is applied over 4 elements.
a significant overhead for its real-time execution. Instead, the Since this is a language feature, it is also available both
threads are created at boot time, and then reused when a in sequential code and in RTOSes, such as ARINC-653 and
time SMP is executed natively. However, for the rest of the plat-
CPU τ1 forms we had difficulties to flash them and boot RTEMS-SMP.
Since the rest of the platforms are based on Linux, we have
done a series of modifications to allow near native, real-time
time performance such as using the real-time patches of the Linux
CPU τ1 Kernel and moving all interrupts and processes to the first
core. Moreover, we are using the KVM virtualisation solution
to execute RTEMS-SMP within a multicore virtual machine
spanning all cores of the system. Note that although there is
Fig. 6: Performance improvement using SIMD instructions to possibility for KVM to use cache partitioning [2] solutions, we
parallelise part of the sequential processing of a task. have not used this option, since as we argue in our proposal,
homogeneous parallelism does not require isolation.
RTEMS, provided that the latter provides support for it. A. Performance results
However, to our knowledge this feature is not used in avionics Figure 7 shows the performance of matrix multiplication for
or space systems yet, although it is a very attractive solution to single precision floating point and 8-bit integer data types, for
increase the guaranteed performance of critical systems [38]. the platforms under analysis in terms of MFLOPS and MOPS
SIMD instructions can be also combined with the homo- respectively (number of floating point or integer operations
geneous parallelism concept described in the previous sub- per second). For ZCU102, TX2 and GR740 there is no
sections to achieve higher performance. This can be done by performance difference between the two data types, but in the
manually written code using intrinsics. Code with intrinsics rest of the platforms int8 is faster: 2 times faster in the Xavier
can be annotated with pragmas, or code can be parallelised and 50% faster in the V1605B and 5 times in the METASAT
automatically by the compiler using the #pragma SIMD platform. The difference is similar also for other integer types
option of OpenMP. In the latter case, the compiler will try and depends on the number of integer and floating point units
to use SIMD instructions for that particular part of the code. and their latency. For example, METASAT’s NOEL-V core has
Regardless of the homogeneous parallelism scheme used 2 integer units but a single floating point unit (NanoFPUnv)
in combination with vector/SIMD instructions, the benefits which is not pipelined. Moreover, the performance seems to
remain the same. Since all threads execute exactly the same remain relatively constant regardless of the input size.
code, and their execution is synchronised and uses all cores The same trend is observed in multicore execution. All
in the system, the interference generated and received by each platforms achieve near ideal speed-up (3.8-3.97× and 1.9
thread is identical. Therefore, there is no need for mitigations. for Xavier’s 10W power mode which uses 2 CPUs), except
IV. E XPERIMENTAL E VALUATION the TX2 which has lower scaling (2.1-2.5× and 3.1-3.5×
In this section, we evaluate the performance benefits of ho- depending on the power mode). This is an indication that ho-
mogeneous parallelism in several multicore processors which mogeneous parallelism is effective and no cache interference
are considered for use in next generation flight computers. mitigation is required. In fact, in our optimised implementation
For the evaluation, we are using the open source GPU4S the second matrix is first transposed and then multiplied. This
Bench [7], known also as OBPMark Kernels [34], part of the makes our implementation cache-friendly. Moreover, since in
European Space Agency’s OBPMark benchmarking suite [39]. this parallelisation scheme each thread is computing the dot
The GPU4S Bench benchmarking suite contains parallelised product of a different row, but the different threads are reusing
implementations of several algorithmic building blocks which the elements of the second matrix, they benefit from finding
are commonly used in several applications domains in the them in the shared cache.
aerospace sector, according to a survey performed in all Figure 8 presents the results of 8-bit integer matrix multipli-
software divisions of Airbus Defence and Space in the GPU4S cation using SIMD instructions. In the case of ARM NEON,
ESA funded project [18]. Since we focus on multicore per- the speed-up of the Xavier over the sequential version is
formance, we are using the OpenMP version of the GPU4S relatively independent of the input size and ranges from 2.3×
Benchmarks, which we have ported to RTEMS-SMP. Our port in the 10W mode to 3.3-4× in the 15W mode. In the TX2
will be released in the official repository [34]. the speed-up starts from 9× for 1K×1K inputs, but gradually
We are using the latest development version of RTEMS- drops to 5× for 4K×4K. Despite that, NEON is faster than
SMP (RTEMS 6 at the time of writing) from the master branch multicore parallelisation in this platform, as well as in Xavier
of RTEMS, which includes support for ARM and RISC-V in the 10W mode. In the METASAT platform, SPARROW
architectures. However, for GR740 we are using the RTEMS instructions provide a 6× speed-up over the sequential version.
5 version from Frontgrade Gaisler. We are evaluating a subset When combined with OpenMP, SPARROW provides a 20×
of the multicore platforms presented in Section II-A which are speed-up over the sequential and 5× over the multicore
promising candidates for future flight computers. version. This is an 8 × improvement over the single core 8-
Table I shows the details and configurations of our evaluated bit integer performance of GR740 and 2× over its multicore
platforms. In the GR740 and METASAT platforms, RTEMS- performance. This is very promising for the performance of
TABLE I: Evaluated Multicore Platforms and configurations used.
Platform Node CPUs Frequency Shared Cache Memory Power
MHz Size
16nm 4 ARM A53 and
AMD ZCU102 1500 2MB 4GB 22.8W
FinFET 2 ARM R5 Disabled
16nm 4 ARM A57 (enabled)
NVIDIA TX2 1200 2MB 8 GB 7.5W
TSMC & 2 Denver
16nm 4 ARM A57 (enabled)
NVIDIA TX2 2000 2MB 8 GB 7.5W
TSMC & 2 Denver
12nm 4 ARM v8.2 Carmel L2 2MB (per pair)
NVIDIA Xavier 1200 32 GB 10W
TSMC (2 Enabled) L3: 4MB
12nm 4 ARM v8.2 Carmel L2 2MB (per pair)
NVIDIA Xavier 1200 32 GB 15W
TSMC (4 Enabled) L3: 4MB
4 x86 Ryzen
14nm
AMD V1605B (hyperthreading 1600 2MB 16 GB ∼15W
GF
disabled)
Frontgrade Gaisler 65nm
4 LEON4 250 2MB 256 MB 1.5W
GR740 ST
AMD VCU118 4 NOEL-V
METASAT 100 256KB 4GB 5.3W
FPGA + SPARROW

7000 10000
400
1024 2048 4096 1024 2048 4096 1024 2048
9000 350
6000
8000 300
5000
7000 250
M(FL)OPS

MOPS
4000 6000 200

MOPS
5000
3000 150
4000
100
2000 3000
175 50
1000 25 97 154 2000
65
25
97 44 13 17 64 0
39 3 3 13 16 1000

int8

int8

SPARROW
SPARROW omp
int8 omp

int8 omp
0
0
int8

int8

int8

int8
float

float

int8

float

float

int8

float

int8

int8
float
int8 omp

int8 omp
float

int8 omp

int8 omp
float
float omp

int8 omp

float omp

float omp

int8 omp

float omp

int8 omp

float omp

float omp

int8 omp
float omp

float omp

int8 int8 Neon int8 int8 Neon int8 int8 Neon int8 int8 Neon
omp Int8 omp Int8 omp Int8 omp Int8
7.5W 15W 10W 15W
23W 7.5W 15W 10W 15W 15W 1.5W 5.3W 1.5W 5.3W
TX2 Xavier
GR740 METASAT
ZCU102 TX2 Xavier V1605B GR740 METASAT

(a) NEON (b) SPARROW


Fig. 7: Matrix Multiplication performance results. Fig. 8: Matrix multiplication results with SIMD instructions.

the upcoming GR765 potentially coupled with SPARROW, (2× difference). The highest performance is achieved by the
considering that the METASAT platform is implemented in V1605B both in single core as well as in multicore setting. The
an FPGA, has a lower frequency, and uses lower performance achieved speed-ups are lower for this benchmark. Again the
components than the ones that GR765 will use, such as the Xavier and the V1605B have the smaller improvement over
non-pipelined floating point unit and the less performant L2C- the sequential version, since they have processors with higher
Lite shared cache from the GPL GRLIB IP library. single-threaded performance. The speed-up of the ZCU102
Sliding FFT: Figure 9 shows the results of a sliding FFT ranges from 2.8-3.6×, while for the GR740 and METASAT it
applied for a window of 128 samples, used for processing is steady over 3.5×.
ADS-B data. In this benchmark we notice that the performance 2D Correlation: In terms of image correlation, which is
increases with the input size increase for the most powerful shown in Figure 10, multicore performance provides small
platforms. The best performance is achieved by NVIDIA improvements, from 1.2-1.75×. The platform with the higher
Xavier for the largest input size in the multicore configu- performance is NVIDIA Xavier at the 15W power mode. In
ration. Interestingly, the performance achieved by GR740 is this case also the ZCU102 and the GR740 are quite close,
very close to the one of the ZCU102, although its power with a 2.5× difference.
consumption and its frequency are much lower. Moreover, as 2D Convolution: Figure 12 shows the results of image
in the case of matrix multiplication very good speed-ups are convolution with a 3×3 filter. In each of the platforms, the
observed. In the lower performance platforms, the speed-ups performance remains the same regardless of the increase in the
are above 3×, with the maximum scaling achieved by GR740 input size and the multicore speed-up is consistently close to
and METASAT, close to 3.9×. However, Xavier and V1605B linear. NVIDIA Xavier in 15W is faster, followed by V1605B.
have moderate multicore speed-ups between 1.6-2.6×. GR740’s performance is also similar to ZCU102, which is
FIR: Finite Impulse Response (FIR) filtering results are 50% faster.
presented in Figure 11, for a filter with 64 taps. Again CIFAR-10 Inference: Figure 13 presents the results of an
GR740’s performance is in the same range with ZCU102 inference pipeline for the CIFAR-10 dataset. The pipeline
1600 300
1400 1024 2048 4096 1024 2048 4096
250
1200

MFLOPS
MFLOPS
1000 200
800 150
600 100
400 216 215
200 5555 2727 50 88 55
7 7 11
0 0
float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float
omp omp omp omp omp omp omp omp omp omp omp omp omp omp omp omp
23W 7.5W 15W 10W 15W 15W 1.5W 5.3W 23W 7.5W 15W 10W 15W 15W 1.5W 5.3W
ZCU102 TX2 Xavier V1605B GR740 METASAT ZCU102 TX2 Xavier V1605B GR740 METASAT

Fig. 9: Sliding FFT with a 128-sample window. Fig. 11: Finite Impulse Response filtering with 64 taps.

2000 800
1024 2048 4096 700 1024 2048 4096
1500 600
MFLOPS

MFLOPS
500
1000 400
300
500 200
39 5659
39 4 4 55 100 1919 5 5 1617
0 0
float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float
omp omp omp omp omp omp omp omp omp omp omp omp omp omp omp omp
23W 7.5W 15W 10W 15W 15W 1.5W 5.3W 23W 7.5W 15W 10W 15W 15W 1.5W 5.3W
ZCU102 TX2 Xavier V1605B GR740 METASAT ZCU102 TX2 Xavier V1605B GR740 METASAT

Fig. 10: 2D Correlation. Fig. 12: 2D Convolution with a 3 × 3 kernel.

consists of several different types of neural network layers by Tyler Lovelly at SHREC [40]. Similar to the GPU4S
such as convolutions, ReLU (rectifier linear unit), max pool- Bench [7] / OBPMark Kernels [34], it was based on a survey
ing, local response normalisation, fully connected layers and performed for common operations used in space process-
softmax. The inference performance is measured in frames per ing [41]. Moreover, similar to GPU4S Bench it supports
second (FPS) and it remains steady regardless of the number of different data types (8, 16 and 32 bit integers and single
images. The AMD V1605B achieves the best performance in and double floating point) and several programming models
this task, followed by NVIDIA TX2 and Xavier, all at the 15W (sequential, OpenMP, CUDA). It consists of nine kernels:
power mode. The ZCU102 is 2.2× faster than the GR740. matrix multiplication, matrix addition, matrix convolution,
All platforms provide speed-ups higher than 3×, however the matrix transpose, a Sobel filter, matrix transpose as well as
NVIDIA Xavier shows a speed-up below 2×. Kepler’s equation and Clohessy–Wiltshire equations.
Moreover, the SHREC GKSuite [40] exists which contains
V. R ELATED W ORK space applications, similar to ESA’s OBPMark [39]. It con-
Our proposal for homogeneous parallelism was first im- tains 7 image processing applications: color search, histogram
plemented by Thales Alenia Space in [31], under the name equalizer, image difference calculator, Mandelbrot set fractal
Data Parallel decomposition. However, in their implementation generator, Sobel filter, bilinear thumbnailer and image tiler.
homogeneous parallelism did not use all available cores in the Lovelly et al. [41] benchmarked the GR740, RAD5545 and
system, but instead two of them. For this reason, it required Boeing’s HPSC as well as other space processing technologies
to take into account multicore contention with the rest of and space-grade FPGAs using SHREC SpaceBench, including
the partitions scheduled in the other cores. Moreover, their OpenMP parallelised versions. Later, Gretok et al. evaluated
implementation was under Part 1 Supplement 4, using identical the performance of RAD5545 and Boeing’s HPSC [42]. Since
single core partitions, which require higher implementation the actual space processors were under development at that
effort and have higher overhead as discussed in Section III-A1. time, their performance was estimated using COTS processors
Our proposal of homogeneous parallelism implemented on with similar characteristics.
an RTOS compliant with ARINC-653 Part 1 Supplement 5 has Cannizzaro and George [43] used SHREC SpaceBench for
some similarities with Melani et al. [30], which also allows the evaluation of a RISC-V processor under irradiation and
the execution of multiprocessor partitions using all the cores of compared the performance to the ARM cores of the Xilinx
the system. However, their proposal of synchronous partition Zynq 7020, which is the predecessor of Zynq Ultrascale+.
switches, focuses on the migration of legacy uniprocessor Cannizzaro [40] evaluated the same platforms also without
partitions to multicores. Our work differs on the fact that we irradiation using OpenMP and NEON versions. Belloch et
only allow homogeneous parallelism within the partitions, so al. [44] evaluated the performance of the Zynq Ultrascale+
that we do not have to mitigate multiprocessor interference. using matrix multiplication. Landauer and Lovelly evaluated
In terms of benchmarking the performance of aerospace the CPU and GPU performance of a radiation tolerant NVIDIA
processors, there are several similar works which have used TK1 [45] using SHREC Space Bench. Towfic et al. pre-
parallel OpenMP and NEON implementations. Several works sented results of benchmarking at the ISS Qualcomm’s Snap-
use the SHREC Space Bench benchmarking suite, developed dragon [46] used in Mars Ingenuity Helicopter. A set of full
8000
7000 1024 2048 4096 almost linear scaling. This is not only because contention is
6000
5000 restricted to tasks executing the same code, but also due to the
FPS

4000
3000
2000
fact that these tasks share the last level cache and reuse the
215 32 118 121
1000
0
215 32 data fetched by other tasks. Regarding the performance details,
float float float float float float float float float float float float float float float float
omp omp omp omp omp omp omp omp
the boards with 15W power mode, NVIDIA Xavier, NVIDIA
23W 7.5W 15W 10W 15W 15W 1.5W 5.3W TX2 and AMD V1605B provided the highest multicore perfor-
ZCU102 TX2 Xavier V1605B GR740 METASAT mance. On the lower end of the spectrum, ZCU102 and GR740
provide similar performance, with ZCU102 being 2-3× faster,
Fig. 13: CIFAR-10 Inference but using much higher frequency, higher power consumption
and a smaller lithography node, not purposely designed for
space applications. However, GR740 is the only multicore
applications developed at JPL as well as application kernels from the benchmarked ones that is qualified for space use and
were used, implementing NEON and OpenCL parallelisation. its harsh environment, even in institutional missions, while the
Jover et al. evaluated the multicore and GPU performance rest can be used only in the avionics sector and Low Earth
of NVIDIA Xavier and AMD V1605B for CCSDS 121 and Orbit (LEO) for New Space Missions.
122 compression [47], using a preliminary version of the Finally, a special mention is worth on the early prototype of
benchmarks which are now part of ESA’s OBPMark open the METASAT platform which is currently under development
source benchmarking suite [39]. Rodriguez evaluated NVIDIA on an FPGA. Based on Gaisler’s NOEL-V, it provides an
TX2, Xavier, HiKey and AMD V1605 [48] using a port of early preview of what can be expected by GR740’s successor.
ESA’s Euclid NIR software to these GPU platforms. Rodriguez Enhanced with SPARROW, a SIMD accelerator unit for AI,
et al. [12] evaluated the multicore OpenMP and GPU per- it provides significant speed-ups, similar to ARM’s NEON.
formance of a superset of the platforms we cover in this Considering also its lower frequency and the use of the lower
work such as NVIDIA TX2, Xavier and Orin, AMD V1605B performance GPL components of the GRLIB IP library com-
and Zynq Ultrascale+ ZCU102 and NXP’s LS1046A and pared to the high performance ones used in the commercial
LX2160A, using OBPMark’s image processing benchmark. offerings of GR740 and its successor, it is expected to provide
A major difference between these previous benchmarking significant performance in an ASIC implementation.
studies and ours is that we are executing parallel OpenMP
benchmarks under RTEMS-SMP instead of Linux. This makes ACKNOWLEDGMENTS
our evaluation more relevant to the aerospace domain. More- The authors thank Frontgrade Gaisler for providing access
over, our results experience less noise from the Linux op- to a GR740 model, upon which the relevant results were
erating system, and ensure homogeneous parallel executions collected. This work was supported by the European Union’s
without pre-emptions, as the one we describe in our proposal. Horizon Europe programme under the METASAT project
Another difference between the works which use the SHREC (grant agreement 101082622). In addition, it was supported
Space Bench, SHREC GKSuite and JPL software and ours by ESA through the 4000136514/21/NL/GLC/my co-funded
is that our work relies on the equivalent GPU4S Bench and PhD activity ”Mixed Software/Hardware-based Fault-tolerance
OBPMark benchmarking suites, which are open source, and Techniques for Complex COTS System-on-Chip in Radia-
allow reproducibility of the results, as well as fair comparisons tion Environments” and the GPU4S (GPU for Space) ESA-
to be performed among different aerospace platforms. funded project. It was also partially supported by the Span-
ish Ministry of Economy and Competitiveness under grants
VI. C ONCLUSIONS PID2019-107255GB-C21 and IJC-2020-045931-I ( Spanish
In this paper we discussed the concept of the homoge- State Research Agency / Agencia Española de Investigación
neous parallelism and its benefits in the aerospace domain, (AEI) / http://dx.doi.org/10.13039/501100011033 ) and by the
both for performance and certification. We argue that this Department of Research and Universities of the Government
paradigm, in which all the available cores in a multicore of Catalonia with a grant to the CAOS Research Group (Code:
execute the same code in a synchronous manner, simplifies 2021 SGR 00637).
multicore analysis, since the effects of contention are the
R EFERENCES
same for all. We discussed implementation aspects of this
computation model both for the ARINC-653 RTOS as well [1] M. Fernández, R. Gioiosa, E. Quiñones, L. Fossati, M. Zulianello,
and F. J. Cazorla, “Assessing the suitability of the NGMP multi-
as for the space-qualified RTEMS RTOS, including a special core processor in the space domain,” in International Conference on
case of homogeneous parallelism, vector instructions. Finally, Embedded Software (EMSOFT), 2012.
we evaluated several multicore platforms which are considered [2] H. Kim and R. R. Rajkumar, “Predictable shared cache management for
multi-core real-time virtualization,” ACM Trans. Embed. Comput. Syst.,
candidates for next generation flight computers, including the vol. 17, no. 1, Dec 2017.
space-qualified GR740, using the GPU4S Benchmarking suite [3] EASA, AMC 20-193 Use of multi-core processors (MCPs), 2022.
parallelised with OpenMP, under RTEMS. [4] Certification Authorities Software Team (CAST), “Multi-core Proces-
sors,” November 2016.
Our results indicate that the homogeneous parallelism can [5] L. Mutuel et al., “Limitations of interference analyses on multicore
provide significant speed-ups over single core execution, with processors,” in Digital Avionics Systems Conference (DASC), 2017.
[6] X. Jean et al, “Assurance of Multicore Processors: Limits on Interference [26] ARINC, ARINC Specification 653: Avionics Application Software Stan-
Analysis,” FAA Final report, Tech. Rep. DOT/FAA/TC-19/24, 2020. dard Interface, Aeronautical Radio, Inc, 2013.
[7] I. Rodriguez, L. Kosmidis, J. Lachaize, O. Notebaert, and [27] P. Gómez Molinero et al., “De-RISC: Dependable Real-time RISC-
D. Steenari, “GPU4S Bench: Design and Implementation V Infrastructure for Safety-critical Space and Avionics Computer
of an Open GPU Benchmarking Suite for Space On-board Systems,” in Data Systems In Aerospace (DASIA), 2021. [Online].
Processing,” Universitat Politècnica de Catalunya, Tech. Rep. Available: https://doi.org/10.5281/zenodo.5707537
UPC-DAC-RR-CAP-2019-1, https://www.ac.upc.edu/app/research- [28] RTEMS SMP QDP. ESA, 2022. [Online]. Available: https://rtems-
reports/public/html/research center index-CAP-2019,en.html. qual.io.esa.int
[8] J. Pérez-Cerrolaza, R. Obermaisser, J. Abella, F. J. Cazorla, K. Grüttner, [29] K. Hoyme and K. Driscoll, “SAFEbus,” in 11th Digital Avionics Systems
I. Agirre, H. Ahmadian, and I. Allende, “Multi-core devices for safety- Conference (DASC), 1992.
critical systems: A survey,” ACM Comput. Surv., vol. 53, no. 4, 2021. [30] A. Melani et al., “A scheduling framework for handling integrated modu-
[9] CAES, “Final Report - GR740 Next Generation Microprocessor lar avionic systems on multicore platforms,” in International Conference
Flight Models (NGMP Phase 3),” ESA, Tech. Rep. NGMPFM-FR-1, on Embedded and Real-Time Computing Systems and Applications
2021, http://microelectronics.esa.int/finalreport/D3.18-NGMPFM-FR-1- (RTCSA), 2017.
1-0 NGMP Final Report.pdf. [31] P. Bretault et al., “Approaching Parallelization of Payload Software Ap-
[10] J.-L. Poupat, “DAHLIA: Very High Performance Microprocessor for plication on ARM Multicore Platforms,” in Data Systems in Aerospace
Space Applications,” in ESA Workshop on Avionics, Data, Control and (DASIA), 2015.
Software Systems (ADCSS), 2017. [32] RTEMS, OpenMP, RTEMS, 2015.
[11] O. Notebaert et al, “NG-Ultra validation and on-board processing board [33] L. Dagum and R. Menon, “OpenMP: an industry standard API for
development,” in Data Systems In Aerospace (DASIA), 2021. shared-memory programming,” IEEE Computational Science and En-
[12] I. Rodrı́guez-Ferrández et al, “Evaluating the Computational Capabilities gineering, vol. 5, no. 1, pp. 46–55, 1998.
of Embedded Multicore and GPU Platforms for On-Board Image Pro- [34] D. Steenari et al., “On-Board Processing Benchmarks,” 2021,
cessing,” in European Data Handling and Data Processing Conference http://obpmark.github.io/.
for Space (EDHPC), 2023. [35] M. Solé and L. Kosmidis, “Compiler Support for an AI-Oriented SIMD
[13] L. Kosmidis, A. J. Calderón, A. Álvarez Suárez, S. Sinisi, E. Göhler, Extension of a Space Processor,” Ada Lett., vol. 42, no. 1, 2022.
P. Gómez Molinero, A. Hönle, A. Jover Álvarez, L. Lazzara, [36] M. S. Bonet and L. Kosmidis, “SPARROW: A Low-Cost Hardware/Soft-
M. Masmano Tello, P. Onaindia, T. Poggi, I. Rodrı́guez Ferrández, ware Co-designed SIMD Microarchitecture for AI Operations in Space
M. Solé Bonet, G. Stazi, M. M. Trompouki, A. Ulisse, V. Di Valerio, Processors,” in Design, Automation & Test in Europe Conference &
J. Wolf, and I. Yarza, “METASAT: Modular Model-Based Design Exhibition (DATE), 2022.
and Testing for Applications in Satellites,” in Embedded Computer [37] L. Kosmidis, M. Solé Bonet, I. Rodriguez-Ferrández, J. Wolf, and
Systems: Architectures, Modeling, and Simulation - 22nd International M. M. Trompouki, “The METASAT Hardware Platform: A High-
Conference (SAMOS), ser. Lecture Notes in Computer Science, 2023. Performance Multicore, AI SIMD and GPU RISC-V Platform for On-
[14] R. Berger et al., “Quad-core radiation-hardened system-on-chip power board Processing,” in European Data Handling and Data Processing
architecture processor,” in 2015 IEEE Aerospace Conference, 2015. Conference for Space (EDHPC), 2023.
[15] D. Saridakis et al., “RAD55xx Platform SoC,” in Flight Software [38] R. Roger et al., “Vector extensions in COTS processors to increase
Workshop (FSW), 2016, https://flightsoftware.jhuapl.edu/files/2016/Day- guaranteed performance in real-time systems,” ACM Trans. Embed.
2/Day-2-13-Saridakis.pdf. Comput. Syst., vol. 22, no. 2, 2023.
[16] W. Powell, “High-Performance Spaceflight Computing (HPSC) Project [39] D. Steenari, L. Kosmidis, I. Rodrı́guez-Ferrández, A. Jover-
Overview,” in Radiation Hardened Electronics Technology (RHET) Álvarez, and K. Förster, “OBPMark (On-Board Processing
Conference, 2018. Benchmarks) - Open Source Computational Performance Benchmarks
[17] ——, “NASA’s Vision for Spaceflight Computing,” in ESA Workshop for Space Applications,” in 2nd European Workshop on On-
on Avionics, Data, Control and Software Systems (ADCSS), 2022. Board Data Processing (OBDP), 2021. [Online]. Available:
[18] L. Kosmidis, I. Rodriguez, A. Jover-Alvarez, S. Alcaide, J. Lachaize, https://doi.org/10.5281/zenodo.5638577
O. Notebaert, A. Certain, and D. Steenari, “GPU4S: Major Project [40] M. J. Cannizzaro, “RISC-V Benchmarking for Onboard Sensor Pro-
Outcomes, Lessons Learnt and Way Forward,” in Design, Automation cessing,” Master’s thesis, University of Pittsburgh, 2019, https://d-
and Test in Europe Conference and Exhibition (DATE), 2021. scholarship.pitt.edu/40400/1/cannizzaromichaeljames etdPitt2021.pdf.
[19] A. Pawlitzki and F. Steinmetz, “multiMIND - high performance [41] T. M. Lovelly et al., “Benchmarking Analysis of Space-Grade Central
processing system for robust NewSpace payloads,” in 2nd Euro- Processing Units and Field-Programmable Gate Arrays,” Journal of
pean Workshop on On-Board Data Processing (OBDP2021), 2021, Aerospace Information Systems, vol. 15, no. 8, 2018.
https://doi.org/10.5281/zenodo.5521502. [42] E. W. Gretok et al., “Comparative Benchmarking Analysis of Next-
[20] L. Kosmidis, C. Maxim, V. Jégu, F. Vatrinet, and F. J. Cazorla, Generation Space Processors,” in IEEE Aerospace Conference, 2019.
“Industrial Experiences with Resource Management Under Software [43] M. J. Cannizzaro and A. D. George, “Evaluation of RISC-V Silicon
Randomization in ARINC653 Avionics Environments,” in International Under Neutron Radiation,” in 2023 IEEE Aerospace Conference, 2023.
Conference on Computer-Aided Design (ICCAD), 2018. [44] J. A. Belloch et al., “Evaluating the computational performance of
[21] A. Agrawal et al., “Contention-Aware Dynamic Memory Bandwidth the Xilinx Ultrascale+ EG Heterogeneous MPSoC,” J. Supercomput.,
Isolation with Predictability in COTS Multicores: An Avionics Case vol. 77, no. 2, 2021.
Study,” in Euromicro Conference on Real-Time Systems, (ECRTS), 2017. [45] D. C. Landauer and T. M. Lovelly, “Performance Evaluation of the
[22] R. Pujol et al., “Empirical Evidence for MPSoCs in Critical Systems: Radiation-Tolerant NVIDIA Tegra K1 System-on-Chip,” in 2023 IEEE
The Case of NXP’s T2080 Cache Coherence,” in Design, Automation Space Computing Conference (SCC), 2023.
& Test in Europe Conference & Exhibition (DATE), 2021. [46] Z. Towfic et al., “Benchmarking and Testing of Qualcomm Snapdragon
[23] N. Sensfelder et al., “On How to Identify Cache Coherence: Case of the System-on-Chip for JPL Space Applications and Missions,” in IEEE
NXP QorIQ T4240,” in Euromicro Conference on Real-Time Systems Aerospace Conference (AeroConf), 2022.
(ECRTS), 2020. [47] A. Jover-Alvarez, I. Rodrı́guez, L. Kosmidis, and D. Steenari, “Space
[24] D. Radack et al., “Civil certification of multi-core processing systems Compression Algorithms Acceleration on Embedded Multi-Core and
in commercial avionics,” in Safety-critical Systems Symposium, 2018. GPU Platforms,” Ada Lett., vol. 42, no. 1, dec 2022.
[25] S. H. VanderLeest and D. C. Matthews, “Incremental Assurance of [48] I. Rodrı́guez Ferrández, “An On-board Algorithm Imple-
Multicore Integrated Modular Avionics (IMA),” in Digital Avionics mentation on an Embedded GPU: A Space Case Study,”
Systems Conference (DASC), 2021. Master’s thesis, Universitat Politècnica de Catalunya, 2021,
https://upcommons.upc.edu/handle/2117/344892.

You might also like