A Framework For Fault Tolerance in RISC-V
A Framework For Fault Tolerance in RISC-V
Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) | 978-1-6654-6297-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/DASC/PiCom/CBDCom/Cy55231.2022.9927800
Abstract—Microcontrollers require protection against tran- Error Correction Codes (ECCs) are applied to caches and
sient and permanent faults when being utilized for safety- other memories. Logic and registers are protected through n-
critical and highly reliable applications. Fail safe Dual Core modular redundancy (NMR) and deep level error detectors.
Lockstep architectures are widely used in the automotive domain;
the aerospace domain utilizes fail functional TMR or higher The framework created for this purpose makes the named tech-
redundancy. This work incorporates fault tolerance techniques niques available for any RISC-V processor being integrated
of those domains into a framework for RISC-V processors. The into Chipyard1 and has been tested for both in-order (Rocket)
implemented fault tolerance components are highly configurable and out-of-order (BOOM) implementations. It offers multiple
to satisfy various dependability requirements. The cost of applied configuration options to meet different dependability require-
fault tolerance mechanisms is evaluated for both an FPGA and an
ASIC implementation. Fault injection tests prove the effectiveness ments. Furthermore, methods for a timely error detection are
for error detection and cover both transient and permanent faults developed and evaluated through extensive fault injection tests.
in logic and memories. New methods are introduced to minimize The remainder of this paper is organized as follows. Sect.
the error detection latency and achieve a reduction of up to 79%. II introduces existing approaches for fault tolerant RISC-V
processors. Sect. III presents the proposed fully protected
Index Terms—Error Correction Codes, Fault Injection, Re-
dundancy, Permanent Errors, Transient Errors. RISC-V architecture and discusses utilized methods. Some
implemented mechanisms minimize error detection latency or
I. I NTRODUCTION optimize resource utilization. Sect. IV breaks down the costs
caused by fault tolerance. The effectiveness of error detection
The RISC-V Instruction Set Architecture (ISA) has gained
is verified by fault injection tests (Sect. V).
large attention in both academic research and commercial
utilization. It is already commonly found in low-performance II. R ELATED W ORK
systems, such as IoT devices. The performance and efficiency The free and open character of RISC-V lowers the barrier
of openly available implementations is constantly improving, for custom modifications such as the addition of fault tolerance
which enables the adoption of RISC-V in further application features. Several works already enhance existing RISC-V
domains. Therefore, a strong growth in computer, consumer, implementations or create new designs for reliable computing.
and industrial markets is expected for RISC-V [1]. RISC- The Thales Group and Antmicro AB developed a TMR
V brings several advantages also for embedded systems. No RISC-V demonstrator with three redundant RV32I Rocket
license fees are charged for the ISA and it provides freedom cores, voter, fault injector, and system monitor [2]. The
for custom extensions. Those extensions can be tailored to the feasibility of Single Event Effect (SEE) mitigation for space
needs of an embedded system and hereby achieve an energy- applications is shown by means of a minimal design without
efficient speedup. However, up to now RISC-V is rarely found caches. TMR is the only applied fault tolerance technique,
in automotive and aerospace domains with high dependability ECCs are not implemented. Also several other works utilize
requirements. An adoption of RISC-V to those domains re- the Rocket implementation for fault tolerance in software
quires compliance with corresponding safety standards, which [3] or hardware. This includes classic TMR [4], [5] and
has not been addressed by existing implementations suffi- extraordinary approaches, e.g., a heterogeneous lockstep sys-
ciently yet. This work develops and evaluates fault tolerance tem with a Rocket and Arm Cortex-A9 core [6]. Another
concepts for RISC-V designs, mitigating both transient and RISC-V lockstep processor is presented in [7]. It builds upon
permanent faults. The applied error detection and correction an interleaved pipeline and aims for fast error recovery and
mechanisms are the foundation for safety standard compliance reduced die size.
such as ISO 26262, IEC 61508, and DO-178B. Moreover, the Klessydra [8] is based on the PULPino microcontroller
redundancy schemes applied in this work mitigate radiation system with its ZeroRisy and RI5CY cores. It explores several
effects and enable RISC-V to be used in space. fault tolerance techniques such as spatial and temporal redun-
Contribution: This work fully protects both RV32I and dancy, ECCs, and watchdog monitoring. A detailed fault re-
RV64I RISC-V designs against transient and permanent errors. silience analysis for a TMR variant of Klessydra is performed
in [9].
978-1-6654-6297-6/22/$31.00 © 2022 IEEE 1 https://chipyard.readthedocs.io/en/latest/
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.
SHAKTI-F [10] achieves SEE tolerance for a 5-stage in- A B2 D
Redundant CPU
order microprocessor while keeping area and performance
NMR
L2
ECC
penalties low. It protects pipeline registers with SEC-DED and RISC-V CPU
bar
C1 TLB
proposes a recomputation with complemented operands for
C2 DLD RISC-V
functional units. The protection of memories such as caches
NMR
Core
ECC
is not addressed. FTM
bar
E L1I
Technolution B.V. elaborates fault tolerance techniques for
a RISC-V softcore [11] and its implications on the memory TileBus
NMR
ECC
bar
architecture [12]. The works differentiate an NMR domain Interrupts L1D
for pipeline resources and an ECC domain for memories and
the register file. Similarly, [13] and [14] enhance an area-
NMR NMR FTM
optimized RISC-V implementation with Hamming codes on B1 E
bar bar (ext.)
the register file and replicate the ALU as TMR.
The first fault tolerant RISC-V cores are already marketed SystemBus
commercially. Cobham Gaisler AB offers fault tolerance fea-
tures for its NOEL-V implementation, which is also used Fig. 1. Fault tolerant RISC-V architecture.
within the H2020 SELENE [15] and De-RISC [16]–[18]
projects. Microsemi provides a commercial RISC-V system
for reliable computing. The Mi-V RISC-V ecosystem [19] A. Redundant CPU
contains several RV32 soft cores that can be deployed on their Voters and comparators evaluate the output of replicated
reliable PolarFire, RTG4, and IGLOO2 devices. components for equality. This requires redundant modules to
The Fraunhofer IPMS developed an ASIL-D ready certified run lockstep synchronously, for which some adoptions are
RISC-V IP with respective component FMEA and safety necessary. Particularly Rocket and BOOM avoid to initialize
manual documentation [20]. This EMSA5-FS processor core sequential elements wherever possible for a reduced routing
implements the RV32E base in a 5-stage pipeline. It features a complexity of resets, resulting in smaller resource utilization
light-weight implementation without any cache hierarchy and and simpler timing closure. However, some uninitialized mem-
can be configured in double or triple modular redundancy. ories and registers cause a non-deterministic behavior. This
ECC is applied for communication buses and a safety manager has been observed for L1 caches and the branch prediction
provides supervision capabilities. unit in particular. While this does not affect the order of
The existing works are limited to rather light-weight RV32I executed instructions and correct program execution, it results
cores and address only one specific RISC-V implementation in non-deterministic Last Level Cache (LLC) transactions. The
each. This work aims for a more generic approach. It im- order of fetched instructions and data loads/stores varies with
plements fault tolerance for several RISC-V implementations initialization values and defeats a comparison and voting on
including both RV32I and RV64I variants that are capable system bus level.
to boot UNIX-based operating systems. Furthermore, the de- This work initializes additional registers only where neces-
veloped framework provides many configuration options and sary for synchronous behavior. In many cases, it is sufficient
allows to customize a RISC-V core for given dependability to enforce a correct generation of the valid signal of buses and
requirements. bundles. Compared to a globally applied reset (e.g., as utilized
for Arm Dual Core Lockstep processors [21]), this selective
III. FAULT T OLERANT RISC-V A RCHITECTURE reset approach has superior silicon area utilization and power
efficiency.
The overall fault tolerant RISC-V architecture is depicted
in Fig. 1. The following subsections describe the individual B. N-Modular Redundancy Crossbar (NMRbar)
fault tolerance features and components (A-E) in detail. Most Inputs and outputs of redundant components are managed
features are configurable to satisfy different dependability by n-modular redundancy crossbars (NMRbars). As shown in
requirements. This allows to tailor a dependable RISC-V Fig. 1, NMRbars are inserted for the TileBus interfaces and
system by adding as much redundancy as necessary while in-/outgoing interrupts (B1). Furthermore, NMRbars are in-
keeping area overhead, increased power consumption, and stantiated optionally at memory interfaces (B2). Each NMRbar
performance losses low. compares, arbitrates, and votes on the outputs of redundant
Some details may vary from this general illustration for components and replicates inputs.
specific RISC-V implementations. E.g., the BOOM processor It supports static-, dynamic- and hybrid redundancy con-
holds additional ECC protected memory arrays for the Branch cepts. Static (or passive) redundancy, such as TMR, replicates
Target Buffer (BTB) and GShare predictor, which are not a module and votes the outputs. It masks errors and achieves
shown here. The CVA6 core has a native AXI interface, which forward error recovery. Dynamic (or active) redundancy first
needs to be bridged to the TileBus for compatibility with the detects errors, and then performs actions to remove the fault.
Chipyard framework. Again, this is not shown here. Detection is performed through comparison of duplicated
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.
class NMRbarParams(
NMRbar RISC-V CPU D NMRbar n_modules: Int = 3,
(in) (master) (out) n_spares: Int = 0,
C has_detector: Vector[Boolean]
RISC-V CPU
(checker) D S = Vector(false, false, false),
... is_checker: Vector[Boolean]
in V out
= Vector(false, false, false),
RISC-V CPU D )
(master)
C Listing 1. NMRbar parameterization. Given default: TMR configuration.
RISC-V CPU Error
Monitor error
(checker) D
Comparator error
C. Deep Level Detector (DLD)
RISC-V Core
The advantage of Dual Core Lockstep (DCLS) architectures
with comparison on output ports is its simplicity. The same 4,976 Fingerprint
Generator:
applies for TMR on component level. However, some errors XOR / H3 166
are only detected after several tens of million clock cycles [22].
This jeopardizes timing guarantees and complicates rollback
procedures because the error might have manifested already
prior to the last software checkpoint. Fig. 3. Deep Level Detector generating a fingerprint of internal registers.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.
class DLDParams( class ECCParams(
xor: Boolean = true, ecc_code: Code = SECDEDCode,
h3: Boolean = false, block_size: Int = 8,
h3_j: Int = 3, scrubbing: Boolean = false,
reg2fp_ratio: Int = 30 scrubbing_interval: UInt = 4,
) interleaving: Boolean = false,
replication: Boolean = false
Listing 2. DLD parameterization. Given default: XOR selected. )
Listing 3. ECC memory parameterization.
Listing 2 summarizes the DLD configuration options.
xor and h3 fingerprints can be activated separately. The
reg2fp_ratio defines the reduction of registers to finger- E. Fault Tolerance Monitor (FTM)
print width. The given example reduces a total of 4,976 moni-
After occurrence of an error, further system operation
tored CPU registers by a factor of 30 to a fingerprint width of
and tolerance of additional faults is limited. Therefore, an
166 bits. h_j defines the number of H3 matrix columns and
erroneous RISC-V CPU needs to resynchronize as soon as
allows to balance between the number of expected fingerprint
possible, for which the FTM provides necessary features. An
collisions and hardware complexity.
interrupt indicates the occurrence of errors to each core. Status
The selection (and number) of CPU registers being pro-
registers provide additional information about the error type
cessed to a fingerprint affects the efficiency of the DLD as well
and affected components and are embedded into the Control
as performance and area overhead. This work keeps intrusion
and Status Register (CSR) address space. An external FTM
low and excludes high fanout registers and timing critical
variant provides the information via SPI and can be accessed
resources (e.g., register file) from the selection. An optimized
by an external system supervisor.
selection is future work and could minimize the number of
monitored registers and upper bound the detection latency. IV. C OST OF FAULT T OLERANCE
In contrast to the other fault tolerance techniques, the DLD
is currently implemented for the Rocket core only. The Rocket processor [27] is one of the most prominent
RISC-V implementations [28]. The “big” variant resembles
D. Error Correction Codes (ECC) an RV64I implementation and is utilized for the following
The Rocket processor instantiates 5 separate memory arrays detailed evaluation of fault tolerance techniques.
(L1I tag/data, L1D tag/data, and L2TLB). The BOOM pro- Fig. 4 depicts the resource overhead for both an FPGA
cessor requires 4 additional arrays (BTB tag/data, Bi-Modal and an ASIC implementation of four different configurations.
Table, and TAGE entries), resulting in a total of 9 arrays. FPGA resources are generated for the Xilinx UltraScale+
All arrays are protected against bitflips by an ECC enhanced architecture2 ; ASIC area is generated for the GlobalFoundries
memory, which has been introduced in [25] and provides 22FDX technology3 .
several configuration options (Listing 3). Parity (without cor- • DMR1 with duplicated memories.
rection capability but only low resource overhead), SEC, and • DMR2 with non-dupl. but parity protected memories.
SEC-DED are selectable as ecc_code The block_size is • TMR1 with triplicated memories.
customizable. An optional hardware scrubbing with config- • TMR2 with non-tripl. but SEC-DED protected memories.
urable scrubbing_interval prevents error accumulation.
The DMR configurations provide examples for fail safe
Multi-bit errors can be mitigated through an interleaving
architectures. A parity code is selected to protect memo-
option, which shuffles bits of different blocks.
ries, because an error detection is sufficient for fail safe
Protecting memory arrays with ECC allows to exclude them behavior. The TMR configurations provide examples for fail
from redundancy replication. This design decision has its pros operational architectures. SEC-DED is selected to protect
and cons, which is why the proposed framework offers it as a memories, because fail operational behavior requires error
configuration option (replication). correction capabilities.
Replicated arrays simplify the redundancy checker (here
The DMR2 and TMR2 configurations present the possible
NMRbar) because the number of supervised signals is reduced.
area savings for non-replicated but ECC protected memories.
Most commercial DCLS processors, such as the NXP S32
The values apply for a Rocket configuration with 16 KB L1I
platform [26] , follow this design principle. On the other
and L1D caches in which memory macros contribute to 49%
hand, non-replicated arrays reduce the ASIC area or FPGA
of the total ASIC area.
block RAM resource utilization. This opens up the possibility
for great resource savings, particularly for ASIC designs (see 2 Computed with the Vivado Design Suite 2019.2 and default synthesis and
Sect. IV). Furthermore, the redundancy check on the array implementation settings.
3 Synthesis results computed with Genus Synthesis Solution Version 19.14-
interfaces acts as an additional error propagation boundary. It
s108 1 for the INVECAS 12T BASE standard cell library with a nominal
detects and/or masks faults early before they manifest in the voltage of 0.8 V. Memories created with INVECAS memory compiler. Clock
system bus interface. gating and medium power optimization enabled.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.
Zynq UltraScale+ XCZU7EV
300 DMR1 DMR2 TMR1 TMR2
Processing System (PS) Programmable Logic (PL)
Resource / Area Overhead [%]
245
244
Reset
218
217
pblock SD
204
Peri-
200
200
200
with JTAG
Interrupt phery
FTM
200 fault tolerant UART
Arm Cortex-A53
150
SPI
Quadcore APU RISC-V system
118
117
112
112
DDR DDR4
103
100
100
100
CMEM
SEM Ctrl. mem
100 fault
62
19
0
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.
C. Workload Characterization TABLE III
N UMBER OF RUNS AND RATIO OF OBSERVED - AND DETECTED ERRORS .
The EEMBC AutoBench™ 2.0 Performance Benchmark
Suite is utilized as a representative workload. It characterizes Logic Logic Memory Memory
common applications of the safety-critical automotive domain. Component transient permanent transient permanent
Number of runs 531,652 214,448 318,544 93,817
Faults are injected during separate executions of each of the 12
Timeouts 1.07% 1.87% 0.67% 5.50%
observed
incorporated kernels. While the resulting error detection cov- Silent Data Corruption 0.72% 0.58% 1.17% 4.18%
erage is similar for all kernels, the detection latency correlates Silent CPU Mismatch 1.32% 2.24% 1.32% 3.13%
Total Observed 3.12% 4.69% 3.16% 12.81%
with the kernel execution time. In particular, the maximum B1: CPU NMRbar 3.09% 4.61% 3.16% 12.78%
detection latency is only upper bounded by the execution time. B2: Mem. NMRbar 3.22% 4.88% 2.67% 12.41%
detected
C1: DLD (XOR) 4.80% 6.64% 6.04% 16.99%
Table II gives an overview of the execution times of each C2: DLD (H3) 7.91% 10.08% 5.85% 16.84%
kernel processing a dataset of 4 KB. Values are reported for D: ECC 0.09% 0.19% 6.74% 17.64%
Total Detected 8.56% 11.03% 6.74% 17.64%
a processor clock frequency of 100 MHz. Note that these
numbers should not be taken as a benchmark score because
they include verification times. Verification is enabled for a
Permanent errors are present during the complete execution
detection of possible Silent Data Corruption (SDC). Fault
time while transient errors typically appear for one clock cycle
injection times are evenly distributed within respective kernel
only. The long presence increases the chance for a resulting
execution times.
failure and lowers the masking factor. This explains the higher
TABLE II percentage of observations and detections for permanent er-
E XECUTION T IMES FOR AUTO B ENCH 2.0 K ERNELS . rors.
Only a small fraction of logic faults produces observable
Execution Max. Detection errors (transient: 3.12%; permanent: 4.69%) because of two
Kernel Time [us] Latency [us]
reasons: 1) Faults are injected into CMEM bits that are marked
a2time01 5,139 5,118
aifirf01 9,794 9,754 as essential by the FPGA toolchain. Not all of them are critical
bitmnp01 5,560 5,504 for the RISC-V design. [30] reports a ratio of critical to
canrdr01 1,463 1,418 essential bits of 54%. 2) Many faults are masked before they
idctrn01 7,328 7,261
iirflt01 7,869 7,847 cause an abnormal behavior or propagate to the system bus.
matrix01 5,605 5,557 The deeper the level of the error detection mechanism, the
pntrch01 97,379 95,367 more errors are detected. For instance, the DLD (H3) detects
puwmod01 5,518 5,482
rspeed01 4,876 4,842
more than twice as many logic errors than the NMRbars. Many
tblook01 988 970 of these additionally reported errors are false alarms because
ttsprk01 594 572 they are masked before they provoke an observable behavior.
Table IV shows the coverage of fault tolerance components
for observable errors. The high number of runs ensures tight
D. Analysis of Error Detection Coverage 95% confidence intervals; it is max. 0.88% points below
An extensive fault injection campaign with a total of more and max. 0.86% points above reported coverages. While the
than 1 million runs and at least 10, 000 observed errors per CPU NMRbar (B1) counts less detected errors than any other
fault type provides the results for coverage and detection fault tolerance component4 , it achieves the highest coverage.
latency. The test setup allows to observe three error types: Reason for this is that the CPU NMRbar (B1) supervises the
• Timeout: The kernel execution does not terminate. It complete RISC-V CPU, while the memory NMRbar (B2) and
is sometimes indicated by an exception calling. For the DLD (C) are limited to a subset of internal states and registers.
presented fault injection campaign, a generous timeout of A very small amount of logic errors are observed, but
3 seconds has been chosen. not detected by any fault tolerance component (e.g., some
• Silent Data Corruption: The kernel execution termi- timeouts). These errors stay undetected because of a common
nates, but the verification reports an incorrect data output. cause, such as a fault in the clock tree or the reset signal.
• Silent CPU Mismatch: The outputs of a CPU (here Furthermore, the fault injection area contains several IOs,
SystemBus or Interrupt) deviate from a golden reference, which are a single point of failure. Hence, no fault tolerance
but no other effect has been observed. This error type component achieves 100% complete coverage for logic faults.
includes deviating branch predictions and caching. On the other hand, timeouts, SDCs, and mismatches on CPU
If no error has been observed at the end of the execution, outputs resulting from memory faults are completely covered
the fault is assumed to be irrelevant due to masking. A masked by the ECC decoder, DLD, and CPU NMRbar. The memory
fault can still be detected by fault tolerance components. NMRbar (B2) does not detect all memory errors because it
Hence, the number of detected errors may be greater than monitors output mismatches in direction ‘CPU to memories’,
the number of observed errors. The number of runs and but not ‘memories to CPU’.
percentages of observed and detected errors are summarized 4 Only the ECC decoder counts less errors during logic fault injection tests
in Table III. because it does not cover logic errors.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.
B1: CPU NMRbar B1: CPU NMRbar F. Error Detection Distribution
B2: Mem. NMRbar D: ECC Reliability theory defines a failure density function and
C1: DLD (XOR) C1: DLD (XOR) failure distribution. Error detection is also a stochastic process
C2: DLD (H3) C2: DLD (H3)
and can be modeled with a probability distribution. Similarly
10 5 10 5 to reliability theory [31], we utilize an error detection density
2.5 2.5
function ed (t) and error detection distribution Ed (t). Hereby
2 2 ed (t) is given by the ratio of the number of detected errors
occurring in a time interval to the number of all observed
Clock Cycles
Clock Cycles
1.5 1.5 errors, divided by the length of the time interval:
1 1
[ndetect (ti ) − ndetect (ti + ∆ti )] /Nobserved
ed (t) = (2)
∆ti
0.5 0.5 The error detection distribution Ed (t) specifies the ratio of
detected errors over time; it is calculated by integration of
0 0
logic logic memory memory ed (t): Z t
transient permanent transient permanent
Ed (t) = ed (τ )dτ (3)
Fig. 6. Average error detection latency. Left: logic errors. Right: memory 0
errors.
The error detection distribution Ed (t) is fitted best with a
cumulative Weibull distribution starting at an intercept cinst ,
which represents the ratio of instantly detected errors. It
The H3 fingerprint does not achieve significantly superior
saturates at the maximum coverage csat (as given in Table IV)
coverage than the XOR chain because each test run injects
and follows a complementary stretched exponential function:
single bit faults only. It is expected, that the advantage of H3
will become visible in the case of multibit faults. Ed (t) = csat − (csat − cinst ) · exp((−λt)k ) (4)
Fig. 7 plots the error detection distribution Ed (t) exemplary
TABLE IV
C OVERAGE OF FAULT T OLERANCE C OMPONENTS .
for transient logic errors. Adding the DLD with XOR algo-
rithm (C1) to the CPU NMRbar (B1) has several benefits,
Logic Logic Memory Memory which is illustrated by the uppermost curve. The ratio of
Component transient permanent transient permanent instantly detected errors (intercept cinst ) is increased from
B1: CPU NMRbar 99.28% 98.27% 100.0% 100.0% 0.6620 to 0.9309. The cumulative Weibull distribution ap-
B2: Mem. NMRbar 97.20% 93.77% 73.07% 90.77%
C1: DLD (XOR) 97.53% 94.48% 100.0% 100.0% proaches its asymptotic maximum csat more quickly, which
C2: DLD (H3) 97.66% 94.62% 100.0% 100.0% indicates an earlier error detection and can be quantized
D: ECC 1.07% 1.73% 100.0% 100.0%
by the k and λ parameters. The shape parameter k < 1
indicates a decreasing error detection rate over time; hence,
minimizing k is desirable. The addition of the DLD reduces
E. Average Error Detection Latency the shape parameter k from 0.50 to 0.40. The scale parameter
λ represents the error detection rate and should be maximized.
Fig. 6 shows the average detection latency, i.e. number It increases from 2.58 × 10−5 to 7.72 × 10−5 per clock cycle.
of clock cycles elapsed between fault injection and error
detection, of observable errors. The average error detection 1
latency is significant; e.g., for memory faults it is in the order
0.95
of magnitude of 100,000 clock cycles. Permanent faults tend to
Detected errors / Total observed errors
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.
VI. C ONCLUSION [13] D. A. Santos, L. M. Luza, C. A. Zeferino, L. Dilillo, and D. R. Melo,
“A low-cost fault-tolerant RISC-V processor for space systems,” in 2020
This work demonstrates how RISC-V implementations can 15th Design Technology of Integrated Systems in Nanoscale Era (DTIS),
be designed to be fault tolerant. The resulting framework al- 2020, pp. 1–5.
[14] D. A. Santos, L. M. Luza, L. Dilillo, C. A. Zeferino, and D. R.
lows developers, researchers, and students to generate various Melo, “Reliability analysis of a fault-tolerant risc-v system-on-chip,”
redundancy schemes without much effort as shown in Listings Microelectronics Reliability, vol. 125, p. 114346, 2021.
1, 2, and 3. Exemplary, DMR and TMR configurations are [15] C. Hernàndez, J. Flieh, R. Paredes, C.-A. Lefebvre, I. Allende,
J. Abella, D. Trillin, M. Matschnig, B. Fischer, K. Schwarz, J. Kiszka,
evaluated for a fail safe and fail functional system behavior M. Rönnbäck, J. Klockars, N. McGuire, F. Rammerstorfer, C. Schwarzl,
respectively. Memory arrays can be excluded optionally from F. Wartet, D. Lüdemann, and M. Labayen, “Selene: Self-monitored
replication, allowing for large resource savings. Protection dependable platform for high-performance safety-critical systems,” in
2020 23rd Euromicro Conference on Digital System Design (DSD),
against memory errors is ensured by various ECCs. An ad- 2020, pp. 370–377.
ditional Deep Level Detector (DLD) significantly reduces the [16] N.-J. Wessman, F. Malatesta, J. Andersson, P. Gomez, M. Masmano,
error detection latency. V. Nicolau, J. L. Rhun, G. Cabo, F. Bas, R. Lorenzo, O. Sala, D. Trilla,
and J. Abella, “De-risc: the first risc-v space-grade platform for safety-
Future work aims to bound the maximum error detection la- critical systems,” in 2021 IEEE Space Computing Conference (SCC),
tency, which could be achieved through adding cyclic software 2021, pp. 17–26.
[17] J. L. Rhun, V. Nicolau, A. Garcia-Vilanova, J. Andersson, and S. Al-
checks or an improved DLD. The quality of different hash caide, “De-risc: Launching risc-v into space,” in European Workshop on
polynomials will be evaluated. Furthermore, the fault tolerance On-Board Data Processing (OBDP2021), 6 2021.
framework will be extended to the remaining SoC components [18] F. G. Molinero, M. Masmano, V. Nicolau, N.-J. Wessman, J. Andersson,
J. L. Rhun, G. Cabo, S. Alcaide, P. Benedicte, and J. Abella, “De-
such as the system bus and periphery. RISC - Dependable Real-time RISC-V Infrastructure for Safety-critical
Space and Avionics Computer Systems,” in Data Systems in Aerospace
ACKNOWLEDGMENT (DASIA) 2021, 2021.
[19] “Mi-v risc-v ecosystem,” https://www.microsemi.com/product-directory/
This work is part of BMBF FKZ 16ES1003 “KI-PRO”. fpga-soc/5210-mi-v-embedded-ecosystem, 2022, accessed: 2022-05-11.
[20] M. Pietzsch, “Risc-v processor core for functional safety,” Fraunhofer
Institute for Photonic Microsystems IPMS, Tech. Rep. 0621002, 6 2021.
R EFERENCES [21] J. Yiu, “Design of SoC for High Reliability Systems with Embedded
Processors,” in Embedded World Conference 2015, 2015.
[1] RISC-V Market Analysis: The New Kid on the Block, https://semico. [22] C. Hernandez and J. Abella, “Timely error detection for effective
com/content/risc-v-market-analysis-new-kid-block, SEMICO Research recovery in light-lockstep automotive systems,” IEEE Transactions on
Corporation, Nov. 2019, accessed: 2021-12-10. Computer-Aided Design of Integrated Circuits and Systems, vol. 34,
[2] Triple-Modular-Redundancy RISC-V Demonstrator, https://github.com/ no. 11, pp. 1718–1729, 2015.
ThalesGroup/TMR/blob/master/RISC-V-demonstrator--docs.pdf, antmi- [23] X. Iturbe, B. Venu, E. Ozer, J.-L. Poupat, G. Gimenez, and H.-U. Zurek,
cro, accessed: 2021-12-10. “The arm triple core lock-step (tcls) processor,” ACM Trans. Comput.
[3] B. James, H. Quinn, M. Wirthlin, and J. Goeders, “Applying compiler- Syst., vol. 36, no. 3, Jun. 2019.
automated software fault tolerance to multiple processor platforms,” [24] J. Carter and M. N. Wegman, “Universal classes of hash functions,”
IEEE Trans. on Nuclear Science, vol. 67, no. 1, pp. 321–327, 2020. Journal of Computer and System Sciences, vol. 18, no. 2, pp. 143–154,
[4] A. B. de Oliveira, L. A. Tambara, F. Benevenuti, L. A. C. Benites, 1979.
N. Added, V. A. P. Aguiar, N. H. Medina, M. A. G. Silveira, and F. L. [25] A. Dörflinger, Y. Guan, S. Michalik, S. Michalik, J. Naghmouchi,
Kastensmidt, “Evaluating soft core RISC-V processor in SRAM-based and H. Michalik, “Ecc memory for fault tolerant risc-v processors,”
FPGA under radiation effects,” IEEE Transactions on Nuclear Science, in Architecture of Computing Systems – ARCS 2020, A. Brinkmann,
vol. 67, no. 7, pp. 1503–1510, 2020. W. Karl, S. Lankes, S. Tomforde, T. Pionteck, and C. Trinitis, Eds.
[5] L. A. Aranda, N.-J. Wessman, L. Santos, A. Sánchez-Macián, J. Ander- Cham: Springer International Publishing, 2020, pp. 44–55.
sson, R. Weigand, and J. A. Maestro, “Analysis of the critical bits of [26] “S32 automotive platform.” https://www.nxp.com/
a RISC-V processor implemented in an SRAM-based FPGA for space products/processors-and-microcontrollers/arm-processors/
applications,” Electronics, vol. 9, no. 1, 2020. s32-automotive-platform:S32, 2021, accessed: 2021-12-10.
[6] C. Rodrigues, I. Marques, S. Pinto, T. Gomes, and A. Tavares, “Towards [27] K. Asanović, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin,
a heterogeneous fault-tolerance architecture based on arm and RISC-V C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar,
processors,” in IECON 2019 - 45th Annual Conference of the IEEE B. Keller, D. Kim, J. Koenig, Y. Lee, E. Love, M. Maas, A. Magyar,
Industrial Electronics Society, vol. 1, 2019, pp. 3112–3117. H. Mao, M. Moreto, A. Ou, D. A. Patterson, B. Richards, C. Schmidt,
[7] M. T. Sim and Y. Zhuang, “A dual lockstep processor system-on-a-chip S. Twigg, H. Vo, and A. Waterman, “The rocket chip generator,” EECS
for fast error recovery in safety-critical applications,” in IECON 2020 Department, University of California, Berkeley, Tech. Rep. UCB/EECS-
The 46th Annual Conference of the IEEE Industrial Electronics Society, 2016-17, Apr 2016.
2020, pp. 2231–2238. [28] A. Dörflinger, M. Albers, B. Kleinbeck, Y. Guan, H. Michalik, R. Klink,
[8] L. Blasi and F. Vigli, “The first space-qualified klessydra RISC-V C. Blochwitz, A. Nechi, and M. Berekovic, “A comparative survey of
microcontroller to be launched on a satellite,” in RISC-V Workshop open-source application-class RISC-V processor implementations,” in
Zurich Proceedings. RISC-V International, 2019. Proceedings of the 18th ACM International Conference on Computing
[9] M. Barbirotta, A. Mastrandrea, F. Menichelli, F. Vigli, L. Blasi, Frontiers, ser. CF ’21. New York, NY, USA: Association for Computing
A. Cheikh, S. Sordillo, F. Di Gennaro, and M. Olivieri, “Fault resilience Machinery, 2021, p. 12–20.
analysis of a risc-v microprocessor design through a dedicated uvm [29] UltraScale Architecture Soft Error Mitigation Controller, 3rd ed.,
environment,” in 2020 IEEE Int. Symposium on Defect and Fault https://www.xilinx.com/support/documentation/ip documentation/sem
Tolerance in VLSI and Nanotechnology Systems (DFT), 2020. ultra/v3 1/pg187-ultrascale-sem.pdf, Xilinx, Inc., 2021, accessed:
[10] S. Gupta, N. Gala, G. S. Madhusudan, and V. Kamakoti, “Shakti-f: A 2021-12-10.
fault tolerant microprocessor architecture,” in 2015 IEEE 24th Asian Test [30] H. Michel, H. Guzmán-Miranda, A. Dörflinger, H. Michalik, and M. A.
Symposium (ATS), Nov 2015, pp. 163–168. Echanove, “Seu fault classification by fault injection for an fpga in the
[11] W. F. Heida, “Towards a fault tolerant RISC-V softcore,” Master’s thesis, space instrument sophi,” in 2017 NASA/ESA Conference on Adaptive
TU Delft, 2016. Hardware and Systems (AHS), 2017, pp. 9–15.
[12] A. A. Verhage, “A fault tolerant memory architecture for a RISC-V [31] M. L. Shooman, Reliability of Computer Systems and Networks: Fault
softcore,” Master’s thesis, TU Delft, 2016. Tolerance, Analysis, and Design. John Wiley and Sons, Ltd, 2002.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 05,2024 at 05:58:06 UTC from IEEE Xplore. Restrictions apply.