0% found this document useful (0 votes)
34 views9 pages

Cryptographic Acceleators For Digital Signature Based On Ed25519

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views9 pages

Cryptographic Acceleators For Digital Signature Based On Ed25519

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Cryptographic Accelerators for Digital


Signature Based on Ed25519
Mojtaba Bisheh-Niasar , Reza Azarderakhsh , Member, IEEE,
and Mehran Mozaffari-Kermani , Senior Member, IEEE

Abstract— This article presents highly optimized implemen- Although most current cryptosystems will be broken by
tations of the Ed25519 digital signature algorithm [Edwards quantum computing based on Shor’s algorithm [5], the transi-
curve digital signature algorithm (EdDSA)]. This algorithm tion to postquantum cryptography (PQC) includes an emerging
significantly improves the execution time without sacrificing secu- field called hybrid systems [6], requiring both classic and
rity, compared to exiting digital signature algorithms. Although
EdDSA is employed in many widely used protocols, such as PQC [7]. Hence, designing high security ECC-based digi-
TLS and SSH, there appear to be extremely few hardware tal signature for different applications is crucial. EdDSA is
implementations that focus only on EdDSA. Hence, we pro- notable for high speed and constant-time implementations
pose two different field-programmable gate array (FPGA)-based and was quickly implemented as a part of the TLS and
EdDSA implementations, i.e., efficient and high-performance OpenSSH protocols [8]. Hence, it has to be implemented
Ed25519 architectures applicable for a security level comparable in various platforms subject to the performance requirement
to AES-128. Our proposed efficient Ed25519 scheme achieves an
improvement of more than 84% compared to the best previous
of the target application, such as constrained IoT devices.
work by reducing the required area. It also incorporates more However, EdDSA has not got sufficient study, especially in the
than 8× speedup. Furthermore, our proposed high-performance field of hardware implementation based on field-programmable
architecture shows a 21× speedup with more than 6200 digital gate arrays (FPGAs). Therefore, investigation of the hardware
signature algorithms per second, showing a significant improve- implementation of this algorithm is required considering the
ment in terms of utilized area × time on a Xilinx advantages of FPGA-based designs to exploit parallelism,
Zynq-7020 FPGA. Finally, the effective side-channel counter- which leads to improvements in the efficiency of the overall
measures are embedded in our proposed designs, which also
outperform the previous works. system.
There are two main solutions to enable the hardware-based
Index Terms— Ed25519, Edwards curve digital signature digital signature algorithm in the constrained IoT, including:
algorithm (EdDSA), elliptic curve cryptography, hardware 1) HW/SW approach to cope with embedded constraints and
implementation, side channel.
2) pure HW method that includes all in hardware instruc-
I. I NTRODUCTION tions. The HW/SW method makes the design smaller, slower,

E DWARDS curve digital signature algorithm (EdDSA)


developed by Bernstein et al. [1] has gained prominent
attention among the existing digital signature algorithms due to
and more controllable/programmable compared to pure HW
schemes. Although the pure HW approach leads to better
performance, HW/SW can be a better choice for IoTs since
its fast operations without affecting the required security. The it provides flexibility to switch security levels based on per-
Ed25519, as the most popular instance of EdDSA, is widely formance targets. In [9], the comparison of the CryptoCell
used as a digital signature method to guarantee the validity API over nRF52840 as an internal HW/SW solution and the
of the communications. On the other hand, the elliptic curve external cryptochip ATECC608A as a pure HW is thoroughly
digital signature algorithm (ECDSA) is no longer suitable for studied. Furthermore, to address higher security needs, new
embedded devices due to its vulnerability against side-channel NIST and IETF recommendations make Curve448 suitable for
analysis (SCA) attacks [2], [3]. Hence, most HTTPS websites higher level security requirements [10], [11]. Hence, imple-
are switching to Ed25519, suitable for higher level security menting HW/SW architecture brings the required flexibility
requirements, which address some backdoor issues [4] in other among different security levels, while a general architecture
ECDSA constructions at the same time. can be implemented in HW and controlled by instruction set
Manuscript received December 30, 2020; revised April 1, 2021; accepted processors such that the hardware remains flexible to a great
May 1, 2021. This work was supported in part by the National Insti- extent, which is beyond the scope of this work.
tute of Standards and Technology (NIST) under Grant 60NANB16D246,
in part by NSF under Grant 1801341, and in part by the Army Research A. Related Work
Office (ARO) under Grant W911NF-17-1-0311. (Corresponding author:
Mojtaba Bisheh-Niasar.) As one of the first FPGA-based works in ECC-based
Mojtaba Bisheh-Niasar and Reza Azarderakhsh are with the Department digital signature, Glas et al. [12] proposed architecture for
of Computer & Electrical Engineering and Computer Science (CEECS), 128-bit security to integrate into a vehicle-to-vehicle com-
Florida Atlantic University, Boca Raton, FL 33431 USA (e-mail:
[email protected]; [email protected]). munication system. Furthermore, Panjwani [13] presented a
Mehran Mozaffari-Kermani is with the Department of Computer Science scalable hardware implementation in prime fields over NIST
and Engineering (CSE), University of South Florida, Tampa, FL 33620 USA recommended field sizes up to 521 bit, employing hardware–
(e-mail: [email protected]). software codesign approach. The work of Vliegen et al. [14]
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2021.3077885. introduced a compact core over the NIST P-256 curve resis-
Digital Object Identifier 10.1109/TVLSI.2021.3077885 tant against simple power analysis (SPA) attacks. Moreover,
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Zhang and Bai [15] proposed a core with a security level make the best of all its features. In this work, we present
128 bit over the SM2 curve. two different architectures, i.e., efficient and high-performance
Recently, a number of hardware implementations have been design of Ed25519 implementation considering different per-
introduced to implement an elliptic curve point multiplication formance levels for time-constrained and area-constrained
(ECPM) core over Curve25519. Sasdrich and Güneysu [16] applications.
proposed the first Curve25519 implementation using Our contributions to this work are listed as follows.
a DSP-based single-core architecture. This work has 1) We propose a new approach for implementing the
been extended by adding side-channel countermeasures EdDSA accelerator on FPGA. We analyze the com-
in [17] and [18] to provide an evaluation against common putation of the restricted-X coordinates of a point on
physical attacks. In [19], fast and compact implementations the Montgomery curve with additional coordinate con-
of ECPM were proposed. This architecture employs a version and design two novel, highly parallel hardware
semisystolic bit-serial multiplier and carry-compact addition architectures based on these algorithms. In this article,
to provide a high-performance architecture. The work of we show how to leverage the advantages of computa-
Koppermann et al. [20], [21] presented a high-speed tion over the Montgomery curve while implementing
prime field multiplier with a latency of 92 μs for a point Ed25519 accelerator circuits so that the true benefits of
multiplication. In addition, in [22], a low-latency ECPM was the accelerator circuits can be achieved.
proposed employing a pipelined arithmetic architecture on 2) We explore the tradeoffs of area and performance
FPGA and ASIC platforms. It should be noted that FPGA to accomplish different optimization perspectives.
implementations of Curve25519 in the literature cannot be We demonstrate various optimization techniques in order
directly compared to ours because the ECPM core in EdDSA to achieve an overall optimization in terms of effi-
occupies more resources for implementing hash core and ciency, including the parallelization, resource sharing,
module L reduction. Furthermore, it requires more time for redundant number presentation, adoption of distrib-
a point multiplication since this architecture is reused for uted RAM and ROM blocks, and interleaved architec-
nonmodular multiplication and module L reduction. ture, which achieves above 84% efficiency improve-
A non-DSP-based Ed25519 point multiplication core was ment of the area–time product compared to the leading
presented by Mehrabi and Doche [23] using the double-and- FPGA implementations.
add algorithm. Hence, this architecture is a nonconstant-time 3) We instantiate the proposed architecture in a Xilinx
core vulnerable to SPA attacks. Notably, the reported area does Zynq-7020 FPGA and provide performance evaluations.
not include all the required modules for providing a digital The effective countermeasures against SCA are embed-
signature, such as hash function and modulus L reduction. ded to enhance the resistance of the proposed archi-
We explore that SHA-512 increases almost 25% utilized area tectures against timing, SPA, and differential power
in Ed25519. Moreover, Turan and Verbauwhede [24] proposed analysis (DPA) attacks.
an Ed25519 architecture combined with the X25519 key The remainder of this article is organized as follows. Section II
exchange. This design targets resource-constrained devices presents the background. Section III conducts our proposed
on a Zynq SoC. Turan and Verbauwhede [24, Sec. 3.3] architectures. The experimental results and comparison are
claimed that the cost of computing using restricted-X coor- given in Section IV. We conclude this article in Section V.
dinates of a point on the Montgomery curve is more than
extended coordinates on the twisted Edwards curve due to
the complexity of coordinate conversion. Therefore, the core II. P RELIMINARIES
works over the twisted Edwards curve. Besides, although side- A. Background
channel countermeasures are considered for the ECPM core,
A point P = (x, y) lies on a twisted Edwards curve
the authors do not include a resistant SHA-512 core, allowing
E if E = {(x, y) ∈ F p × F p : ax 2 + y 2 = 1 + d x 2 y 2 }.
vulnerability against SCA, as shown in [25].
The Ed25519 is a type of Schnorr’s signature employing
Based on the aforementioned discussions, the tradeoff
(twisted) Edwards curves developed by Bernstein et al. [1].
explorations between resource utilization and performance to
Ed25519 includes three different phases, i.e., key generation,
implement an efficient Ed25519 implementation from dif-
signing, and verifying. In the key generation, KeyGen(s)
ferent optimization perspectives have not been thoroughly
takes a parameter s and computes a signing key sk and a
studied. Particularly, designing a unified architecture consist-
public key pk with associated message space M. In sign-
ing of physical protection against SCA in all submodules
ing, a signature (R, S) is generated by Sign(sk, m), taking
to perform secure key generation, signature generation, and
an sk and a message m ∈ M. The signature (R, S) can
signature verification is required. Besides, employing the fast
be verified by Verify( pk, m, R, S) considering the public
and efficient Karatsuba-based multiplier for designing a high-
key pk and message m ∈ M. The Appendix gives these
performance Ed25519 architecture should be investigated.
algorithms. For details, we refer interested readers to [26].
Eventually, the signature computation cost over the Edwards
Moreover, Ed25519 is equivalent to a Montgomery curve
domain compared to the Montgomery domain for a highly
called Curve25519, introduced by Bernstein [27] in 2006.
parallel design should be investigated.
For group arithmetic based on Ed25519, the computa-
tion can be performed on extended homogeneous coordi-
B. Contributions nates [1], [26]. A mapping between affine coordinates (x, y)
To the best of our knowledge, there appear to be very few and extended coordinates (X, Y, Z , T ) for a point P is
hardware implementations that focus only on Ed25519 and defined by x = X/Z , y = Y/Z , and x × y = T /Z .

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BISHEH-NIASAR et al.: CRYPTOGRAPHIC ACCELERATORS FOR DIGITAL SIGNATURE BASED ON Ed25519 3

Let P1 = (X 1 , Y1 , Z 1 , T1 ) and P2 = (X 2 , Y2 , Z 2 , T2 ); P3 = A. Design I: High-Performance Architecture


P1 + P2 can be computed using the following formula: To design a high-performance Ed25519 scheme, we need
A = (Y1 − X 1 ) · (Y2 − X 2 ), B = (Y1 + X 1 ) · (Y2 + X 2 ) to accelerate the scalar multiplication procedure as the more
time-consuming part of the signature algorithm, particularly its
C = 2d · T1 · T2 , D = 2Z 1 · Z 2 , E = B − A
modular multiplication unit. Hence, we design a low latency
F = D − C, G = D + C, H = B + A modular multiplier followed by an interleaved reduction.
X3 = E · F, Y3 = G · H, T3 = E · H, Z 3 = F · G. (1) In this scheme, the full width of 255-bit is implemented to
minimize data transition latency and maximize parallelization
Hisil et al. [28] introduced an efficient unified point addi- within the arithmetic logic unit (ALU). Therefore, loading
tion and a dedicated point doubling formula. Hamburg [29] and storing data take only one cycle to accelerate ALU
suggested a method for mixed readdition using extended coor- throughput. Addition/subtraction between two operands is
dinates. However, an efficient computation can be performed performed in 255-bit data width in one clock cycle. Moreover,
using the restricted-X coordinate on the Montgomery curve. the interleaved reduction is performed at the cost of one
In addition, the Y -coordinate result is required to recover, additional cycle in a pipeline fashion.
proposed by Okeya and Sakurai [30]. Eventually, the achieved 1) Modular Multiplication: Different multiplication
point should be mapped to twisted Edwards space. approaches are investigated for resource and area optimization,
such as Schoolbook or Toom-3, while the Karatsuba
B. Side-Channel Protection multiplication consumes fewer resources and less time
Although both EdDSA and ECDSA rely on an ephemeral than other mentioned multipliers [22]. The Karatsuba
and secret random number to sign a message, generating this multiplication can be performed for n-bit integer A and B
random number is not determined in the ECDSA procedure. such that C = A· B = (a1 φ+a0 )·(b1 φ+b0 ) = a1 b1 φ 2 +a0 b0 +
Hence, the security of ECDSA is based on the quality of ((a1 + a0 ) · (b1 + b0 ) − a1 b1 − a0 b0 )φ, where A = (a1 φ + a0 ),
random number generators (RNGs) and how to implement B = (b1 φ + b0 ), φ = 2(n/2) , and A, B, C ∈ GF( p).
them securely. Nevertheless, EdDSA employs a hash function Hence, we implement different levels of the Karatsuba
to generate a random number in a secretly deterministic way. multiplication to investigate their efficiency in terms of A · T ,
ECDSA vulnerability against SCA has been shown in sev- where A and T are the required resources and time, respec-
eral research works [2], [3]. Recently, Aranha et al. [31] show tively. By applying the k-level Karatsuba multiplication, an
breaking ECDSA exploiting even less than one-bit leakage n × n-bit multiplier is broken to 3k multipliers, while they
against 192- and 160-bit elliptic curves. Several countermea- perform an (n/2k ) × (n/2k )-bit multiplication. Therefore,
sures, including Z -coordinate randomization and constant-time the maximum level of the consecutive Karatsuba multiplication
implementation of group law, are suggested to avoid these before decreasing performance can be four levels due to
vulnerabilities [31]. DSP block specifications.
Constant-time and secret-independent computations are Our modular arithmetic units for the proposed high-
popular countermeasures against timing and SPA attacks, performance design are illustrated in Fig. 1. In this scheme,
respectively. Simple point randomization [32] provides pro- a 255 ×255-bit multiplication is decomposed to 81 16×16-bit
tection against DPA attack using a random value, whereas the multipliers in four consecutive levels. All partial products work
scalar multiplication output is not changed. Let B = (X, Y, Z ) in one cycle simultaneously. An addition tree is designed in a
be the base point presentation in projective coordinates and backward direction to merge the products and build the final
λ ∈ Z p \{0} be a random number. The base point can be altered result. The pipelined multiplier has five stages, of which three
such that Br = (λ · X, λ · Y, λ · Z ) = (λ · X, λ · Y, λ), which are required for the multiplication and the remaining ones for
yields to different point representations, due to the fact that the interleaved reduction in a pipeline fashion. Hence, the full
x B = (X/Z ) = ((λ · X )/(λ · Z )) mod p and yB = (Y/Z ) = five cycles are taken only for the first multiplication, and then,
((λ · Y )/(λ · Z)) mod p. a 255 × 255-bit multiplication computation is becoming avail-
A continuous point randomization approach can be applied able with a latency of only one cycle. The proposed scheduling
to the projective coordinate representation of points after for performing a Montgomery ladder step is depicted in Fig. 2.
each iteration of the Montgomery ladder. This approach was 2) Mod p Reduction: Employing the Karatsuba multipli-
implemented in a research work presented in [18]. cation in the first level can be also used for implementing
Samwel et al. [25] proposed an attack on Ed25519 by mea- the fast modular reduction to optimize computations. This
suring the power consumption of approximately 4000 traces. multiplication includes two main stages: breaking inputs and
This work also suggested a countermeasure that kills the merging the results. Breaking stage decomposes A to a1 , a0 ,
deterministic signature properties. and a1 + a0 (B is decomposed similar to A), and the merging
stage computes addition between C2 = a1 b1 , C0 = a0 b0 , and
C1 = (a1 +a0 )·(b1 +b0 ), where φ = 2(256/2) = 2128 in the first
III. TARGET A RCHITECTURES FOR E D 25519 level. Due to the fact that 2 p ≡ 2256 − 38 mod p, the merging
This article introduces two different architectures for stage in the first level of the Karatsuba multiplication can
Ed25519, i.e., high-performance and efficient schemes, and be used for the fast reduction such that C = A · B =
discusses their primitives to achieve the considered opti- (a1 2128 + a0 ) · (b1 2128 + b0 ) = a1 b1 2256 + a0 b0 + ((a1 + a0 ) ·
mization objectives. The arithmetic multiplier unit in the (b1 +b0 )−a1 b1 −a0 b0 )2128 = 38C2 +C0 +(C1 −C2 −C0 )2128 .
high-performance scheme is derived from our previous work Hence, the computed C can be presented in 387 bit.
presented in [19]. Thus, the first reduction stage optimizes the obtained result

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

The centerpiece of the modular multiplication unit is a 64×64


pipelined schoolbook multiplier implemented by 16 DSPs.
The architecture of our proposed multiplication core is
illustrated in Fig. 3. In order to accumulate the partial products,
a 256-bit register and a 128-bit adder are designed. Thus,
the partial product is accumulated with the upper half of
the register. Furthermore, according to the sequence of mul-
tiplications, i.e., start from a00 b00 , then a00 b01 , a01 b00 , and
eventually a01 b01 , the register is shifted downward by 64 bit
before accumulating the second and fourth partial products.
Thus, when the pipeline stages are full, 64 × 64-, 128 × 128-,
and 255 × 255-bit multiplications are becoming available with
a throughput of one, four, and 16 cycles, respectively. The
proposed scheduling for performing a Montgomery ladder step
is depicted in Fig. 4.
Furthermore, the field inversion is considered based on
Fermat’s little theorem (FLT) together with the addition
chain method executing 254 squaring and 11 multiplications.
We also utilize an additional dedicated ROM for performing
Fig. 1. Highly parallel modular multiplier in the high-performance scheme. inversion to decrease the required size in the main ROM.
2) Mod p Reduction: Modular multiplication is interleaved
width from 512 to 387 bit, which increases our expected by a reduction unit, which accumulates partial products to
performance. Suppose that C is presented in two parts: Cl perform a fast reduction. As mentioned earlier, modular mul-
and Ch , which are its first 255-bit and rest 132 bit such tiplication performs a 255 × 255-bit multiplication in four
that C = Ch 2255 + Cl . Therefore, the subsequent reduction sequential partial products, which takes 16 clock cycles. Thus,
stage is applied to C such that C  = 19Ch + Cl . In addition, the modular reduction unit is fed by the multiplier every four
C  = C  − p is computed in the case of C  > p, and the cycles to implement a fast reduction as follows:
output is chosen between C  and C  considering subtraction
C = A · B = (a1 2128 + a0 ) · (b1 2128 + b0 )
borrow flag.
= a1 b1 2256 + a0 b0 + a0 b1 2128 + a1 b0 2128
= 38C3 + C0 + C1 2128 + C2 2128 (2)
B. Design II: Efficient Architecture where C3 = a1 b1 , C2 = a1 b0 , C1 = a0 b1 , and C0 = a0 b0 .
Fig. 3 shows the lower level arithmetic operations for our In order to diminish the cost of carry propagation, redun-
proposed efficient architecture. In this architecture, we con- dant representation is employed in the proposed reduction
sider decreasing the required resources as the main opti- architecture. This unit uses several registers and adders with
mization objective, while the area–time factor is simultane- a 136-bit datapath providing more 8 bits for each digit.
ously improved. Furthermore, DSP components as the critical Single-pair registers, i.e., R1 and R2 , take partial products
resource in FPGA significantly affect architecture perfor- from the multiplier, and the accumulated data are computed
mance. Therefore, improvement of Ad × T metrics should be using the second pair, i.e., S1 and S2 . According to the
considered as another vital factor to describe efficiency, where sequence of multiplications, i.e., start from C3 , then C0 ,
Ad is the number of employed DSPs. C1 , and, eventually, C2 , multiplication with a small integer
In this scheme, the data width of 128-bit is implemented 38 = (100110)2 is performed using the shift and addition
within ALU to decrease the CPD. However, in the modular approach in the first four-cycle period. Then, the accumulated
reduction unit, redundant representation providing more 8 bits, data are stored in the second register pair to add with C0 . After
i.e., 136-bit is implemented to avoid the cost of carry propa- that, the S-registers are shifted downward by 136-bit, and the
gation between digits. Moreover, addition/subtraction between accumulation is continued until adding the last partial product.
two operands, i.e., C = A ± B, is performed in 128-bit Hence, the result represented in S-registers is computed to
data width, which takes two clock cycles. Hence, the carry is perform the last stage of reduction.
propagated between digits employing a register. Furthermore, The accumulated result is prepared in S-registers for the
the reduction stage performs C  = C ∓ p at the cost of two last stage of reduction by applying a shift such that C =
additional cycles. Both C and C  are stored in the memory S2 2255 + S1 2128 + S0 = 19S2 + S1 2128 + S0 . In order to
unit, and the correct result is determined by a flag obtained have an efficient implementation, again, a multiplication with
from the previous carry/borrow. a small integer 19 = (10011)2 is performed using the shift
1) Modular Multiplication: Modular multiplication can be and addition approach, which takes three additional cycles.
computed by four 128 × 128-bit partial products, i.e., a0 b0 , According to the described scheduling, 16 cycles are required
a0 b1 , a1 b0 , and a1 b1 . Operands can be read from memory unit to perform these operations. Hence, the rest of the operations
in a cycle to feed two input registers. Then, four multiplica- taking four additional cycles are stored in the T -registers that
tions are consecutively performed for these required products. utilize R and S for new arrival data. The next two cycles are
For example, a0 b0 is computed by a00 b00 , a00 b01 , a01 b00 , and considered to accumulate 19S2 with S1 2128 + S0 . Then, two
a01 b01 , where a0 = a01 264 + a00 and b0 = b01 264 + b00 . cycles are required for modulus p computation.

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BISHEH-NIASAR et al.: CRYPTOGRAPHIC ACCELERATORS FOR DIGITAL SIGNATURE BASED ON Ed25519 5

Fig. 2. Proposed Montgomery ladder scheduling in the high-performance architecture.

Fig. 5. Message digest creation for SHA-512.


Fig. 3. Lower level arithmetic operations in the proposed fully pipelined
efficient Ed25519 scheme. ai and bi are read from memory unit, and ci or (ci1 , 2) Mod L Reduction and Nonmodular Multiplication: A
ci0 ) is stored to memory unit.
512-bit scalar achieved from hash function should be reduced
by modulus L, where L is a 253-bit value. In order to
implement a constant-time reduction, we design consecutive
rounds, which is repeated three times to make sure that result
is reduced completely. Let x have 512-bit length, which can be
shown by x = x 1 2256 + x 0 . The group order can be presented
by L = 2252 + l0 , where l0 has 125 bit. The first round of
reduction is performed such that

Fig. 4. Proposed Montgomery ladder scheduling in the efficient architecture. x mod L ≡ x 1 2256 + x 0 ≡ x 1 24 × (L − l0 ) + x 0
≡ −x 1 · l0 24 + x 0 . (3)
C. Ed25519 Design Considerations In (3), a 256 × 125-bit nonmodular multiplication should
be performed, which utilizes the already provided modular
1) Hash Unit: According to RFC 8032 [26], SHA-512 is
multiplier. Then, the product is shifted by 4 bits and subtracted
recommended by the standard to use in Ed25519. It takes
from x 0 .
arbitrary inputs in 1024-bit chunks and provides 512-bit out-
For the next round, let x  = x 1 2252 + x 0 ; hence, x 1 and
put. In general, hash computation does not take considerable
x 0 have 134 and 252 bit, respectively. The reduction can be
latency compared to ECPM. Therefore, lightweight hardware
performed as follows:
architecture is implemented for efficient architecture, which
utilizes minimum resources. x  mod L ≡ x 1 2252 + x 0 ≡ x 1 × (L − l0 ) + x 0
Fig. 5 illustrates message-digest creation for N-block mes- ≡ −x 1 · l0 + x 0 . (4)
sage. As one can see, the main part of SHA-512 is the com-
pressor core, which works iteratively, i.e., 80 times repeated Performing (4) results in a 260-bit-long value. Therefore,
compressing for each 1024-bit chunk of input. the third round must be performed similar to the second round,
In order to minimize CPD, the entire data path is designed leading to a 253-bit-long value.
64-bit. In addition, we use the optimal number of registers 3) Double-Point Multiplication: Two scalar multiplications
employing a dedicated finite state machine and resource shar- are required for a verification procedure. The verifying algo-
ing approach to decrease the utilized resources and complexity. rithm can be revised to improve efficiency, including two main

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

advantages: 1) employing double-point multiplication to halve TABLE I


total latency and 2) skipping a decompression. In addition, I MPLEMENTATION R ESULTS FOR D IFFERENT L EVELS OF K ARATSUBA
both scalars in the verifying algorithm are not secret. Hence,
a nonconstant-time execution can be used for fast verification.
We use a modified version of Strauss’ trick, a special case
referred to as “Shamir’s trick,” presented in [33].
4) SCA Countermeasures: Different SCA countermeasures
are embedded in the proposed designs to provide enhanced
architecture against SPA and DPA. Designing an RNG is not
in the scope of this study, so we assume that the randomized
numbers are provided externally. Besides, since the scalar in
the verifying procedure is not secret, the SCA countermeasures
are not applied in this phase.
Each iteration of the ECPM algorithm requires one point
addition and one point doubling per ladder step independent
of the current key bit value. Furthermore, other executed
operations are performed in a constant number of clock
cycles. Therefore, considering a constant-time and secret-
independent execution for designing our proposed schemes,
our architectures are inherently resistant to timing and
SPA attacks.
Base point randomization is achieved using the randomized
base point Br = (λ · X, λ · Y, λ) in projective coordinates.
We assume that Br is externally delivered to the ECPM
core. Moreover, implementing variable-base-point architecture
leads to achieving base point randomization without any cost.
We can perform two more modular multiplications to reran- Fig. 6. Proposed architecture for high-performance and efficient Ed25519.
domize the Montgomery ladder outputs. Hence, the continuous
point randomization increases the Montgomery ladder latency A. Top-Level Architecture
and, consequently, the total latency. The top-level architecture used in our schemes is illustrated
DPA-resistant SHA-512 can be achieved by padding the in Fig. 6, composed of three stages: 1) the top stage includes
key proposed in [25]. In this method, the design requires FSM, controller, and ROM; 2) the lower stage consists of the
128 bits of fresh random for padding the key such that the field ALU; and 3) the middle stage includes hash function,
first 1024-bit block is composed of the random value. It is reduction handlers, memory unit, and secret key buffer.
to be noted that this algorithm is not compatible with the FSM determines the state of the core and the required
existing definition of EdDSA and destroys the deterministic address number for the controller. The controller/ROM
signature properties. However, since a full arithmetic/Boolean includes the main routines (fixed sequence of instructions)
masked architecture for SHA-512 is too costly, the future for point multiplication, double-point multiplication, inversion,
implementations might actually use SHA-3 with much robust and modulus L according to the architecture. Furthermore,
and easier countermeasures [25]. the controller includes hand-optimized routines for all the
operations required for computing a signature algorithm,
IV. I MPLEMENTATION , R ESULTS , AND C OMPARISON such as enabling/disabling the modules, setting their required
address, and handling their interfaces.
The FPGA used in our implementation is the Xilinx Zynq-
7020 synthesized and implemented with Xilinx Vivado 2018.2. B. Implementation Results and Comparison
All given results are obtained post-place-and-route (PAR). Table II summarizes the resource utilization for our pro-
Table I shows the different implementations of the Karat- posed Ed25519 architectures broken down to the required
suba multiplier and the performance comparison results. components for our unprotected scheme. Our proposed high-
Applying the Karatsuba multiplication has a significant effect performance Ed25519 architecture utilizes 9.7k slices and
on efficiency in terms of A · T , where A and T are the 81 DSPs, while the efficient Ed25519 architecture reduces
utilized area and total time, respectively. According to this 70% and 80% utilized slices and DSPs compared to our
table, the first Karatsuba multiplication has an efficiency equal Design I, respectively. Thus, Design II requires only 2.8k slices
to 1166 slice × sec. Moreover, increasing the number of and 16 DSPs to perform the Ed25519 signature algorithm.
applied levels of the Karatsuba multiplication augments the Both designs do not occupy any Block RAM, and they are
CPD of architecture due to expanding its followed addition implemented using the distributed memory.
tree. Applying the second level to the fourth level improves The latency requirements of all operations are reported
20%, 5%, and 3% efficiencies compared to its previous level, in Table III. Since the architecture works in a parallel fashion,
respectively. Furthermore, the four-level Karatsuba multipli- the total latency is less than the summation between the latency
cation achieves a speedup factor of 3.6×, 2×, and 1.4× of individual modules. To compare with the state-of-the-art
compared to one to three levels, respectively. modular multiplier in GF( p25519), as listed in Table IV, our

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BISHEH-NIASAR et al.: CRYPTOGRAPHIC ACCELERATORS FOR DIGITAL SIGNATURE BASED ON Ed25519 7

TABLE II TABLE V
I MPLEMENTATION R ESULTS IN T ERMS OF U TILIZATION R EQUIREMENTS P ERFORMANCE R ESULTS FOR U NPROTECTED AND P ROTECTED
S CHEME A GAINST DPA (R ESULTS A RE R EPORTED
FOR A 1024-bit M ESSAGE )

TABLE III
FPGA I MPLEMENTATION R ESULTS IN T ERMS OF C LOCK C YCLES

multiplication in order to avoid any additional resources for


performing signature algorithms.
Besides, our efficient scheme can be used in 136 MHz,
while the maximum operating frequency for our high-
performance design is dropped as expected to 73 MHz due
to the increasing level of the Karatsuba multiplication. Hence,
an unprotected ECPM can be performed in almost 126 and
TABLE IV
356 μs in our proposed high-performance and efficient archi-
I MPLEMENTATION R ESULTS FOR P OINT M ULTIPLIER IN GF( p25519)
tecture. Thus, our Design II provides a tradeoff between time
and area by decreasing almost 75% of occupied resources at
the cost of nearly three times more required time.
Table V reports the performance results for three algorithms
in EdDSA: key generation, signing, and verifying. For the
unprotected scheme, Designs I and II can generate 6276 and
2279 keys/s. Furthermore, they can sign 6293 and 2293
128-byte messages/s. Moreover, 5112 and 1507 messages with
128-byte wide can be verified every second employing our
high-performance and efficient architecture, respectively. Note
that increasing the size of the message increases the total
high-performance and efficient modular multiplication requires latency such that each 1024-bit chunk adds 80 cycles. The
five and 32 cycles occupying 81 and 16 DSPs, respectively. proposed protected Designs I and II require 0.18 and 0.50 ms,
However, our parallel designs can significantly compensate for respectively, to sign a message, while the verification does not
the required clock cycles such that a modular multiplication need to be protected.
can be performed in one and 16 cycles in our Designs I and II, Complete signature and verification implementations with
respectively. In [22], a low-latency multiplier was proposed certificate handling are scarce in the literature. Hence, a direct
for key exchange requiring three cycles taking advantage of comparison of the area utilization and performance is difficult.
occupying register bank and 182 DSPs. We also introduced Nevertheless, we intend to put our results in the context with
a low-latency architecture in our previous work [19] using other relevant works to allow the reader a quick overview of
the register bank requiring three cycles for key exchange. other designs and architectures.
However, to develop the Ed25519 scheme in this article, Table VI reports area and performance results for several
we use the RAM module since the controller, hash, ALU, and digital signature schemes. As one can see, our proposed
module l reduction work with the memory unit. The designed high-performance architecture achieves 27×, 21×, and 19×
architecture in [20] and [21] requires ten and eight cycles better performances for key generation, signing, and verify-
utilizing 260 and 175 DSPs. The proposed multiplier in [24] ing operations compared to [24], respectively. However, our
needs 33 cycles, of which 16 cycles for multiplications and Design I is larger than this work and utilizes 3× and 5× more
the rest of it for the reduction utilizing 15 DSPs. slice and DSP resources, respectively. Furthermore, Turan and
Our proposed high-performance architecture follows the Verbauwhede [24] have 10×, 8×, and 6× more delays than
reduction algorithm of [22] using one level of the Karatsuba our efficient Ed25519 scheme for key generation, signing,
multiplication, however applying the following modifications. and verifying, respectively, while ours occupies similar DSP
First, we make use of the true dual-port capabilities of the counts and reduces 11% utilized slices. Hence, the superiority
RAM modules instead of register bank to decrease the required of computation over the Montgomery domain compared to
resources and avoid high fan-out circuits. Second, we imple- the Edwards domain is shown despite coordinating conversion
ment four consecutive levels of the Karatsuba multiplication, overheads.
enabling our design to save 55.5% of utilized DSP in [22] Therefore, Design I has 89%, 86%, and 84% improvements
and, thus, still allowing processing in a pipeline fashion. in terms of A·T (Slice_count × Time) for key generation, sign-
Third, our architecture performs both modular and nonmodular ing, and verifying algorithm compared to [24], respectively.

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE VI
C OMPARISON OF D IFFERENT D ESIGNS FOR THE D IGITAL S IGNATURE A LGORITHM

TABLE VII Algorithm 2 EdDSA Signing Operations [26]


E D DSA PARAMETERS FOR E D 25519 [26]

Algorithm 1 EdDSA Key Generation Operations [26]

Algorithm 3 EdDSA Verifying Operations [26]

Moreover, Design II shows a significant improvement in terms


of A · T , i.e., 91%, 88%, and 84% for key generation, signing,
and verifying algorithm compared to [24], respectively.
Considering the importance of utilized DSP in the FPGA-
based architecture, we present a comparison in terms of
Ad · T (DSP_count × Time). Thus, our Design I improves
81%, 76%, and 73% efficiencies in terms of Ad · T for
key generation, signing, and verifying algorithm compared
to [24], respectively. Furthermore, our proposed Design II
improves 90%, 87%, and 82% efficiencies in this term for key
generation, signing, and verifying algorithm compared to [24], Applying DPA countermeasures decreases efficiency due to
respectively. executing more operations. However, our protected Design I
Moreover, the work in [23] proposed a nonconstant-time (and II) improves almost 95% (96%) and 89% (94%) effi-
point multiplication core for Ed25519. Although it can com- ciency in terms of A · T and Ad · T for signing a message.
pute 1838 ECPM per second, the architecture is vulnerable To compare different implementation approaches,
to SPA. Notably, the reported area does not include all the some software-based and heterogeneous configurations of
required modules for providing a digital signature, such as Ed25519 are listed in Table VI. With a variety of performance
hash function and modulus L reduction. We explore that optimizations in hardware implementations, the throughput
SHA-512 increases almost 25% utilized area in Ed25519. is significantly increased compared with software-based

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BISHEH-NIASAR et al.: CRYPTOGRAPHIC ACCELERATORS FOR DIGITAL SIGNATURE BASED ON Ed25519 9

implementation and heterogeneous computing. Hence, our [13] B. Panjwani, “Scalable and parameterized hardware implementation of
design achieves almost 40 times speedup compared to [36]. elliptic curve digital signature algorithm over prime fields,” in Proc.
Int. Conf. Adv. Comput., Commun. Informat. (ICACCI), Sep. 2017,
V. C ONCLUSION pp. 211–218.
[14] J. Vliegen et al., “A compact FPGA-based architecture for elliptic curve
In this article, we have proposed hardware design strategies cryptography over prime fields,” in Proc. 21st IEEE Int. Conf. Appl.-
for recently proposed Edwards curve digital signatures Specific Syst., Architectures Processors, 2010, pp. 313–316.
[15] D. Zhang and G. Bai, “High-performance implementation of SM2 based
Ed25519 on Xilinx Zynq-7020 FPGA, including advanced on FPGA,” in Proc. 8th IEEE Int. Conf. Commun. Softw. Netw. (ICCSN),
protection against side-channel attacks. The proposed Jun. 2016, pp. 718–722.
architectures achieve above 84% efficiency improvement [16] P. Sasdrich and T. Güneysu, “Efficient elliptic-curve cryptography
using curve25519 on reconfigurable devices,” in Proc. 10th Int. Symp.,
of the area–time product using pipelined architecture and D. Goehringer, M. D. Santambrogio, J. M. P. Cardoso, and K. Bertels,
interleaved multiplication. Our high-performance and efficient Eds., Vilamoura, Portugal, 2014, pp. 25–36.
architectures compute more than 6200 and 2200 signings and [17] P. Sasdrich and T. Güneysu, “Implementing Curve25519 for side-
5100 and 1500 verifications per second, respectively. We also channel-protected elliptic curve cryptography,” ACM Trans. Reconfig-
urable Technol. Syst., vol. 9, no. 1, pp. 1–15, Nov. 2015.
show the design can outperform recently presented works [18] P. Sasdrich and T. Gäneysu, “Exploring RFC 7748 for hardware imple-
using only moderate resource requirements. mentation: Curve25519 and Curve448 with side-channel protection,”
J. Hardw. Syst. Secur., vol. 2, no. 4, pp. 297–313, Dec. 2018.
A PPENDIX [19] M. Bisheh Niasar, R. El Khatib, R. Azarderakhsh, and
M. Mozaffari-Kermani, “Fast, small, and area-time efficient
Ed25519 has some critical parameters shown in Table VII. architectures for key-exchange on Curve25519,” in Proc. IEEE 27th
EdDSA algorithms are described in Algorithms 1–3, respec- Symp. Comput. Arithmetic (ARITH), Jun. 2020, pp. 72–79.
tively. According to [26], an encoded integer S = enc(S) can [20] P. Koppermann, F. DeSantis, J. Heyszl, and G. Sigl, “X25519 hard-
ware implementation for low-latency applications,” in Proc. Euromicro
be shown in its little-endian convention. In addition, when Conf. Digit. Syst. Design, P. Kitsos, Ed., Limassol, Cyprus, 2016,
an element P = (x, y) is encoded, its y-coordinate should be pp. 99–106.
encoded first, and then, its most significant bit is substituted by [21] P. Koppermann, F. De Santis, J. Heyszl, and G. Sigl, “Low-
the least significant bit of its x. The dom(x, y) string function latency X25519 hardware implementation: Breaking the 100 microsec-
onds barrier,” Microprocessors Microsyst., vol. 52, pp. 491–497,
is blank for Ed25519. Jul. 2017.
[22] R. Salarifard and S. Bayat-Sarmadi, “An efficient low-latency point-
ACKNOWLEDGMENT multiplication over Curve25519,” IEEE Trans. Circuits Syst. I, Reg.
The authors would like to thank the reviewers for their Papers, vol. 66, no. 10, pp. 3854–3862, Oct. 2019.
comments. [23] M. A. Mehrabi and C. Doche, “Low-cost, low-power FPGA implemen-
tation of ED25519 and CURVE25519 point multiplication,” Information,
R EFERENCES vol. 10, no. 9, p. 285, Sep. 2019.
[24] F. Turan and I. Verbauwhede, “Compact and flexible FPGA implemen-
[1] D. J. Bernstein, N. Duif, T. Lange, P. Schwabe, and B. Yang, “High- tation of Ed25519 and X25519,” ACM Trans. Embedded Comput. Syst.,
speed high-security signatures,” in Proc. 13th Int. Workshop, Nara, vol. 18, no. 3, pp. 1–21, 2019.
Japan, Sep./Oct. 2011, pp. 124–142. [25] N. Samwel, L. Batina, G. Bertoni, J. Daemen, and R. Susella, “Breaking
[2] A. C. Aldaya, C. P. García, and B. B. Brumley, “From A to Z: Ed25519 in WolfSSL,” Cryptol. ePrint Arch., Tech. Rep. 2017/985,
Projective coordinates leakage in the wild,” Cryptol. ePrint Arch., 2017.
Tech. Rep. 2020/432, 2020. [26] S. Josefsson and I. Liusvaara, Edwards-Curve Digital Signature Algo-
[3] K. Ryan, “Return of the hidden number Problem: A widespread and rithm (EdDSA), document RFC 8032, 2017, pp. 1–60.
novel key extraction attack on ECDSA and DSA,” Trans. Cryptograph. [27] D. J. Bernstein, “Curve25519: New Diffie-Hellman speed records,” in
Hardw. Embedded Syst., vol. 2019, no. 1, pp. 146–168, Nov. 2018. Proc. 9th Int. Conf. Theory Pract. Public-Key Cryptogr., M. Yung,
[4] D. J. Bernstein and T. Lange. (2011). Security Dangers of the Y. Dodis, A. Kiayias, and T. Malkin, Eds., New York, NY, USA, 2006,
Nist Curves. [Online]. Available: https://www.hyperelliptic.org/tanja/ pp. 207–228.
vortraege/20130531.pdf [28] H. Hisil, K. K.-H. Wong, G. Carter, and E. Dawson, “Twisted edwards
[5] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms curves revisited,” Cryptol. ePrint Arch., Tech. Rep. 2008/522, 2008.
and factoring,” in Proc. 35th Annu. Symp. Found. Comput. Sci., Santa Fe, [29] M. Hamburg, “Fast and compact elliptic-curve cryptography,” in Proc.
NM, USA, Nov. 1994, pp. 124–134. IACR, 2012, p. 309.
[6] N. Bindel, U. Herath, M. McKague, and D. Stebila, “Transitioning to [30] K. Okeya and K. Sakurai, “Efficient elliptic curve cryptosystems from
a quantum-resistant public key infrastructure,” in Proc. IACR, 2017, a scalar multiplication algorithm with recovery of the Y-coordinate on a
p. 460. montgomery-form elliptic curve,” in Proc. Int. Workshop, Paris, France,
[7] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, May 2001, pp. 126–141.
“High-speed NTT-based polynomial multiplication accelerator for [31] D. F. Aranha, F. R. Novaes, A. Takahashi, M. Tibouchi, and Y. Yarom,
CRYSTALS-Kyber post-quantum cryptograsphy,” Cryptol. ePrint Arch., “Ladderleak: Breaking ECDSA with less than one bit of nonce leakage,”
Tech. Rep. 2021/563, 2021. Cryptol. ePrint Arch., Tech. Rep. 2020/615, 2020.
[8] (2020). Things That Use Ed25519. [Online]. Available: https://ianix. [32] J. Coron, “Resistance against differential power analysis for ellip-
com/pub/ed25519-deployment.html tic curve cryptosystems,” in Proc. Cryptograph. Hardw. Embedded
[9] P. Kietzmann, L. Boeckmann, L. Lanzieri, T. C. Schmidt, and Syst., Ç. K. Koç and C. Paar, Eds., Worcester, MA, USA, 1999,
M. Wählisch, “A performance study of crypto-hardware in the low-end pp. 292–302.
IoT,” in Proc. IACR, 2021, p. 58. [33] P. Schwabe. (Sep. 2013). Scalar-Multiplication Algorithms. [Online].
[10] M. Bisheh Niasar, R. Azarderakhsh, and M. Mozaffari Kermani, “Effi- Available: https://cryptojedi.org/peter/data/eccss-20130911b.pdf
cient hardware implementations for elliptic curve cryptography over [34] M. Bisheh Niasar, R. Azarderakhsh, and M. Mozaffari Kermani, “Opti-
Curve448,” in Proc. 21st Int. Conf. Cryptol. India, Bangalore, India, mized architectures for elliptic curve cryptography over Curve448,” in
Dec. 2020, pp. 228–247. Proc. IACR, 2020, p. 1338.
[11] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, “Area- [35] M. Scott, “On the deployment of curve based cryptography for the
time efficient hardware architecture for signature based on Ed448,” IEEE Internet of Things,” in Proc. IACR, 2020, p. 514.
Trans. Circuits Syst. II, Exp. Briefs, early access, Mar. 23, 2021, doi: [36] H. Fujii and D. F. Aranha, “Curve25519 for the cortex-M4 and beyond,”
10.1109/TCSII.2021.3068136. in Proc. 5th Int. Conf. Cryptol. Inf. Secur. Latin Amer., Havana, Cuba,
[12] B. Glas, O. Sander, V. Stuckert, K. D. Müller-Glaser, and J. Becker, Sep. 2017, pp. 109–127.
“Prime field ECDSA signature processing for reconfigurable embed- [37] D. Bernstein and T. Lange. EBACS: ECRYPT Benchmarking of Cryp-
ded systems,” Int. J. Reconfigurable Comput., vol. 2011, Oct. 2011, tographic Systems. Accessed: Mar. 22, 2021. [Online]. Available:
Art. no. 836460. https://bench.cr.yp.to

Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 00:26:23 UTC from IEEE Xplore. Restrictions apply.

You might also like