High Performance ECDSA Over F Based On Java With Hardware Acceleration
High Performance ECDSA Over F Based On Java With Hardware Acceleration
{ernst@vlsi, birgit@cdc}.informatik.tu-darmstadt.de
Keywords: Public Key Cryptography, ECDSA, Java and JCA, VHDL Model Generator, FPGA
based Hardware Acceleration
Abstract
Many E-Commerce applications are characterized, for instance, by their demand for confidential
data exchange via public communication networks (e.g., Internet). These data exchanges must be pro-
tected from fraudulent access by third parties. One way is to use public key crypto systems based on
elliptic curves. They gain more and more acceptance, since they provide high security in spite of their
small key sizes. We introduce an elliptic curve based crypto provider, featuring ECDSA, within the
Java Cryptography Architecture for the sake of flexibility and platform independence. Furthermore we
present a FPGA based CryptoProcessor, which raises the performance significantly. The design of the
CryptoProcessor is supported by a custom VHDL model generator.
1
1 Introduction
Many E-Commerce applications are characterized, for instance, by their demand for confidential data
exchange via public communication networks (e.g., Internet). These data exchanges must be protected
from fraudulent access by third parties. The basic technology, which can warrant this kind of protection,
is known as Public-Key Cryptography.
Besides the widely-used RSA method, Public-Key methods based on elliptic curves (EC) have
gained more importance because they are believed to give higher security per key bit, i.e. one can
work with shorter keys [1] (1024 RSA-bits are equivalent to 160 EC-bits). The smaller key size permits
a more cost-efficient implementation and higher throughput.
This article describes an implementation of the ECDSA digital signature algorithm over F (2n ) based
on Java and the Java Cryptographic Architecture (JCA), which is a de-facto standard platform for public-
key software implementations. The basic operation in the area of EC cryptography is the point multipli-
cation (k·P ). This is a complex operation, and its computation is very time consuming. Basically, the
time required for the computation of k·P determines the performance of EC algorithms like ECDSA.
To support ECDSA within server-based cryptosystems (e.g., online banking servers), the performance
of pure software implementations is not sufficient.
To overcome this problem, an Elliptic-Curve CryptoProcessor, which implements the k·P multi-
plication within hardware was developed at our institute. This hardware implementation is based on a
reconfigurable logic device (FPGA) mounted on a PCI card, so that the system integration can be done
easily via the PCI interface. It is shown, that the performance of the ECDSA algorithm can be enhanced
significantly by the use of this processor.
The mathematical background for finite fields and elliptic curves is explained in the following sec-
tion. Section 3 considers the ECDSA algorithm. The implementation of the proposed CryptoProcessor
together with the corresponding design flow is illustrated in Section 4, followed by conclusions.
2 Background
In this section, we introduce some basic notations and definitions. We will start introducing the finite
field F (2n ), its representation and its arithmetic and then go on to elliptic curves over F (2n ). In this
paper we will not discuss either the generation of elliptic curves nor the security aspects. For the latter
please consult [2].
a normal base for each positive integer n. If n is not divisible by 8, then F (2n ) has a gaussian normal
base, with which the multiplication is simpler and faster than with non-gaussian normal bases.
The type T of a normal base is an integer, which measures the complexity of the multiplication
operation of that base. The smaller type T is, the smaller is the complexity and the more efficient the
operation. Bases with type T = 1 or T = 2 are called optimal normal bases.
A field F (2n ) has a gaussain normal base of type T , if and only if each of the following items are true:
• n is not divisable by 8
• p = T · n + 1 is a prime
• gcd(n, T · n/k) = 1, with 2k = 1 mod p.
2
0 1 2
We will represent an element α = α0 · Θ2 + α1 · Θ2 + α2 · Θ2 + · · · + αn−1 · Θ2
n−1
by the bitstring
(α0 α1 α2 . . . αn−1 ). In this work we only work with optimal normal bases of type T = 2 and therefore
keep explanations at this level.
Let α = (α0 α1 α1 . . . , αn−1 ) and β = (β0 β1 β1 . . . , βn−1 ) be two elements of F (2n ). Then the
sum γ = α + β is
γ = (α0 + β0 α1 + β1 α2 + β2 . . . αn−1 + βn−1 ) . (1)
So the addition simply is an addition of the coefficients αi βi mod 2, that means an xor, so this
field operation can be done very fast.
The multiplication is more complicated:
Compute the sequence S(1), S(2), . . . , S(p − 1) as follows:
1. Set m ← 1.
2. For i from 0 to n − 1 do
(a) S(m) = i.
(b) m = 2m mod p.
3. Set m = p − 1 and repeat.
Given a field F (2n ) with the gaussian normal base B of type T = 2 and two elements α and β ∈ F (2n ).
Then the first coefficient γ0 of the product γ = α · β is
p−2
X
γ0 = αS(k+1) βS(p−k) . (2)
k=1
The other coordinates of this product are obtained by the formula from γ0 by first left-cycling the sub-
scripts of α and β modulo n. In praxis the sequence S can be saved in an integer matrix of size 2 × n.
The inversion of an element β of F (2n ) is even more expensive: For any element β ∈ F (2n ) is
β = β 2 −2 . There are several algorithms to perform the inversion more efficient than straight forward
−1 m
squaring and multiplication (i.e. see [3]). Even so, each inversion needs several field multiplications.
In contrast to these operations, computing squares and square roots can be done efficiently, especially
in hardware, since these operations are circular shifts:
Let α = (α0 α1 . . . αn−2 αn−1 ) be an element in F (2n ). Then
α2 = (αn−1 α0 α1 . . . αn−2 )
and
√
α = (α1 . . . αn−2 αn−1 α0 ).
y 2 + xy = x3 + ax2 + b, (3)
with x, y, a and b ∈ F (2n ) is called Weierstrass equation for the field F (2n ). An elliptic curve E over
F (2n ) is the set of pairs (x, y) ∈ F (2n ) × F (2n ), solving equation (3), where a, b 6= 0:
3
In the following we will denote a curve E over the field F (2n ) as E(F (2n )).
The points, along with a point at infinity (denoted by O) and an inner operation called point addition,
form an additive group, where O is the neutral element [6]. The order r of this group is the number of
points on the curve, including point O. By Hasse’s Bound the order r of an elliptic curve over the field
√ √
F (q) is approximately q: q − 2 q + 1 ≤ r ≤ q + 2 q + 1. For a proof see [6].
The point addition is defined as follows: Let P = (x0 , y0 ) and Q = (x1 , y1 ), with x0 , y0 , x1 , y1 ∈
F (2n ). Then R = (x2 , y2 ) = P + Q, x2 , y2 ∈ F (2n ), with
R =P + O = P and
R =P + −P = O,
x2 = a + λ 2 + λ + x 0 + x1 and (5)
y2 = (x1 + x2 )λ + x1 + x2 , where (6)
y0 − y1
λ= , for P =
6 Q and (7)
x0 − x1
y1
λ = x1 + , for P = Q. (8)
x1
These formulas are exclusively for curves over the field F (2n ). For general addition rules see for exam-
ple [7] or [6].
Since the points form an additive group, there is no inner group operation like the multiplication.
Even so repeated point additions like
|P + P +{z. . . + P} = r · P = R,
r times
with P, R ∈ E(F (2n )), are sometimes considered as one operation called point multiplication. With
this operation we obtain a parallel problem to the discrete logarithm problem (DLP) over finite fields:
Let P and R be points on the curve E(F (2n )), r an integer with r · P = R. Then r is the discrete
logarithm of R to the base P . Therefore cryptographic algorithms based on discrete logarithms over
finite fields can be modified to algorithms based on the discrete logarithm problem of a group of points
(ECDLP), for it is known, that the ECDLP is very hard to solve. The currently best algorithm attacking
the ECDLP is the Pohlig-Hellman algorithm, which has exponential complexity. Therefore it’s assumed
to be save to use finite fields of size 2160 for cryptographic algorithms based on the ECDLP, whereas
for the most common algorithm RSA, which is based on the factorization problem, the size of 21024 is
recommended.
Now we have seen, that each point addition or doubling requires one inversion, which is very expen-
sive (see 2.1). Therefore we avoid this costly operation in point arithmetic by using projective coordi-
nates as proposed in [3].
Let P 2 (F (2n )) be the projective plain over F (2n ). Then one projective representation the Weierstrass
equation is of following form:
y 2 z 2 + xyz 2 = x3 z 3 + ax2 z 2 + b,
where x∗ = xz and y ∗ = yz are the affine coordinates and where two points Q = (xQ , yQ , zQ ) and
R = (xR , yR , zR ) on the same elliptic curve E are equal, if and only if
xR xQ yR yQ
= and = .
zR zQ zR zQ
4
Then we can add two points as follows:
Let P = (x1 , y1 , z1 ) and Q = (x2 , y2 , z2 ) where P, Q 6= O and P 6= −Q then R = P +Q = (x3 , y3 , z3 )
is for P 6= Q:
x3 = AD
y3 = CD + A2 (Bx1 + Ay1 ) (9)
3
z3 = A z1 z2
x3 = AB
y3 = x41 A + B(x21 + y1 z1 + A) (10)
3
z3 = A
5
Figure 1: Double-and-Add algorithm
3 ECDSA
ECDSA is a digital signature algorithm, originated from DSA to be based on the ECDLP (see 2.2). It’s
by now widely known and accepted, since it’s been included in the IEEE standard P1363 [3] and in
ANSI X9.62 [4]. In following sections we will introduce ECDSA as it is in [3] and its use and we will
illustrate, why we chose Java as implementation platform.
6
3.3 ECDSA Signature Generation
Let f be a message representative, an integer with f ≥ 0. The digital signature, generated by ECDSA,
is a pair (c, d) with 1 ≤ c, d < r, which is computed as follows:
1. Generate a one-time key pair (u, V ) with the same set of ec domain parameters used for the
generation of (s, W ) and where u is a random integer in the range [1, r − 1] and V = u · G =
(xV , yV ) Again, since ord G = r and u < r, V 6=O.
2. Convert xV to an integer i.
3. Compute an integer c = i mod r; if c = 0, goto step 1.
4. compute an integer d = u−1 (f + sc) mod r; if c = 0 goto, step 1.
(c, d) is the digital signature of the message representative f associated to its ec domain parameters and
the private key s.
7
gives us the required facilities with its Java Cryptography Architecture (JCA) and its extension Java
Cryptography Extension (JCE):
The JCA refers to a framework for accessing and developing cryptographic functionality for the Java
platform. It was first introduced in JDK1.1 and is by now extended to include, along with the JCE,
so called APIs (Application Programming Interface) for digital signatures, message digests, encryption,
key exchange, and Message Authentication Code (MAC). The JCA includes a provider architecture that
allows for multiple and inter-operable cryptography implementations.
Further more, this architecture defines plain interfaces between user applications and provider im-
plementations. A user can rely upon a fix sequence of operations to use any of the primitives named
above. We will demonstrate the user interface with an example for key pair generation, signature gen-
eration and verification for ECDSA. Let us assume, that there is already a set of ec domain parameters,
stored in an instance ps of ParameterSpec. This example does not claim completeness or syntactical
correctness, it’s only for illustration.
Key Pair Generation:
1. KeyPairGenerator kpg = KeyPairGenerator.getInstance(“ECDSA”);
2. kpg.init(ps);
3. KeyPair kp = kpg.generateKeyPair();
Signature Generation:
1. Signature sig = Signature.getInstance(“ECDSA”);
2. sig.init(kp.getPrivate());
3. sig.sign();
Signature Verification (Assuming, the public key is already stored in an instance pk of PublicKey:):
1. Signature sig = Signature.getInstance(“ECDSA”);
2. sig.init(pk);
3. sig.verify();
The scheme is more or less always the same: Step (1) - obtain an instance of the class, which
implements the desired primitive, step (2) - initialize this instance and step (3) - carry out your intended
operation.
The classes KeyPairGenerator and Signature are so called engine classes, which provide
the interface to the functionality of a specific type of cryptographic service (independent of a particular
cryptographic algorithm). It defines API-methods to allow for applications access to the specific type
of cryptographic service that each of them provides. The actual implementations are those for specific
algorithms. The application interfaces supplied by an engine class are implemented in terms of a "Ser-
vice Provider Interface" (SPI): For each engine class there is an abstract SPI class, which defines the
methods, that a provider must implement.
So the interfaces between user applications and the provider are well defined. For detailed informa-
tion please see [9].
Now that we explained what Java allows us to do, let us introduce you to our cdcProvider [5]:
The cdcProvider is a powerful toolkit for the Java Cryptography Architecture. It provides cryptographic
modules that can be plugged in into every application that is built on top of the JCA. The cdcProvider is
split in three parts - theCDCStandardProvider, the CDCECProvider and the CDCNFProvider.
Part of the CDCECProvider is ECDSA, which is still under construction. Once finished, it will work on
elliptic curves over large prime fields (a first version of this part can already be down-loaded), over finite
8
fields of characteristic 2 in both optimal normal base and polynomial base representation and in future
time perhaps over Optimal Extension Fields (OEF), see [13].
A first version of the F (2n )-arithmetic with ONBs in Java is already finished and plugged in a test
version of the CDCECProvider, so that ECDSA works within the JCA over F (p), p prime, and over
F (2n ) in pure software. But to gain a high performance, which is absolutely necessary for clients like
banks, we substituted the software of the critical part, namely the point multiplication, by hardware. Our
state of the art is an algorithm-optimized Java implementation of the EC arithmetic, exploiting the use
of projective coordinates and a windows sliding method with build-in NAFs (for NAFs, see [15]).
Since server-based cryptosystems like online banking servers depend on high performance imple-
mentations, we applied hardware acceleration for the most critical part, the point multiplication.
4 Hardware Acceleration
Due to the immense computational effort for the k·P computation, high performance hardware imple-
mentations, like the CryptoProcessor described below, are necessary in order to support the use of EC
methods in server-based cryptosystems (e.g., online banking servers). The performance of software im-
plementations is not sufficient for this kind of applications because the n-bit finite field operations have
to be mapped to a processor with fixed word length (e.g., Intel Pentium, 32 bit) which introduces an
immense computational overhead. As mentioned before, detailed information about leading software
implementations can be found in [15].
9
Figure 2: Architecture of the CryptoProzessor
For applications such as the proposed CryptoProcessor the most important optimization goals are
through-put and area, the modeling of multi-cycle operations is not of primary concern. To overcome
the deficiencies of current algorithmic synthesis tools, it has become clear that another approach is
needed.
10
deficiency, it has become clear that a custom VHDL model generator has to be developed, which was
done at our institute.
Having such a generator we are able to build CryptoProcessors for various key sizes, so we are well
prepared to support upcoming increasing key sizes. Especially when targeting FPGA implementations
there are some additional benefits from the generator approach. The available FPGA resources can be
used at an optimum, because the number of Massey-Omura multipliers (radix) is not fixed. This in turn
is the prerequisite to achieve maximal performance from a specific FPGA. Since the generated VHDL
descriptions are not bound to a special FPGA family, the rapid advance in FPGA technology can be
directly transferred into better performance. When targeting recent FPGA architectures (e.g. Xilinx
Virtex Family) the expected k·P performance is comparable to standard cell ASIC implementations.
But ASIC implementations normally support only one fixed key size. The use of reconfigurable logic
enables EC methods with variable key sizes on the same hardware.
With respect to design quality and validation there is another benefit from the generator approach.
An implementation for a small key size (e.g. 18 bit) can be used for exhaustive tests. This is necessary
to be sure that the model generator itself is correct. For real world crypto application key sizes n ≥ 160
are required, but in this order of magnitude really exhaustive tests are not possible.
RAM
RAM FPGA
RAM
11
Target Platform Key Size k·P Operations
per second
C/C++ Software [15] 191 48
(Intel PentiumPro, 200 MHz)
FPGA Hardware [16] 167 4762
(XCV400E, 76.7 MHz)
FPGA CryptoProcessor 173 568
(XC4085XLA, 36 MHz)
FPGA CryptoProcessor 191 431
(XC4085XLA, 36 MHz)
FPGA CryptoProcessor 270 146
(XC4085XLA, 34 MHz)
FPGA CryptoProcessor 173 6816
(XCV400E, 120 MHz) (estimated)
Using our generator approach on top of a VHDL based design flow for a XC4085XLA-FPGA as
target device, leads to a 270-bit CryptoProcessor design including 3 Massey-Omura single-bit multipliers
(radix 3). This design has a CLB utilization of 82% and is running at 34 MHz. The resulting performance
is 146 k·P operations per second, as summarized in Tab. 1.
Two other versions of the Cryptoprocessor supporting a key size of 191- resp. 173-bit have been
mapped to the same XC4085XLA-FPGA. This shows how the generator allows to easily trade secu-
rity for performance. The 191-bit version of the Processor with radix 5 has a CLB utilization of 69%
and is running at 36 MHz. The resulting performance is 431 k·P operations per second. The 173-bit
version with radix 6 is also running at 36 MHz and has a CLB utilization of 66%. This results in 568
k·P operations per second. Please note that the achievable CLB utilization decreases when the radix
increases, which denotes the number of Massey-Omura multipliers in the design. This is because of the
relatively few routing resources a XC4085XLA-FPGA provides in comparison to its logic complexity.
It is possible to implement a 191-bit Processor with radix 9 in a XC4085XLA-FPGA. This results in a
CLB utilization of 95%, but the achievable operating frequency is only 12 MHz. Therefore the overall
k·P performance is lower as in the case of the implementation with radix 5. This tradeoff, which has to
be done whenever the target FPGA changes is uniquely supported by the proposed generator approach.
The values given above, which are summarized in Tab. 1, were measured in real-time within our test
environment consisting of a standard PC (Intel Pentium III, 550 MHz) running MS Windows NT 4.0.
The test application is based on the application programming interface provided with the microEnable
PCI card, written in C++ and compiled with MS Visual C++ 6.0. For the hardware synthesis we used
FPGA Compiler II V3.5 from Synopsys Inc. The FPGA mapping was performed using the Foundation
Series Software V2.1i from Xilinx Inc.
There are several hardware implementations for the k·P computation documented in literature. The
latest and best performing one, representing the current benchmark with respect to k·P performance, is
described in [16]. However, this implementation uses a fixed key size of 167 bit only and a polynomial
basis representation for the underlying field. It is highly optimized, exploiting pipelining and concur-
rency. For the finite field multiplication, which is the performance critical part, a digit-serial multiplier
is used. The latter topic is similar to our approach, but even if the architecture in [16] can be applied to
any field F (2m ), a method to utilize this feature has not been detailed.
A performance comparison of hardware implementations against each other is in general not straight
12
forward. This is because of mostly different key sizes and due to the fact, that different ASIC resp.
FPGA technologies are used for the implementation. In order to do an almost fair comparison of the
implementation in [16] to our approach, we applied our design flow to the same device as it is used in
[16], which is a Xilinx XCV400E-FPGA of speed grade -8. Our design flow for this target device leads
to a 173-bit CryptoProcessor design with radix 35. From the design tools we can expect an operating
frequency of approximately 120 MHz for this design. This would be a speedup of roughly 3 caused by
the frequency increase in comparison to the previously described 173-bit implementation with radix 6
running at 36 MHz. To our experience, at least a further speedup of 4 can be expected because of the
radix increase (35 in comparison to 6) 2 . Summing it up, this would result in an overall performance of
3∗4∗568 = 6816 k·P operations per second for this 173-bit implementation of our CryptoProcessor. In
contrast, the implementation in [16] supports a key size of 167-bit and achieves a performance of 4762
k·P operations per second.
5 Conclusions
We implemented an elliptic curve based crypto provider within the Java Cryptography Architecture,
which provides cryptographic modules that can be plugged in into every application that is built on top of
the JCA. For high performance requirements a CryptoProcessor has been developed, which implements
the most critical operation k·P .
Our solution of the crypto provider is flexible and platform independent. Its performance can be
increased significantly by exploiting the acceleration provided by the proposed FPGA based CryptoPro-
cessor. All the same, we retain the flexibility according to the key size, enabled by our custom VHDL
model generator.
The result, namely high performance and flexibility, is of highest interest for e.g. online banking
server applications. We illustrated these results by means of ECDSA over F (2n ) in ONB representation.
References
[1] N. Koblitz, “Elliptic Curve Cryptosystems,” Mathematics of Computation, 48 (1987), pp. 203–
209.
[2] A. Lenstra and E. Verheul, “Selecting cryptographic key sizes”, August 1999,
http://www.cryptosavvy.com
[3] IEEE P1363, “Standard Specifications For Public Key Cryptography”
http://grouper.ieee.org/groups/1363/
[4] ANSI X9.62, “Public key cryptography for the financial services industry: The Elliptic Curve
Digital Signature Algorithm (ECDSA)”, 1999 (available from the ANSI X9 catalog)
[5] Institute of Cryptography and Computer Algebra, J. Buchmann, TU Darmstadt, “CDCProvider”,
http://www.informatik.tu-darmstadt.de/TI/Forschung/cdcProvider/overview.html, 2000
[6] Joseph H. Silverman, “The Arithmetic of Elliptic Curves”,1986, Springer, Graduate Texts in
Mathematics Vol.106
[7] Neal Koblitz, “Introduction to Elliptic Curves and Modular Forms”, 1993 Graduate Texts in
Mathematics, Springer
[8] SUN, “Java Native Interface”, http://java.sun.com/products/jdk/1.2/docs/guide/jni/index.html
2
The k·P performance is not scaling linear by speeding up the finite field multiplication only.
13
[9] SUN, “Java Cryptography Architecture API Specification & Reference”, 1997,
http://java.sun.com/products/jdk/1.1/docs/guide/security/CryptoSpec.html
[10] J. Massey and J. Omura,”Computational Method and Apparatus for Finite Field Arithmetic,”
U.S. Patent 4,587,627, 1986.
[11] O. Hauck, A. Katoch and S. A. Huss, “VLSI System Design Using Asynchronous Wave
Pipelines: A 0.35 µm CMOS 1.5 GHz Elliptic Curve Public Key Cryptosystem Chip,” Proc.
IEEE ASYNC 2000, Eilat, April 2000.
[12] M. Rosing, “Implementing Elliptic Curve Cryptogarphy,” Manning Publications Co., Greenwich,
1999. ISBN 1-884777-69-4
[13] D. V. Bailey and C. Paar, “Efficient Arithmetic in Finite Field Extensions with Application in
Elliptic Curve Cryptography,” To appear in Journal of Cryptology.
[14] IEEE Standard 1076.6; Standard for VHDL Register Transfer Level Synthesis, IEEE Standards
Department, New York, 1999.
[15] E. De Win, S. Mister, B. Preneel and M. Wiener, ”On the Performance of Signature Schemes
based on Elliptic Curves,” Proc. Algorithmic Number Theory Symposium III, LNCS 1423, J. P.
Buhler, Ed., Springer-Verlag, pp. 252-266, 1998.
[16] G. Orlando and C. Paar, “A High-Performance Reconfigurable Elliptic Curve Processor for
GF (2m ),” Proc. Workshop on Cryptographic Hardware and Embedded Systems (CHES 2000),
Worcester MA, USA, August 2000.
[17] Silicon Software, ”microEnable Users Guide”, 1999.
[18] Xilinx, ”Programmable Logic Data Book”, 1999.
14