This is a good background color and a good text color - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

This is a good background color and a good text color

Description:

Most widely used public key system. Good for encryption and signatures. ... Reducing Z-M at the end is a hassle. Allow 0 X, Y 2M to avoid reduction ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 43

Provided by: melis129

Category:

more less

Transcript and Presenter's Notes

Title: This is a good background color and a good text color

1
High Radix Scalable Montgomery Multipliers
David Money Harris, Nathaniel Pinckney, Philip
Amberg Harvey Mudd College Claremont, CA
Supported by Intel Circuit Research Labs
2
Outline

RSA Encryption
Montgomery Multiplication
Radix 2 Implementation
Scalable Hardware
Left Shifting
Higher Radix
Parallel Algorithm
High Radix Implementations
Results

3
RSA Encryption

Most widely used public key system.
Good for encryption and signatures.
Invented by Rivest, Shamir, Adleman (1978)
Public e and private d keys are long s
n 256-2048 bits
Satisfy xde mod M x for all x
Finding d from e is as hard as factoring M
Encryption B Ae mod M
Decryption C Bd mod M Aed A

4
Cryptographic Algorithms

DES, AES
Symmetric key algorithms
Require exchange of secret key
Computationally efficient
RSA, ECC
Public key algorithms
No key exchange needed (e.g. ecommerce)
Computationally expensive
Use public key to exchange symmetric key

5
Modular Exponentiation

Critical operation in RSA and for
Digital signature algorithm
Diffie-Hellman key exchange
SSL, IPSec, IPv6
Elliptic curve cryptosystems
Done with modular multiplications
Ex A27 ((((((A2) A)2)2) A)2) A
Division after each multiplication to compute
modulo
Maximum 2n, average 1.5n mults needed
But always use 2n to avoid timing attacks

6
Montgomery Multiplication

Faster way to do modular exponentation
Operate on Montgomery residues
Division becomes a simple shift
Requires conversion to and from residues only
once per exponentiation

7
Montgomery Residues

Let the modulus M be an odd n-bit integer
2n-1 lt M lt 2n
Define R 2n
Define the M-residue of an integer A lt M as
A AR mod M
There is a one-to-one correspondence between
integers and M-residues for
0 lt A lt M-1

8
Montgomery Multiplicaton

Define
Z MM(X, Y) X Y R-1 mod M
Where R-1 is the inverse of R mod M
R-1R 1 (mod M)
Montgomery Mult finds residue of Z XY mod M
Z X Y R-1 mod M
(XR) (YR) R-1 mod M
XYR mod M
ZR mod M

9
Montgomery Reduction

Precompute M satisfying RR-1 MM 1
Convert mult and mod to 3 mult and shift
Multiply Z X Y
Reduce reduce Z M mod R
Z Z reduce M / R
Normalize if Z M then Z Z M
Why is Z Reduce M divisible by R?

Mult
Drop bits
Mult
Mult
Shift for R-1
10
Reduction Proof

Z reduce M mod R
Z (Z M mod R) M mod R
Z Z MM mod R
Z Z(RR-1 - 1) mod R
ZRR-1 mod R
0 mod R
So Z reduce M is divisible by R

11
More Comments on M

RR-1 MM 1
Implies M ? -M-1 mod R
M is odd
M is precomputed from M using the extended
Euclidian algorithm
M is held constant over many mults
Only least significant v bits of M are needed
when computing in radix 2v
Dusse Kaliski, Eurocrypt 90s

12
Radix 2 Algorithm

In radix 2, process one bit of X per step
Reduction becomes trivial because M mod 2 1
Two multiplies and one shift per step
Z 0
for i 0 to n-1
Z Z Xi Y
reduce Z0 trivial
Z Z reduce M make Z divisible by 2
Z Z/2
if Z M then Z Z M final Mod M

Z X Y reduce Z M mod R Z Z reduce
M / R if Z M then Z Z M
13
Final Modulo

Result before last step in range
0 ? Z lt 2M
Reducing Z-M at the end is a hassle
Allow 0 ? X, Y lt 2M to avoid reduction
Then if R gt 4M, 0 ? Z lt 2M
Hence add two bits to R to avoid subtraction at
end of each step
Walter, Electronic Letters 99

14
Conversion

Conversion of integers to/from Montgomery
residues takes one MM operation (if r2 mod M is
precomputed and saved)
Modular exponentiation takes two conversion steps
and 2n multiplication steps.

15
Reconfigurable Hardware

Building hardwired n-bit unit is limiting
Slow for large n
Not scalable to different n
Better to design for w-bit words
Break n-bit operand into e w-bit words
e n/w
This is called scalable
Also handle both GF(p) and GF(2n)
Requires conditionally killing carries
Called unified

16
Tenca-Koç Montgomery Multiplier

Z 0
for i 0 to n-1
(Ca, Z0) Z0 Xi Y0
reduce Z0
(Cb, Z0) Z0 reduce M0
for j 1 to e
(Ca,Zj) Zj Ca Xi Yj
(Cb,Zj) Zj Cb reduce Mj
Zj-1 (Zj0, Zj-1w-11)

M (M(e-1), , M1, M0), Y (Y(e-1), , Y1,
Y0), Z (Z(e-1), , Z1, Z0), X (Xn-1, , X1,
X0)
Tenca, Koç, Trans. Computers, 2003
17
Processing Elements

Keep Z in carry-save redundant form
Tc 2tAND 2tCSA tMUX tBUF(w) tREG

18
Parallelism

Two dimensions of parallelism
Width of processing element w
Number of pipelined PEs p
Multiply takes k n/p kernel cycles

19
Pipeline Timing
20
Left shifting

Dont wait two cycles for msb
Kick off dependent operation right away on the
available bits
Take extra cycle(s) at the end to handle the
extra bits
For p processing elements, cycle count reduces
from 2p to p (p/w)
Harris, Krishnamurthy, Anders, Mathew, Hsu, Arith
2005.

21
Improved PE

Left-shift M and Y
Rather than right-shift Z
Same PE, saves pipeline registers

22
Pipeline Timing
23
High Radix Scalable Montgomery Algorithm
n1 n 1 n2 n 2 e n1/w f n2/v X
(X(f-1), , X1, X0), M (M(e-1), , M1, M0), Y
(Y(e-1), , Y1, Y0), Z (Z(e-1), , Z1, Z0),

Z 0
for i 0 to f-1
(Ca, Z0) Z0 Xi Y0
Q (M0' Z0) mod 2v
(Cb, Z0) Z0 Q M0
for j 1 to e-1
(Ca,Zj) Zj Ca Xi Yj
(Cb,Zj) Zj Cb Q Mj
Zj-1 (Zjv-10, Zj-1w-1v)

Only two mults and a shift in the inner loop
24
Parallel Operation

Eliminate two of the steps
Multiplication to compute Q
By precomputing M M M mod R
Dependency of Z0 on Q
By prescaling X by 2v so Z0 0
Math proposed by Orup Arith95
But no scalable hardware

25
Improvement 1 Eliminate Multiply
Z X Y Q Z M mod R Z Z Q M / R

Z 0
for i 0 to f-1
Z Z Xi Y
Z Z Z0 M M (M mod 2v)M mod R
Z Z / 2v

M M M mod R (precompute) Z X Y Z Z
Z M / R

26
Improvement 2 Prescale X by 2v
Z 0 for i 0 to f-1 Z Z Xi Y Z
Z Z0 M Z Z / 2v
One more iteration

Z 0
for i 0 to f
Z Z 2vXi Y Z0 M
Z Z / 2v
Z 0
for i 0 to f
Z (Z Z0 M) / 2v Xi Y

Because Z0 is independent of 2vXi

Final result in range 0 ? Z lt 2nv1 - avoid
final small mod in successive mults by using
larger n2 n 2 v
27
Improvement 3 Avoid LSW add

Z 0
for i 0 to f
Z (Z Z0 M) / 2v Xi Y
Z 0
for i 0 to f
Q Z0
Z Z gtgt v Q M Xi Y

(Z Z0 M) / 2v Z gtgt v (Z0 M Z
mod 2v) / 2v Z gtgt v (Z0 (M1)) / 2v Z
gtgt v Z0 M

M 1
M
2v

M ? MM ? -1 mod 2v So M 1 is divisible

28
Scalable Parallel High Radix

Z 0
for i 0 to f-1
C 0
Q Z0
for j 0 to e -1
(C, Zj) Zj C Q Mj Xi Yj
Zj-1 (Zjv-10, Zj-1w-1v)

n1 n 1 v n2 n 2 v e n1/w f n2/v
29
Choice of Radix

Radix 2 1 bit of X per step
Radix 2v v bits of X per step
Requires 0, Y, 2Y, 3Y, , 2v-1Y multiples
Very High Radix (v gt 4)
Our prior work
Requires w x v-bit multipliers to compute w bits
of Y times v bits of X
Efficient for FPGAs with dedicated 16 x 16
multipliers, but perhaps not for custom silicon
High Radix radix 4, 8, 16
This work

30
Design Space Exploration

Parallel hardware
Left vs. right shifting
Choice of radix
Booth encoding?
Design Points
Parallel Radix-4 left shift non-Booth
Parallel Radix-4 left shift Booth
Parallel Radix-4 right shift Booth
Parallel Radix-8 right shift Booth

31
Booth Encoding

Radix 4
Non-Booth 0, Y, 2Y, 3Y
Booth -2Y, -Y, 0, Y, 2Y
Radix 8
Non-Booth 0, Y, 2Y, 3Y, 4Y, 5Y, 6Y, 7Y
Booth -4Y, -3Y, -2Y, -Y, 0, Y, 2Y, 3Y, 4Y
Hard multiples must precomputed
w-bit CPAs at output of operand memory
Registers to pipe along values

32
Radix-4 Left-Shifting Non-Booth
33
Radix-4 Left-Shifting Booth
34
Extra Bits in Least Signficant Word

Booth encoding SX, SQ
Carry out of discarded v lsbs
Parallel design has nonzero lsbs
Only two CSAs available to add 3 bits
Use H Zs10 Zc10 SQ
Inject SX and H into CSA vacancies

35
Radix-4 Right-Shifting Booth
36
Radix-8 Right-Shifting Booth
37
Synthesis Results

Target Xilinx XC2V2000-6
21504 LUTs

38
Exponentiation Times
39
Cautions

Choose n2 to be a multiple of vp so results can
be read from last PE
Reset registers properly between steps
Sequencer is big source of bugs
High radix starts to get messy to implement

40
Sponsored Research Outcomes

Kyle Kelley
Completing 3rd year of Stanford Ph.D.
Ted Jiang
Completing 1st year of Stanford Ph.D. program
with NSF Fellowship
John Parker
Completing 1st year of UCSB Ph.D. program in
Integrated Photonics
Nathaniel Pinckney
Accepted to U. Michigan
Philip Amberg
Accepted to Stanford University M.S. program
Publications
6 conference papers published
1 conference paper accepted (Asilomar 08)
1 conference paper under review (VLSI SOC 08)
1 journal paper under review (JICS)

41
Publications

Published (Conference)
Pinckney, Nathaniel, and Harris, David,
Parallelized Radix-4 Scalable Montgomery
Multipliers, 20th Symp. On Integrated Circuits
and Systems Design, Sept. 2007.
Jiang, Nan and Harris, David, Parallelized
Radix-2 Scalable Montgomery Multiplier, VLSI
SOC, 2007.
Jiang, Nan, and Harris, David, Quotient
Pipelined Very High Radix Scalable Montgomery
Multipliers, Asilomar Conf. Signals, Systems, and
Computers, Nov. 2006.
Kelley, Kyle, and Harris, David, Parallelized
Very High Radix Scalable Montgomery Multipliers,
Asilomar Conf. Signals, Systems, and Computers,
Nov. 2005, pp. 1196-1200.
Kelley, Kyle, and Harris, David, Very High Radix
Scalable Montgomery Multipliers, 5th Intl.
Workshop on System-on-Chip, July 2005, pp.
400-404.
Harris, David, Krishnamurthy, Ram, Anders, Mark,
Mathew, Sanu, and Hsu, Steven, An Improved
Unified Scalable Radix-2 Montgomery Multiplier,
IEEE Symposium on Computer Arithmetic, June 2005,
pp. 172-178.
In Progress (Conference)
Pinckney, N., Amberg, P., and Harris, D.,
Parallelized Booth-Encoded Radix-4 Montgomery
Multipliers, submitted to VLSI SOC 2008.
Amberg, P., Pinckney, N., and Harris, D.,
Parallelized Booth-Encoded Radix-8 Montgomery
Multipliers, invited paper for Asilomar Conf.
Signals, Systems, and Computers 2008.
Submitted (Journal)
Pinckney, Nathaniel, and Harris, David,
Parallelized Radix-4 Scalable Montgomery
Multipliers, submitted to JICS, 2007.
Harris, David, Jiang, Ted, Kelley, Kyle,
Pinckney, Nathaniel, and Parker, Jon, Very High
Radix Montgomery Multipliers, submitted to IEEE
Transactions on Computers.