Title: This is a good background color and a good text color
1High Radix Scalable Montgomery Multipliers
David Money Harris, Nathaniel Pinckney, Philip
Amberg Harvey Mudd College Claremont, CA
Supported by Intel Circuit Research Labs
2Outline
- RSA Encryption
- Montgomery Multiplication
- Radix 2 Implementation
- Scalable Hardware
- Left Shifting
- Higher Radix
- Parallel Algorithm
- High Radix Implementations
- Results
3RSA Encryption
- Most widely used public key system.
- Good for encryption and signatures.
- Invented by Rivest, Shamir, Adleman (1978)
- Public e and private d keys are long s
- n 256-2048 bits
- Satisfy xde mod M x for all x
- Finding d from e is as hard as factoring M
- Encryption B Ae mod M
- Decryption C Bd mod M Aed A
4Cryptographic Algorithms
- DES, AES
- Symmetric key algorithms
- Require exchange of secret key
- Computationally efficient
- RSA, ECC
- Public key algorithms
- No key exchange needed (e.g. ecommerce)
- Computationally expensive
- Use public key to exchange symmetric key
5Modular Exponentiation
- Critical operation in RSA and for
- Digital signature algorithm
- Diffie-Hellman key exchange
- SSL, IPSec, IPv6
- Elliptic curve cryptosystems
- Done with modular multiplications
- Ex A27 ((((((A2) A)2)2) A)2) A
- Division after each multiplication to compute
modulo - Maximum 2n, average 1.5n mults needed
- But always use 2n to avoid timing attacks
6Montgomery Multiplication
- Faster way to do modular exponentation
- Operate on Montgomery residues
- Division becomes a simple shift
- Requires conversion to and from residues only
once per exponentiation
7Montgomery Residues
- Let the modulus M be an odd n-bit integer
- 2n-1 lt M lt 2n
- Define R 2n
- Define the M-residue of an integer A lt M as
- A AR mod M
- There is a one-to-one correspondence between
integers and M-residues for - 0 lt A lt M-1
8Montgomery Multiplicaton
- Define
- Z MM(X, Y) X Y R-1 mod M
- Where R-1 is the inverse of R mod M
- R-1R 1 (mod M)
- Montgomery Mult finds residue of Z XY mod M
- Z X Y R-1 mod M
- (XR) (YR) R-1 mod M
- XYR mod M
- ZR mod M
9Montgomery Reduction
- Precompute M satisfying RR-1 MM 1
- Convert mult and mod to 3 mult and shift
- Multiply Z X Y
- Reduce reduce Z M mod R
- Z Z reduce M / R
- Normalize if Z M then Z Z M
- Why is Z Reduce M divisible by R?
Mult
Drop bits
Mult
Mult
Shift for R-1
10Reduction Proof
- Z reduce M mod R
- Z (Z M mod R) M mod R
- Z Z MM mod R
- Z Z(RR-1 - 1) mod R
- ZRR-1 mod R
- 0 mod R
- So Z reduce M is divisible by R
11More Comments on M
- RR-1 MM 1
- Implies M ? -M-1 mod R
- M is odd
- M is precomputed from M using the extended
Euclidian algorithm - M is held constant over many mults
- Only least significant v bits of M are needed
when computing in radix 2v - Dusse Kaliski, Eurocrypt 90s
12Radix 2 Algorithm
- In radix 2, process one bit of X per step
- Reduction becomes trivial because M mod 2 1
- Two multiplies and one shift per step
- Z 0
- for i 0 to n-1
- Z Z Xi Y
- reduce Z0 trivial
- Z Z reduce M make Z divisible by 2
- Z Z/2
- if Z M then Z Z M final Mod M
Z X Y reduce Z M mod R Z Z reduce
M / R if Z M then Z Z M
13Final Modulo
- Result before last step in range
- 0 ? Z lt 2M
- Reducing Z-M at the end is a hassle
- Allow 0 ? X, Y lt 2M to avoid reduction
- Then if R gt 4M, 0 ? Z lt 2M
- Hence add two bits to R to avoid subtraction at
end of each step - Walter, Electronic Letters 99
14Conversion
- Conversion of integers to/from Montgomery
residues takes one MM operation (if r2 mod M is
precomputed and saved) - Modular exponentiation takes two conversion steps
and 2n multiplication steps.
15Reconfigurable Hardware
- Building hardwired n-bit unit is limiting
- Slow for large n
- Not scalable to different n
- Better to design for w-bit words
- Break n-bit operand into e w-bit words
- e n/w
- This is called scalable
- Also handle both GF(p) and GF(2n)
- Requires conditionally killing carries
- Called unified
16Tenca-Koç Montgomery Multiplier
- Z 0
- for i 0 to n-1
- (Ca, Z0) Z0 Xi Y0
- reduce Z0
- (Cb, Z0) Z0 reduce M0
- for j 1 to e
- (Ca,Zj) Zj Ca Xi Yj
- (Cb,Zj) Zj Cb reduce Mj
- Zj-1 (Zj0, Zj-1w-11)
M (M(e-1), , M1, M0), Y (Y(e-1), , Y1,
Y0), Z (Z(e-1), , Z1, Z0), X (Xn-1, , X1,
X0)
Tenca, Koç, Trans. Computers, 2003
17Processing Elements
- Keep Z in carry-save redundant form
- Tc 2tAND 2tCSA tMUX tBUF(w) tREG
18Parallelism
- Two dimensions of parallelism
- Width of processing element w
- Number of pipelined PEs p
- Multiply takes k n/p kernel cycles
19Pipeline Timing
20Left shifting
- Dont wait two cycles for msb
- Kick off dependent operation right away on the
available bits - Take extra cycle(s) at the end to handle the
extra bits - For p processing elements, cycle count reduces
from 2p to p (p/w) - Harris, Krishnamurthy, Anders, Mathew, Hsu, Arith
2005.
21Improved PE
- Left-shift M and Y
- Rather than right-shift Z
- Same PE, saves pipeline registers
22Pipeline Timing
23High Radix Scalable Montgomery Algorithm
n1 n 1 n2 n 2 e n1/w f n2/v X
(X(f-1), , X1, X0), M (M(e-1), , M1, M0), Y
(Y(e-1), , Y1, Y0), Z (Z(e-1), , Z1, Z0),
- Z 0
- for i 0 to f-1
- (Ca, Z0) Z0 Xi Y0
- Q (M0' Z0) mod 2v
- (Cb, Z0) Z0 Q M0
- for j 1 to e-1
- (Ca,Zj) Zj Ca Xi Yj
- (Cb,Zj) Zj Cb Q Mj
- Zj-1 (Zjv-10, Zj-1w-1v)
Only two mults and a shift in the inner loop
24Parallel Operation
- Eliminate two of the steps
- Multiplication to compute Q
- By precomputing M M M mod R
- Dependency of Z0 on Q
- By prescaling X by 2v so Z0 0
- Math proposed by Orup Arith95
- But no scalable hardware
25Improvement 1 Eliminate Multiply
Z X Y Q Z M mod R Z Z Q M / R
- Z 0
- for i 0 to f-1
- Z Z Xi Y
- Z Z Z0 M M (M mod 2v)M mod R
- Z Z / 2v
M M M mod R (precompute) Z X Y Z Z
Z M / R
26Improvement 2 Prescale X by 2v
Z 0 for i 0 to f-1 Z Z Xi Y Z
Z Z0 M Z Z / 2v
One more iteration
- Z 0
- for i 0 to f
- Z Z 2vXi Y Z0 M
- Z Z / 2v
- Z 0
- for i 0 to f
- Z (Z Z0 M) / 2v Xi Y
Because Z0 is independent of 2vXi
Final result in range 0 ? Z lt 2nv1 - avoid
final small mod in successive mults by using
larger n2 n 2 v
27Improvement 3 Avoid LSW add
- Z 0
- for i 0 to f
- Z (Z Z0 M) / 2v Xi Y
- Z 0
- for i 0 to f
- Q Z0
- Z Z gtgt v Q M Xi Y
(Z Z0 M) / 2v Z gtgt v (Z0 M Z
mod 2v) / 2v Z gtgt v (Z0 (M1)) / 2v Z
gtgt v Z0 M
M 1
M
2v
M ? MM ? -1 mod 2v So M 1 is divisible
28Scalable Parallel High Radix
- Z 0
- for i 0 to f-1
- C 0
- Q Z0
- for j 0 to e -1
- (C, Zj) Zj C Q Mj Xi Yj
- Zj-1 (Zjv-10, Zj-1w-1v)
n1 n 1 v n2 n 2 v e n1/w f n2/v
29Choice of Radix
- Radix 2 1 bit of X per step
- Radix 2v v bits of X per step
- Requires 0, Y, 2Y, 3Y, , 2v-1Y multiples
- Very High Radix (v gt 4)
- Our prior work
- Requires w x v-bit multipliers to compute w bits
of Y times v bits of X - Efficient for FPGAs with dedicated 16 x 16
multipliers, but perhaps not for custom silicon - High Radix radix 4, 8, 16
- This work
30Design Space Exploration
- Parallel hardware
- Left vs. right shifting
- Choice of radix
- Booth encoding?
- Design Points
- Parallel Radix-4 left shift non-Booth
- Parallel Radix-4 left shift Booth
- Parallel Radix-4 right shift Booth
- Parallel Radix-8 right shift Booth
31Booth Encoding
- Radix 4
- Non-Booth 0, Y, 2Y, 3Y
- Booth -2Y, -Y, 0, Y, 2Y
- Radix 8
- Non-Booth 0, Y, 2Y, 3Y, 4Y, 5Y, 6Y, 7Y
- Booth -4Y, -3Y, -2Y, -Y, 0, Y, 2Y, 3Y, 4Y
- Hard multiples must precomputed
- w-bit CPAs at output of operand memory
- Registers to pipe along values
32Radix-4 Left-Shifting Non-Booth
33Radix-4 Left-Shifting Booth
34Extra Bits in Least Signficant Word
- Booth encoding SX, SQ
- Carry out of discarded v lsbs
- Parallel design has nonzero lsbs
- Only two CSAs available to add 3 bits
- Use H Zs10 Zc10 SQ
- Inject SX and H into CSA vacancies
35Radix-4 Right-Shifting Booth
36Radix-8 Right-Shifting Booth
37Synthesis Results
- Target Xilinx XC2V2000-6
- 21504 LUTs
38Exponentiation Times
39Cautions
- Choose n2 to be a multiple of vp so results can
be read from last PE - Reset registers properly between steps
- Sequencer is big source of bugs
- High radix starts to get messy to implement
40Sponsored Research Outcomes
- Kyle Kelley
- Completing 3rd year of Stanford Ph.D.
- Ted Jiang
- Completing 1st year of Stanford Ph.D. program
with NSF Fellowship - John Parker
- Completing 1st year of UCSB Ph.D. program in
Integrated Photonics - Nathaniel Pinckney
- Accepted to U. Michigan
- Philip Amberg
- Accepted to Stanford University M.S. program
- Publications
- 6 conference papers published
- 1 conference paper accepted (Asilomar 08)
- 1 conference paper under review (VLSI SOC 08)
- 1 journal paper under review (JICS)
41Publications
- Published (Conference)
- Pinckney, Nathaniel, and Harris, David,
Parallelized Radix-4 Scalable Montgomery
Multipliers, 20th Symp. On Integrated Circuits
and Systems Design, Sept. 2007. - Jiang, Nan and Harris, David, Parallelized
Radix-2 Scalable Montgomery Multiplier, VLSI
SOC, 2007. - Jiang, Nan, and Harris, David, Quotient
Pipelined Very High Radix Scalable Montgomery
Multipliers, Asilomar Conf. Signals, Systems, and
Computers, Nov. 2006. - Kelley, Kyle, and Harris, David, Parallelized
Very High Radix Scalable Montgomery Multipliers,
Asilomar Conf. Signals, Systems, and Computers,
Nov. 2005, pp. 1196-1200. - Kelley, Kyle, and Harris, David, Very High Radix
Scalable Montgomery Multipliers, 5th Intl.
Workshop on System-on-Chip, July 2005, pp.
400-404. - Harris, David, Krishnamurthy, Ram, Anders, Mark,
Mathew, Sanu, and Hsu, Steven, An Improved
Unified Scalable Radix-2 Montgomery Multiplier,
IEEE Symposium on Computer Arithmetic, June 2005,
pp. 172-178. - In Progress (Conference)
- Pinckney, N., Amberg, P., and Harris, D.,
Parallelized Booth-Encoded Radix-4 Montgomery
Multipliers, submitted to VLSI SOC 2008. - Amberg, P., Pinckney, N., and Harris, D.,
Parallelized Booth-Encoded Radix-8 Montgomery
Multipliers, invited paper for Asilomar Conf.
Signals, Systems, and Computers 2008. - Submitted (Journal)
- Pinckney, Nathaniel, and Harris, David,
Parallelized Radix-4 Scalable Montgomery
Multipliers, submitted to JICS, 2007. - Harris, David, Jiang, Ted, Kelley, Kyle,
Pinckney, Nathaniel, and Parker, Jon, Very High
Radix Montgomery Multipliers, submitted to IEEE
Transactions on Computers.
42Conclusions
- High Radix Design Space Exploration
- Suitable for custom circuit application
- Parallel always helpful for HW simplicity and
cycle time - Radix-4 better than Radix-2
- Left-shifting parallel non-Booth performs best
- If cycle is slower choose left-shifting Booth
- Radix-8 is less efficient than radix-4
- High Performance Architectures
- Choose w as big as cycle time permits
- Bigger w saves flops at small cost in clock rate
- Then use as many PEs p as budget permits