This is a good background color and a good text color - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

This is a good background color and a good text color

Description:

Most widely used public key system. Good for encryption and signatures. ... Reducing Z-M at the end is a hassle. Allow 0 X, Y 2M to avoid reduction ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 43
Provided by: melis129
Category:

less

Transcript and Presenter's Notes

Title: This is a good background color and a good text color


1
High Radix Scalable Montgomery Multipliers
David Money Harris, Nathaniel Pinckney, Philip
Amberg Harvey Mudd College Claremont, CA
Supported by Intel Circuit Research Labs
2
Outline
  • RSA Encryption
  • Montgomery Multiplication
  • Radix 2 Implementation
  • Scalable Hardware
  • Left Shifting
  • Higher Radix
  • Parallel Algorithm
  • High Radix Implementations
  • Results

3
RSA Encryption
  • Most widely used public key system.
  • Good for encryption and signatures.
  • Invented by Rivest, Shamir, Adleman (1978)
  • Public e and private d keys are long s
  • n 256-2048 bits
  • Satisfy xde mod M x for all x
  • Finding d from e is as hard as factoring M
  • Encryption B Ae mod M
  • Decryption C Bd mod M Aed A

4
Cryptographic Algorithms
  • DES, AES
  • Symmetric key algorithms
  • Require exchange of secret key
  • Computationally efficient
  • RSA, ECC
  • Public key algorithms
  • No key exchange needed (e.g. ecommerce)
  • Computationally expensive
  • Use public key to exchange symmetric key

5
Modular Exponentiation
  • Critical operation in RSA and for
  • Digital signature algorithm
  • Diffie-Hellman key exchange
  • SSL, IPSec, IPv6
  • Elliptic curve cryptosystems
  • Done with modular multiplications
  • Ex A27 ((((((A2) A)2)2) A)2) A
  • Division after each multiplication to compute
    modulo
  • Maximum 2n, average 1.5n mults needed
  • But always use 2n to avoid timing attacks

6
Montgomery Multiplication
  • Faster way to do modular exponentation
  • Operate on Montgomery residues
  • Division becomes a simple shift
  • Requires conversion to and from residues only
    once per exponentiation

7
Montgomery Residues
  • Let the modulus M be an odd n-bit integer
  • 2n-1 lt M lt 2n
  • Define R 2n
  • Define the M-residue of an integer A lt M as
  • A AR mod M
  • There is a one-to-one correspondence between
    integers and M-residues for
  • 0 lt A lt M-1

8
Montgomery Multiplicaton
  • Define
  • Z MM(X, Y) X Y R-1 mod M
  • Where R-1 is the inverse of R mod M
  • R-1R 1 (mod M)
  • Montgomery Mult finds residue of Z XY mod M
  • Z X Y R-1 mod M
  • (XR) (YR) R-1 mod M
  • XYR mod M
  • ZR mod M

9
Montgomery Reduction
  • Precompute M satisfying RR-1 MM 1
  • Convert mult and mod to 3 mult and shift
  • Multiply Z X Y
  • Reduce reduce Z M mod R
  • Z Z reduce M / R
  • Normalize if Z M then Z Z M
  • Why is Z Reduce M divisible by R?

Mult
Drop bits
Mult
Mult
Shift for R-1
10
Reduction Proof
  • Z reduce M mod R
  • Z (Z M mod R) M mod R
  • Z Z MM mod R
  • Z Z(RR-1 - 1) mod R
  • ZRR-1 mod R
  • 0 mod R
  • So Z reduce M is divisible by R

11
More Comments on M
  • RR-1 MM 1
  • Implies M ? -M-1 mod R
  • M is odd
  • M is precomputed from M using the extended
    Euclidian algorithm
  • M is held constant over many mults
  • Only least significant v bits of M are needed
    when computing in radix 2v
  • Dusse Kaliski, Eurocrypt 90s

12
Radix 2 Algorithm
  • In radix 2, process one bit of X per step
  • Reduction becomes trivial because M mod 2 1
  • Two multiplies and one shift per step
  • Z 0
  • for i 0 to n-1
  • Z Z Xi Y
  • reduce Z0 trivial
  • Z Z reduce M make Z divisible by 2
  • Z Z/2
  • if Z M then Z Z M final Mod M

Z X Y reduce Z M mod R Z Z reduce
M / R if Z M then Z Z M
13
Final Modulo
  • Result before last step in range
  • 0 ? Z lt 2M
  • Reducing Z-M at the end is a hassle
  • Allow 0 ? X, Y lt 2M to avoid reduction
  • Then if R gt 4M, 0 ? Z lt 2M
  • Hence add two bits to R to avoid subtraction at
    end of each step
  • Walter, Electronic Letters 99

14
Conversion
  • Conversion of integers to/from Montgomery
    residues takes one MM operation (if r2 mod M is
    precomputed and saved)
  • Modular exponentiation takes two conversion steps
    and 2n multiplication steps.

15
Reconfigurable Hardware
  • Building hardwired n-bit unit is limiting
  • Slow for large n
  • Not scalable to different n
  • Better to design for w-bit words
  • Break n-bit operand into e w-bit words
  • e n/w
  • This is called scalable
  • Also handle both GF(p) and GF(2n)
  • Requires conditionally killing carries
  • Called unified

16
Tenca-Koç Montgomery Multiplier
  • Z 0
  • for i 0 to n-1
  • (Ca, Z0) Z0 Xi Y0
  • reduce Z0
  • (Cb, Z0) Z0 reduce M0
  • for j 1 to e
  • (Ca,Zj) Zj Ca Xi Yj
  • (Cb,Zj) Zj Cb reduce Mj
  • Zj-1 (Zj0, Zj-1w-11)

M (M(e-1), , M1, M0), Y (Y(e-1), , Y1,
Y0), Z (Z(e-1), , Z1, Z0), X (Xn-1, , X1,
X0)
Tenca, Koç, Trans. Computers, 2003
17
Processing Elements
  • Keep Z in carry-save redundant form
  • Tc 2tAND 2tCSA tMUX tBUF(w) tREG

18
Parallelism
  • Two dimensions of parallelism
  • Width of processing element w
  • Number of pipelined PEs p
  • Multiply takes k n/p kernel cycles

19
Pipeline Timing
20
Left shifting
  • Dont wait two cycles for msb
  • Kick off dependent operation right away on the
    available bits
  • Take extra cycle(s) at the end to handle the
    extra bits
  • For p processing elements, cycle count reduces
    from 2p to p (p/w)
  • Harris, Krishnamurthy, Anders, Mathew, Hsu, Arith
    2005.

21
Improved PE
  • Left-shift M and Y
  • Rather than right-shift Z
  • Same PE, saves pipeline registers

22
Pipeline Timing
23
High Radix Scalable Montgomery Algorithm
n1 n 1 n2 n 2 e n1/w f n2/v X
(X(f-1), , X1, X0), M (M(e-1), , M1, M0), Y
(Y(e-1), , Y1, Y0), Z (Z(e-1), , Z1, Z0),
  • Z 0
  • for i 0 to f-1
  • (Ca, Z0) Z0 Xi Y0
  • Q (M0' Z0) mod 2v
  • (Cb, Z0) Z0 Q M0
  • for j 1 to e-1
  • (Ca,Zj) Zj Ca Xi Yj
  • (Cb,Zj) Zj Cb Q Mj
  • Zj-1 (Zjv-10, Zj-1w-1v)

Only two mults and a shift in the inner loop
24
Parallel Operation
  • Eliminate two of the steps
  • Multiplication to compute Q
  • By precomputing M M M mod R
  • Dependency of Z0 on Q
  • By prescaling X by 2v so Z0 0
  • Math proposed by Orup Arith95
  • But no scalable hardware


25
Improvement 1 Eliminate Multiply
Z X Y Q Z M mod R Z Z Q M / R
  • Z 0
  • for i 0 to f-1
  • Z Z Xi Y
  • Z Z Z0 M M (M mod 2v)M mod R
  • Z Z / 2v


M M M mod R (precompute) Z X Y Z Z
Z M / R



26
Improvement 2 Prescale X by 2v
Z 0 for i 0 to f-1 Z Z Xi Y Z
Z Z0 M Z Z / 2v
One more iteration
  • Z 0
  • for i 0 to f
  • Z Z 2vXi Y Z0 M
  • Z Z / 2v
  • Z 0
  • for i 0 to f
  • Z (Z Z0 M) / 2v Xi Y



Because Z0 is independent of 2vXi

Final result in range 0 ? Z lt 2nv1 - avoid
final small mod in successive mults by using
larger n2 n 2 v
27
Improvement 3 Avoid LSW add
  • Z 0
  • for i 0 to f
  • Z (Z Z0 M) / 2v Xi Y
  • Z 0
  • for i 0 to f
  • Q Z0
  • Z Z gtgt v Q M Xi Y



(Z Z0 M) / 2v Z gtgt v (Z0 M Z
mod 2v) / 2v Z gtgt v (Z0 (M1)) / 2v Z
gtgt v Z0 M



M 1
M
2v

M ? MM ? -1 mod 2v So M 1 is divisible

28
Scalable Parallel High Radix
  • Z 0
  • for i 0 to f-1
  • C 0
  • Q Z0
  • for j 0 to e -1
  • (C, Zj) Zj C Q Mj Xi Yj
  • Zj-1 (Zjv-10, Zj-1w-1v)

n1 n 1 v n2 n 2 v e n1/w f n2/v
29
Choice of Radix
  • Radix 2 1 bit of X per step
  • Radix 2v v bits of X per step
  • Requires 0, Y, 2Y, 3Y, , 2v-1Y multiples
  • Very High Radix (v gt 4)
  • Our prior work
  • Requires w x v-bit multipliers to compute w bits
    of Y times v bits of X
  • Efficient for FPGAs with dedicated 16 x 16
    multipliers, but perhaps not for custom silicon
  • High Radix radix 4, 8, 16
  • This work

30
Design Space Exploration
  • Parallel hardware
  • Left vs. right shifting
  • Choice of radix
  • Booth encoding?
  • Design Points
  • Parallel Radix-4 left shift non-Booth
  • Parallel Radix-4 left shift Booth
  • Parallel Radix-4 right shift Booth
  • Parallel Radix-8 right shift Booth

31
Booth Encoding
  • Radix 4
  • Non-Booth 0, Y, 2Y, 3Y
  • Booth -2Y, -Y, 0, Y, 2Y
  • Radix 8
  • Non-Booth 0, Y, 2Y, 3Y, 4Y, 5Y, 6Y, 7Y
  • Booth -4Y, -3Y, -2Y, -Y, 0, Y, 2Y, 3Y, 4Y
  • Hard multiples must precomputed
  • w-bit CPAs at output of operand memory
  • Registers to pipe along values

32
Radix-4 Left-Shifting Non-Booth
33
Radix-4 Left-Shifting Booth
34
Extra Bits in Least Signficant Word
  • Booth encoding SX, SQ
  • Carry out of discarded v lsbs
  • Parallel design has nonzero lsbs
  • Only two CSAs available to add 3 bits
  • Use H Zs10 Zc10 SQ
  • Inject SX and H into CSA vacancies

35
Radix-4 Right-Shifting Booth
36
Radix-8 Right-Shifting Booth
37
Synthesis Results
  • Target Xilinx XC2V2000-6
  • 21504 LUTs

38
Exponentiation Times
39
Cautions
  • Choose n2 to be a multiple of vp so results can
    be read from last PE
  • Reset registers properly between steps
  • Sequencer is big source of bugs
  • High radix starts to get messy to implement

40
Sponsored Research Outcomes
  • Kyle Kelley
  • Completing 3rd year of Stanford Ph.D.
  • Ted Jiang
  • Completing 1st year of Stanford Ph.D. program
    with NSF Fellowship
  • John Parker
  • Completing 1st year of UCSB Ph.D. program in
    Integrated Photonics
  • Nathaniel Pinckney
  • Accepted to U. Michigan
  • Philip Amberg
  • Accepted to Stanford University M.S. program
  • Publications
  • 6 conference papers published
  • 1 conference paper accepted (Asilomar 08)
  • 1 conference paper under review (VLSI SOC 08)
  • 1 journal paper under review (JICS)

41
Publications
  • Published (Conference)
  • Pinckney, Nathaniel, and Harris, David,
    Parallelized Radix-4 Scalable Montgomery
    Multipliers, 20th Symp. On Integrated Circuits
    and Systems Design, Sept. 2007.
  • Jiang, Nan and Harris, David, Parallelized
    Radix-2 Scalable Montgomery Multiplier, VLSI
    SOC, 2007.
  • Jiang, Nan, and Harris, David, Quotient
    Pipelined Very High Radix Scalable Montgomery
    Multipliers, Asilomar Conf. Signals, Systems, and
    Computers, Nov. 2006.
  • Kelley, Kyle, and Harris, David, Parallelized
    Very High Radix Scalable Montgomery Multipliers,
    Asilomar Conf. Signals, Systems, and Computers,
    Nov. 2005, pp. 1196-1200.
  • Kelley, Kyle, and Harris, David, Very High Radix
    Scalable Montgomery Multipliers, 5th Intl.
    Workshop on System-on-Chip, July 2005, pp.
    400-404.
  • Harris, David, Krishnamurthy, Ram, Anders, Mark,
    Mathew, Sanu, and Hsu, Steven, An Improved
    Unified Scalable Radix-2 Montgomery Multiplier,
    IEEE Symposium on Computer Arithmetic, June 2005,
    pp. 172-178.
  • In Progress (Conference)
  • Pinckney, N., Amberg, P., and Harris, D.,
    Parallelized Booth-Encoded Radix-4 Montgomery
    Multipliers, submitted to VLSI SOC 2008.
  • Amberg, P., Pinckney, N., and Harris, D.,
    Parallelized Booth-Encoded Radix-8 Montgomery
    Multipliers, invited paper for Asilomar Conf.
    Signals, Systems, and Computers 2008.
  • Submitted (Journal)
  • Pinckney, Nathaniel, and Harris, David,
    Parallelized Radix-4 Scalable Montgomery
    Multipliers, submitted to JICS, 2007.
  • Harris, David, Jiang, Ted, Kelley, Kyle,
    Pinckney, Nathaniel, and Parker, Jon, Very High
    Radix Montgomery Multipliers, submitted to IEEE
    Transactions on Computers.

42
Conclusions
  • High Radix Design Space Exploration
  • Suitable for custom circuit application
  • Parallel always helpful for HW simplicity and
    cycle time
  • Radix-4 better than Radix-2
  • Left-shifting parallel non-Booth performs best
  • If cycle is slower choose left-shifting Booth
  • Radix-8 is less efficient than radix-4
  • High Performance Architectures
  • Choose w as big as cycle time permits
  • Bigger w saves flops at small cost in clock rate
  • Then use as many PEs p as budget permits
Write a Comment
User Comments (0)
About PowerShow.com