Privacypreserving Reinforcement Learning - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Privacypreserving Reinforcement Learning

Description:

Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment ... Homomorphic public-key cryptosystem. Addition of cipher epk(m1 m2; r1 ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 37
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Privacypreserving Reinforcement Learning


1
Privacy-preserving Reinforcement Learning
Tokyo Inst. of Tech. Jun Sakuma, Shigenobu
Kobayashi
Rutgers Univ. Rebecca N. Wright
2
Motivating application Load balancing
Order
Shipment
Order
Shipment
Production
Production
Redirection when heavily loaded
  • A load balancing among competing factories
  • Factories obtain a reward by processing a job,
    but suffer a large penalty if overflow happens
  • Factories may need to redirect jobs to the other
    factory when heavily loaded
  • When should factories redirect jobs to the other
    factory?

Jun Sakuma
3
Motivating application Load balancing
Order
Shipment
Order
Shipment
Production
Production
Redirection when heavily loaded
  • If two factories are competing
  • The frequency of orders and the speed of
    production is private (private model)
  • The backlog is private (private state
    observation)
  • The profit is private (private reward)
  • Privacy-preserving Reinforcement Learning
  • States, actions, and rewards are not shared
  • But the learned policy is shared in the end

Jun Sakuma
4
Definition of Privacy
  • Partitioned-by-time model
  • Agents share the state space, the action space
    and the reward function
  • Agents cannot interact with the environment
    simultaneously

Environment
state st, reward rt
state st, reward rt
action at
action at
Alice
Bob
Alices (st,at,rt)
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
5
Definition of Privacy
  • Partitioned-by-observation model
  • State spaces and action spaces are mutually
    exclusive between agents
  • Agents interact with the environment
    simultaneously

Environment
state stA, reward rtA
state stB, reward rtB
action atA
action atB
Alice
Bob
Alices perception


(sAt,aAt,rAt)
(sA1,aA1,rA1)
Bobs perception

(sBt,aBt,rBt)

(sB1,aB1,rB1)
t0
Jun Sakuma
6
Are existing RLs privacy-preserving?
Centralized RL (CRL)
Distributed RL (DRL), Schneider99Ng05
environment
environment
Each distributed agent shares partial observation
and learns
Leader agent learns
Independent DRL (IDRL)
environment
Each agent learns independently
Target achievement of privacy preservation
without sacrificing the optimality
Jun Sakuma
7
Privacy-preserving Reinforcement Learning
  • Algorithm
  • Tabular SARSA learning with epsilon-greedy action
    selection
  • Overview
  • (Step 1) Initialization of Q-vales
  • Building block 1 Homomorphic cryptosystem
  • (Step 2) Observation from the environment
  • (Step 3) Private Action selection
  • Building block 2 Random shares
  • Building block 3 Private comparison by Secure
    Function Evaluation
  • (Step 4) Private update of Q-values
  • Go to step 2

Jun Sakuma
8
Building block Homomorphic public-key
cryptosystem
  • Public-key cryptosystem
  • A pair of public and secret key (pk, sk)
  • Encryption c epk(m r), m is an integer, r is
    a random integer
  • Decryption mdsk(c)
  • Homomorphic public-key cryptosystem
  • Addition of cipher epk(m1m2 r1r2) epk(m1
    r)epk(m2 r)
  • Multiplication of cipher epk(km kr) epk(m
    r)k
  • Paillier cryptosystemPai99 is homomorphic

8
Jun Sakuma
9
Building block Random shares
public N
Secret x
Bob
Alice
Random share a
Random share b
  • (a, b) are random shares when a and b distributes
    uniform randomly with satisfying ab x mod N

Jun Sakuma
10
Building block Random shares
Secret x6
public N23
Bob
Alice
Random share a15
Random share b14
  • (a, b) are random shares when a and b distributes
    uniform randomly with satisfying ab x mod N
  • Example
  • a15 and b14
  • 6 15 14 (29) mod 23

Jun Sakuma
11
Building block Private comparison
  • Private comparison
  • Secure Function Evaluation Yao86
  • allows parties to evaluate a specified function f
    by taking their private input
  • after the SFE, their inputs and outputs are not
    revealed

Private input x
Private input y
Private comparison
Output If xgty, 0. Else 1.
Output 0
Jun Sakuma
12
Privacy-preserving Reinforcement Learning
  • Protocol for partitioned-by-time model
  • (Step 1) Initialization of Q-vales
  • Building block 1 Homomorphic cryptosystem
  • (Step 2) Observation from the environment
  • (Step 3) Private Action selection
  • Building block 2 Random shares
  • Building block 3 Private comparison by Secure
    Function Evaluation
  • (Step 4) Private update of Q-values
  • Go to step 2

Jun Sakuma
13
Step 1 Initialization of Q-vales
  • Alice Learn Q-values Q(s,a) from t0 to T1
  • Alice Generate a pair of keys (pk, sk)
  • Alice Compute c(s,a) encpk(Q(s,a)) send them
    to Bob

Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Q-values
Jun Sakuma
14
Step 1 Initialization of Q-vales
  • Alice Learn Q-values Q(s,a) from t0 to T1
  • Alice Generate a pair of keys (pk, sk)
  • Alice Compute c(s,a) encpk(Q(s,a)) send them
    to Bob

Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Q-values
Encrypted Q-values
Jun Sakuma
15
Step 1 Initialization of Q-vales
  • Alice Learn Q-values Q(s,a) from t0 to T1
  • Alice Generate a pair of keys (pk, sk)
  • Alice Compute c(s,a) encpk(Q(s,a)) send them
    to Bob

Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Q-values
Encrypted Q-values
Jun Sakuma
16
Privacy-preserving Reinforcement Learning
  • The protocol overview
  • (Step 1) Initialization of Q-vales
  • Building block 1 Homomorphic cryptosystem
  • (Step 2) Observation from the environment
  • (Step 3) Private Action selection
  • Building block 2 Random shares
  • Building block 3 Private comparison by Secure
    Function Evaluation
  • (Step 4) Private update of Q-values
  • Go to step 2

Jun Sakuma
17
Step 2-3 Private Action Selection (greedy)
  • Bob Observe state st, reward rt
  • Bob For all a, compute random shares of Q(st, a)
    and send them to Alice
  • Bob and Alice Run private comparison of random
    shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
18
Step 2-3 Private Action Selection (greedy)
  • Bob Observe state st, reward rt
  • Bob For all a, compute random shares of Q(st, a)
    and send them to Alice
  • Bob and Alice Run private comparison of random
    shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
19
Step 2-3 Private Action Selection (greedy)
  • Bob Observe state st, reward rt
  • Bob For all a, compute random shares of Q(st, a)
    and send them to Alice
  • Bob and Alice Run private comparison of random
    shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Split Q values as random shares
Jun Sakuma
20
Step 2-3 Private Action Selection (greedy)
  • Bob Observe state st, reward rt
  • Bob For all a, compute random shares of Q(st, a)
    and send them to Alice
  • Bob and Alice Run private comparison of random
    shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Private comparison
Jun Sakuma
21
Step 2-3 Private Action Selection (greedy)
  • Bob Observe state state st, reward rt
  • Bob For all a, compute random shares of Q(st, a)
    and send them to Alice
  • Bob and Alice Run private comparison of random
    shares to learn greedy action at

Environment
state st
action at
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
gt
Private comparison
Jun Sakuma
22
Privacy-preserving Reinforcement Learning
  • The protocol overview
  • (Step 1) Initialization of Q-vales
  • Building block 1 Homomorphic cryptosystem
  • (Step 2) Observation from the environment
  • (Step 3) Private Action selection
  • Building block 2 Random shares
  • Building block 3 Private comparison by Secure
    Function Evaluation
  • (Step 4) Private update of Q-values
  • Go to step 2

Jun Sakuma
23
Step 3 Private Update of Q-values
  • After greedy action selection, Bob observes (rt,
    st1)
  • How can Bob update encrypted Q-values c(st,at)
    from (st, at, rt, st1) ?

Environment
reward rt , state st1
action at
Alice
Bob
Taken by Bob (greedy)
Regular update by SARSA
Observed
Encrypted Q-values
Jun Sakuma
24
Step 3 Private Update of Q-values
  • After greedy action selection, Bob observes (rt,
    st1)
  • How can Bob update encrypted Q-values c(st,at)
    from (st, at, rt, st1) ?

Environment
reward rt , state st1
action at
Alice
Bob
Taken by Bob (greedy)
Regular update by SARSA
Observed
Can Bob update encrypted Q-values?
Encrypted Q-values
?
?
Jun Sakuma
25
Step 3 Private Update of Q-values
Jun Sakuma
26
Step 3 Private Update of Q-values
K, L
Jun Sakuma
27
Step 3 Private Update of Q-values
K, L
Encryption
Jun Sakuma
28
Step 3 Private Update of Q-values
public
Bob holds
Jun Sakuma
29
Step 3 Private Update of Q-values
K, L
Encryption
Bob can update c(s,a) without knowledge of Q(s,a)!
Jun Sakuma
30
Privacy-preserving Reinforcement Learning
  • The protocol overview
  • (Step 1) Initialization of Q-vales
  • (Step 2) Observation from the environment
  • (Step 3) Private Action selection
  • (Step 4) Private update of Q-values
  • Go to step 2
  • Not mentioned in this talk, but
  • Partitioned-by-Observation model
  • Epsilon-greedy action selection
  • Q-learning can be treated in a similar manner

Jun Sakuma
31
Experiment Load balancing among factories
Job is assigned w.p. pin
aBno redirect
SA5
SB2
aAredirect
Job is processed w.p. pout
  • Setting
  • State space SA,SB?0,1,,5
  • Action space AA,AB?redirect, no redirect
  • Reward
  • Cost for backlog rA 50-(sA)2
  • Cost for redirection rA rA -2
  • Cost for overflow rA0
  • Reward rB is set similarly
  • System reward rt rtA rtB

Regular RL/PPRL
DRL(rewards are shared)
IDRL(no sharing)
Jun Sakuma
32
Experiment Load balancing among factories
Job is assigned w.p. pin
  • Comparison

aBno redirect
SA5
SB2
aAredirect
Job is processed w.p. pout
Java 1.5.0Fairplay, 1.2 GHz Core solo
Jun Sakuma
33
Summary
  • Reinforcement Learning from private observations
  • Achieve optimality as regular RL does
  • Privacy preservation is guaranteed theoretically
  • Computational load is higher than regular RL, but
    works efficiently with 36 state/4 action problem
  • Future works
  • Scalability
  • Treatment of agents with competing reward
    functions
  • Game theoretic analysis

Jun Sakuma
34
Thank you!
35
Step 2-3 Private Action Selection (greedy)
  • Bob Observe state state st, reward rt
  • Bob For all a, compute random shares of c(st,
    a) c(st, a)?encpk(-rB(st, a)) and send them
  • Alice For all a, compute rA(st, a)decsk(c(st,
    a))
  • Bob and Alice Run private comparison of random
    shares to learn greedy action at

Environment
state st, reward rt
action at
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Private comparison
decrypt
c(st, a) c(st, a)?encpk(-rB(st, a))
36
Distributed Reinforcement Learning
Environment
state sA, reward rA
state sB, reward rB
action aA
action aB
Bob
Alice
(sA, rA , aA)
(sB, rB , aB)
  • Distributed Value Function Schneider99
  • Manage huge state-action space
  • Suppress the memory consumption
  • Policy gradient approach Peshkin00Moallemi03B
    agnell05
  • Limit the communication
  • DRL learns good, but sub-optimal, policies with
    minimal or limited sharing of agents perceptions
Write a Comment
User Comments (0)
About PowerShow.com