Steps%20of%20The%20PARADISE%20Methodology

About This Presentation

Title:

Steps%20of%20The%20PARADISE%20Methodology

Description:

1) A: So, how should we split up the work of peeling the banana? 2) B: I don't know. ... 1) A: Need to split up the work of peeling this banana. I have the plan. ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 89

Provided by: ping8

Learn more at: http://lalab.gmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Steps%20of%20The%20PARADISE%20Methodology

1
George Mason University Learning Agents Laboratory
Concepts, Issues, and Methodologies for
Evaluating Mixed-initiative Intelligent Systems
Research Presentation by Ping Shyr Research
Director Dr Gheorghe Tecuci May 3, 2004
2
Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
3
Common Issues

Common Issues Related to Intelligent System
Evaluations
1. Costly (resource-intensive), lengthy, and
labor-intensive
2. Domain experts are scarce and expensive
3. Quality of the knowledge acquired is hard to
analyze and measure
4. Efficiency and effectiveness are not easy to
evaluate
5. Component evaluation is easier than approach
(full system) evaluation
Issues of Evaluating Mixed-initiative Systems
Its hard to analyze the contribution of the
system to the overall performance
2. Different knowledge sources (Imported
ontology, SMEs, and Agent)
3. Mixed-initiative methods are more complex and
therefore more
difficult to evaluate

4
General Evaluation Questions
1. When to conduct an evaluation? 2. What kind of
evaluation should we use? Summative/outcome
evaluation? (predict/measure the final
level of performance) Diagnostic/process
evaluation? (identify problems in the
prototype/development system that may
degrade the performance of the final
system) 3. Where and when to collect evaluation
data? Who should supply
data? 4. Who should be involved in an evaluation?
5
Usability Evaluation

Usability is critical factor to judge a system
2. Usability evaluation provides us important
information to
determine
a). How easy a system (especially human-
computer/agent interface/interaction)
to learn and to use?
b). Users confidence in the system results.
3. Usability evaluation also can be used to
predict whether an
organization or will use the system?

6
Issues in Usability Evaluation
1. To what extent does the interface meet
acceptable Human Factors standards? 2. To
what extent is the system easy to use? 3. To
what extent is the system easy to learn how to
use? 4. To what extent does the system decrease
user workload? 5. To what extent does the
explanation capability meet user needs? 6.
Is the allocation of tasks to user and system
appropriate? 7. Is the supporting documentation
adequate?
7
Performance Evaluation
Performance evaluation helps us to answer the
question Does the system meet user and
organizational objects and need? Measure
the systems performance behavior. Experiment is
the best (only) method can appropriately evaluate
the systems performance of the stable prototype
and the final system.
8
Issues in Performance Evaluation
1. Is the system cost-effective? 2. To what
extent does the system meet users need? 3. To
what extent does the system meet organizations
need? 4. How effective is the system in
enhancing users performance? 5. How
effective is the system in enhancing
organizational performance? 6. How
effective is the system in specific tasks ?
9
Mixed-initiative user interface evaluation (1)
(based on Eric Horvitz) 1. Does the automated
service (automation) significantly add value
to the system? 2. Does the system employ
machinery for inferring/exploiting the
uncertainty about a users intentions and
focus? 3. Does the system include the
consideration of the status of a users
attention in the timing of services? - the
nature and timing of automated services and
alerts can be a critical factor in the
cost and benefits of actions. 4. Can automated
services be guided by expected value (cost
and benefit) of taking actions?
10
Mixed-initiative user interface evaluation (2)
5. Will the system employ dialog to resolve key
uncertainties? 6. Can users directly invoke or
terminate the automated services? 7. Does
the system minimize the cost of poor guesses
about action and timing? 8. Can the system
adjust its level of services to match the
current uncertainty?
11
Mixed-initiative user interface evaluation (3)
9. Does the system provide mechanisms for
efficient agent- user collaboration to refine
results? 10. Does the system employ socially
appropriate behaviors for agent-user
interaction? 11. Can the system maintain a
memory of the recent interactions? 12.
Will the system continue to learn about a users
goals and needs?
12
Metrics for evaluating models of mixed-initiative
dialog (1) (Guinn)

OBJECTIVE METRICS
Percentage of correct answers
Percentage of successful transactions
Number of dialog turns
Dialog time
User response time
System response time
Percentage of error messages
Percentage of non-trivial utterances
Mean length of utterances
Task completion

13
Metrics for evaluating models of mixed-initiative
dialog (2) (Guinn)

SUBJECTIVE METRICS
Percentage of implicit recover utterances
Percentage of explicit recover utterances
Percentage of appropriate system utterances
Cooperativity
Percentage of correct and partially correct
utterances
User satisfaction
Number of initiative changes
Number of explicit initiative changing events
Number of implicit initiative changing events
Level of initiative (see examples later)
Number of discourse segments
Knowledge density of utterances
Co-reference patterns

14
Level (Strength) of initiative (1)
Level (Strength) of initiative - how strongly a
participant is in control, when he or she does
have the initiative. Initiative can be present
in many different degrees or strengths. 1) A I
wish I knew how to split up the work of peeling
this banana. 2) B Yeah. 3) A What do you think
we should do? 4) B I dont know. Its a tough
problem. 5) A Sure is. Im so confused. 6) B Me
too. Maybe the waiter has an idea. 7) A I hope
so, Im getting hungry. (From example 8 in What
is Initiative, R. Cohen et. al.)
15
Level (Strength) of initiative (2)
1) A So, how should we split up the work of
peeling the banana? 2) B I dont know. What do
you think? 3) A We need a plan. 4) B I know we
need to split this up somehow. 5) A Yes, youre
right. We need something sharp. 6) B A
cleaver? 7) A Good idea! That way we can split
it up evenly. 8) B Then we can each peel our own
half. 1) A Need to split up the work of peeling
this banana. I have the plan. You grab the
bottom of the banana and hold it steady. Then I
grab the top of the banana and pull hard. Thats
how well do it. 2) B No! I think Ill just peel
the banana myself. That would be way more
efficient. (From example 9 and 10 in What is
Initiative, R. Cohen et. al.)
16
Overview
Common Evaluation Issues
Evaluating spoken dialogue agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
17
Evaluating Spoken Dialogue Agent with
PARADISE (M. Walker)

PARADISE (PARAdigm for DIalogue System
Evaluation), a general framework for evaluating
spoken dialogue agents.
Decouples task requirements from an agents
dialogue behaviors
Supports comparisons among dialogue strategies
Enables the calculation of performance over
subdialogues and whole
dialogues
Specifies the relative contribution of various
factors to performance
Makes it possible to compare agents performing
different tasks by
normalizing for task complexity.

18
Attribute Value Matrix (AVM)
An attribute value matrix (AVM) can represent
many dialogue tasks. This consists of the
information that must be exchanged between the
agent and the user during the dialogue,
represented as a set of ordered pairs of
attributes and their possible values.
Attribute-value pairs are annotated with the
direction of information flow to represent who
acquires the information. Performance evaluation
for an agent requires a corpus of dialogues
between users and the agent, in which users
execute a set of scenarios. Each scenario
execution has a corresponding AVM instantiation
indicating the task information that was actually
obtained via the dialogue. (from PARADISE, M.
Walker)
19
Kappa coefficient
Given a matrix M, success at achieving the
information requirements of the task is measured
with the Kappa coefficient
P(A) is the proportion of times that the AVMs for
the actual set of dialogues agree with the AVMs
for the scenario keys, and P(E) is the proportion
of times that the AVMs for the dialogues and the
keys are expected to agree by chance.
(from PARADISE, M. Walker)
20
Steps in PARADISE Methodology

Definition of a task and a set of scenarios
Specification of the AVM task representation
Experiments with alternate dialogue agents for
the task
Calculation of user satisfaction using surveys
Calculation of task success using k
Calculation of dialogue cost using efficiency and
qualitative
measures
Estimation of a performance function using linear
regression and values for user satisfaction, k,
and dialogue costs
Comparison with other agents/tasks to determine
which factor
generalize
Refinement of the performance model.
(from PARADISE, M. Walker)

21
Objective Metrics for evaluating a dialog (Walker)

Objective metrics can be calculated without
recourse to human judgment, and in many cases,
can be logged by the spoken dialogue system so
that they can be calculated automatically.
Percentage of correct answers with respect to a
set of reference
answers
Percentage of successful transactions or
completed tasks
Number of turns or utterances
Dialogue time or task completion time
Mean user response time
Mean system response time
Percentage of diagnostic error messages
Percentage of non-trivial (more than one word)
utterances
Mean length of non-trivial utterances
(from PARADISE, M. Walker)

22
Subjective Metrics for evaluating a dialog (1)

Percentage of implicit recovery utterances
(where the system
uses dialogue context to recover from errors
of partial
recognition or understanding)
Percentage of explicit recovery utterances
Percentage of contextually appropriate system
utterances
Cooperativity (the adherence of the systems
behavior to
Grices conversational maxims Grice, 1967)
Percentage of correct and partially correct
answers
Percentage of appropriate and inappropriate
system directive
and diagnostic Utterances
User satisfaction (users perceptions about the
usability of a
system, usually assessed with multiple choice
questionnaires
that ask users to rank the systems
performance on a range of
usability features according to a scale of
potential assessments)
(from PARADISE, M. Walker)

23
Subjective Metrics for evaluating a dialog (2)
Subjective metrics require subjects using the
system and/or human evaluators to categorize the
dialogue or utterances within the dialogue along
various qualitative dimensions. Because these
metrics are based on human judgments, such
judgments need to be reliable across judges in
order to compete with the reproducibility of
metrics based on objective criteria.
Subjective metrics can still be quantitative,
as when a ratio between two subjective categories
is computed. (from PARADISE, M. Walker)
24
Limitations of Metrics Current Methodologies
for evaluating a dialog

The use of reference answers makes it impossible
to compare
systems that use different dialogue strategies
for carrying out
the same task. This is because the reference
answer approach
requires canonical responses (i.e. a single
correct answer) to
be defined for every user utterance, even
though there are
potentially many correct answers.
Interdependencies between metrics are not yet
well understood.
The inability to trade-off or combine various
metrics and to
make generalizations
(from PARADISE, M. Walker)

25
How Does PARADISE to Address these Limitations
for Evaluating a Dialog

PARADISE supports comparisons among dialogue
strategies
by providing a task representation that
decouples what an agent
needs to achieve in terms of the task
requirements from how
the agent carries out the task via dialogue.
PARADISE uses methods from decision theory
(Keeney
Raiffa, 1976 Doyle, 1992) to combine a
disparate set of
performance measures (i.e. user satisfaction,
task success
and dialogue cost) into a single performance
evaluation
function(weighted function).
Once a performance function has been derived, it
can be used both to
make predictions about future versions of the
agent and as the basis
of feedback to the agent so that the agent can
learn to optimize its
behavior based on its experiences with users
over time.
(from PARADISE, M. Walker)

26
Example Dialogue 1 (D1, Agent A)
Figure 2. Dialogue 1 (D1)
Agent A dialogue interaction (from Figure 2 in
PARADISE, M. Walker)
27
Example Dialogue 2 (D2, Agent B)
Figure 3. Dialogue 2 (D2) Agent B dialogue
interaction
(from Figure 3 in PARADISE, M. Walker)
28
PARADISEs structure of objectives for spoken
dialogue performance.
Figure 1. PARADISEs structure of objectives
for spoken dialogue performance.
(from Figure 1 in PARADISE, M. Walker)
29
Example in Simplified Train Timetable Domain
Our example scenario requires the user to find a
train from Torino to Milano that leaves in the
evening. During the dialogue the agent must
acquire from the user the values of depart-city,
arrival-city, and depart-range, while the user
must acquire depart-time.
TABLE 1. Attribute value matrix,
simplified train timetable domain
(from Figure 1 in PARADISE, M. Walker)
This AVM consists of four attributes and each
contained a single value. (from PARADISE, M.
Walker)
30
Measuring task success (1)
Success at the task for a whole dialogue (or
sub-dialogue) is measured by how well the agent
and user achieve the information requirements of
the task by the end of the dialogue (or
sub-dialogue). PARADISE uses the Kappa
coefficient (Siegel Castellan, 1988 Carletta,
1996) to operationalize the task-based success
measure in Figure 1.
TABLE 2. Attribute value matrix instantiation,
scenario key for Dialogues 1 and 2
(from Table 2 in PARADISE, M. Walker)
31
Measuring task success (3)
TABLE 3. Confusion matrix for Agent A
(from Table 3 in PARADISE, M. Walker)
32
Measuring task success (4)
TABLE 4. Confusion matrix for Agent B
(from Table 4 in PARADISE, M. Walker)
33
Measuring task success (2)
The values in the matrix cells are based on
comparisons between the dialogue and scenario key
AVMs. Whenever an attribute value in a dialogue
(i.e. data) AVM matches the value in its scenario
key, the number in the appropriate diagonal cell
of the matrix (boldface for clarity) is
incremented by 1. The off-diagonal cells
represent misunderstandings that are not
corrected in the dialogue. The time course of
the dialogue and error handling for any
misunderstandings are assumed to be reflected in
the costs associated with the dialogue. The
matrix in Table 3 and 4 summarizes how the 100
AVMs representing each dialogue with Agent A and
B compare with the AVMs representing the relevant
scenario keys. (from PARADISE, M. Walker)
34
Measuring task success (5) - Kappa coefficient
(1)
Given a confusion matrix M, success at achieving
the information requirements of the task is
measured with the Kappa coefficient
P(A) is the proportion of times that the AVMs for
the actual set of dialogues agree with the AVMs
for the scenario keys, and P(E) is the proportion
of times that the AVMs for the dialogues and the
keys are expected to agree by chance. P(A), the
actual agreement between the data and the key,
can be computed over all the scenarios from the
confusion matrix M
(from PARADISE, M. Walker)
35
Measuring task success (6) - Kappa coefficient (2)
When the prior distribution of the categories is
unknown, P(E), the expected chance agreement
between the data and the key, can be estimated
from the distribution of the values in the keys.
This can be calculated from confusion matrix M,
since the columns represent the values in the
keys. In particular
where ti is the sum of the frequencies in column
i of M, and T is the sum of the frequencies in M
(t1. . .tn).
(from PARADISE, M. Walker)
36
Measuring task success (7) - Kappa coefficient (3)
For Agent A P(A) (22 29 16 11 20 22
20 15 45 40 20 19 18 21) / 400
.795 2 2
2
2 P(E) (30/400) (30/400)
(25/400) (25/400) 0.079375 K
0.777 For Agent B P(A)0.59,
K0.555 Suggesting that Agent A is more
successful than B in achieving the task goals.
37
Measuring Dialogue Costs (1)
PARADISE represents each cost measure as a
function ci that can be applied to any
(sub)dialogue. First, consider the simplest case
of calculating efficiency measures over a whole
dialogue. For example, let c1 be the total number
of utterances. For the whole dialogue D1 in
Figure 2, c1 (D1) is 23 utterances. For the whole
dialogue D2 in Figure 3, c1 (D2) is 10
utterances. Tagging by AVM attributes is
required to calculate costs over sub-dialogues,
since for any sub-dialogue task attributes define
the sub-dialogue. For sub-dialogue S4 in Figure
4, which is about the attribute arrival-city and
consists of utterances A6 and U6, c1 (S4) is 2.
(from PARADISE, M. Walker)
38
Measuring Dialogue Costs (2)
Figure 4. Task-defined discourse structure of
Agent A dialogue interaction.
(from Figure 4 in PARADISE, M. Walker)
39
Measuring Dialogue Costs (3)
Tagging by AVM attributes is also required to
calculate the cost of some of the qualitative
measures, such as number of repair utterances.
For example, let c2 be the number of repair
utterances. The repair utterances for the whole
dialogue D1 in Figure 2 are A3 through U6, thus
c2 (D1) is 10 utterances and c2 (S4) is two
utterances. The repair utterance for the whole
dialogue D2 in Figure 3 is U2, but note that
according to the AVM task tagging, U2
simultaneously addresses the information goals
for depart-range. In general, if an utterance U
contributes to the information goals of N
different attributes, each attribute accounts for
1/N of any costs derivable from U. Thus, c2 (D2)
is 0.5. (from PARADISE, M. Walker)
40
Measuring Dialogue Performance (1)
41
Measuring Dialogue Performance (2)
42
Measuring Dialogue Performance (3)
TABLE 5 Hypothetical performance data from users
of Agent A and B.
(from Table 5 in PARADISE, M. Walker)
43
Measuring Dialogue Performance (4)
In this illustrative example, the results of the
regression with all factors included shows that
only j and rep are significant (plt002). In
order to develop a performance function estimate
that includes only significant factors and
eliminates redundancies. A second regression
including only significant factors must then be
done. In this case, a second regression yields
the predictive equation
The mean performance of A is 0.44 and the mean
performance of B is 0.44, suggesting that Agent B
may perform better than Agent A overall. The
evaluator must then however test these
performance differences for statistical
significance. In this case, a t-test shows that
differences are only significant at the plt0.07
level, indicating a trend only. (from PARADISE,
M. Walker)
44
Experimental Design (1)
Both experiments required every user to complete
a set of application tasks (3 task with 2
subtasks 4 tasks) in conversations with a
particular version of the agent. Instructions
to the users were given on a set of web pages
there was one page for each experimental task.
Each web page consisted of a brief general
description of the functionality of the agent, a
list of hints for talking to the agent, a task
description and information on how to call the
agent. (from PARADISE, M. Walker)
45
Experimental Design (2)
Each page also contained a form for specifying
information acquired from the agent during the
dialogue, and a survey, to be filled out after
task completion, designed to probe the users
satisfaction with the system. Users read the
instructions in their offices before calling the
agent from their office phone. All of the
dialogues were recorded. The agents dialogue
behavior was logged in terms of entering and
exiting each state in the state transition table
for the dialogue. Users were required to fill
out a web page form after each task. (from
PARADISE, M. Walker)
46
Examples of survey questions (1)
Was SYSTEM_NAME easy to understand in this
conversation? (text-to-speech (TTS)
Performance) In this conversation, did
SYSTEM_NAME understand what you said? (automatic
speech recognition (ASR) Performance) In this
conversation, was it easy to find the message you
wanted? (Task Ease) Was the pace of interaction
with SYSTEM_NAME appropriate in this
conversation? (Interaction Pace) How often was
SYSTEM_NAME sluggish and slow to reply to you in
this conversation? (System Response) (from
PARADISE, M. Walker)
47
Examples of survey questions (2)
Did SYSTEM_NAME work the way you expected him to
in this conversation? (Expected Behavior) In
this conversation, how did SYSTEM_NAMEs voice
interface compare to the touch-tone interface to
voice mail? (Comparable Interface) From your
current experience with using SYSTEM_NAME to get
your e-mail, do you think youd use SYSTEM_NAME
regularly to access your mail when you are away
from your desk? (Future Use) (from PARADISE, M.
Walker)
48
Summary of Measurements
User Satisfaction score is used as a measure of
User Satisfaction. Kappa measures actual task
success. The measures of System Turns, User
Turns and Elapsed Time (the total time of the
interaction ) are efficiency cost measures.
The qualitative measures are Completed, Barge
Ins (how many times users barged in on agent
utterances), Timeout Prompts (the number of
timeout prompts that were played), ASR Rejections
(the number of times that ASR rejected the users
utterance ), Help Requests and Mean Recognition
Score. (from PARADISE, M. Walker)
49
Using the performance equation
One potentially broad use of the PARADISE
performance function is as feedback to the agent,
then agent to learn how to optimize its behavior
automatically. The basic idea is to apply the
performance function to any dialogues Di in which
the agent conversed with a user. Then each
dialogue has an associated real numbered
performance value Pi, which represents the
performance of the agent for that dialogue. If
the agent can make different choices in the
dialogue about what to do in various situations,
this performance feedback can be used to help the
agent learn automatically, over time, which
choices are optimal. Learning could be either
on-line so that the agent tracks its behavior on
a dialogue by dialogue basis, or off-line where
the agent collects a lot of experience and then
tries to learn from it. (from PARADISE, M. Walker)
50
Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
51
How to Evaluate a Mixed-initiative System?
Mike Pazzanis caution

Dont lose sight of the goal.
The metrics are just approximations of the goal.
Optimizing the metric may not optimize the goal.

52
Usability Evaluation - Sweeny et al., 1993 (1)
Three dimensions of usability evaluation 1.
Evaluation approach User-based,
Expert-based, and Theory-based approach 2.
Type of evaluation Diagnostic methods,
Summative evaluation, and Certification
approach 3. Time of evaluation
Specification, Rapid prototype, High Fidelity
prototype, and Operational system
53
Usability Evaluation - Sweeny et al., 1993 (2)
Evaluation approach 1. User-based approach can
utilize a). Performance indicator (i.e.,
task times or error rate) b). Nonverbal
behaviors (i.e., eye movements) c).
Attitudes (i.e., questionnaires) d).
Cognitive indicator (i.e., verbal protocols)
e). Stress indicators (i.e., heart rates or
electro-encephalogram) f). Motivational
indicators (i.e., effort) 2. Expert-based
approach (heuristic evaluation -Nielsen
Phillips, 1993) can apply methods
indicating conformance to guidelines or design
criteria and expert attitudes (i.e.,
comments or rating). 3. Theory-based approach
uses idealized performance measures to predict
learning or performance times or ease of
understanding.
54
Usability Evaluation - Sweeny et al., 1993 (3)
Evaluation by Type 1. Diagnostic methods seek
to identify problems with the design and
suggest solution. All three approaches can be
used as diagnostics, but expert-based and
user-based methods work better. 2.
Summative evaluation seek to determine the extent
to which the system helps the user
complete the desired task. 3. Certification
approach is used to determine if the system met
the required performance criteria for its
operational environment.
55
Usability Evaluation - Sweeny et al., 1993 (4)
Time of Evaluation 1. Specification -
theory-based and expert-based methods 2. Rapid
prototype - expert-based and user-based methods
conducted in the laboratory (e.g., usability
testing and questionnaires) work best.
3. High Fidelity prototype 4. Operational
system In phase 3 and 4, evaluation should be
conducted in the field and utilized
user-based techniques.
56
Usability Evaluation - Nielsen Phillips (1993)
Ten rules for usability inspection 1. Use
simple and nature dialogue 2. Speak the users
language 3. Minimize the users memory load 4.
Be consistent 5. Provide feedback 6. Provide
clear marked exit 7. Provide shortcut 8.
Provide good error message 9. Prevent
errors 10. Provide good on and off-line documents
57
Subjective Usability Evaluation Methods
Nine subjective evaluation methods 1. Thinking
aloud 2. Observation 3. Questionnaires 4.
Interviews 5. Focus groups (5-9 participants,
lead by a moderator - member of evaluation
team) 6. User feedback 7. User diaries and
log-books 8. Teaching back (having users try to
teach others how to use a system) 9. video
taping
58
How to handle Questionnaires (1)
Adelman Riedel (1997) suggested to use
Multi-Attribute Utility Assessment (MAUA)
approach 1. the overall system utility is
decomposed into three categories
(dimensions) - Effect on task
performance - System usability -
System fit. 2. Each dimensions is decomposed
into different criteria (e.g. fit with user
and fit with organization in system fit
dimension), and each criterion may be
further decomposed into specific
attributes.
59
How to handle Questionnaires (2)
3. There are at least two questions for each
bottom-level criterion and attribute in the
hierarchy. 4. Each dimension is weighted equally
and each criterion is also weighted equally
in its dimension, so as attributes in each
criterion. 5. A simple arithmetic operation
can be used to score and weight the results.
6. Sensitivity analysis can be performed by
determining how sensitive the overall
utility score is to change in the relative
weights on the criteria and dimensions, or to the
system scores on the criteria and attributes
60
Overall System Utility
Process Quality
Fit With Organization
Person- Machine Functional Allocation
Multi-Attribute Utility Assessment (MAUA)
Evaluation Hierarchy (From Adelman L. Riedel,
S.L. (1997). Handbook for Evaluating
Knowledge-based systems.)
61
Objective Usability Evaluation Methods
Objective data about how well users can actually
use a system can be collected by empirical
evaluation methods, and this kind of data is the
best data one can gather to evaluate system
usability. Four Evaluation Methods proposed
1. Usability testing 2. Logging activity use
(includes system use) 3. Activity analysis
4. Profile examination Usability testing and
logging system use methods have been identified
as effective methods for evaluating a stable
prototype (Adelman and Riedel, 1997)
62
Usability Testing (1)
1. Usability testing is the most common
empirical evaluation method it assesses
a systems usability on pre-defined object
performance measures. 2. Involves
potential system users in a laboratory-like
environment. Users will be giving either
test cases or problems to solve after
received proper training on the prototype. 3.
Evaluation team collects objective data on the
usability measures while they are
performing the test cases or solving problems.
- users individual or group performance
data - measure the relative position (e.g.
how much time difference or how many
times) of the current level of usability against
(1) the best possible expected
performance level and (2) the worst expected
performance level. This is our
upper/lower bound baseline.
63
Usability Testing (2)
4. The best possible expected performance level
can be obtained by having development team
members perform each of the task and
recording their results (e.g., time). 5. The
worst expected performance level is the lowest
level of acceptable performance - this the
lowest level in which the system could
reasonably be expected to be used. We plan to
use 1/6 of the best possible expected
performance level as our worst expected
performance level in our initial study. This
proportion is based on the study in the
usability testing handbook (Uehling, 1994).
64
Logging Activity Use (1)
Some of the important objective measures (system
use) provide by Nielsen (1993) are listed
below - The time users take to complete a
specific task. - The number of tasks of
various kinds that can be completed within a
given time period. - The ratio
between successful interactions and errors.
- The time spent recovering from errors.
- The number of user errors. - The number
of system features utilized by users. -
The frequency of use of the manuals and/or the
help system, and the time spent using
them. - The number of times the user
express clear frustration (or clear joy).
- The proportion of users who say that they would
prefer using the system over some
specified competitor. - The number of
times the user had to work around an unsolvable
problem.
65
Logging Activity Use (2)
Proposed logging activity use method not only
collects system features related information,
but also records user behavior and relationship
between tasks/movements (e.g. sequence,
preference, pattern, trend, etc.) . This
provides us a better coverage on both system and
user level.
66
Activity Analysis Profile Examination Methods
- Proposed activity analysis method analyzes the
user behavior (e.g. sequence or preference)
while using the system. - This method together
with profile examination will provide extreme
useful feedback to users, developers, and agent.
- Profile examination method analyzes
consolidated data from user logging files
and provide summary level results. - Review
single user profiles, group users profiles, and
all users profiles and compare behaviors,
trends, and progress on the same user and it
also provides comparison among users, groups,
and best possible expected performance.
67
Integration of System Prototyping and Evaluation
68
Architecture of Disciple Learning Agent Shell
69
Performance Evaluation
1. Experiment is the most appropriate method for
evaluating the performance of the stable
prototype (Adelman and Riedel, 1997 and
Shadbolt et al., 1999). 2. Two major kinds of
experiments - laboratory and field
experiments (allow evaluator to rigorously
evaluate the systems effect in its
operational environment). 3. Collect both
objective data and subjective data for
performance evaluation.
70
Compare to baseline behavior?
Measure and compare speed, memory, accuracy,
competence, creativity for solving a class of
problems in different settings.
What are some of the settings to consider?
MI
Human alone
Agent alone
Mixed-initiative human-agent system
MI
MI-
Non mixed-initiative human-agent system
Ablated mixed-initiative human-agent system
71
Other complex questions
Consider the setting
MI
Human alone (baseline)
Mixed-initiative human-agent system
How to account for human learning during baseline
evaluation?
Use other humans? How to account for human
variability? Use many humans? How to pay
for the associated cost??? Replace a human with
a simulation? How well does the simulation
actually represents a human? Since the
simulation is not perfect, how good is the
result? How much does a good simulation cost?
72
Important Studies of Performance Evaluation (1)
Several important studies are needed for a
through performance evaluation 1).
Knowledge-level study - Analyzes agents
overall behavior and knowledge formation rate,
knowledge changes (adding, deletion, and
modification) and reasoning. -
Not only analyze the size changes among KBS
during the KB building process, but
also examine the real content changes among
KBS (e.g., same rule (name) in different
phase of KBS may cover different
examples or knowledge elements). - By just
comparing number of rules (or even name or rules)
in different KBS will not give us the
whole picture of knowledge changes.
73
Important Studies of Performance Evaluation (2)
2). Ablation study tests a specific method, or a
specific aspect of the knowledge base
development methodology. - Tool
ablation study tests different versions of tools
that have different capabilities under
different conditions, with some of the
capabilities inhibited. - Knowledge
ablation study - incomplete and incorrect
knowledge - These experiments will
demonstrate the effectiveness of various
capabilities (a good example to explain why
design and development behavior and
philosophy will be changed when
incorporating evaluation utility and data
collecting function into the agent
we need to build the agent much more flexible.
74
Important Studies of Performance Evaluation (3)
3). Sensitivity (or degradation) study which by
degrades parameters or data structure of a
system and observing the effects on
performance. - In order to get meaningful
results, the change of parameters
should be plausible and not random. - For
example, presenting only some of the explanations
generated by agent, or when doing
generalization, using different
level/depth/length of Semantic Net link/path)
Clancey and Cooper (Buchanan and Shortliffe,
1984, chapter 10), who tested MYCINs
sensitivity to the accuracy of certain
factors (CFs) by introducing inaccuracy
and observing the effect on therapy
recommendations
75
Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
76
Experiments Design Concepts
77
Experiments Guidelines

I. Key Concept thoroughly designing our
experiments to collect data and to generate
timely evaluation results
II. General Guidelines
Involve multiple iterations of laboratory
experiments and field experiments
Collect Subjective and Objective Data
Compare Subjective and Objective Data
Evaluate alternative KB development methods
(based on mixed-initiative approach) and compare
their results
Compare evaluation results with a baseline (or
model answer), if the baseline does not exist,
well create one from development laboratory or
in our first iteration experiment
Utilize scoring functions when its applicable
Handle Sample Size Issue
Include multiple controlled studies

Participant(s) design SME (laboratory
experiment), field SMEs (field experiment), and
development team.
Domain(s) Any (e.g., military operational
domains)
Studies Knowledge-level, Tool ablation,
Knowledge ablation, Sensitivity/degradation, and
Simulate experts
Main tasks Import and enhance ontology,
Knowledge acquisition, Knowledge base
repairing/fixing, Knowledge base extension and
modification, Problem solving (e.g., Course Of
Action challenge problem)
Key methods (1) Use full system functions to
perform tasks record results and activities, (2)
Use the agent with and without one or some of the
(previous identified) functions, (3) Use the
agent with and without some knowledge
(pre-defined), (4) Use various system level
variables to perform tasks (laboratory
experiment), and (5) Use various usability
evaluation methods to perform tasks.

Data collection and measurements
Build both user and knowledge profiles
Performance measures - speed/efficiency,
accuracy/quality, prior knowledge reuse,
knowledge elements changes, size and size change
of knowledge base, KA rate, mixed knowledge
contribution, etc...
Usability measures - collect both subjective and
objective data (includes the best expected data
and derived the worst expected data)
Test against gold, silver, or model answers
(e.g., Recall and Precision measures)
Create baselines
Results analyzing
Automatic results analyzing from evaluation
module
Further analyses by evaluator

80
Generic Method for Tool Ablation Experiment

Identify critical components need to be
evaluated
Create various versions (N) of tool
Base version (lower bound - minimum system
requirements B)
Complete version (upper bound - full system
functionality C)
Intermediate versions (Base version one or
more critical components I1, I2,)
Organize experts in groups
At least two groups (X Y)
Minimum two experts per group is recommended
Prepare M sets (1/2N lt M lt N) of
tasks/questions (M1, M2, )
Tasks/questions will be designed as disjoint as
possible, but with similar level
of complexity
Perform tasks based on the combination of
XBM1, XI1M3, YI2M2, YCM4,
Minimize transfer effects
Monitor possible order effects

81
Generic Method for Knowledge Ablation Experiment

Identify knowledge elements need to be removed
or modified
At least three versions of knowledge bases
(ontology)
Complete knowledge version (W)
Version without identified knowledge elements
(O)
Version with modified knowledge (M)
Organize experts in groups
At least two groups (A B)
Minimum two experts per group is recommended
Prepare the same sets of tasks/questions as
versions of KBs (e.g., X, Y, Z).
Tasks/questions will be designed as disjoint as
possible, but with similar level
of complexity
Perform tasks based on the combination of
AXW, AYO, BXO, BYW, and AZM or BZM
Minimize transfer effects
Monitor possible order effects

82
Provides timely feedback on the research and
design
83
Provides timely feedback on the research and
design
84

Activity analysis
Profile examination

Laboratory Experiments Participants Design SME
Development team Environment Design
development laboratory Main tasks Import/modify
ontology KA create/refine/fix KBs KB extension
modification Problem Solving
Field Experiments Participants Non-design SME
Developer Environment Non-design laboratory
(near-operation environment) Main tasks Modify
ontology KA create/refine/fix KBsProblem
Solving

Subjective Methods
Thinking aloud
Observation
Questionnaires
Interviews
Focus group
User feedback
User diaries
Teaching back
Video taping

Subjective Methods
Design SMEs feedback
Peer review/feedback
Developer diaries/implementing note

Objective Methods
Usability testing
Log Activities
Build profiles

Techniques
Activity analysis
Profile examination
Create baselines

Usability

Objective Methods
Collect Best expected
performance data
(upper bound)
Derive Worst expected
performance data (lower bound)
Log Activities
Build profiles

Evaluation Utilities

Techniques
Activity analysis
Profile examination
Create baselines

Automatic data collection functions
Automatic data analysis utilities
Generate/evaluate user knowledge level profiles
Provide feedback (dynamic, based on request, or
phase-
end)
Utilize scoring functions
Automatic questionnaire answers collection and
analysis
Integrate KB verification software
Consolidate evaluation results

Subjective Methods
Questionnaires
User feedback
Video taping

Subjective Methods
Questionnaires
User feedback

Objective Methods
Knowledge-level study
Tool Ablation Study
Knowledge Ablation Study
Build/apply profiles

Objective Methods
Knowledge-level study
Tool Ablation Study
Knowledge Ablation Study
Sensitivity/degradation study
Expert simulation
Build/apply profiles

Techniques
Measure aspects directly relate to the system
(e.g., knowledge increasing rate)
Multiple iterations and domains
Test against gold or model answers
Recall and Precision measures
Create baselines
Controlled studies
Compare results with similar approaches and
systems

Techniques
Multiple iterations and domains
Measure aspects directly relate to the system
Test against gold, silver, or model answers
Score results
Create baselines
Controlled studies
Apply large effect size (e.g., increase the KA
rate by one or two
orders of magnitude) to handle small sample size
problem.
Compare results with similar approaches and
systems

Performance
Existing, enhanced, and new techniques and
methods
85

Scoring methods for problem solving activities
Score for all Correct or Partly
Correct Answers
Recall ----------------------------------
-------------------------
Number of
Original Model Answers
Score for all Answers to
Questions
Precision -------------------------------
----------------------------
Number of System Answers
Provided
TotalScore (of a question Q) Wc Correctness
(Q)
Wj Justification (Q) Wi
Intelligibility (Q)
Ws Sources (Q) Wp
Proactivity (Q)
where Justification (of a question Q)
Wjp Present (Q)
Wjs
Soundness (Q) Wjd Detail (Q)

86
Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
87

Our approach provides some key concepts for
effectively and thoroughly evaluating
mix-initiative intelligent systems
Fully utilizing integrated evaluation utilities
for all time and automatic evaluation
Thoroughly designing laboratory and field
experiments ( e.g., multi-iteration and handle
sample size issue)
Creating baselines and using them for results
comparison
Applying various kinds of studies
Collecting and analyzing subjective data and
objective data via many methods (e.g., examining
profiles, scoring functions and model answers)
Comparing subjective data with objective

Our approach provides some key concepts for
useful and timely feedback to users and
development
Building data collection, data analysis, and
evaluation utility in the agent
Generate and evaluate user knowledge data files
and profiles
Provide three kinds of feedback - dynamic, based
on request, and phase-end - to users, developers,
and the agent