E-Science, the GRID and Statistical Modelling in Social Research Rob Crouchley Collaboratory for Quantitative e-Social Science University of Lancaster - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

E-Science, the GRID and Statistical Modelling in Social Research Rob Crouchley Collaboratory for Quantitative e-Social Science University of Lancaster

Description:

E-Science, the GRID and Statistical Modelling in Social Research Rob Crouchley Collaboratory for Quantitative e-Social Science University of Lancaster – PowerPoint PPT presentation

Number of Views:265

Avg rating:3.0/5.0

Slides: 46

Provided by: RobertCr6

Category:

more less

Transcript and Presenter's Notes

Title: E-Science, the GRID and Statistical Modelling in Social Research Rob Crouchley Collaboratory for Quantitative e-Social Science University of Lancaster

1
E-Science, the GRID and Statistical Modelling in
Social Research Rob CrouchleyCollaboratory
for Quantitative e-Social Science University of
Lancaster
2
Contents

The Problem/Motivation Some Background on
Statistical Methods and Social Research
A Solution to part of the Problem? GRID Enabling
the Analysis of Multiprocess Random Effect
Response Data
Questions.

3
Part 1. Some Background onStatistical Methods
and Social Research

Some Features of Social Science Research
Complications
A computationally demanding example
Sabre and Stata/MP

4
Some Features of Quantitative Social Science
Research

We often want to develop evidence based
substantive theory. We want to know what
determines what, e.g. long term unemployment and
social exclusion
And we want to explore the consequences of policy
changes on individual behaviour, e.g.
encouragement to stay on at school on educational
attainment, truancy, and social exclusion
Our data sets are often very small (lt10GB)

5
Some of the Complexities of non experimental data

Cluster effects, random and fixed effects
Contextual effects
Measurement Error
Missing data, dropout and selection
Parametric Assumptions
Endogenous Effects

6
Some of the Consequent Issues

Disentangling the contributions created by the
different complexities for our results is
computationally intensive
Results really change as our model becomes more
comprehensive e.g. direct effects change sign,
other become NS
Problems of Large Scale Fixed Effects Analysis,
sparse matrices
To tackle these complexities we could use GRID
enabled tools, resources and services.

7
Social Science Research

Randomised experiments offer the most powerful
tool to understand social processes, but outside
of psychology, they are infeasible, unethical or
inappropriate (e.g. for instance we can not
allocate pupils to different levels of
education)
Social scientists must therefore rely on
observational data from longitudinal and other
surveys e.g. YCS, NCDS, BHPS, The analysis of
non experimental data involves complications..

8
Complication 1. Cluster Effects (CE)

Most large scale surveys use multi-stage sample
designs to obtain 'representative' samples this
procedure often creates cluster effects, e.g.
BHPS (households), YCS (schools)
Pupils in the same class are often more
behaviourally alike than pupils in different
classes (even in the same school)

9
Complication 1. Cluster Effects (CE)

Procedures have been developed to model cluster
effects by means of shared random effects -
MLwiN, Stata (Gllamm), SAS, AML
The estimation of non-identity link (and non
nested CE) models, e.g. probit, can be
computationally demanding

10
Complication 2. Measurement Errors (ME)

In observational studies, it is rarely possible
to measure all relevant covariates accurately,
e.g. age, educational attainment
Ignoring ME can seriously mislead the
quantification of the link between explanatory
and response variables
ME in one covariate can bias the association
between other covariates and the response
variable, even if those other covariates are
measured without error

11
Complication 2. Measurement Errors (ME)

Also, some important determinants of behaviour
are either not measured (i.e. omitted) or are
unmeasurable (e.g. motivation)
Repeated measures and longitudinal data provide
the opportunity to deal with ME in explanatory
variables, this adds to the computational demands
of the analysis.

12
Complication 3.Missing Data, Dropout and
Selection

All of the major longitudinal data sets available
to the British social science community, (e.g.
YCS, BHPS and NCDS), contain missing data and
dropout
Ignoring this could create bias in the model
estimated on the data
We need to model, as realistically as possible,
the process by which the observed subjects have
been retained in the sample, otherwise we will
not know how much bias is present in our results
Also, some sample designs create selection
effects of their own, e.g. by using a subset of
locations, or oversampling the poor
These add to the computational demands of the
analysis.

13
Complication 4.Parametric Assumptions

Our statistical tools are assumption rich
Parametric linear predictors,
Parametric link functions and error structures
What if the assumed parametric relationships do
not hold?
BUT - Nonparametric statistical models are
computationally intensive.

14
Complication 5.Endogenous effects

The curse of endogenous effects, everything seems
to depend on everything else
We need multiprocess models (simultaneous
equations) to disentangle this complexity, adds
to computation

15
Disentangling complexity with existing tools an
example

This is the kind of example that got me
interested in e-Science.

16
Disentangling complexity with existing tools an
exampleendogenous effects

The YCS is a multi-stage stratified clustered
random sample of individuals ages 16-17
I use YCS6 which covers young people eligible to
leave school in 1990-91, who are then observed
over the 1992-94 period.

17
Part-time work and truancy are potential
determinants of educational attainment

A comprehensive model will allow us to
disentangle the observable, direct, effects of
truancy on educational attainment from any
effects that arise from correlation in the errors
(unobserved effects).

18
Educational Attainment
19
Level of truancy
20
Part Time Work
21
Trivariate Ordered Probit Model(Path Diagram)
Independent Errors (ep, et, eq)
Part-time work
Educational Attainment
Truancy
22
Independent Errors (ep, et, eq)

This model is quick (1-2 seconds) to estimate, 3
linear predictors
- Probit for PT work,
- Ordered Probits for Truancy and
Qualifications
We can use standard software, e.g. Stata.

23
Correlated Errors
24
Correlation Structure
25
Problems and Model Extensions

Cant use standard software to fit the model via
MLE
I used NAG software library, it has special
routines to evaluate high dimensional
multivariate normal integrals
Even so, this Model can take 2-3 weeks to
estimate on a P4, 3 linear predictors, 169
parameters, 8,496 trivariate integrals for each
function evaluation
Results from this model are quite different to
those estimated under independence e.g. one
direct effect changes sign, another becomes NS

26
What is happening?

Evaluating lots of 3 dimensional integrals in
order to compute our likelihood functions is
computationally demanding
We could
Try other methods for evaluating integrals such
as Gibbs sampling and MCMC,
Use approximations
Laplace expansions with many terms
Pseudo and Quasi Likelihood Methods
Estimate fixed effects versions of the models
Use Instruments for the endogenous covariates
All can be computationally demanding, and each
approach has its own problems

27
If we want to go this way, what can we do?

Use parallel algorithms on the Grid
Use faster Hardware, e.g. HPCx, (also part of the
Grid)
Both

28
In the education example Ive assumed

Particular directions for the direct effects
No Non Ignorable dropout in the YCS
No School Cluster effects present
MVN Error structure
Linear predictor, additive function
No measurement error in observed covariates
We do not yet have the computational power (on
the GRID) to relax all the assumptions
simultaneously in this model.

29
The Grid some Definitions

"is the Web on steroids."
"is distributed computing across multiple
administrative domains"
Dave Snelling, senior architect of UNICORE
provides flexible, secure, coordinated
resource sharing among dynamic collections of
individuals, institutions, and resource
From The Anatomy of the Grid Enabling Scalable
Virtual Organizations
"enables communities (virtual organizations)
to share geographically distributed resources as
they pursue common goals.."

30
SABRE Software for the Analysis of Binary
Recurrent Events

What is it ?
Programme for analyising multivariate binary,
ordinal, count and recurrent events data. Employs
fast numerical algorithms. Uses Gaussian
Quadrature and NPMLE for the REs
Some typical application areas.
Infertility in humans, animal husbandry.
Voting, trade union membership, economic activity
and migration.
Absenteeism studies.

31
(No Transcript)
32
SABRE Why use it ?
gt6 months
gt1 week
Data is administrative records covering the
duration in employment in the workforce of a
major Australian state government to investigate
the determinants of quits and separations amongst
permanent and temporary workers. NP base line
hazard, quadrature for the REs
33
An Alternative Stata/MP
34
What about SABRE and Stata/MP

Stata/MP is 1.7 times faster on 2 processors
Stata/MP is 2.8 times faster on 4 processors
Stata/MP is 4 times faster on 8 processors
Sabre can have a bit faster speedup, but the big
difference is probably the base from which
Stata/MP starts.
Using the previous example on our HPC we could
have (in minutes)

35
An empirical analysis of vacancy duration using
micro data from Lancashire Careers Service over
the period 19851992, NP base line hazard,
quadrature for the REs
36
What have I said so far?

That the estimation (via maximum likelihood) of
some statistical models can be very
computationally demanding and beyond what you can
usefully do on your desktop.

37
Ways of running Sabre on the GRID

Directly via the operating system, e.g. Globus
Via a Portal, e.g. Science Gateway
Via a desktop application, like the tip of an
iceberg (Im going to concentrate on this for
the rest of the talk)

38
Using the Grid Via a Desktop Application

Separation of Client and Server Logic
Why ?
Implementation of Service Logic may change to
allow for improved algorithms, models or
scheduling policies and so on
However, user interface stays the same!!

39
Using the Grid Via a Desktop Application

Take as an example SABRE
Using GROWL Grid Resources on a Workstation
Library.
3 Integration of SABRE functionality into
Statistics Software (R and Stata)

40
Solution - How

Host Sabre as Secure Web Service
Service needs to be secure
Service needs to be persistent
Many services provided via a single host on a
single port
Multiple clients
Difficult to do !!
Above features easy to host by employing generic
GROWL server allows the developer to
concentrate just the service logic (algorithms,
scheduling etc)

41
Web services

A software system designed to support
interoperable machine-to-machine interaction over
a network.
It has an interface that is described in a
machine-processable format such as WSDL.
Other systems interact with the Web service in a
manner prescribed by its interface using
messages, which may be enclosed in a SOAP
envelope, or follow a RESTful approach.
These messages are typically conveyed using HTTP,
and normally comprise XML in conjunction with
other Web-related standards.
Software applications written in various
programming languages and running on various
platforms can use web services to exchange data
over computer networks like the Internet in a
manner similar to inter-process communication on
a single computer.
This interoperability (for example, between Java
and Python, or Microsoft Windows and Linux
applications) is due to the use of open
standards.
OASIS and the W3C are the primary committees
responsible for the architecture and
standardization of web services.

42
Client
Client
Client
Client
First Tier
Second Tier
Configuration
GROWL Server
Agent
Agent
Agent
Agent
Third Tier
Services
43
Example Using Sabre on a GRID from Stata

User gets a Stata plugin (unzip it in the users
ado directory)
This adds some items to the Stata menus
And provides a series of dialogue boxes

44
(No Transcript)
45
(No Transcript)
46
GROWL SERVICES

Could contain lots of other software, e.g. MCMC
software on the Grid
Could use lots of different systems, NGS, NWG, etc

47
(No Transcript)
48
(No Transcript)
49
Integration
50
Integration
51
Integration
52
Integration
53
Authentication required for a Fit
54
SABRE Availability and Support

Web Site http//sabre.lancs.ac.uk
Full Command Documentation
Tutorials
Example Data
Publications
Downloads
SabreR binary R packages including
documentation (end 06/2006)
SabreStata Stata plugin including
documentation (end 07/2006)
Sabre source code

55
What have I said in part 2

.
There are beginning to be some tools that can
make a lot more resources (Grid) available to you
from within desktop applications.

56
e-science. lancs.ac.uk/cqess/
Grid Resources on Work Stations
GROWL employs a client/server architecture that
hides the complexity of GRID middleware from the
user. Client access to GROWL employs a secure
(PKI/SSL) connection to a single port on the host
system and clients are authenticated using the
distinguished name extracted from their
certificate. The use of a persistent server to
access grid resources allows all of the service
logic to be hosted by the server, making the
client application, library or plugin extremely
lightweight.
Future developments

Course material for the use of Sabre is currently
being developed.
It is planned to launch a Sabre/GROWL service on
the North West Grid within the coming year. This
will provide a utility based grid resource.
Research into labour markets using Sabre/Growl.
SABRE will become available as a plug in for
STATA

Further information http//www. sabre.lancs.ac.uk
Middleware for e-Social Science
Development of a parallel, multilevel,
multi-process (OGSA) implementation of SABRE as
an R object to enable the Social Scientists to
disentangle the full stochastic complexity of
socio-economic processes.
SABRE development
SABRE and GROWL
GROWL provides a client-side lightweight library
as a plug in to R, providing easy user friendly
access to Grid resources and computational power,
providing
57
You can watch a more detailed presentation about
Growl by Dan Grose at the NCeSS conference on
line at http//redress.lancs.ac.uk/Workshops/Pre
sentations.html
58
Version on my PC