Title: Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish
1Big Data in UK BiobankOpportunities and
ChallengesFunders Wellcome Trust and Medical
Research Council,with Department of Health,
Scottish Welsh Governments, British
Heart Foundation and Diabetes UK
- Rory Collins
- UK Biobank Principal Investigator
- BHF Professor of Medicine Epidemiology
- Nuffield Department of Population Health
- University of Oxford, UK
2UK Biobank Prospective Cohort
- 500,000 UK men and women aged 40-69 years when
recruited and assessed during 2006-2010 - Extensive baseline questions and measurements,
with stored biological samples (and opportunities
to add enhanced assessments in large subsets) - Repeat assessments over time in subsets of the
participants to allow for sources of variation - General consent for follow-up through all health
records and for all types of health research - Sufficiently large numbers of people developing
different conditions to assess causes reliably
3Need for prospective studies to be LARGE CHD
versus SBP for 5K vs 50K vs 500K people in the
Prospective Studies Collaboration (PSC)
500,000 people
50,000 people
5000 people
Age at risk
256
256
256
80-89
Age at risk
80-89
128
128
128
70-79
70-79
Age at risk
64
64
64
60-69
80-89
60-69
32
32
32
50-59
70-79
50-59
60-69
16
16
16
40-49
8
8
8
40-49
50-59
40-49
4
4
4
2
2
2
1
1
1
120
140
160
180
120
140
160
180
120
140
160
180
Usual SBP (mmHg)
Usual SBP (mmHg)
Usual SBP (mmHg)
4Locations ofUK Biobank assessment centres around
the UK (with people recruited from urban and
rural areas)
5UK Biobank 500,000 participants aged 40-69
recruited in 2007-10
Age 40-49 119,000
Age 50-59 168,000
Age 60-69 213,000
Gender Male 228,000
Gender Female 270,000
Deprivation More 92,000
Deprivation Average 166,000
Deprivation Less 241,000
Generalisability (not representativeness)
Heterogeneity of study population allows
associations with disease to be studied reliably
6Production line baseline assessment
visit(improved throughput efficient staffing)
7Baseline assessment Questionnaire content
- Self-completion topics Median time
- (minutes)
- Socio-demographics 1.7
- Ethnicity 0.1
- Work-employment 1.4
- Physical activity 4.4
- Smoking (non-smokers) 0.5
- (past/current smokers) 1.5
- Diet (food frequency) 4.5
- Alcohol 1.1
- Sleep 1.2
- Sun exposure 1.3
- Environmental exposures 1.0
- Early life factors 0.8
- Family history of common diseases 1.6
- Reproductive history screening (women) 2.4
- (men) 0.8
- Sexual history 0.4
- General health 2.1
Interview topics Median time
(minutes) Medical history/medication
3.1 Occupation 0.4 Other 0.6 Total time
4.1
Subset of 200,000 participants repeated daily
diet diaries conducted via the internet Touchs
creen and interview questions (plus extra
enhancement questions) available at
www.ukbiobank.ac.uk
8Baseline assessment Physical measurements (with
enhanced measures in large subsets)
- All 500,000 participants
- Blood pressure heart rate
- Height (standing/seated)
- Waist/hip circumference
- Weight/impedance
- Spirometry
- Heel ultrasound
- Subset 175,000 participants
- Hearing test
- Vascular reactivity
- Subset 120,000 participants
- Visual acuity, refractive index intraocular
pressure - Subset 85,000 participants
- Retinal images optical coherence tomograms
- Fitness test ECG limb leads
9UK Biobank different types of biological
sampleallowing a wide range of different assays
Sample collection tube Fractions collected Potential assays
Na EDTA Plasma Buffy coat Red cells Plasma proteome and metabonome Assays of genomic DNA Membrane lipids and heavy metals
Lithium Heparin (PST) Plasma Plasma proteome and metabonome (without haemolysis)
Silica clot accelerator (SST) Serum Serum proteome and metabonome (without haemolysis)
Acid citrate dextrose Whole blood Assays of DNA extracted from EBV immortalised cell lines (B-cell transcriptome)
EDTA Whole blood Standard haematological parameters
Tempus RNA stabilisation Whole blood with lysis reagent Blood transcriptome Representative transcriptomes of other tissues
Urine Urine Urine proteome and metabonome Gut microbiome
Saliva Mixed saliva sample Salivary proteome and metabonome Salivary microbiome (Mucosal proteome and metabonome)
10Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
- Web-based assessments of diet completed
11Web-based dietary assessment 24-hr recall
- Design considerations
- Easy and quick takes only 10-15 minutes
- Automated data collection and coding
- Repeatable (capturing seasonal variation)
- Detailed enough to estimate nutrient intake
- Over 200,000 participants completed the
questionnaire at least once, and about 90,000 did
so more than once
12Future web-based assessments for exposures
- Cognitive function
- Repeat assessment of baseline measures
- Broaden cognitive phenotyping with new measures
- Complements enhanced cognitive function
assessment that is planned for the imaging
assessment visit - Occupational history
- Information about all previous occupations (not
just latest) - Greater detail on type of work and duration
- Physical activity questionnaire (RPAQ)
- Complement data from activity monitor
13Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
- Web-based assessments of diet completed and next
to be cognition/mental health (2014) - Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15)
14UK Biobank wrist-worn accelerometer
- 45 of participants agree to wear one
- Willing participants sent device by mail
- It is to be worn continuously for 7 days
- Returned by mail and data downloaded
- Device cleaned and sent to next participant
- 100K participants from mid-2013 to mid-2015
(50,000 complete data-sets already obtained)
15Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
- Web-based assessments of diet completed and next
to be cognition/mental health (2014) - Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15) - Biobank chip to genotype (GWAS candidate SNPs
exome) all participants (2013-15)
16Genotyping of all UK Biobank participants
- 820K bespoke UK Biobank Affymetrix genotyping
chip - 250,000 SNPs in a whole-genome array
- 200,000 markers for known risk factor or disease
associations, copy number variation, loss of
function, and insertions/deletions - 150,000 exome markers for high proportion of
non-synonymous coding variants with allele
frequency over 0.02 - Estimate (impute) additional genotypes by
combining measured genotypes with reference
sequence data - Researchers can study associations of genotype
data with biochemical risk factors and detailed
phenotyping from baseline assessment, along with
health outcomes
17Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
- Web-based assessments of diet completed and next
to be cognition/mental health (2014) - Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15) - Biobank chip to genotype (GWAS candidate SNPs
exome) all participants (2013-15) - Standard panel of assays (e.g. lipids clotting)
on samples from all participants (2014-15)
18Rationale for assaying many standard markers in
baseline samples from all 500,000 participants
- Cost-effective way of increasing the usability of
the resource for researchers, by providing data
for - Cross-sectional analyses with prevalent disease
- Identification of subsets based on assay values
- Conducting these assays in all of the
participants at the same time should facilitate
good quality control - Lower cost for conducting all of these assays at
one time rather than in multiple retrievals and
assays - Facilitates management of depletable samples
19Consideration of a proposal to conduct assays of
biomarkers of infectious disease in all
participants
- Request from the international research community
to facilitate studies of the associations of
infectious agents with disease (in particular,
different types of cancer) - Plan would be to assay a panel of infectious
agents (e.g. HPV, Hepatitis B C, HBV, EBV, H.
pylori) in the baseline sample collected from all
500,000 participants - As with the biochemical and genetic assays that
are being conducted, assays of a wide range of
infectious agents would increase the efficient
use of the resource - Detailed proposal for funding is now being
developed
20Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
- Web-based assessments of diet completed and next
to be cognition/mental health (2014) - Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15) - Biobank chip to genotype (GWAS candidate SNPs
exome) all participants (2013-15) - Standard panel of assays (e.g. lipids clotting)
on samples from all participants (2014-15) - Information from multiple imaging modalities
(e.g. brain/heart/body MRI bone/joint DEXA)
21Imaging of 100,000 UK Biobank participants
- MRI of brain, heart and abdomen
- DEXA of bones, joints and body
- Ultrasound of carotid arteries
- Shortened baseline assessment plus more detailed
cognitive function tests and ECG to detect rhythm
disturbances
Pilot phase 4-6,000 people in 1 centre
(2014-15) Main phase 95,000 people in 3 centres
(2015-19) Opportunities for repeat imaging in
sub-sets (e.g. as part of MRCs focus on
dementia)
22Body Mass Index (BMI) vs Heart Disease and Stroke
(PSC1M people followed for 12 years Lancet 2009)
160
Heart disease
(18 237 deaths)
80
Annual deaths per 1000 (floated so mean
PSC rates at age 65-69)
40
Stroke (6122 deaths)
20
10
15
20
25
30
35
40
50
Baseline BMI (kg/m2)
Adjusted for age, sex, smoking study first 5
years of follow-up excluded
23Similar age, gender, BMI body fat, but
different amounts of INTERNAL FAT
5.86 litres of internal Fat
1.65 litres of internal fat
24Atrial fibrillation (AF) prevalence and
mortalityduring the period between 1993 and 2007
Prevalence increasing
Mortality little change
Piccini et al. Circulation Cardiovascular
Quality and Outcomes. 2012
25Consideration of prolonged cardiac monitoring
- Cardiac arrhythmias (especially AF)
- can indicate significant underlying cardiac
disease - can directly cause significant morbidity and
mortality - important risk factors for cardio-embolic events
(esp. stroke) - Detection requires prolonged monitoring
- many are intermittent (e.g. paroxysmal AF)
- substantial under-detection with standard 12 lead
ECG - AF increases with age (lt50 years lt1 gt80 years
10) - No large-scale population-based prospective
studies with prolonged monitoring, so the full
extent/impact of AF on health outcomes is likely
to have been underestimated
26Example of device for prolonged arrhythmia
detection
- iRhythmZio Patch
- Has been used in 18,000 people
- Non-invasive stick-on patch
- Comfortable (median wear 12 days)
- Can be applied in clinic or at home
- Beat-to-beat ECG recording
- Validated against reference Holter
- Potentially recyclable device chip which stores
data for downloading - Planning to pilot feasibility and acceptability
during imaging pilot
27UK Biobank Centralised follow-up of health
- Death and cancer registries
- In-patient and out-patient hospital episodes
(including psychiatric) and related procedure
registries - Primary care records of health conditions,
prescriptions, diagnostic tests and other
investigations - Other health-related disease registries
dispensing records imaging screening dental
records - Direct to participants self-reported medical
conditions treatments actually being taken
degree of functional impairment cognitive and
psychological scores
28Health outcome data-linkage challenges
- Regulation, bureaucracy, and permissions (despite
explicit consent from participants) - Data transfer, matching and coding queries
- Understanding different data structures
- Mapping between coding systems
- Mapping between different countries
- Presenting outcome data to researchers
- Original outcome codes
- Post-adjudication outcomes
29Progress with UK-wide linkage to outcome data
(both before and after baseline assessment)
30Meaning of coded data from health records
- What do the coded data actually tell us?
- Characteristics of coded data
- How accurate?
- How detailed?
- How complete?
- Do we need to go beyond the coded data?
31UK Biobank Expected numbers of participants
developing diseases during long-term follow-up
Condition 2012 2017 2022
Diabetes 10,000 25,000 40,000
MI/CHD death 7,000 17,000 28,000
Stroke 2,000 5,000 9,000
COPD 3,000 8,000 14,000
Breast cancer 2,500 6,000 10,000
Colorectal cancer 1,500 3,500 7,000
Prostate cancer 1,500 3,500 7,000
Lung cancer 800 2,000 4,000
Hip fracture 800 2,500 6,000
Rh. arthritis 800 2,000 3,000
Alzheimers 800 3,000 9,000
32General strategy for outcome adjudication
- Avoid false positive cases (but tolerate some
false negatives) - Geographical generalisability
- Cost-effectiveness
- Future-proofed
- Scalability
- Staged approach
- Ascertain
- Confirm
- Classify
33Staged approach to outcome adjudication
APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES
ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires
34Staged approach to outcome adjudication
APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES
ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires
CONFIRMATION of case-ness As above, but greater cost/lower feasibility Cross-referencing e-records Disease registers
35Staged approach to outcome adjudication
APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES
ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires
CONFIRMATION of case-ness As above, but greater cost/lower feasibility Cross-referencing e-records Disease registers
CLASSIFICATION of disease cases More involved and costly per case Review of clinical records Tumour collections/assays Specialised databases (e.g. imaging)
36Expert Working Groups developing protocols for
ascertainment, confirmation and classification
37UK Biobank Principles of Access
- UK Biobank is available to all bona fide
researchers for all types of health-related
research that is in public interest - No preferential or exclusive access (and, in
particular, access does not involve
collaboration with UK Biobank) - Researchers have to pay for access to the
Resource for their proposed research on a
cost-recovery basis only - Access to the biological samples that are limited
and depletable will be carefully controlled and
coordinated - Researchers are required to publish their
findings and return the data so that other
researchers can use them
38Showcase e-catalogue of data itemscurrently
in the UK Biobank Resource(www.ukbiobank.ac.uk)
39Showcase supports search strategies for data
items in the UK Biobank Resource
40Body Composition Body Fat
41Preliminary applications subdivided by type of
researcher, location and type of research
42What makes UK Biobank special?
- PROSPECTIVE It can assess the full effects of a
particular exposure (such as smoking) on all
types of health outcome (such as cancer, vascular
disease, lung disease, dementia) - DETAILED The wide range of questions, measures
and samples at baseline allows good assessment of
exposures, and outcome adjudication allows good
disease classification - BIG Inclusion of large number of participants
allows reliable assessment of the causes of a
wide range of diseases, and of the combined
impact of many different exposures
Unique combination of BREADTH and DEPTH