ABSTRACT

About This Presentation

Title:

ABSTRACT

Description:

Due to the large quantities of clinical and molecular ... Columbus, Ohio. DCC tools. Master. input. input. replicate. output. Hardware. Dell Poweredge 2900 III ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 2

Provided by: brendaj2

Category:

more less

Transcript and Presenter's Notes

Title: ABSTRACT

1
Data coordinating infrastructure for the Autism
Genome Project
Olaf Stein for the AGP
Battelle Center for Mathematical Medicine The
Research Institute at Nationwide Children's
Hospital Columbus, Ohio
Data input (web server)
ABSTRACT
The Autism Genome Project is an international
collaboration dedicated to gene discovery in
Autism (AD). Due to the large quantities of
clinical and molecular data, and the distribution
of the project across many clinical groups and
laboratories, the AGP has established a Data
Coordinating Center (DCC) which houses a
state-of-the-art infrastructure for data input,
data cleaning and curation, data output, and
large-scale data archiving. The core DCC
infrastructure includes two 64 bit MySQL Linux
database servers, and a webserver (LAMP) which is
used by collaborators for uploads/downloads. The
servers are configured as master and slave, which
sit in different locations as a disaster recovery
precaution the master acts as the sole data
input entity, and is automatically replicated to
the slave. Both master and slave can be used to
retrieve data, balancing the load each server has
to carry administrative tasks such as backups
are also done on the slave. The system capacity
can be extended by additional cloning of slave
machines. The main interface for data submission
is the web application, which automatically vets
item-level clinical data (ADI, ADOS, etc),
uploaded over a secure channel in simple csv
(comma separated value) format, upon input,
checking for illegal variable values and logical
errors (eg age of onset prior to current age)
with immediate feedback given to user if problems
are detected. Molecular data ranging from 10k to
1M SNP chip data are imported and cleaned through
a semi-automated process, in which files are
output from the database, run through
error-detection programs that write output
directly in SQL command format for execution in
the database, with iterative processing until all
errors are removed from output files. AGP
participants can download raw and cleaned data
via the same web server. All raw data, including
images from the large SNP experiments, are stored
either within the database or (in the case of
images) in an automated terabyte tape storage
facility. The current footprint of the database
is 400GB and 10 TB raw images on tape. This
includes over 19000 samples, over 1M phenotypic
datapoints, and 2.5B genotypes.
Webserver
File upload
User
Immediate Feedback
Data validation
Master
Transfer to main database

Data validation
Web application acts as interface
remote uploading of phenotypes (ADI, ADOS, IQ,
other measures) to main database
files are uploaded in csv (comma separated
value) format
order content of columns defined in codebooks
Before data are accepted they are checked for
structural errors (wrong file format, wrong
number of columns, etc)
integrity errors (valid IDs, invalid or
out-of-range value)
logical errors (e.g. age at assessment in months
smaller than age in months)

SETUP
Master
replicate
Slaves
input
Sample file for ADOS upload
input
output
1M data curation
Webserver
Master
Output
DCC tools
Files
raw
clean
Hard and Software

Hardware
Dell Poweredge 2900 III
Intel Xeon CPU 3.0GHz quad core
16 GB RAM
300 GB RAID 1 (OS)
1TB GB RAID 5 (misc data)
3TB RAID 5 (database files)

SQL

Software
Redhat Enterprise Linux 5.0
MySQL 5.1 Community Edition
Maintenance tools (RAID management, backup
scripts, etc.)
Inntop, mytop, maatkit (database monitoring
tools)
Python 2.5
Apache, mod_python

Data curation
Data curation is an iterative process
first the raw dataset is duplicated to a
"cleaning" data set
changes are made iteratively in the cleaning set
utilizing automated SQL protocols
to produce the final "clean" data set
Tools used for include Merlin, Relcheck and
various in-house python and perl scripts
Missingness by marker and individual (cutoff gt
20)
Relationship problems, sample swaps
Mendelian Inconsistencies

Security
Deidentified data only
Database server behind hospital firewall
Strong passwords, issued to PI's only
Webserver in DMZ
HTTP (80) and SSH (22) connections only
network traffic is monitored

ACKNOWLEDGEMENTS
Current footprint

The Autism Genome Project gratefully acknowledges
the contributions of the
families who participated in this study. Current
support for the AGP includes grants from
Autism Speaks (USA)
Genome Canada (Canada),
The Health Research Board (Ireland)
The Hilibrand Foundation (USA)
The Medical Research Council (UK)
The National Institutes of Health

5,000 families, 20,000 individuals
Approx. 1 million phenotypic data points
Approx. 6B genotypic records (800 GB on disk)
20 TB raw image data
Other misc. info, including integrated marker
maps

Write a Comment

User Comments (0)

About PowerShow.com

ABSTRACT - PowerPoint PPT Presentation

ABSTRACT

Due to the large quantities of clinical and molecular ... Columbus, Ohio. DCC tools. Master. input. input. replicate. output. Hardware. Dell Poweredge 2900 III ... – PowerPoint PPT presentation