ABSTRACT - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

ABSTRACT

Description:

Due to the large quantities of clinical and molecular ... Columbus, Ohio. DCC tools. Master. input. input. replicate. output. Hardware. Dell Poweredge 2900 III ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 2
Provided by: brendaj2
Category:

less

Transcript and Presenter's Notes

Title: ABSTRACT


1
Data coordinating infrastructure for the Autism
Genome Project
Olaf Stein for the AGP
Battelle Center for Mathematical Medicine The
Research Institute at Nationwide Children's
Hospital Columbus, Ohio
Data input (web server)
ABSTRACT
The Autism Genome Project is an international
collaboration dedicated to gene discovery in
Autism (AD). Due to the large quantities of
clinical and molecular data, and the distribution
of the project across many clinical groups and
laboratories, the AGP has established a Data
Coordinating Center (DCC) which houses a
state-of-the-art infrastructure for data input,
data cleaning and curation, data output, and
large-scale data archiving. The core DCC
infrastructure includes two 64 bit MySQL Linux
database servers, and a webserver (LAMP) which is
used by collaborators for uploads/downloads. The
servers are configured as master and slave, which
sit in different locations as a disaster recovery
precaution the master acts as the sole data
input entity, and is automatically replicated to
the slave. Both master and slave can be used to
retrieve data, balancing the load each server has
to carry administrative tasks such as backups
are also done on the slave. The system capacity
can be extended by additional cloning of slave
machines. The main interface for data submission
is the web application, which automatically vets
item-level clinical data (ADI, ADOS, etc),
uploaded over a secure channel in simple csv
(comma separated value) format, upon input,
checking for illegal variable values and logical
errors (eg age of onset prior to current age)
with immediate feedback given to user if problems
are detected. Molecular data ranging from 10k to
1M SNP chip data are imported and cleaned through
a semi-automated process, in which files are
output from the database, run through
error-detection programs that write output
directly in SQL command format for execution in
the database, with iterative processing until all
errors are removed from output files. AGP
participants can download raw and cleaned data
via the same web server. All raw data, including
images from the large SNP experiments, are stored
either within the database or (in the case of
images) in an automated terabyte tape storage
facility. The current footprint of the database
is 400GB and 10 TB raw images on tape. This
includes over 19000 samples, over 1M phenotypic
datapoints, and 2.5B genotypes.
Webserver
File upload
User
Immediate Feedback
Data validation
Master
Transfer to main database
  • Data validation
  • Web application acts as interface
  • remote uploading of phenotypes (ADI, ADOS, IQ,
    other measures) to main database
  • files are uploaded in csv (comma separated
    value) format
  • order content of columns defined in codebooks
  • Before data are accepted they are checked for
  • structural errors (wrong file format, wrong
    number of columns, etc)
  • integrity errors (valid IDs, invalid or
    out-of-range value)
  • logical errors (e.g. age at assessment in months
    smaller than age in months)

SETUP
Master
replicate
Slaves
input
Sample file for ADOS upload
input
output
1M data curation
Webserver
Master
Output
DCC tools
Files
raw
clean
Hard and Software
  • Hardware
  • Dell Poweredge 2900 III
  • Intel Xeon CPU 3.0GHz quad core
  • 16 GB RAM
  • 300 GB RAID 1 (OS)
  • 1TB GB RAID 5 (misc data)
  • 3TB RAID 5 (database files)

SQL
  • Software
  • Redhat Enterprise Linux 5.0
  • MySQL 5.1 Community Edition
  • Maintenance tools (RAID management, backup
    scripts, etc.)
  • Inntop, mytop, maatkit (database monitoring
    tools)
  • Python 2.5
  • Apache, mod_python
  • Data curation
  • Data curation is an iterative process
  • first the raw dataset is duplicated to a
    "cleaning" data set
  • changes are made iteratively in the cleaning set
  • utilizing automated SQL protocols
  • to produce the final "clean" data set
  • Tools used for include Merlin, Relcheck and
    various in-house python and perl scripts
  • Missingness by marker and individual (cutoff gt
    20)
  • Relationship problems, sample swaps
  • Mendelian Inconsistencies
  • Security
  • Deidentified data only
  • Database server behind hospital firewall
  • Strong passwords, issued to PI's only
  • Webserver in DMZ
  • HTTP (80) and SSH (22) connections only
  • network traffic is monitored

ACKNOWLEDGEMENTS
Current footprint
  • The Autism Genome Project gratefully acknowledges
    the contributions of the
  • families who participated in this study. Current
    support for the AGP includes grants from
  • Autism Speaks (USA)
  • Genome Canada (Canada),
  • The Health Research Board (Ireland)
  • The Hilibrand Foundation (USA)
  • The Medical Research Council (UK)
  • The National Institutes of Health
  • 5,000 families, 20,000 individuals
  • Approx. 1 million phenotypic data points
  • Approx. 6B genotypic records (800 GB on disk)
  • 20 TB raw image data
  • Other misc. info, including integrated marker
    maps
Write a Comment
User Comments (0)
About PowerShow.com