AnnBuilder: An automated system - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

AnnBuilder: An automated system

Description:

Human Genome Project (http://genome.ucsc.edu/downloads.html) Gene Ontology Consortium ... Human Genome Project Date built: 28jun2002. URL:http://www.genome.ucsc.edu ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 22
Provided by: jzh15
Category:

less

Transcript and Presenter's Notes

Title: AnnBuilder: An automated system


1
AnnBuilder An automated system for genomic data
annotation
2
Ready to use annotation
  • dChip
  • (http//www.biostat.harvard.edu/complab/dchip/i
    nfo_file.htm)
  • Gene Expression Omnibus (GEO)
  • (http//www.ncbi.nlm.nih.gov/geo)
  • Unigene annotation for Affy chips
  • (http//dot.ped.med.umich.edu2000/ourimage/
  • microarrays/Affy_annot/Unigene/inde
    x.html)

3
Public repositories
  • LocusLink
  • (http//www.ncbi.nlm.nih.gov/LocusLink)
  • Human Genome Project
  • (http//genome.ucsc.edu/downloads.html)
  • Gene Ontology Consortium
  • (http//www.geneontology.org)
  • KEGG
  • (http//www.genome.ad.jp/kegg)

4
Issues to be addressed
  • Large data source

LocusLink (LL_tmpl.gz) 3,845,227 lines
170,377
loci UniGene (Hs.data.gz) 4,607,437 lines
104,210 clusters
  • Changeability
  • Multiple sources
  • Quality control

5
LocusLink (LL_tmpl.gz)
gtgt1 LOCUSID 1 ORGANISM Homo sapiens ACCNUM
AA4844352213248na OFFICIAL _SYMBOL
A1BG OFFICIAL_GENE_NAME alpha-1-B
glycoprotein CHR 19 UNIGENE Hs.41997 gtgt2 LOCU
SID 2 ORGANISM Homo sapiens ACCNUM
M11313177869na OFFICIAL_SYMBOL
A2M OFFICIAL_GENE_NAME alpha-2-macroglobulin CHR
12 UNIGENE Hs.74561 gtgt3
6
UniGene (Hs.data.gz)
ID Hs.2 TITLE N-acetyltransferase
2 (arylamine N-acetyltransferase) GENE
NAT2 CYTOBAND 8p22 LOCUSLINK 10 SEQUENCE
ACCD10871 NIDg219874 PIDg219875 SEQUENCE
ACCX14672 NIDg28227 PIDg28228 ...... ID
Hs.4 TITLE alcohol dehydrogenase 1B
(class I), beta polypeptide GENE
ADH1B CYTOBAND 4q21-q23 LOCUSLINK
125 SEQUENCE ACCAV652448 NIDg9873462
CLONEGLCDAE01 END3' LID3618 SEQUENCE
ACCBF126114 NIDg10965154 CLONEIMAGE3934311
END5' LID4872 ......
7
Source
K
A
J
I
H
G
F
E
D
C
B
M
L
L
L
Parser
H
H
A
A
Base
8
Parser
Source
Base
fileParser
Output
9
Multiple data sources
Source Number mapped
(of 12625) LocusLink
8743 UniGene
9934 Unigene annotation
11292 for Affy chips dChip
11453
10
Agreement between sources Source
LocusLink UniGene UAAC
dChip LocusLink 8743 UniGene 7490
9934 UAAC 8139
9287 11292 dChip 8603
9659 10606 11453
11
Unified mapping Src1 Src2 Src3 Src4 SrcX U
nified Element1 A A A A A A Element2 B B Ele
ment3 C D E C C C Element4 F G H J X F
12
Unified mapping
LL UG UAAC
dChip Total 11887
8743 9934 11292 11453 Four
7115 Three
1197 Two
3573 One 2
13
Quality control
  • Number of ids
  • Content of ids
  • Number of mapped ids
  • Built information

14
Quality control data Date built Fri Sep
27 16 Number of probes 12625 Probe number
missmatch None Probe missmatch None Mappings
found for probe based rda files
hgu95aCHRLOC found 10516 of 12625
hgu95aCHRORI found 10516 of 12625
hgu95aCHR found 11755 of 12625
hgu95aENZYME found 1511 of 12625
Mappings found for non-probe based rda files
hgu95aAFFYCOUNTS found 2349
hgu95aENZYME2AFFY found 516
hgu95aGO2AFFY found 2349
15
Built information
hgu95a packagehgu95a
R Documentation Genomic Annotation data package
built with AnnBuilder Description The
package is built using a downlodable R package
AnnBuilder (download and build your own) from
www.bioconductor.org using the following
public data sources LocusLink Date built
September 27, 2002.ltURLftp//ftp.ncbi.nih.gov/ref
seq/LocusLink/LLtmpl.gzgt. Gene Ontology
Consortium Date built 2002-08-01.ltURLhttp//www.
godatabase.org/dev/database/archive/2002-08-01/go2
00208 -termdb.xml.gzgt. KEGG Date
built Release 23.0 (July 2002).ltURLftp//ftp.gen
ome.ad.jp/pub/kegg/pathwaysgt. Human Genome
Project Date built 28jun2002.ltURLhttp//www.geno
me.ucsc.edu/goldenPath/28jun2002/database/gt.
UniGene Date built Build 155.ltURLftp//ftp.ncb
i.nih.gov/repository/UniGene/Hs.data.gzgt.
The function hgu95a() provides information about
the rda files
16
Parser
Source URLs
Base
AnnBuilder
Data packages
XML documents
17
LocusLink
GenBank
GO
HGP
Target
LocusLink
KEGG
UniGene

18
Chromosome number Enzyme
Chromosomal location Chromosomal orientation
Gene name GO id
Cytoband
Pathway
PubMed id
Summary of function Symbol
UniGene id
Affymetrix id counts for GO Enzyme/Affy id
mapping Goid/Affy id mapping GO id/all Affy id
mapping Pathway/Affy id mapping PubMed id/Affy
id mapping
19
  • Use data packages
  • Download from Bioconductor
  • (www.bioconductor.org)
  • Install the package
  • Start R
  • library(hgu95a)
  • hgu95()
  • ?hgu95a
  • get("1200_at", hgu95aSYMBOL)
  • 1 "GLI2
  • multiget(c("1200_at", "1704_at"), hgu95aSYMBOL)
  • "1200_at"
  • 1 "GLI2"
  • "1704_at"
  • 1 "VAV2"

20
lt!DOCTYPE AnnBuilder SYSTEM "http//www.bioconduc
tor.org/datafiles/dtds/annotate.dtd"gt ltAnnBuilder
Annotate xmlnsAnnBuilder 'http//www.bioconduct
or.org/AnnBuilder/'gt ltAnnBuilderAttrgt
ltAnnBuilderTarget value "hu6800"/gt
ltAnnBuilderDateMade value "Wed Oct 2 152854
2002"/gt ltAnnBuilderVersion value
"1.1.0"/gt ltAnnBuilderSourceFile url
"ftp//ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.g
z" built "October 7, 2002"/gt
ltAnnBuilderSourceFile url "ftp//ftp.genome.ad.
jp/pub/kegg/pathways" built "Release 23.0 (July
2002)"/gt ltAnnBuilderElement value "ACCNUM"
describ "GenBank accession number"/gt
ltAnnBuilderElement value "LOCUSID" describ
"LocusLink identifier"/gt lt/AnnBuilderAttrgt lt
AnnBuilderDatagt ltAnnBuilderEntry
id"A28102_at" describ "Affymetrix
identifier"gt ltAnnBuilderItem
name"UNIGENE" value"Hs.123024"/gt
ltAnnBuilderItem name"SYMBOL" value"GABRA3"
type"Official"/gt ltAnnBuilderItem
name"CHR" value"X"/gt ltAnnBuilderItem
name"MAP" value"Xq28"/gt
lt/AnnBuilderEntrygt lt/AnnBuilderTntrygt lt/An
nBuilderAnnotategt
21
  • Contributors
  • Robert Gentleman
  • Vincent Carey
  • Cheng Li
Write a Comment
User Comments (0)
About PowerShow.com