Alan Tonge - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Alan Tonge

Description:

12-month project between University of Cambridge and Imperial ... line-breaks: loss of continuous text; paragraphs ... Remove linebreaks from extended ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 21
Provided by: pubsOr08
Category:
Tags: alan | linebreaks | tonge

less

Transcript and Presenter's Notes

Title: Alan Tonge


1
SPECTRa-T Project
  • Alan Tonge

Semantic Web Data Repositories from Chemistry
e-Thesis Data Mining
Open Repositories 2008 Southampton University 2
April 2008
2
Project Overview
Submission, Preservation and Exposure of
Chemistry Teaching and Research Data
  • 12-month project between University of
    Cambridge and Imperial College London to
    develop text- and data-mining tools to extract
    chemical data from e-theses
  • Part of the JISC Digital Repositories programme

in Theses
3
Background
Chemistry is an experimental science Synthetic
Organic Chemistry
is the basis of
Pharmaceutical and Agrochemical industries
Where does the information to make this molecule
come from?
Ethyl 4,5-epoxy-hex-2-enolate C8H12O3
Systematic Name Molecular Formula
4
Chemical Abstracts (9000 journals - 12,000
structures/day)Beilstein (180 core
journals)Patents (CAS, Derwent, MDL) (400,000
/annum)
Search Chemical patent journal abstracting
services e.g.
Academic chemistry publications largely derived
from PhD Theses Perhaps 10K published per year
worldwide Synthetic contains 50-60 preparations
only 20 published in detail
5
  • List of Starting Materials Reagents
  • Recipe Reactions Conditions Work-up
  • Product Characterization spectroscopic
    physical properties

6
Sample preparation from synthetic chemistry
thesis
7
The Problem
  • 80 of (academic) synthetic preparations remain
    locked in theses
  • Manual abstraction (cf journals/patents) not an
    option

The Solution
  • OSCAR3 Automatic high-throughput chemical name
    and chemical term recognition
  • Open Source Chemistry Analysis Routines is
    an extensible Open Source framework which can
    identify much of the chemical terminology in
    electronic articles
  • Semantic Web Deposit extracted terms in
    searchable RDF triplestore

8
OSCAR Name recognition
1. Dictionary of chemical names/terms (ChEBI
Ontology)
2. Rules chemical suffix filters
3. Regular expressions to recognise data,
formulae
9
(No Transcript)
10
Input PDF Legacy FormatPDF is the de facto
format for electronic document deposition in
digital repositories
  • Problem

PDF text is a Page Description Format
optimized for human, not machine, readability
  • irregular word order
  • line-breaks loss of continuous text paragraphs
    difficult to identify
  • loss of subscripts and superscripts
  • non-printing characters
  • erroneous character assignment with OCR.

11
(No Transcript)
12
Programmatic modifications to
  • Remove linebreaks from extended chemical names
  • Remove text fragments derived from Figures and
    Tables
  • Correct whitespace in chemical names

OSCAR3
XSLT
UTF-8 text
SAF XML
RDF statements
PDF
Used as is
OSCAR used as is on PDF e-theses
Gives 5000 terms / thesis (80 duplicates) Cannot
identify chemical objects (spectra assignments
properties)
Gives 5000 terms / thess
13
Input MS Office Open XML docx
  • No information loss from students deposited
    thesis (written with MS software)
  • Identification of experimental sections no
    longer a problem -gt Chemical Objects
  • Conversion of COs into Chemical Markup Language

Extract chemical terms
RDF statements
OSCAR3
Link together
URI
DocX
Extract chemical objects
CML data files
Data Repository
14
Sample preparation from synthetic chemistry
thesisSample preparation from chemistry
thesis
15
CML Infra-Red ASSIGNMENTS ltcmlspectrum
type"cmlir"gt - ltcmlconditionListgt  
ltcmlcondition title"the form of the IR
spectrum dictRef"cmlirform"gtfilmlt/cmlcondition
gt   lt/cmlconditionListgt - ltcmlpeakListgt  
ltcmlpeak id"p1" xValue"3446" title"OH" /gt  
ltcmlpeak id"p2" xValue"3062"
title"unassigned" /gt   ltcmlpeak id"p3"
xValue"3029" title"unassigned" /gt   ltcmlpeak
id"p4" xValue"2922" title"unassigned" /gt  
ltcmlpeak id"p5" xValue"1672" title"CO" /gt  
ltcmlpeak id"p6" xValue"1604" title"CC" /gt  
ltcmlpeak id"p7" xValue"1496"
title"unassigned" /gt   ltcmlpeak id"p8"
xValue"1454" title"unassigned" /gt   ltcmlpeak
id"p9" xValue"1366" title"unassigned" /gt  
ltcmlpeak id"p10" xValue"1299"
title"unassigned" /gt   ltcmlpeak id"p11"
xValue"1135" title"unassigned" /gt   ltcmlpeak
id"p12" xValue"1078" title"unassigned" /gt  
ltcmlpeak id"p13" xValue"974"
title"unassigned" /gt     lt/cmlpeakListgt  
lt/cmlspectrumgt
CML C-13 NMR ASSIGNMENTS ltcmlspectrum
type"cmlcnmr"gt - ltcmlparameterListgt  
ltcmlparameter dictRef"cmlfrequency"
units"unitsMHz"gt50lt/cmlparametergt  
lt/cmlparameterListgt - ltcmlsubstanceListgt  
ltcmlsubstance ref"" /gt   lt/cmlsubstanceListgt -
ltcmlpeakListgt   ltcmlpeak xValue"198.6"
integral"" peakMultiplicity"" title"CO" /gt  
ltcmlpeak xValue"198.5" integral""
peakMultiplicity"" title"" /gt   ltcmlpeak
xValue"145.0" integral"" peakMultiplicity""
title"C" /gt   ltcmlpeak xValue"142.7"
integral"" peakMultiplicity"" title"C" /gt  
ltcmlpeak xValue"137.3" integral""
peakMultiplicity"" title"CH2" /gt   ltcmlpeak
xValue"136.7" integral"" peakMultiplicity""
title"CH2" /gt   ltcmlpeak xValue"129.1"
integral"" peakMultiplicity"" title"" /gt  
ltcmlpeak xValue"128.6" integral""
peakMultiplicity"" title"" /gt   ltcmlpeak
xValue"126.7" integral"" peakMultiplicity""
title"" /gt   ltcmlpeak xValue"124.0"
integral"" peakMultiplicity"" title"aryl-C" /gt
  ltcmlpeak xValue"62.5" integral""
peakMultiplicity"" title"CH" /gt   ltcmlpeak
xValue"59.0" integral"" peakMultiplicity""
title"CH" /gt   ltcmlpeak xValue"55.2"
integral"" peakMultiplicity"" title"CH" /gt  
ltcmlpeak xValue"54.9" integral""
peakMultiplicity"" title"CH" /gt   ltcmlpeak
xValue"38.5" integral"" peakMultiplicity""
title"CH2" /gt   ltcmlpeak xValue"32.8"
integral"" peakMultiplicity"" title"CH2" /gt  
ltcmlpeak xValue"26.1" integral""
peakMultiplicity"" title"CH3" /gt   ltcmlpeak
xValue"26.0" integral"" peakMultiplicity""
title"CH3" /gt   lt/cmlpeakListgt  
lt/cmlspectrumgt
16
RDF - Resource Description Framework. A
component of the Semantic Web, it is based upon
the idea of making statements about
resources/data in the form of a
subject-predicate-object (or resource-property-v
alue) expression (called a triple) e.g.
My_thesis has_chemical_entity
2,4-dinitrobenzene The value of one property can
in turn be used as the resource for another.
17
SPARQL QUERY PREFIX st lthttp//wwmm.ch.cam.ac.uk/
spectra-tgt PREFIX dcrdf lthttp//purl.org/metadat
a/dublin_coregt CONSTRUCT ?thesis
sthasBicycloMoleculeAndHNMR ?chemical . ?thesis
dcrdfauthor ?author WHERE ?thesis
dcrdfcreator ?author . ?thesis
sthasChemicalName ?annot . ?annot
stchemicalName ?chemical . ?annot
sthasHNMRSpectrum ?hnmr . FILTER
regex(?chemical, ".bicyclo.") .
RDF TRIPLESTORE ENTRY ltrdfRDF xmlnsdc"http//pu
rl.org/dc/elements/1.1/" xmlnsdcrdf"http//pur
l.org/metadata/dublin_core" xmlnsrdf"http//w
ww.w3.org/1999/02/22-rdf-syntax-ns"
xmlnsspectra-t"http//wwmm.ch.cam.ac.uk/spectr
a-t"gt ltrdfDescription rdfabout"file/C/spect
ra-t-theses/Juergen_Harter.docx"gt ltspectra-thasC
hemicalNamegt - ltrdfDescriptiongt
ltspectra-tchemicalNamegtCDCl3lt/spectra-tchemicalN
amegt ltspectra-thasSMILESgtClC(2H)(Cl)Cllt/spec
tra-thasSMILESgt ltspectra-thasInChIgtInChI1/CH
Cl3/c2-1(3)4/h1H/i1Dlt/spectra-thasInChIgt
lt/rdfDescriptiongt lt/spectra-thasChemicalNamegt lt
spectra-thasChemicalNamegt - ltrdfDescriptiongt
ltspectra-tchemicalNamegt1-Benzyloxy-but-3-ynelt/spe
ctra-tchemicalNamegt ltspectra-thasSMILESgtCCCC
OCC1CCCCC1lt/spectra-thasSMILESgt
ltspectra-thasInChIgtInChI1/C11H12O/c1-2-3-9-12-10
-11-7-5-4-6-8-11/h1,4-8H,3,9-10H2lt/spectra-thasIn
ChIgt ltspectra-thasHNMRSpectrumgthttp//ch.cam.a
c.uk8182/1ea7f8cd07/data-0.cmllt/spectra-thasHNMR
Spectrumgt ltspectra-thasCMLMoleculegthttp//ch.c
am.ac.uk8182/1ea7f8cd07/data-0.cmllt/spectra-thas
CMLMoleculegt ltspectra-thasPreparationgthttp//c
h.cam.ac.uk8182/1ea7f8cd07/preparation-0.sci.xmllt
/spectra-thasPreparationgt lt/rdfDescriptiongt lt
/spectra-thasChemicalNamegt ltspectra-thasChemica
lNamegt - ltrdfDescriptiongt ltspectra-tchemicalNa
megt(3E,5S,6S)-8-(p-Methoxy-benzyloxy)-5,6-epoxy-6-
methyl-oct-3-en-2-onelt/spectra-tchemicalNamegt
ltspectra-thasHNMRSpectrumgthttp//fiwlt.ch.cam.ac.
uk8182/8f2d98b04/data-20.cmllt/spectra-thasHNMRSp
ectrumgt ltspectra-thasIRSpectrumgthttp//fiwlt.c
h.cam.ac.uk8182/8f2d98b04/data-20.cmllt/spectra-t
hasIRSpectrumgt ltspectra-thasMassSpectrumgthttp
//fiwlt.ch.cam.ac.uk8182/8f2d98b04/data-20.cmllt/s
pectra-thasMassSpectrumgt ltspectra-thasHRMSSpe
ctrumgthttp//fiwlt.ch.cam.ac.uk8182/8f2d98b04/dat
a-20.cmllt/spectra-thasHRMSSpectrumgt
ltspectra-thasPreparationgthttp//fiwlt.ch.cam.ac.u
k8182/8f2d98b04/preparation-20.sci.xmllt/spectra-t
hasPreparationgt lt/rdfDescriptiongt lt/spectra-t
hasChemicalNamegt lt/rdfDescriptiongt ltrdfRDFgt
RESULT ltrdfDescription rdfabout"file/C/spectr
a-t-articles/B207708F.docx"gt ltsthasBicycloMolecu
leAndHNMRgt5-Acetyl-7,8-bis(trimethylsilyl)bicyclo
4.2.1nona-4,7-dienelt/sthasBicycloMoleculeAndHNMR
gt ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltst
hasBicycloMoleculeAndHNMRgt5-Acetyl-bicyclo4.2.1
nona-4,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdc
rdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltsthasBi
cycloMoleculeAndHNMRgt5-Phenyl-bicyclo4.2.1nona-3
,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdcrdfau
thorgtN.R.Champnesslt/dcrdfauthorgt ltsthasBicycloM
oleculeAndHNMRgt5-Acetyl-7,8-bis(trimethylsilyl)bic
yclo4.2.1nona-4,7-dienelt/sthasBicycloMoleculeAn
dHNMRgt ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt
ltsthasBicycloMoleculeAndHNMRgt5-Acetyl-bicyclo4
.2.1nona-4,7-dienelt/sthasBicycloMoleculeAndHNMRgt
ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltst
hasBicycloMoleculeAndHNMRgt5-Phenyl-bicyclo4.2.1n
ona-3,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdcr
dfauthorgtN.R.Champnesslt/dcrdfauthorgt lt/rdfDescr
iptiongt
18
Message to repository managers PDF is a limited
format for data extraction from e-theses Docx
allows chemical data object extraction (80
precision / recall)
Solutions Domain ontology development Make
your e-theses public!
Caveats (Proof-of-concept) Single subject area
(synthetic organic chemistry) Single institution
docx (limited variation in document
structure) Limited thesis availability
19
Acknowledgements
  • Project Director Peter Morgan UL Cambridge
  • Chemistry leads Henry Rzepa, Peter Murray-Rust
  • Developers Jim Downing, Diana Stewart,
  • Joe Townsend, Matt Harvey
  • Project Manager Alan Tonge

http//www.lib.cam.ac.uk/spectra-t/
20
SPECTRa Tools Workshop
Autumn 2008 Unilever Centre, Cambridge, UK
Contact Peter Murray-Rust (pm286_at_cam.ac.uk) P
eter Morgan (pbm2_at_cam.ac.uk)
Write a Comment
User Comments (0)
About PowerShow.com