Project Prospect and the Semantic Web - PowerPoint PPT Presentation

About This Presentation
Title:

Project Prospect and the Semantic Web

Description:

and the Semantic Web Colin Batchelor Royal Society of Chemistry, Cambridge, UK batchelorc_at_rsc.org – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 53
Provided by: Richard1710
Learn more at: https://www.w3.org
Category:

less

Transcript and Presenter's Notes

Title: Project Prospect and the Semantic Web


1
Project Prospectand the Semantic Web
  • Colin Batchelor
  • Royal Society of Chemistry, Cambridge, UK
  • batchelorc_at_rsc.org

2
Project Prospectand the Semantic Web
  • Who we are
  • What weve done
  • Motivation
  • Means
  • The InChI and the Semantic Web
  • Ontology development for chemistry
  • RXNO and MOP

3
Who we are
4
(No Transcript)
5
Royal Society of ChemistryAdvancing the Chemical
Sciences
  • Learned and professional society
  • Scientific publisher
  • 25 journals, 8 databases and a growing book
    program
  • 8000 articles yearly
  • Covering a broad spectrum of chemical sciences
    from systems biology (Molecular BioSystems) to
    physical and theoretical chemistry (PCCP)

6
What weve done
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
The motivation
11
The motivation
  • Scientific papers are formulaic and consistently
    structured (but not necessarily IMRD see later)
  • There may be infinitely many possible chemical
    compounds
  • BUT
  • Nomenclature is productive and susceptible to
    machine parsing

12
The means
13
The meanshow publishing really works
14
Data capture
Editing and proof-reading
15
Enhanced HTML
Database
Text mining (Oscar)
Manual QA
Enhanced RSS
16
(No Transcript)
17
(No Transcript)
18
Regular polysemy
  • where words stand for multiple things in a
    consistent way.
  • Examples
  • Brand names
  • Grinding
  • Figureground
  • Exactclasspart polysemy in chemistry
  • Peter Corbett, Colin Batchelor and Ann Copestake
    (2008), Pyridines, pyridine and pyridine rings,
    Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.

19
Regular polysemy
  • Brand names
  • Learning to buy a Renault and talk to BMW
  • Grinding
  • The squirrel scampered down the path and kept
    stopping and looking at the officers to check
    they were behind
  • vs.
  • the trick was to serve squirrel fresh and
    not to leave it hanging like other game

20
Regular polysemy
  • Figureground
  • Audrey Hepburn painted the door (figure)
  • Audrey Hepburn walked through the door (ground)
  • The Incredible Hulk walked through the door
    (ambiguous)

21
Imidazole
22
An imidazole
23
The imidazole side-chain/group/ring/etc.
24
Can ChEBI handle this?
  • Imidazoles (!) (CHEBI24780)
  • Imidazole (CHEBI16069)
  • Imidazole ring not yet
  • Imidazolyl group not yet (but methyl, benzyl,
    etc.)
  • and there are no disambiguation cues

25
Disambiguation
  • One Sense per Discourse (Gale et al. 1992)
  • this doesnt hold at all
  • One Sense per Collocation (Yarowsky 1993)
  • matches our intuitions

26
Disambiguation toy model
  • CLASS
  • w(1) a, an, the, this
  • w(0) plural (bit of a cheat, as not a
    collocation)
  • PART
  • w(1) bridging, terminal
  • w(1) backbone, bridge, chain, core, dyad,
    fluorophore, fragment, framework (and many more)
  • w(1)w(2) building block, protecting
    group, side chain

27
Why is this hard?
Coordination resolution
Part of speech ambiguity tosylates noun or verb?
28
Why is this hard?
  • How many numbered compounds actually are named in
    a given paper?
  • iloprost (1)
  • tributyl-1-hexynylstannane (2)
  • the desired 2-heptyne (3)
  • methylPd(II) iodide 4 or 4'
  • alkynylstannane 5
  • the hypervalent stannate 6
  • (alkynyl)(methyl)Pd(II) complex 7
  • the desired methylalkyne 8
  • compounds 914
  • the stannyl precursors 15 and 16
  • methylated compounds 17 and 18
  • stannyl precursor 19
  • iloprost methyl ester 20
  • iloprost methyl ester is the real name, but you
    need to know that iloprost is a monocarboxylic
    acid!

29
Why is this hard?
  • For compound names
  • 60 Oscar (Corbett and Murray-Rust 2006,
    Batchelor and Corbett 2007)
  • 20 PubChem
  • 20 ChemDraw
  • For compound numbers
  • 70 author ChemDraw
  • 30 editors

30
What are we marking up?
  • Chemical compounds (InChI, ChEBI)
  • Chemical classes and parts (ChEBI)
  • Nanoparticles (in ChEBI from end of October)
  • Chemical terms from the IUPAC Gold Book
  • Name reactions (RXNO)
  • Gene products function, process, location (GO)
  • Nucleotide and polypeptide sequence terms (SO)
  • Cell types (CL)

31
InChI and the Semantic Web
32
What InChI is for
  • Can represent complete molecules (may be ions or
    radicals) of less than 1024 heavy (non-H) atoms.
  • (however)
  • Cannot yet represent metal atom geometry.
  • Cannot yet represent polymers.
  • Cannot yet represent diradicals etc.

33
What InChI is not for
  • Classes of molecule
  • Parts of molecule
  • (these have been done in ChemBlast)

34
InChI in RDF
  • (We dont like this.)
  • We use the RSS content module. (As if articles
    contained molecules.)
  • And we use infoinchi URIs.
  • Look

35
Some RDF
  • ltcontentitemsgt
  • ltrdfBaggt
  • ltrdfligt
  • ltcontentitem rdfabout"infoinchi/InChI1/C15
    H22O9/c1-8(16)19-6-15(7-20-9(2)17)12(21-10(3)18)11
    -13(24-15)23-14(4,5)22-11/h11-13H,6-7H2,1-5H3/t11?
    ,12-,13/m1/s1"/gt
  • lt/rdfligt
  • ltrdfligt
  • ltcontentitem rdfabout"infoinchi/InChI1/C21
    H34O9/c1-6-9-14(22)25-12-21(13-26-15(23)10-7-2)18(
    27-16(24)11-8-3)17-19(30-21)29-20(4,5)28-17/h17-19
    H,6-13H2,1-5H3/t17?,18-,19/m1/s1"/gt
  • lt/rdfligt
  • lt/rdfBaggt
  • lt/contentitemsgt
  • ltcontentitemsgt
  • ltcontentitemgt ltowlClass rdfID"GO_0016298"gt
    ltrdfslabelgtlipase activitylt/rdfslabelgt
  • lt/owlClassgtlt/contentitemgt
  • lt/contentitemsgt

36
(No Transcript)
37
RXNO
  • David Barden
  • Colin Batchelor
  • Celia Gitterman

38
RXNOthe name reaction ontology (1)
  • Every chemist knows about famous chemists like
    Wittig, Cannizzaro, Diels, Alder, benzoin
  • Theyre pretty unambiguous and well-suited to
    logical definitions
  • But what organizing principle do we use?

39
RXNOthe name reaction ontology (2)
  • Sort reactions by what they do to the skeleton
    of the molecule.
  • Skeleton-changing reactions
  • Joinings, cleavings, rearrangements, ring
    formation, ring expansion
  • Skeleton-preserving reactions
  • Additions, eliminations, substitutions,
    protections, deprotections

40
RXNOthe name reaction ontology (3)
  • Quality? Subjectivity?
  • Get our curators to assign reactions to
    categories without conferring, check percentage
    agreement, discuss disagreements, improve
    guidelines, iterate to convergence.

41
(No Transcript)
42
(No Transcript)
43
RXNOthe name reaction ontology (4)
44
(No Transcript)
45
What do people say?
46
(No Transcript)
47
The spectroscopists tale
  • The enriched html version came as something of a
    revelation and the current emphasis on links to,
    and through biomolecular terminology was very
    much a plus for us, since my colleagues and I are
    a mix of physical and biological chemists who are
    dabbling in inter-disciplinary waters. Given the
    steadily increasing burden of keeping up with the
    current literature and accessing earlier
    publications - a fortiori when conventional
    disciplinary boundaries are being crossed - the
    ability to 'grow a tree' from current articles
    (including one's own) is going to make 'targeted
    sleuthing' a great deal easier.
  • John Simons, Oxford

48
The high-throughput screeners tale
  • An interesting opportunity particularly for
    managers, students and beginners that are not
    that deeply immersed in the detail and the
    terminology. It further opens access to those who
    want to explore areas they are not specialists
    in. Great idea!
  • Eberhard Krausz, MPI-CBG Dresden

49
Lastly
  • My only criticism would be the need for a time
    warning I spent 4 hours digging about which
    generated at least six new research ideas printed
    half a ream of paper and I missed my bus home. At
    least it was a new excuse my wife had not heard,
    so another first.
  • An analytical chemist, The North.

50
(No Transcript)
51
Acknowledgements
  • Royal Society of Chemistry
  • Richard Kidd, Jeff White, David Barden, Celia
    Gitterman, Hilary Burch, the Informatics team
  • University of Cambridge
  • Peter Corbett, Simone Teufel, Ann Copestake,
    Peter Murray-Rust
  • OBO
  • Karen Eilbeck, Midori Harris, Jen Deegan, Jane
    Lomax, Chris Mungall, Barry Smith, the ChEBI team

52
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com