Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents

Description:

Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents – PowerPoint PPT presentation

Number of Views:799
Avg rating:3.0/5.0
Slides: 32
Provided by: Berm
Category:

less

Transcript and Presenter's Notes

Title: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents


1
Writing Simple Perl Scripts to Create, Convert
and Analyze XML Documents Presented for APIII
- Advancing Practice Instruction and Innovation
through Informatics Marriott City Center,
Pittsburgh, PA Friday, October 10, 2003 Session
E2 Perl and Python Programming Workshop Session
Organizers Jules Berman and Jim Harrison Jules
J. Berman, Ph.D., M.D. Program Director for
Pathology Informatics Cancer Diagnosis
Program National Cancer Institute National
Institutes of Health Rockville, MD
2
Virtually everything presented can be reviewed at
you leisure at http//65.222.228.150/jjb/tutor.h
tm This site contains literally hundreds of Perl
programming tips and scripts
3
(No Transcript)
4
What is the purpose of XML? XML allows
heterogeneous systems to communicate and exchange
their data It achieves this through metadata
(data about data). Can produce an ideal
document that completely describes itself,
including all data and all metadata.
5
COMMON XML TASKS 1. Converting an HTML file to
an XML file. 2. Converting an XML file to an
HTML file (e.g. making an XML file presentable
while preserving its information content) 3.
Converting an Excel file to an XML file
Converting an XML file to a different data
structure (e.g. moving XML into a standard
database) 4. Querying an XML file 5. Querying
multiple XML files for related information
6
Lets do a simple conversion of an html file to an
XML file. Heres the html file (notice that the
top header information has been
removed) ltbodygt lth1gtSimple HTML
documentlt/h1gt ltbrgtList to follow ltulgt ltligtFirst lt
ligtSecond ltligtThird lt/ulgt lt/bodygt lt/htmlgt
7
open (TEXT, "html.htm")die"Cannot"
substitute your html page open (STDOUT,
"gthtml.xml")die"Cannot" substitute your
html page print "\lt\?xml version \ \"1\.0\"
encoding \ \"ISO\-8859\-1\"\?\gt\n" line "
" dictionary ( "body" gt "document", "h1" gt
"title", "ul" gt "list", "ol" gt
"list" ) _at_keysarray keys(dictionary) while
(line ne "") line ltTEXTgt line
s/\lt\/html\gt// line s/\n// if (line
/\ltbr\gt/) line "\ltline\gt'\lt\/line
\gt" print line next if
(line /\ltli\gt/) line
"ltitemgt'\lt\/item\gt" print line
next foreach key (_at_keysarray)
line s/(\lt\/?)key/1dictionarykey/g
print line exit
8
Most important parts of HTML-gtXML
script dictionary ( "body" gt
"document", "h1" gt "title", "ul" gt "list", "ol"
gt "list" ) _at_keysarray keys(dictionary) fore
ach key (_at_keysarray) line
s/(\lt\/?)key/1dictionarykey/g
9
(No Transcript)
10
Converting an XML file to an HTML file (many
many different ways to do this)
11

12
Converting an XML file to an HTML file use
XMLParser calls an external module open
(STDOUT, gtoutput.htm") my parser
XMLParser-gtnew( Handlers gt Init gt
\handle_doc_start, Final gt \handle_doc_end,
Start gt \handle_elem_start, End gt
\handle_elem_end, Char gt \handle_char_data,
) my file "presum.xml" parser -gt
parsefile(file)
13
sub handle_doc_start my header
ltltHEADER lthtmlgt ltheadgt lttitlegt Precancer
Classification lt/titlegt lt/headgt ltbodygt ltcentergtlth1
gtPrecancer Classificationlt/h1gtlt/centergt ltbrgt ltbrgt
HEADER print header
14
sub handle_doc_end my header
ltltHEADER ltbrgt lt/bodygt lt/htmlgt HEADER print
header
15
sub handle_elem_start my (expat, name,
atts) _at__ if (name eq "concept")
count print "\ltbr\gtltfont
color\"0000ff\"gtname countlt/fontgtltulgt\n"
return Etc., etc., etc.,
16
(No Transcript)
17
Remember Perl XML-related modules can be
downloaded/installed at no cost
from www.activestate.com ppm service.
18
PPMgt search xml Packages available from
http//www.ActiveState.com/PPMPackages/5.6 CGI-Fo
rm2XML 1.3 Render CGI
form input as XML CGI-ToXML
0.02 Converts CGI to an XML
structure CGI-XML
0.1 Perl extension for converting

CGI.pm variables to/from XML CGI-XMLForm
0.10 Extension of CGI.pm
which
reads/generates formated
XML. CGI-XMLPost 1.3
receive XML file as an HTTP POST DBIx-XML-DataLo
ader 1.1b DBIx-XMLMessage
0.05 XML Message exchange between DBI

data sources DBIx-XML_RDB
0.05 Perl extension for creating XML
from
existing DBI datasources Data-DumpXML
1.05 Dump arbitrary data structures
as
XML GoXML-XQI
1.1.4 Perl extension for the XML Query

Interface at xqi.goxml.com. HTTP-WebTes
t-Plugin-XMLReport 1.01 Report plugin
for HTTPWebTest generates
output in XML format
19
Tk-XMLViewer 0.15 Tk
widget to display XML XML-AutoWriter
0.37 DOCTYPE based XML
output XML-Beautify 0.05
Beautifies XML output from
XMLWriter
(soon to do any XML). XML-DOM
1.25 A perl module for building DOM

Level 1 compliant document

structures XML-DOMHandler 1
Implements a call-back interface to

DOM. XML-DTDParser 1.7
quick and dirty DTD parser XML-Excel
0.02 Perl extension
converting Excel
files to XML XML-Node
0.11 Node-based
XML parsing an
simplified interface to
XMLParser XML-SAX
0.12 Simple API for XML XML-SAX-Base
1.02 Base class SAX Drivers
and Filters XML-SAX-Builder 0.02
build XML documents using SAX XML-SAX-Expat
0.37 SAX Driver for
Expat XML-SAX-Machines 0.4
manage collections of SAX
processors
20
XML-SAX-PurePerl 0.80 Pure
Perl XML Parser with SAX2
interface XML-SAX-RTF
0.1 SAX Driver for
Microsoft's Rich
Text Format (RTF) XML-SAX-Simple
0.02 SAX version of
XMLSimple XML-SAX-Writer
0.44 SAX2 XML Writer XML-SAXDriver-CSV
0.07 SAXDriver for converting CSV
files
to XML XML-Writer 0.4
Perl extension for writing XML

documents. XML-Writer-String
0.1 Capture output from XMLWriter. XML-XP
ath 1.12 a set
of modules for parsing and
evaluating XPath
statements XML-XPath-Simple 0.05
Very simple interface for XPaths XML-XPathScri
pt 0.03 Stand alone
XPathScript XML-XQL
0.68 A perl module for querying XML tree

structures with XQL XML-XSLT
0.40 A perl module for processing XSLT
21
  • Creating an XML file from an Excel file
  • Example is done in Windows, and because its
    using an Windows-based application, and the
    Windows API, it wont work in Linux (not Perls
    fault).
  • There are plenty of other approaches that will
    work in Linux
  • Also, requires Excel to be installed.
  • The complete Perl script is opener7.pl and found
    in the perl tutorial http//65.222.228.150/jjb/tu
    tor.htm

22
Creates a Windows OLE object for Excel -
NON_PERLISH my app CreateObject OLE
"Excel.Application" die "Can't
open" app-gtWorkbooks-gtOpen(xlfile) Creates
the XML tags by collecting a list of the column
headers foreach my column_place (_at_column_array)
thing app-gtRange("column_place1")-gt'V
alue' if (thing ne "") thing
s/ /_/g thing s/\w0-9//g thing
s/2nd/Second/g nextthing
"column_placething" print
"nextthing\n" push(_at_index, nextthing)
else last
23
Creates a Windows OLE object for Excel -
NON_PERLISH foreach my arrayvalue (_at_index)
arrayvalue /\\/ my key my
value ' thing app-gtRange(key .
row)-gt'Value' substitute amp for
thing s/\/\amp/ substitute gt for gt
thing s/\gt/\gt/ substitute lt for lt
thing s/\lt/\lt/ substitute apos for '
thing s/\'/\apos/ substitute quot for
" thing s/\"/\quot/ thing
tr/a-zA-Z0-9 //cd print " \ltvalue\gtthing\lt\
/value\gt\n" row
24
(No Transcript)
25
BUILDING THE COOPERATIVE PROSTATE CANCER TISSUE
RESOURCE TISSUE MICROARRAY FILE 1. Get xls file
with core information TMACPCTR.XLS 98,816
7-24-03 1117am A 2. convert the xls file to
an xml file using opener7.pl OPENER7 .PL
3,663 7-24-03 1139am A This produces file
block2.xml BLOCK2.XML 328,263 7-24-03
1139am A 3. Add header and trailer information
to the xml file
26
Header information is basically lt?xml
version"1.0"?gt lthistogt lttmagt ltheadergt lttitlegtCPCT
R Microarray 1lt/titlegt ltcreatorgtCPCTRlt/creatorgt lts
ubjectgtTissue Microarrayslt/subjectgt ltdescriptiongtC
PCTR TMA XMLlt/descriptiongt ltrightsgtpublic
domainlt/rightsgt ltfilenamegttmacpctr.xmllt/filenamegt
lt/headergt Trailer information is
basically lt/coregt lt/blockgt lt/tmagt lt/histogt
This produces TMACPCTR .XML 331,636
7-28-03 1058am A
27
4. Check validity of the tmacpcrt.xml file using
validtma.pl VALIDTMA .PL 9,132 5-21-03
306pm A The TMA validating Perl script can be
obtained by going to the TMA specification
paper The tissue microarray data exchange
specification A community-based, open source
tool for sharing tissue microarray dataJules J
Berman1 , Mary E Edgerton2 and Bruce A
Friedman3 BMC Medical Informatics and Decision
Making 2003 35 http//www.biomedcentral.com/1472-
6947/3/5 The validating protocol produces a
screen output that includes c\tmacpctr.xml Begi
ning to parse c\tmacpctr.xml now. Finished.
c\tmacpctr.xml is a valid Tissue Microarray
File. The one-way hash of your file is
e2ad62a75974628b7499bd7d771b82f0
28
  • Querying an XML file
  • Many many ways. Most people use XSLT
    (Extensible Stylesheet Language Transformations)
  • 2. When you havent converted your XML into
    another data structure (like a database
    structure) and youre using straight XML as the
    document that youre querying, then a query is
    the same as a transformation where you through
    everything away except the stuff that matches
    your query.

29
  • HETEROGENEOUS XML MERGES/QUERIES
  • Can be thought of as a special form of XSLT
  • Or as a data structure conversion
  • Or as a straightforward Perl programming job

30
HETEROGENEOUS XML MERGES/QUERIES
31
HETEROGENEOUS XML MERGES/QUERIES This is where
namespaces becomes important
Write a Comment
User Comments (0)
About PowerShow.com