Perl regular expressions - PowerPoint PPT Presentation

About This Presentation
Title:

Perl regular expressions

Description:

Perl regular expressions This Powerpoint file can be found at: http://www.ku.edu/pri/ksdata/sashttp/kcasug2004-10 Kansas City Area SAS User Group (KCASUG) – PowerPoint PPT presentation

Number of Views:501
Avg rating:3.0/5.0
Slides: 38
Provided by: lho8
Learn more at: https://ipsr.ku.edu
Category:

less

Transcript and Presenter's Notes

Title: Perl regular expressions


1
Perl regular expressions
  • This Powerpoint file can be found at
  • http//www.ku.edu/pri/ksdata/sashttp/kcasug2004-10

Kansas City Area SAS User Group (KCASUG) October
5, 2004 Larry Hoyle Policy Research Institute,
The University of Kansas
2
Regular expressions
  • A regular expression is a pattern to be matched
    against some text (a string)
  • originally from neurophysiology
  • Then in QED and grep
  • see
  • http//msdn.microsoft.com/library/default.asp?url
    /library/en-us/dnaspp/html/regexnet.asp

3
Perl regular expressions
  • Practical Extraction and Report Language
    implements a version of regular expressions that
    is something of a standard
  • see http//www.perldoc.com/perl5.6.1/pod/perlre.h
    tml

4
SAS Documentation
5
Short syntax description
6
Some simple examples
  • /Baa/ matches the string "Baa"
  • /Baa\d/ matches "Baa" followed by
  • any numeric digit

7
Using Perl Regular Expressions in SAS 9.1 and
above
  • data cc
  • input c
  • prxNumprxParse('/Baa\d/')
  • startprxMatch(prxNum,c)
  • if start then put c 'is a match'
  • else put c 'does not match'
  • datalines
  • Baa
  • Baa2
  • baa3
  • aaaaBaa3
  • run
  • proc sql select from cc
  • where prxmatch('/Baa\d/',c)

8
Documentation for PRX Functions and Call Routines
in SAS HELP
CALL PRXCHANGE Performs a pattern-matching replacement
CALL PRXDEBUG Enables Perl regular expressions in a DATA step to send debug output to the SAS log
CALL PRXFREE Frees unneeded memory that was allocated for a Perl regular expression
CALL PRXNEXT Returns the position and length of a substring that matches a pattern and iterates over multiple matches within one string
CALL PRXPOSN Returns the start position and length for a capture buffer
CALL PRXSUBSTR Returns the position and length of a substring that matches a pattern
PRXCHANGE Function Performs a pattern-matching replacement
PRXMATCH Function Searches for a pattern match and returns the position at which the pattern is found
PRXPAREN Function Returns the last bracket match for which there is a match in a pattern
PRXPARSE Function Compiles a Perl regular expression (PRX) that can be used for pattern matching of a character value
PRXPOSN Function Returns the value for a capture buffer
9
single character "wildcards"
  • . matches any character
  • \d matches a numeric character
  • \D matches a non-numeric
  • \w matches a "word character"
  • (letter, digit, or underscore)
  • \W matches a non-word character
  • \s matches white space (spaces or tabs)
  • \S matches non-white space

10
Try a different pattern for expr
  • data myturn
  • retain expr '/Whatever/' / put your own
    expression here /
  • retain prxNum
  • length c 80
  • input c 80.
  • if _n_1 then do
  • prxNumprxParse(expr)
  • if prxNum0 then put 'bad expression' expr
  • end
  • startprxMatch(prxNum,c)
  • put start c
  • datalines
  • Whatever floats your boat
  • Now is the time
  • for
  • all-good
  • men 2
  • come to the
  • aid of their country.

find all the numbers find the first space on each
line find any non word characters
11
sample expressions
find all the numbers /\d/ find the first space
on each line /\s/ find any non word
characters /\W/
12
Anchors
  • beginning of the string
  • end of the string

13
Character Classes
  • acB matches "a", "c" or "B"
  • D-G matches "D", "E", "F", or "G"
  • aeiouyAEIOUY matches any non vowel

14
Search for words
  • data mywords
  • / words starting with a-d /
  • retain expr '/a-dA-D/'
  • retain prxNum
  • length word 50
  • input word 50.
  • if _n_1 then do
  • prxNumprxParse(expr)
  • if prxNum0 then put 'bad expression' expr
  • end
  • startprxMatch(prxNum,word)
  • put start c
  • if startgt0
  • datalines
  • a
  • boo
  • cwm
  • Dublin
  • oocyte

find all the proper names find words with a "q"
not followed by a "u"
15
How about?
find all the proper names
find words with a "q" not followed by a "u"
16
How about?
find all the proper names
/A-Z/ find words with a "q" not followed
by a "u"
17
How about?
find all the proper names
/A-Z/ find words with a "q" not followed by
a "u" /qu/
18
Multipliers
  • n previous expression n times e.g. 3
  • n, previous expression n or more times
  • n,m previous expression from n to m times
  • 0,m previous expression m or fewer times
  • previous expression 0 or more times 0,
  • previous expression 1 or more times 1,
  • ? previous expression 0 or 1 times 0,1

19
from the word list
find words without vowels
20
from the word list
find words without vowels /aeiouyAEIOUY/
21
"write only"? document your expressions
find words without vowels /aeiouyAEIOUY/
/ beginning
of string aeiouyAEIOUY one or more
non-vowels
end of string /
22
Hangman Example
  • Suppose we want to code the sequence of guesses
    in the game of hangman by the use of inferred
    strategies
  • e.g. did the person guess the most frequently
    used letters first?
  • did the person guess vowels first?

23
Coding the strategies
  • data HangmanGuesses
  • let ns4
  • drop i prxNum1-prxnumns
  • array exprns 80 ex1-exns(
  • '/aeiou3/'
  • '/etaoin6/'
  • '/qwerty/'
  • '/zqxjkv6/'
  • )
  • array usednsused1-usedns
  • label used1 '3 vowels first'
  • used2 'letter frequency'
  • used3 'qwerty'
  • used4 'unusuals'
  • array prxnsprxNum1-prxnumns
  • retain used1-usedns / strategy name /
  • retain ex1-exns / strategy name /
  • retain prxNum1-prxnumns /prx number /
  • length guess 13
  • input guess 13. success
  • guesslowcase(guess)
  • if _n_1 then do i1 to ns
  • prxiprxParse(expri)
  • if prxi0 then put "expression ns is bad"
    expri
  • end
  • do i1 to ns
  • usediprxMatch(prxi,guess)
  • end
  • datalines
  • eaotwhnrbg 1
  • etaoinshrdlcu 0
  • etaoinshrdluc 0
  • qwertyuiopasd 0
  • vkjxqznmasdfg 0
  • asdfghjklzxcv 0

24
We get dummy variables
25
Looking at expression 2
26
Memory within match
  • (pattern) treat the pattern as a unit and
    remember the part of the string matched
  • \n inside the match recall substring n
  • example /(\d)3X\1/
  • matches 123X123
  • not
    123X456

27
Memory outside match
  • (pattern) treat the pattern as a unit and
    remember the part of the string matched
  • n outside the match recall substring n
  • example s/(\w),(\w)/ 2 1/
  • substitutes
    Doe,John
  • with John
    Doe

28
Call log example
  • datalines
  • I called Fred at 917 am at 785-555-1234
  • 1012 Called George - (913)-555-3213
  • 816-555-9876 was Irving the time was 122 pm
  • 751 555 1212 8384 333 Bob

29
Get the time
  • retain expTime '/\d1,2\d2\s?(pmam)?/'
  • /
  • \d1,2 one or two digits
    followed by a colon
  • \d2\s? two digits and optional
    space
  • (pmam)? optional am or pm
  • /

30
Get the phone numberdefine 3 capture buffers
  • retain expPhone '/\(?(2-9\d\d)\)?
    -(\d\d\d) -(\d4)/'
  • /
  • \(? optional left paren
  • (2-9\d\d) 3 digit area code (buffer
    1)
  • \)? optional right paren
  • - space or hyphen
  • (\d\d\d) 3 digit exchange
    (buffer 2)
  • - space or hyphen
  • (\d4) 4 digit exchange
    (buffer 3)
  • /

31
Use the expressions
  • retain prxTime prxPhone
  • if _n_1 then do
  • prxTimeprxParse(expTime)
  • if prxTime0 then put 'bad expression'

  • expTime
  • prxPhoneprxParse(expPhone)
  • if prxPhone0 then put 'bad expression'

  • expPhone
  • end
  • sequence_n_
  • call prxsubstr(prxTime, note,
  • position, length)
  • timesubstr(note,position,length)
  • call prxsubstr(prxPhone, note,
  • position, length)
  • CALL PRXPOSN (prxPhone, 1,
  • position, length)
  • acsubstr(note,position,length)
  • CALL PRXPOSN (prxPhone, 2,
  • position, length)
  • exchangesubstr(note,
  • position,length)
  • CALL PRXPOSN (prxPhone, 3,
  • position, length)
  • last4substr(note,
  • position,length)
  • localexchange'-'last4

32
Result
The time and phone number have been
extracted. The phone number is standardized.
33
Substitution expressions
  • s/match expression/replacement/
  • s/cat/hat/ changes cat to hat
  • s/(a-zA-Z\-),(a-zA-Z\-)/2 1/
  • changes Doe-Roe,John to John Doe-Roe

34
Call PRXCHANGE(Data Step only)
  • CALL PRXCHANGE (regular-expression-id,
  • times,
  • old-string
  • lt, new-string
  • lt,
    result-length
  • lt,
    truncation-value
  • lt,
    number-of-changesgtgtgtgt)

35
PRXCHANGE(Data Step, SQL, where clauses)
  • PRXCHANGE(perl-regular-expression
  • regular-expression-id,
  • times,
  • source)

36
PRXCHANGE example
  • data cc
  • length c 60 changedString 60
  • input c 60.
  • prxNumprxParse('s/(a-zA-Z\-),
    (a-zA-Z\-)/2 1/')
  • CALL prxChange (prxNum,
  • 1,
  • c,

  • changedString,
  • newLength,
  • wasTruncated,

  • numberChanges)
  • datalines
  • Doe-Roe,John
  • BlackSheep, BaaBaa
  • Prince

s/ (a-zA-Z\-) first word ,
comma zero or more
blanks (a-zA-Z\-) second word /2 1/
switch words
37
PRXCHANGE example results
Write a Comment
User Comments (0)
About PowerShow.com