Title: Perl regular expressions
1Perl regular expressions
- This Powerpoint file can be found at
- http//www.ku.edu/pri/ksdata/sashttp/kcasug2004-10
Kansas City Area SAS User Group (KCASUG) October
5, 2004 Larry Hoyle Policy Research Institute,
The University of Kansas
2Regular expressions
- A regular expression is a pattern to be matched
against some text (a string) - originally from neurophysiology
- Then in QED and grep
- see
- http//msdn.microsoft.com/library/default.asp?url
/library/en-us/dnaspp/html/regexnet.asp
3Perl regular expressions
- Practical Extraction and Report Language
implements a version of regular expressions that
is something of a standard - see http//www.perldoc.com/perl5.6.1/pod/perlre.h
tml
4SAS Documentation
5Short syntax description
6Some simple examples
- /Baa/ matches the string "Baa"
- /Baa\d/ matches "Baa" followed by
- any numeric digit
7Using Perl Regular Expressions in SAS 9.1 and
above
- data cc
- input c
- prxNumprxParse('/Baa\d/')
- startprxMatch(prxNum,c)
- if start then put c 'is a match'
- else put c 'does not match'
- datalines
- Baa
- Baa2
- baa3
- aaaaBaa3
-
- run
- proc sql select from cc
- where prxmatch('/Baa\d/',c)
8Documentation for PRX Functions and Call Routines
in SAS HELP
CALL PRXCHANGE Performs a pattern-matching replacement
CALL PRXDEBUG Enables Perl regular expressions in a DATA step to send debug output to the SAS log
CALL PRXFREE Frees unneeded memory that was allocated for a Perl regular expression
CALL PRXNEXT Returns the position and length of a substring that matches a pattern and iterates over multiple matches within one string
CALL PRXPOSN Returns the start position and length for a capture buffer
CALL PRXSUBSTR Returns the position and length of a substring that matches a pattern
PRXCHANGE Function Performs a pattern-matching replacement
PRXMATCH Function Searches for a pattern match and returns the position at which the pattern is found
PRXPAREN Function Returns the last bracket match for which there is a match in a pattern
PRXPARSE Function Compiles a Perl regular expression (PRX) that can be used for pattern matching of a character value
PRXPOSN Function Returns the value for a capture buffer
9single character "wildcards"
- . matches any character
- \d matches a numeric character
- \D matches a non-numeric
- \w matches a "word character"
- (letter, digit, or underscore)
- \W matches a non-word character
- \s matches white space (spaces or tabs)
- \S matches non-white space
10Try a different pattern for expr
- data myturn
- retain expr '/Whatever/' / put your own
expression here / - retain prxNum
- length c 80
- input c 80.
- if _n_1 then do
- prxNumprxParse(expr)
- if prxNum0 then put 'bad expression' expr
- end
- startprxMatch(prxNum,c)
- put start c
- datalines
- Whatever floats your boat
- Now is the time
- for
- all-good
- men 2
- come to the
- aid of their country.
find all the numbers find the first space on each
line find any non word characters
11sample expressions
find all the numbers /\d/ find the first space
on each line /\s/ find any non word
characters /\W/
12Anchors
- beginning of the string
- end of the string
13Character Classes
- acB matches "a", "c" or "B"
- D-G matches "D", "E", "F", or "G"
- aeiouyAEIOUY matches any non vowel
14Search for words
- data mywords
- / words starting with a-d /
- retain expr '/a-dA-D/'
- retain prxNum
- length word 50
- input word 50.
- if _n_1 then do
- prxNumprxParse(expr)
- if prxNum0 then put 'bad expression' expr
- end
- startprxMatch(prxNum,word)
- put start c
- if startgt0
- datalines
- a
- boo
- cwm
- Dublin
- oocyte
find all the proper names find words with a "q"
not followed by a "u"
15How about?
find all the proper names
find words with a "q" not followed by a "u"
16How about?
find all the proper names
/A-Z/ find words with a "q" not followed
by a "u"
17How about?
find all the proper names
/A-Z/ find words with a "q" not followed by
a "u" /qu/
18Multipliers
- n previous expression n times e.g. 3
- n, previous expression n or more times
- n,m previous expression from n to m times
- 0,m previous expression m or fewer times
-
- previous expression 0 or more times 0,
- previous expression 1 or more times 1,
- ? previous expression 0 or 1 times 0,1
19from the word list
find words without vowels
20from the word list
find words without vowels /aeiouyAEIOUY/
21"write only"? document your expressions
find words without vowels /aeiouyAEIOUY/
/ beginning
of string aeiouyAEIOUY one or more
non-vowels
end of string /
22Hangman Example
- Suppose we want to code the sequence of guesses
in the game of hangman by the use of inferred
strategies - e.g. did the person guess the most frequently
used letters first? - did the person guess vowels first?
23Coding the strategies
- data HangmanGuesses
- let ns4
- drop i prxNum1-prxnumns
- array exprns 80 ex1-exns(
- '/aeiou3/'
- '/etaoin6/'
- '/qwerty/'
- '/zqxjkv6/'
- )
- array usednsused1-usedns
- label used1 '3 vowels first'
- used2 'letter frequency'
- used3 'qwerty'
- used4 'unusuals'
-
- array prxnsprxNum1-prxnumns
- retain used1-usedns / strategy name /
- retain ex1-exns / strategy name /
- retain prxNum1-prxnumns /prx number /
- length guess 13
- input guess 13. success
- guesslowcase(guess)
- if _n_1 then do i1 to ns
- prxiprxParse(expri)
- if prxi0 then put "expression ns is bad"
expri - end
-
- do i1 to ns
- usediprxMatch(prxi,guess)
- end
- datalines
- eaotwhnrbg 1
- etaoinshrdlcu 0
- etaoinshrdluc 0
- qwertyuiopasd 0
- vkjxqznmasdfg 0
- asdfghjklzxcv 0
24We get dummy variables
25Looking at expression 2
26Memory within match
- (pattern) treat the pattern as a unit and
remember the part of the string matched - \n inside the match recall substring n
- example /(\d)3X\1/
- matches 123X123
- not
123X456
27Memory outside match
- (pattern) treat the pattern as a unit and
remember the part of the string matched - n outside the match recall substring n
- example s/(\w),(\w)/ 2 1/
- substitutes
Doe,John - with John
Doe
28Call log example
- datalines
- I called Fred at 917 am at 785-555-1234
- 1012 Called George - (913)-555-3213
- 816-555-9876 was Irving the time was 122 pm
- 751 555 1212 8384 333 Bob
29Get the time
- retain expTime '/\d1,2\d2\s?(pmam)?/'
- /
- \d1,2 one or two digits
followed by a colon - \d2\s? two digits and optional
space - (pmam)? optional am or pm
- /
30Get the phone numberdefine 3 capture buffers
- retain expPhone '/\(?(2-9\d\d)\)?
-(\d\d\d) -(\d4)/' - /
- \(? optional left paren
- (2-9\d\d) 3 digit area code (buffer
1) - \)? optional right paren
- - space or hyphen
- (\d\d\d) 3 digit exchange
(buffer 2) - - space or hyphen
- (\d4) 4 digit exchange
(buffer 3) - /
31Use the expressions
- retain prxTime prxPhone
- if _n_1 then do
- prxTimeprxParse(expTime)
- if prxTime0 then put 'bad expression'
-
expTime - prxPhoneprxParse(expPhone)
- if prxPhone0 then put 'bad expression'
-
expPhone - end
- sequence_n_
- call prxsubstr(prxTime, note,
- position, length)
- timesubstr(note,position,length)
- call prxsubstr(prxPhone, note,
- position, length)
- CALL PRXPOSN (prxPhone, 1,
- position, length)
- acsubstr(note,position,length)
-
- CALL PRXPOSN (prxPhone, 2,
- position, length)
- exchangesubstr(note,
- position,length)
- CALL PRXPOSN (prxPhone, 3,
- position, length)
- last4substr(note,
- position,length)
- localexchange'-'last4
32Result
The time and phone number have been
extracted. The phone number is standardized.
33Substitution expressions
- s/match expression/replacement/
- s/cat/hat/ changes cat to hat
- s/(a-zA-Z\-),(a-zA-Z\-)/2 1/
- changes Doe-Roe,John to John Doe-Roe
34Call PRXCHANGE(Data Step only)
- CALL PRXCHANGE (regular-expression-id,
- times,
- old-string
- lt, new-string
- lt,
result-length - lt,
truncation-value - lt,
number-of-changesgtgtgtgt)
35PRXCHANGE(Data Step, SQL, where clauses)
- PRXCHANGE(perl-regular-expression
- regular-expression-id,
- times,
- source)
36PRXCHANGE example
- data cc
- length c 60 changedString 60
- input c 60.
- prxNumprxParse('s/(a-zA-Z\-),
(a-zA-Z\-)/2 1/') - CALL prxChange (prxNum,
- 1,
- c,
-
changedString, - newLength,
- wasTruncated,
-
numberChanges) - datalines
- Doe-Roe,John
- BlackSheep, BaaBaa
- Prince
s/ (a-zA-Z\-) first word ,
comma zero or more
blanks (a-zA-Z\-) second word /2 1/
switch words
37PRXCHANGE example results