Workbook 8 String Processing Tools - PowerPoint PPT Presentation

About This Presentation
Title:

Workbook 8 String Processing Tools

Description:

Workbook 8 String Processing Tools Pace Center for Business and Technology * fmt Command Syntax Like most of the text processing commands encountered in this Workbook ... – PowerPoint PPT presentation

Number of Views:263
Avg rating:3.0/5.0
Slides: 170
Provided by: Delga4
Learn more at: http://csis.pace.edu
Category:

less

Transcript and Presenter's Notes

Title: Workbook 8 String Processing Tools


1
Workbook 8String Processing Tools
Pace Center for Business and Technology
2
String Processing Tools
  • Key Concepts
  • When storing text, computers transform characters
    into a numeric representation. This process is
    referred to as encoding the text.
  • In order to accommodate the demands of a variety
    of languages, several different encoding
    techniques have been developed. These techniques
    are represented by a variety of character sets.
  • The oldest and most prevalent encoding technique
    is known as the ASCII character set, which still
    serves as a least common denominator among other
    techniques.
  • The wc command counts the number of characters,
    words, and lines in a file. When applied to
    structured data, the wc command can become a
    versatile counting tool.
  • The cat command has options that allow
    representation of nonprinting characters such as
    NEWLINE.
  • The head and tail commands have options that
    allow you to print only a certain number of lines
    or a certain number of bytes (one byte usually
    correlates to one character) from a file.

3
What are Files?
  • Linux, like most operating systems, stores
    information that needs to be preserved outside of
    the context of any individual process in files.
    (In this context, and for most of this Workbook,
    the term file is meant in the sense of regular
    file). Linux (and Unix) files store information
    using a simple model information is stored as a
    single, ordered array of bytes, starting from at
    first and ending at the last. The number of bytes
    in the array is the length of the file. 9
  • What type of information is stored in files? Here
    are but a few examples.
  • The characters that compose the book report you
    want to store until you can come back and finish
    it tomorrow are stored in a file called (say)
    /bookreport.txt.
  • The individual colors that make up the picture
    you took with your digital camera are stored in
    the file (say) /mnt/camera/dcim/100nikon/dscn1203.
    jpg.
  • The characters which define the usernames of
    users on a Linux system (and their home
    directories, etc.) are stored in the file
    /etc/passwd.
  • The specific instructions which tell an x86
    compatible CPU how to use the Linux kernel to
    list the files in a given directory are stored in
    the file /bin/ls.

4
What is a Byte?
  • At the lowest level, computers can only answer
    one type of question is it on or off? What is
    it? When dealing with disks, it is a magnetic
    domain which is oriented up or down. When dealing
    with memory chips, it is a transistor which
    either has current or doesn't. Both of these are
    too difficult to mentally picture, so we will
    speak in terms of light switches that can either
    be on or off. To your computer, the contents of
    your file is reduced to what can be thought of as
    an array of (perhaps millions of) light switches.
    Each light switch can be used to store one bit of
    information (is it on, or is it off).
  • Using a single light switch, you cannot store
    much information. To be more useful, an early
    convention was established group the light
    switches into bunches of 8. Each series of 8
    light switches (or magnetic domains, or
    transistors, ...) is a byte. More formally, a
    byte consists of 8 bits. Each permutation of ons
    and offs for a group of 8 switches can be
    assigned a number. All switches off, we'll assign
    0. Only the first switch on, we'll assign 1 only
    the second switch on, 2 the first and second
    switch on, 3 and so on. How many numbers will it
    take to label each possible permutation for 8
    light switches? A mathematician will quickly tell
    you the answer is 28, or 256. After grouping the
    light switches into groups of eight, your
    computer views the contents of your file as an
    array of bytes, each with a value ranging from 0
    to 255.

5
Data Encoding
  • In order to store information as a series of
    bytes, the information must be somehow converted
    into a series of values ranging from 0 to 255.
    Converting information into such a format is
    called data encoding. What's the best way to do
    it? There is no single best way that works for
    all situations. Developing the right technique to
    encode data, which balances the goals of
    simplicity, efficiency (in terms of CPU
    performance and on disk storage), resilience to
    corruption, etc., is much of the art of computer
    science.
  • As one example, consider the picture taken by a
    digital camera mentioned above. One encoding
    technique would divide the picture into pixels
    (dots), and for each pixel, record three bytes of
    information the pixel's "redness", "greenness",
    and "blueness", each on a scale of 0 to 255. The
    first three bytes of the file would record the
    information for the first pixel, the second three
    bytes the second pixel, and so on. A picture
    format known as "PNM" does just this (plus some
    header information, such as how many pixels are
    in a row). Many other encoding techniques for
    images exist, some just as simple, many much more
    complex.

6
Text Encoding
  • Perhaps the most common type of data which
    computers are asked to store is text. As
    computers have developed, a variety of techniques
    for encoding text have been developed, from the
    simple in concept (which could encode only the
    Latin alphabet used in Western languages) to
    complicated but powerful techniques that attempt
    to encode all forms of human written
    communication, even attempting to include
    historical languages such as Egyptian
    hieroglyphics. The following sections discuss
    many of the encoding techniques commonly used in
    Red Hat Enterprise Linux.

7
ASCII
  • One of the oldest, and still most commonly used
    techniques for encoding text is called ASCII
    encoding. ASCII encoding simply takes the 26
    lowercase and 26 uppercase letters which compose
    the Latin alphabet, 10 digits, and common English
    punctuation characters (those found on a
    keyboard), and maps them to an integer between 0
    and 255, as outlined in the following table.

8
ASCII
  • What about the integers 0 - 32? These integers
    are mapped to special keys on early teletypes,
    many of which have to do with manipulating the
    spacing on the page being typed on. The following
    characters are commonly called "whitespace"
    characters.

9
ASCII
  • Others of the first 32 integers are mapped to
    keys which did not directly influence the
    "printed page", but instead sent "out of band"
    control signals between two teletypes. Many of
    these control signals have special
    interpretations within Linux (and Unix).

10
Generating Control Characters from the Keyboard
  • Control and whitespace characters can be
    generated from the terminal keyboard directly
    using the CTRL key. For example, an audible bell
    can be generated using CTRLG, while a backspace
    can be sent using CTRLH, and we have already
    mentioned that CTRLD is used to generate an "End
    of File" (or "End of Transmission"). Can you
    determine how the whitespace and control
    characters are mapped to the various CTRL key
    combinations? For example, what CTRL key
    combination generates a tab? What does CTRLJ
    generate? As you explore various control
    sequences, remember that the reset command will
    restore your terminal to sane behavior, if
    necessary.
  • A tab can be generated with CTRLI, while CTRLJ
    will generate a line feed (akin to hitting the
    RETURN key). In general, CTRLA will generate
    ASCII character 1, CTRLB will generate ASCII
    character 2, and so on.
  • What about the values 128-255? ASCII encoding
    does not use them. The ASCII standard only
    defines the first 128 values of a byte, leaving
    the remaining 128 values to be defined by other
    schemes.

11
ISO 8859 and Other Character Sets
  • Other standard encoding schemes have been
    developed, which map various glyphs (such as the
    symbol for the Yen and Euro), diacritical marks
    found in many European languages, and non Latin
    alphabets to the latter 128 values of a byte
    which the ASCII standard leaves undefined. The
    following table lists a few of these standard
    encoding schemes, which are referred to as
    character sets. The following table lists some
    character sets which are supported in Linux,
    including their informal name, formal name, and a
    brief description.

12
ISO 8859 and Other Character Sets
  • Notice a couple of implications about ISO 8859
    encoding.
  • Each of the alternate encodings map a single
    glyph to a single byte, so that the number of
    letters encoded in a file equals the number of
    bytes which are required to encode them.
  • Choosing a particular character set extends the
    range of characters that can be encoded, but you
    cannot encode characters from different character
    sets simultaneously. For example, you could not
    encode both a Latin capital A with a grave and a
    Greek letter Delta simultaneously.

13
Unicode (UCS)
  • In order to overcome the limitations of ASCII and
    ISO 8859 based encoding techniques, a Universal
    Character Set has been developed, commonly
    referred to as UCS, or Unicode. The Unicode
    standard acknowledges the fact that one byte of
    information, with its ability to encode 256
    different values, is simply not enough to encode
    the variety of glyphs found in human
    communication. Instead, the Unicode standard uses
    4 bytes to encode each character. Think of 4
    bytes as 32 light switches. If we were to again
    label each permutation of on and off for 32
    switches with integers, the mathematician would
    tell you that you would need 4,294,967,296 (over
    4 billion) integers. Thus, Unicode can encode
    over 4 billion glyphs (nearly enough for every
    person on the earth to have their own unique
    glyph the user prince would approve).

14
Unicode (UCS)
  • What are some of the features and drawbacks of
    Unicode encoding?
  • Scale
  • The Unicode standard will easily be able to
    encode the variety of glyphs used in human
    communication for a long time to come.
  • Simplicity
  • The Unicode standard does have the simplicity of
    a sledgehammer. The number of bytes required to
    encode a set of characters is simply the number
    of characters multiplied by 4.
  • Waste While
  • The Unicode standard is simple in concept, it is
    also very wasteful. The ability to encode 4
    billion glyphs is nice, but in reality, much of
    the communication that occurs today uses less
    than a few hundred glyphs. Of the 32 bits (light
    switches) used to encode each character, the
    first 20 or so would always be "off".
  • ASCII Non-compatibility
  • For better or for worse, a huge amount of
    existing data is already ASCII encoded. In order
    to convert fully to Unicode, that data, and the
    programs that expect to read it, would have to be
    converted.

15
Unicode Transformation Format (UTF-8)
  • UTF-8 encoding attempts to balance the
    flexibility of Unicode, and the practicality and
    pervasiveness of ASCII, with a significant
    sacrifice variable length encoding. With
    variable length encoding, each character is no
    longer encoded using simply 1 byte, or simply 4
    bytes. Instead, the traditional 127 ASCII
    characters are encoded using 1 byte (and, in
    fact, are identical to the existing ASCII
    standard). The next most commonly used 2000 or so
    characters are encoded using two bytes. The next
    63000 or so characters are encoded using three
    bytes, and the more esoteric characters may be
    encoded using from four to six bytes. Details of
    the encoding technique can be found in the
    utf-8(7) man page. With full backwards
    compatibility to ASCII, and the same functional
    range of pure Unicode, what is there to lose? ISO
    8859 (and similar) character set compatibility.
  • UTF-8 attempts to bridge the gap between ASCII,
    which can be viewed as the primitive days of text
    encoding, and Unicode, which can be viewed as the
    utopia to aspire toward. Unfortunately, the
    "intermediate" methods, the ISO 8859 and other
    alternate character sets, are as incompatible
    with UTF-8 as they are with each other.
  • Additionally, the simple relationship between the
    number of characters that are being stored and
    the amount of space (measured in bytes) it takes
    to store them is lost. How much space will it
    take to store 879 printed characters? If they are
    pure ASCII, the answer is 879. If they are Greek
    or Cyrillic, the answer is closer to twice that
    much.

16
Text Encoding and the Open Source Community
  • In the traditional development of operating
    systems, decisions such as what type of character
    encoding to use can be made centrally, with the
    possible disadvantage that the decision is wrong
    for some community of the operating system's
    users. In contrast, in the open source
    development model, these types of decisions are
    generally made by individuals and small groups of
    contributors. The advantages of the open source
    model are a flexible system which can accommodate
    a wide variety of encoding formats. The
    disadvantage is that users must often be educated
    and made aware of the issues involved with
    character encoding, because some parts of the
    assembled system use one technique while others
    parts use another. The library of man pages is an
    excellent example

17
Text Encoding and the Open Source Community
  • When contributors to the open source community
    are faced with decisions involving potentially
    incompatible formats, they generally balance
    local needs with an appreciation for adhering to
    widely accepted standards where appropriate. The
    UTF-8 encoding format seems to be evolving as an
    accepted standard, and in recent releases has
    become the default for Red Hat Enterprise Linux.
  • The following paragraph, extracted from the
    utf-8(7) man page, says it well

18
Internationalization (i18n)
  • As this Workbook continues to discuss many tools
    and techniques for searching, sorting, and
    manipulating text, the topic of
    internationalization cannot be avoided. In the
    open source community, internationalization is
    often abbreviated as i18n, a shorthand for saying
    "i-n with 18 letters in between". Applications
    which have been internationalized take into
    account different languages. In the Linux (and
    Unix) community, most applications look for the
    LANG environment variable to determine which
    language to use.
  • At the simplest, this implies that programs will
    emit messages in the user's native language.
  • More subtly, the choice of a particular language
    has implications for sorting orders, numeric
    formats, text encoding, and other issues.

19
The LANG environment variable
  • The LANG environment variable is used to define a
    user's language, and possibly the default
    encoding technique as well. The variable is
    expected to be set to a string using the
    following syntax
  • The locale command can be used to examine your
    current configuration (as can echo LANG), while
    locale -a will list all settings currently
    supported by your system. The extent of the
    support for any given language will vary.

20
The LANG environment variable
  • The following tables list some selected language
    codes, country codes, and code set
    specifications.

21
Revisiting cat, head, and tail
  • Revisiting cat
  • We have been using the cat command to simply
    display the contents of files. Usually, the cat
    command generates a faithful copy of its input,
    without performing any edits or conversions. When
    called with one of the following command line
    switches, however, the cat command will indicate
    the presence tabs, line feeds, and other control
    sequences, using the following conventions.
  • Using the -A command line switch, the whitespace
    structure of the file becomes evident, as tabs
    are replaced with I, and line feeds are
    decorated with . E.g. cat -A /etc/hosts

22
Revisiting head and tail
  • The head and tail commands have been used to
    display the first or last few lines of a file,
    respectively. But what makes a line? Imagine
    yourself working at a typewriter click! clack!
    click! clack! clack! ziiing! Instead of the
    ziing! of the typewriter carriage at the end of
    each line, the line feed character (ASCII 10) is
    chosen to mark the end of lines.
  • Unfortunately, a common convention for how to
    mark the end of a line is not shared among the
    dominant operating systems in use today. Linux
    (and Unix) uses the line feed character (ASCII
    10, often represented \n), while Macintosh
    operating systems uses the carriage return
    character (ASCII 13, often represented \r or M),
    and Microsoft operating systems use a carriage
    return/line feed pair (ASCII 13, ASCII 10).

23
Revisiting head and tail
  • For example, the following file contains a list
    of four musicians.
  • Linux (and Unix) text files generally adhere to a
    convention that the last character of the file
    must be a line feed for the last line of text.
    Following the cat of the file musicians.mac,
    which does not contain any conventional Linux
    line feed characters, the bash prompt is not
    displayed in its usual location.

24
Revisiting head and tail
25
The wc (Word Count) Command
  • Counting Made Easy
  • Have you ever tried to answer a 25 words or
    less quiz? Did you ever have to write a
    1500-word essay?
  • With the wc you can easily verify that your
    contribution meets the criteria.
  • The wc command counts the number of characters,
    words, and lines. It will take its input either
    from files named on its command line or from its
    standard input. Below is the command line form
    for the wc program

26
The wc (Word Count) Command
  • When used without any command line switches, wc
    will report on the number of characters, lines,
    and words. Command line switches can be combined
    to return any combination of character count,
    line count or word count.

27
How To Recognize A Real Character
  • Text files are composed using an alphabet of
    characters. Some characters are visible, such as
    numbers and letters. Some characters are used for
    horizontal distance, such as spaces and TAB
    characters. Some characters are used for vertical
    movement, such as carriage returns and line
    feeds.
  • A line in a text file is a series of any
    character other than a NEWLINE (line feed)
    character and then a NEWLINE character.
    Additional lines in the file immediately follow
    the first line.
  • While a computer represents characters as
    numbers, the exact value used for each symbol
    varies depending on which alphabet has been
    chosen. The most common alphabet for English
    speakers is ASCII, also called Latin-1.
    Different human languages are represented by
    different computer encoding rules, so the exact
    numeric value for a given character depends on
    the human language being recorded.

28
So, What Is A Word?
  • A word is a group of printing characters, such as
    letters and digits, surrounded by white space,
    such as space characters or horizontal TAB
    characters.
  • Notice that our definition of a word does not
    include any notion of meaning. Only the form of
    the word is important, not its semantics. As far
    as Linux is concerned, a line such as

29
QuestionsChapter 1.  Text Encoding and Word
Counting
  • 1 and 2

30
Chapter 2.  Finding Text grep
  • Key Concepts
  • grep is a command that prints lines that match a
    specified text string or pattern.
  • grep is commonly used as a filter to reduce
    output to only desired items.
  • grep -r will recursively grep files underneath a
    given directory.
  • grep -v prints lines that do NOT match a
    specified text string or pattern.
  • Many other command line switches allow users to
    specify grep's output format.

31
Searching Text File Contents using grep
  • In an earlier Lesson, we saw how the wc program
    can be used to count the characters, words and
    lines in text files. In this Lesson we introduce
    the grep program, a handy tool for searching text
    file contents for specific words or character
    sequences.
  • The name grep stands for general regular
    expression parser. What, you may well ask, is a
    regular expression and why on earth should I want
    to parse one? We will provide a more formal
    definition of regular expressions in a later
    Lesson, but for now it is enough to know that a
    regular expression is simply a way of describing
    a pattern, or template, to match some sequence of
    characters. A simple regular expression would be
    Hello, which matches exactly five characters
    H, e, two consecutive l characters, and a
    final o. More powerful search patterns are
    possible and we shall examine them in the next
    section.
  • The figure below gives the general form of the
    grep command line

32
Searching Text File Contents using grep
  • There are actually three different names for the
    grep tool 10
  • fgrep Does a fast search for simple patterns. Use
    this command to quickly locate patterns without
    any wildcard characters, useful when searching
    for an ordinary word.
  • grep Pattern searches using ordinary regular
    expressions.
  • egrep Pattern searches using more powerful
    extended regular expressions.
  • The pattern argument supplies the template
    characters for which grep is to search. The
    pattern is expected to be a single argument, so
    if pattern contains any spaces, or other
    characters special to the shell, you must enclose
    the pattern in quotes to prevent the shell from
    expanding or word splitting it.

33
Searching Text File Contents using grep
  • The following table summarizes some of grep's
    more commonly used command line switches. Consult
    the grep(1) man page (or invoke grep --help) for
    more.

34
Show All Occurrences of a String in a File
  • Under Linux, there are often several ways of
    accomplishing the same task. For example, to see
    if a file contains the word even, you could
    just visually scan the file
  • Reading the file, we see that the file does
    indeed contain the letters even. Using this
    method on a large file suffers because we could
    easily miss one word in a file of several
    thousand, or even several hundred thousand,
    words. We can use the grep tool to search through
    the file for us in an automatic search
  • Here we searched for a word using its exact
    spelling. Instead of just a literal string, the
    pattern argument can also be a general template
    for matching more complicated character
    sequences we shall explore that in a later
    Lesson.

35
Searching in Several Files at Once
  • An easy way to search several files is just to
    name them on the grep command line
  • Perhaps we are more interested in just
    discovering which file mentions the word nine
    than actually seeing the line itself. Adding the
    -l switch to the grep line does just that

36
Searching Directories Recursively
  • Grep can also search all the files in a whole
    directory tree with a single command. This can be
    handy when working a large number of files.
  • The easiest way to understand this is to see it
    in action. In the directory /etc/sysconfig are
    text files that contain much of the configuration
    information about a Linux system. The Linux name
    for the first Ethernet network device on a system
    is eth0, so you can find which file contains
    the configuration for eth0 by letting the grep -r
    command do the searching for you 11

37
Searching Directories Recursively
  • Every file in /etc/sysconfig that mentions eth0
    is shown in the results.
  • We can further limit the files listed to only
    those referring to an actual device by filtering
    the grep -r output through a grep DEVICE
  • This shows a common use of grep as a filter to
    simplify the outputs of other commands.
  • If only the names of the files were of interest,
    the output can be simplified with the -l command
    line switch.

38
Inverting grep
  • By default, grep shows only the lines matching
    the search pattern. Usually, this is what you
    want, but sometimes you are interested in the
    lines that do not match the pattern. In these
    instances, the -v command line switch inverts
    grep's operation.

39
Getting Line Numbers
  • Often you may be searching a large file that has
    many occurrences of the pattern. Grep will list
    each line containing one or more matches, but how
    is one to locate those lines in the original
    file? Using the grep -n command will also list
    the line number of each matching line.
  • The file /usr/share/dict/words contains a list of
    common dictionary words. Identify which line
    contains the word dictionary
  • You might also want to combine the -n switch with
    the -r switch when searching all the files below
    a directory

40
Limiting Matching to Whole Words
  • Remember the file containing our nursery rhyme
    earlier?
  • Suppose we wanted to retrieve all lines
    containing the word at. If we try the command
  • Do you see what happened? We matched the at
    string, whether it was an isolated word or part
    of a larger word. The grep command provides the
    -w switch to imply that the specified pattern
    should only match entire words.
  • The -w switch considers a sequence of letters,
    numbers, and underscore characters, surrounded by
    anything else, to be a word.

41
Ignoring Case
  • The string Bob has quite a meaning quite
    different from the string bob. However,
    sometimes we want to find either one, regardless
    of whether the word is capitalized or not. The
    grep -i command solves just this problem.

42
ExamplesFinding Simple Character Strings
  • Verify that your computer has the system account
    lp, used for the line printer tools. Hint the
    file /etc/passwd contains one line for each user
    account on the system.

43
QuestionsChapter 2.  Finding Text grep
  • 1, 2 and 3

44
Chapter 3.  Introduction to Regular Expressions
  • Key Concepts
  • Regular expressions are a standard Unix syntax
    for specifying text patterns.
  • Regular expressions are understood by many
    commands, including grep, sed, vi, and many
    scripting languages.
  • Within regular expressions, . and are used to
    match characters.
  • Within regular expressions, , , and ?specify a
    number of consecutive occurrences.
  • Within regular expressions, and specify the
    beginning and end of a line.
  • Within regular expressions, (, ), and specify
    alternative groups.
  • The regex(7) man page provides complete details.

45
Introducing Regular Expressions
  • In the previous chapter you saw grep used to
    match either a whole word or part of a word. This
    by its self is very powerful, especially in
    conjunction with arguments like -i and -v, but it
    is not appropriate for all search scenarios. Here
    are some examples of searches that the grep usage
    you've learned so far would not be able to do
  • First, suppose you had a file that looked like
    this

46
Introducing Regular Expressions
  • What if you wanted to pull out just the names of
    the people in people_and_pets.txt? A command like
    grep -w Name would match the 'Name' line for
    each person, but also the 'Name' line for each
    person's pet. How could we match only the 'Name'
    lines for people? Well, notice that the lines for
    pets' names are all indented, meaning that those
    lines begin with whitespace characters instead of
    text. Thus, we could achieve our goal if we had a
    way to say "Show me all lines that begin with
    'Name'".
  • Another example Suppose you and a friend both
    witnessed a hit-and-run car accident. You both
    got a look at the fleeing car's license plate and
    yet each of you recalls a slightly different
    number. You read the license number as "4I35VBB"
    but your friend read it as "413SV88". It seems
    that what you read as an 'I' in the second
    character, your friend read as a '1'. Similar
    differences appear in your interpretations of
    other parts of the license like '5' vs 'S' and
    'BB' vs '88'. The police, having taken both of
    your statements, now need to narrow down the
    suspects by querying their database of license
    plates for plates that might match what you saw.

47
Introducing Regular Expressions
  • One solution might be to do separate queries for
    "4I35VBB" and "413SV88" but doing so assumes that
    one of you is exactly right. What if the
    perpetrator's license number was actually
    "4135VB8"? In other words, what if you were right
    about some of the characters in question but your
    friend was right about others? It would be more
    effective if the police could query for a pattern
    that effectively said "Show me all license
    numbers that begin with a '4', followed by an 'I'
    or a '1', followed by a '3', followed by a '5' or
    an 'S', followed by a 'V', followed by two
    characters that are each either a 'B' or an '8'".
  • Query scenarios like these can be solved using
    regular expressions. While computer scientists
    sometimes use the term "regular expression" (or
    "regex" for short) to describe any method of
    describing complex patterns, in Linux and many
    programming languages the term refers to a very
    specific set of special characters used for
    solving problems like the above. Regular
    expressions are supported by a large number of
    tools including grep, vi, find and sed.

48
Introducing Regular Expressions
  • To introduce the usage of regular expressions,
    lets look at some solutions to two problems
    introduced earlier. Don't worry if these seem a
    bit complicated, the remainder of the unit will
    start from scratch and cover regular expressions
    in great detail.
  • A regex that could solve the first problem, where
    we wanted to say "Show me all lines that begin
    with 'Name'" might look like this
  • ...that's it! Regular expressions are all about
    the use of special characters, called
    metacharacters to represent advanced query
    parameters. The carat (""), as shown here, means
    "Lines that begin with...". Note, by the way,
    that the regular expression was put in
    single-quotes. This is a good habit to get into
    early on as it prevents bash from interpreting
    special characters that were meant for grep.

49
Introducing Regular Expressions
  • Ok, so what about the second problem? That one
    involved a much more complicated query "Show me
    all license numbers that begin with a '4',
    followed by an 'I' or a '1', followed by a '3',
    followed by a '5' or an 'S', followed by a 'V',
    followed by two characters that are each either a
    'B' or an '8'". This could be represented by a
    regular expression that looks like this
  • Wow, that's pretty short considering how long it
    took to write out what we were looking for! There
    are only two types of regex metacharacters used
    here square braces ('') and curly braces
    (''). When two or more characters are shown
    within square braces it means "any one of these".
    So 'B8' near the end of the expression means
    "'B' or '8'". When a number is shown within curly
    braces it means "this many of the preceding
    character". Thus, 'B82' means "two characters
    that are each either a 'B' or an '8'". Pretty
    powerful stuff!
  • Now that you've gotten a taste of what regular
    expressions are and how they can be used, let's
    start from scratch and cover them in depth.

50
Regular Expressions, Extended Regular
Expressions, and the grep Command
  • As the Unix implementation of regular expression
    syntax has evolved, new metacharacters have been
    introduced. In order to preserve backward
    compatibility, commands usually choose to
    implement regular expressions, or extended
    regular expressions. In order to not become
    bogged down with the differences, this Lesson
    will introduce the extended syntax, summarizing
    differences at the end of the discussion.
  • One of the most common uses for regular
    expressions is specifying search patterns for the
    grep command. As was mentioned in the previous
    Lesson, there are three versions of the grep
    command. Reiterating, the three differ in how
    they interpret regular expressions.

51
Regular Expressions, Extended Regular
Expressions, and the grep Command
  • fgrep
  • The fgrep command is designed to be a "fast"
    grep. The fgrep command does not support regular
    expressions, but instead interprets every
    character in the specified search pattern
    literally.
  • grep
  • The grep command interprets each patterns using
    the original, basic regular expression syntax.
  • egrep
  • The egrep command interprets each patterns using
    extended regular expression syntax.
  • Because we are not yet making a distinction
    between the basic and extended regular expression
    syntax, the egrep command should be used whenever
    the search pattern contains regular expressions.

52
Anatomy of a Regular Expression
  • In our discussion of the grep program family, we
    were introduced to the idea of using a pattern to
    identify the file content of interest. Our
    examples were carefully constructed so that the
    pattern contained exactly the text for which we
    were searching. We were careful to use only
    literal characters in our regular expressions a
    literal character matches only itself. So when we
    used hello as the regular expression, we were
    using a five-character regular expression
    composed only of literal characters. While this
    let us concentrate on learning how to operate the
    grep program, it didn't allow us to get a full
    appreciation of the power of regular expressions.
    Before we see regular expressions in use, we
    shall first see how they are constructed.

53
Anatomy of a Regular Expression
  • A regular expression is a sequence of
  • Literal Characters Literal characters match only
    themselves. Examples of literals are letters,
    digits and most special characters (see below for
    the exceptions).
  • Wildcards Wildcard characters match any
    character. Within a regular expression, a period
    (.) matches any character, be it a space, a
    letter, a digit, punctuation, anything.
  • Modifiers A modifier alters the meaning of the
    immediately preceding pattern character. For
    example, the expression abc matches the
    strings ac, abc, abbc, abbbc, and so on,
    because the asterisk () is a modifier that
    means any number of (including zero). Thus, our
    pattern means to match any sequence of characters
    consisting of one a, a (possibly empty) series
    of b characters, and a final c character.
  • Anchors Anchors establish the context for the
    pattern, such as "the beginning of a line", or
    "the end of a word". For example, the expression
    cat would match any occurrence of the three
    letters, while cat would only match lines that
    begin cat.

54
Taking Literals Literally
  • Literals are straightforward because each literal
    character in a regular expressions matches one,
    and only one, copy of itself in the searched
    text. Uppercase characters are distinct from
    lowercase characters, so that A does not match
    a.
  • Wildcards
  • The "dot" wildcard
  • The character . is used as a placeholder, to
    match one of any character. In the following
    example, the pattern matches any occurrence of
    the literal characters x and s, separated by
    exactly two other characters.

55
Bracket Expressions Ranges of Literal
Characters
  • Normally a literal character in a regex pattern
    matches exactly one occurrence of itself in the
    searched text. Suppose we want to search for the
    string hello regardless of how it is
    capitalized we want to match Hello and HeLLo
    as well. How might we do that?
  • A regex feature called a bracket expression
    solves this problem neatly. A bracket expression
    is a range of literals enclosed in square
    brackets ( and ). For example, the regex
    pattern Hh is a character range that matches
    exactly one character either an uppercase H or
    a lowercase h letter. Notice that it doesn't
    matter how large the set of characters within the
    range is, the set matches exactly one character,
    if it matches any at all. A bracket expression
    that matches the set of lowercase vowels could be
    written aeiou and would match exactly one
    vowel.
  • In the following example, bracket expressions are
    used to find words from the file
    /usr/share/dict/words. In the first case, the
    first five words that contain three consecutive
    (lowercase) vowels are printed. In the second
    case, the first 5 words that contain lowercase
    letters in the pattern of vowel-consonant-vowel-co
    nsonant-vowel-consonant are printed.

56
Bracket Expressions Ranges of Literal
Characters
  • If the first character of a bracket expression is
    a , the interpretation is inverted, and the
    bracket expression will match any single
    occurrence of a character not included in the
    range. For example, the expression aeiou
    would match any character that is not a vowel.
    The following example first lists words which
    contain three consecutive vowels, and secondly
    lists words which contain three consecutive
    consonant-vowel pairs.

57
Range Expressions vs. Character Classes Old
School and New School
  • Another way to express a character range is by
    giving the start- and end-letters of the sequence
    this way a-d would match any character from
    the set a, b, c or d. A typical usage of this
    form would be 0-9 to represent any single
    digit, or A-Z to represent all capital
    letters.

58
Range Expressions vs. Character Classes Old
School and New School
  • As an alternative to such quandaries, modern
    regular expression make use character classes.
    Character classes match any single character,
    using language specific conventions to decide if
    a given character is uppercase or lowercase, or
    if it should be considered part of the alphabet
    or punctuation. The following table lists some
    supported character classes, and the ASCII
    equivalent range expression, where appropriate.

59
Range Expressions vs. Character Classes Old
School and New School
  • Character classes avoid problems you may run into
    when using regular expressions on systems that
    use different character encoding schemes where
    letters are ordered differently. For example,
    suppose you were to run the command
  • On a Red Hat Enterprise Linux system, this would
    match every word in the file, not just those that
    contain capital letters as one might assume. This
    is because in unicode (utf-8), the character
    encoding scheme that RHEL uses, characters are
    alphabetized case-insensitively, so that A-Z is
    equivalent to AaBbCc...etc.

60
Range Expressions vs. Character Classes Old
School and New School
  • On older systems, though, a different character
    encoding scheme is used where alphabetization is
    done case-sensitively. On such systems A-Z
    would be equivalent to ABC...etc. Character
    classes avoid this pitfall. You can run
  • on any system regardless of the encoding scheme
    being used and it will only match lines that
    contain capital letters.
  • For more details about the predefined range
    expressions, consult the grep manual page. For
    more information on character encoding schemes
    under Linux, refer back to chapter 8.3. To learn
    about how character encoding schemes are used to
    support other languages in Red Hat Enterprise
    Linux, begin with the locale manual page.

61
Common Modifier Characters
  • We saw a common usage of a regex modifier in our
    earlier example abc to match an a and c
    character with some number of b letters in
    between. The character changed the
    interpretation of the literal b character from
    matching exactly one letter to matching any
    number of b's.
  • Here are a list of some common modifier
    characters
  • b? The question mark (?) means either one or
    none the literal character is considered to be
    optional in the searched text. For example, the
    regex pattern ab?c matches the strings ac,
    and abc, but not abbc.
  • b The asterisk () modifier means any number
    of (including zero) of the preceding literal
    character. The regex pattern abc matches the
    strings ac, abc, abbc, and so on.

62
Common Modifier Characters
  • b The plus () modifier means one or more,
    so the regex pattern b matches a non-empty
    sequence of b's. The regex pattern abc matches
    the strings abc and abbc, but does not match
    ac
  • bm,n The brace modifier is used to specify a
    range of between m and n occurrences of the
    preceding character. The regex pattern b2,4
    would match abbc and abbbc, and abbbbc, but
    not abc or abbbbbc.
  • bn With only one integer, the brace modifier is
    used to specify exactly n occurrences for the
    preceding character.

63
Common Modifier Characters
  • In the following example, egrep prints lines from
    /usr/share/dict/words that contain patterns which
    start with a (capital or lowercase) a, might or
    might not next have a (lowercase) b, but then
    definitely follow with a (lowercase) a.
  • The following example prints lines which contain
    patterns which start al, then use the .
    wildcard to specify 0 or more occurrences of any
    character, followed by the pattern bra.

64
Common Modifier Characters
  • Notice we found variations on the words algebra
    and calibrate. For the former, the . expression
    matched ge, while for the latter, it matched
    the letter i.
  • The expression ., which is interpreted as "0
    or more of any character", shows up often in
    regex patterns, acting as the "stretchable glue"
    between two patterns of significance.
  • As a subtlety, we should note that the modifier
    characters are greedy they always match the
    longest possible input string. For example, given
    the regex pattern

65
Anchored Searches
  • Four additional search modifier characters are
    available
  • foo A caret () matches the beginning of a
    line. Our example foo matches the string foo
    only when it is at the beginning of a line
  • foo A dollar sign () matches the end of a
    line. Our example foo matches the string foo
    only at the end of a line, immediately before the
    newline character.
  • \ltfoo\gt By themselves, the less than sign (lt)
    and the greater than sign (gt) are literals.
    Using the backslash character to escape them
    transforms them into meaning first of a word
    and end of a word, respectively. Thus the
    pattern \gtcat\lt matches the word cat but not
    the word catalog.
  • You will frequently see both and used
    together. The regex pattern foo matches a
    whole line that contains only foo and would not
    match that line if it contained any spaces.
  • The \lt and \gt are also usually used as pairs.

66
Anchored Searches
  • In the following an example, the first search
    lists all lines that contain the letters ion
    anywhere on the line. The second search only
    lists lines which end in ion.

67
Coming to Terms with Regex Grouping
  • The same way that you can use parenthesis to
    group terms within a mathematical expression, you
    also use parenthesis to collect regular
    expression pattern specifiers into groups. This
    lets the modifier characters ?, and
    apply to groups of regex specifiers instead of
    only the immediately preceding specifier.
  • Suppose we need a regular expression to match
    either foo or foobar. We could write the
    regex as foo(bar)? and get the desired results.
    This lets the ? modifier apply to the whole
    string bar instead of only the preceding r
    character.
  • Grouping regex specifiers using parenthesis
    becomes even more flexible when the pipe symbol
    () is used to separate alternative patterns.
    Using alternatives, we could rewrite our previous
    example as (foofoobar). Writing this as
    foofoobar is simpler and works just as well,
    because just like mathematics, regex specifiers
    have precedence. While you are learning, always
    enclose your groups in parenthesis.

68
Coming to Terms with Regex Grouping
  • In the following example, the first search prints
    all lines from the file /usr/share/dict/words
    which contain four consecutive vowels (compare
    the syntax to that used when first introducing
    range expressions, above). The second search
    finds words that contain a double o or a double
    e, followed (somewhere) by a double e.

69
Escaping Meta-Characters
  • Sometimes you need to match a character that
    would ordinarily be interpreted as a regular
    expression wildcard or modifier character. To
    temporarily disable the special meaning of these
    characters, simply escape them using the
    backslash (\) character. For example, the regex
    pattern cat. would match the letters cat
    followed by any character cats or catchup.
    To match only the letters cat. at the end of a
    sentence, use the regex pattern cat\. to
    disable interpreting the period as a wildcard
    character.
  • Note one distracting exception to this rule. When
    the backslash character precedes a lt or gt
    character, it enables the special interpretation
    (anchoring the beginning or ending of a word)
    instead of disabling the special interpretation.
    Shudder. It even gets worse - see the footnote at
    the bottom of the following table.

70
Summary of Linux Regular Expression Syntax
  • The following table summarizes regular expression
    syntax, and identifies which components are found
    in basic regular expression syntax, and which are
    found only in the extended regular expression
    syntax.

71
Summary of Linux Regular Expression Syntax
  • The following table summarizes regular expression
    syntax, and identifies which components are found
    in basic regular expression syntax, and which are
    found only in the extended regular expression
    syntax.

72
Regular Expressions are NOT File Globbing
  • When first encountering regular expressions,
    students understandably confuse regular
    expressions with pathname expansion (file
    globbing). Both are used to match patterns in
    text. Both share similar metacharacters (,
    ?, ...), etc.). However, they are
    distinctly different. The following table
    compares and contrasts regular expressions and
    file globbing.

73
Regular Expressions are NOT File Globbing
  • In the following example, the first argument is a
    regular expression, specifying text which starts
    with an l and ends .conf, while the second
    argument is a file glob which specifies all files
    in the /etc directory whose filename starts with
    l and ends .conf.
  • Take a close look at the second line of output.
    Why was it matched by the specified regular
    expression?
  • Why does the line containing the text krb5.conf
    match the expression? The l is found way back
    in the word default!
  • In a similar vain, when specifying regular
    expressions on the bash command line, care must
    be taken to quote or escape the regex
    meta-characters, lest they be expanded away by
    the bash shell with unexpected results. In all of
    the examples found in this discussion, the first
    argument to the egrep command is protected with
    single quotes for just this reason.

74
Where to Find More Information About Regular
Expressions
  • We have barely scratched the surface of the
    usefulness of regular expressions. The
    explanation we have provided will be adequate for
    your daily needs, but even so, regular
    expressions offer much more power, making even
    complicated text searches simple to perform.
  • For more online information about regular
    expressions, you should check
  • The regex(7) manual page.
  • The grep(1) manual page.

75
Examples
  • Regular Expression Modifiers

76
Chapter 4.  Everything Sorting sort and uniq
  • Key Concepts
  • The sort command sorts data alphabetically.
  • sort -n sorts numerically.
  • sort -u sorts and removes duplicates.
  • sort -k and -t sorts on a specific field in
    patterned data.

77
The sort Command
  • Sorting is the process of arranging records into
    a specified sequence. Examples of sorting would
    be arranging a list of usernames into
    alphabetical order, or a set of file sizes into
    numeric order.
  • In its simplest form, the sort command will
    alphabetically sort lines (including any
    whitespace or control characters which are
    encountered). The sort command uses the local
    locale (language definition) to determine the
    order of the characters (referred to as the
    collating order). In the following example,
    madonna first displays the contents of the file
    /etc/sysconfig/mouse as is, and then sorts the
    contents of the file alphabetically.

78
Modifying the Sort Order
  • By default, the sort command sorts lines
    alphabetically. The following table lists command
    line switches which can be used to modify this
    default sort order.

79
Examples of sort
  • As an example, madonna is examining the file
    sizes of all files that start with an m in the
    /var/log directory.
  • She next sorts the output with the sort command.

80
Examples of sort
  • Without being told otherwise, the sort command
    sorted the lines alphabetically (with 1952 coming
    before 20). Realizing this is not what she
    intended, madonna adds the -n command line
    switch.

81
Examples of sort
  • Better, but madonna would prefer to reverse the
    sort order, so that the largest files come first.
    She adds the -r command line switch
  • Why ls -1?
  • Why was the -1 command line switch given to the
    ls command in the first example, but not the
    others? By default, when the ls command is using
    a terminal for standard out, it will group the
    filenames in multiple columns for easy
    readability. When the ls command is using a pipe
    or file for standard out, however, it will print
    the files one file per line. The -1 command line
    switch forces this behavior for for terminal
    output as well.

82
Specifying Sort Keys
  • In the previous examples, the sort command
    performed its sort based on the first characters
    found on a line. Often, formatted data is not
    arranged so conveniently. Fortunately, the sort
    command allows users to specify which column of
    tabular data to use for determining the sort
    order, or, in more formally, which column should
    be used as the sort key.
  • The following table of command line switches can
    be used to determine the sort key.

83
Sorting Output by a Particular Column
  • As an example, suppose madonna wanted to
    reexamine her log files, using the long format of
    the ls command. She tries simply sorting her
    output numerically.
  • Now that the sizes are no longer reported at the
    beginning of the line, she has difficulty.
    Instead, she repeats her sort using the -k
    command line switch to sort her output by the 5th
    column, producing the desired output.

84
Specifying Multiple Sort Keys
  • Next, madonna is examining the file /etc/fdprm,
    which tables low level formatting parameters for
    floppy drives. She uses the grep command to
    extract the data from the file, stripping away
    comments and blank lines.

85
Specifying Multiple Sort Keys
  • She next sorts the data numerically, using the
    5th column as her key.

86
Specifying Multiple Sort Keys
  • Her data is successfully sorted using the 5th
    column, with the formats specifying 40 tracks
    grouped at the top, and 80 tracks grouped at the
    bottom. Within these groups, however, she would
    like to sort the data by the 3rd column. She adds
    an additional -k command line switch to the sort
    command, specifying the third column as her
    secondary key.
  • Now the data has been sorted primarily by the
    fifth column. For rows with identical fifth
    columns, the third column has been used to
    determine the final order. An arbitrary number of
    keys can be specified by adding more -k command
    line switches.

87
Specifying the Field Separator
  • The above examples have demonstrated how to sort
    data using a specified field as the sort key. In
    all of the examples, fields were separated by
    whitespace (i.e., a series of spaces and/or
    tabs). Often in Linux (and Unix), some other
    method is used to separate fields. Consider, for
    example, the /etc/passwd file.

88
Specifying the Field Separator
  • The lines are structured into seven fields each,
    but the fields are separated using a instead
    of whitespace. With the -t command line switch,
    the sort command can be instructed to use some
    specified character (such as a ) to separate
    fields.
  • In the following, madonna uses the sort command
    with the -t command line switch to sort the first
    10 lines of the /etc/passwd file by home
    directory (the 6th field).
  • The user bin, with a home directory of /bin, is
    now at the top, and the user mail, with a home
    directory of /var/spool/mail, is at the bottom.

89
Summary
  • In summary, we have seen that the sort command
    can be used to sort structured data, using the -k
    command line switch to specify the sort field
    (perhaps more than once), and the -t command line
    switch to specify the field delimiter.
  • The -k command line switch can receive more
    sophisticated arguments, which serve to specify
    character positions within a field, or customize
    sort options for individual fields. See the
    sort(1) man page for details.

90
The uniq Command
  • The uniq program is used to identify, count, or
    remove duplicate records in sorted data. If given
    command line arguments, they are interpreted as
    filenames for files on which to operate. If no
    arguments are provided, the uniq command operates
    on standard in. Because the uniq command only
    works on already sorted data, it is almost always
    used in conjunction with the sort command.
  • The uniq command uses the following command line
    switches to qualify its behavior.

91
The uniq Command
  • In order to understand the uniq command's
    behavior, we need repetitive data on which to
    operate. The following python script simulates
    the rolling of three six sided dice, writing the
    sum of 100 roles once per line. The user madonna
    makes the script executable, and then records the
    output in a file called trial1.

92
Reducing Data to Unique Entries
  • Now, madonna would like to analyze the data. She
    begins by sorting the data and piping the output
    through the uniq command.
  • Without any command line switches, the uniq
    command has removed duplicate entries, reducing
    the data from 100 lines to only 15. Easily,
    madonna sees that the data looks reasonable the
    sum of every combination for three six sided die
    is represented, with the exception of 3. Because
    only one combination of the dice would yield a
    sum of 3 (all ones), she expects it to be a
    relatively rare occurrence.

93
Counting Instances of Data
  • A particularly convenient command line switch for
    the uniq command is -c, or --count. This causes
    the uniq command to count the number of
    occurrences of a particular record, prepending
    the result to the record on output.
  • In the following example, madonna uses the uniq
    command to reproduce its previous output, this
    time prepending the number of occurrences of each
    entry in the file.

94
Counting Instances of Data
  • As would be expected (by a statistician, at
    least), the largest and smallest numbers have
    relatively few occurrences, while the
    intermediate numbers occur more numerously. The
    first column can be summed to 100 to confirm that
    the uniq command identified every occurrence.

95
Identifying Unique or Repeated Data with uniq
  • Sometimes, people are just interested in
    identifying unique or repeated data. The -d and
    -u command line switches allow the uniq command
    to do just that. In the first case, madonna
    identifies the dice combinations that occur only
    once. In the second case, she identifies
    combinations that are repeated at least once.

96
QuestionsChapter 4.  Everything Sorting sort
and uniq
  • 1 and 2

97
Chapter 5  Extracting and Assembling Text cut
and paste
  • Key Concepts
  • The cut command extracts texts from text files,
    based on columns specified by bytes, characters,
    or fields.
  • The paste command merges two text files line by
    line.

98
The cut Command
  • Extracting Text with cut
  • The cut command extracts columns of text from a
    text file or stream. Imagine taking a sheet of
    paper that lists rows of names, email addresses,
    and phone numbers. Rip the page vertically twice
    so that each column is on a separate piece. Hold
    onto the middle piece which contains email
    addresses, and throw the other two away. This is
    the mentality behind the cut command.
  • The cut command interprets any command line
    arguments as filenames of files on which to
    operate, or operates on the standard in stream if
    none are provided. In order to specify which
    bytes, characters, or fields are to be cut, the
    cut command must be called with one of the
    following command line switches.

99
The cut Command
  • The list arguments are actually a comma-separated
    list of ranges. Each range can take one of the
    following forms.

100
Extracting text by Character Position with cut
-c
  • With the -c command line switch, the list
    specifies a character's position in a line of
    text, where the first character is character
    number 1. As an example, the file
    /proc/interrupts lists device drivers, the
    interrupt request (IRQ) line to which they
    attach, and the number of interrupts which have
    occurred on that IRQ line. (Do not be concerned
    if you are not yet familiar with the concepts of
    a device driver or IRQ line. Focus instead on how
    cut is used to manipulate the data).

101
Extracting text by Character Position with cut
-
Write a Comment
User Comments (0)
About PowerShow.com