move everything that starts with tekstmanip to th - PowerPoint PPT Presentation

1 / 125
About This Presentation
Title:

move everything that starts with tekstmanip to th

Description:

move everything that starts with tekstmanip to the tekstmanip directory ... r reverse the result of comparisons -n sort numerically ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 126
Provided by: odurL
Category:

less

Transcript and Presenter's Notes

Title: move everything that starts with tekstmanip to th


1
ReMa Corpus Linguistics - LTR010M10
  • UNIX tools and more
  • Gertjan van Noord (Alfa-Informatica)
  • G.J.M.van.Noord_at_rug.nl
  • Met dank aan Lonneke van der Plas

2
http//www.let.rug.nl/vannoord/College/REMA/
  • That is the website where you can find all latest
    info.
  • Schedule.
  • Assignments for the practicals will be put there.
  • Handouts of slides of each lecture can be found
    there.

3
What you have to do to pass
  • Test UNIX tools (practical session of week 3 or 4
  • Corpus linguistics study (end of course)
  • Oral presentation of your study

4
Goal of the course
  • Main goal is to teach research master students
    what corpora are available, what their nature is
    and to provide you with tools that will help you
    to use these corpora for your (future) research.
  • You will get some theoretical background, but
    very limited.
  • I want to spend most of the time and energy into
    getting you ready to use UNIX tools and XML
    search tools.

5
UNIX
  • Unix is an operating system. (ATT, 1969) plenty
    of variations (Linux).
  • Shells
  • Commands cd, ls, pwd, mkdir

6
Why UNIX?
  • As part of UNIX you get many tools that you need
    for corpus linguistics, such as counting
    frequencies of words etc. for free.
  • No need for interfaces (always different ones for
    each corpus) with limited functionalities.
  • We have a large collection of corpora on UNIX
    machines.

7
What is a corpus?
  • A collection of linguistic data, either written
    texts or a transcription of recorded speech,
    which can be used as a starting-point of
    linguistic description or as a means of verifying
    hypotheses about a language. (David Crystal, A
    Dictionary of Linguistics and Phonetics,
    Blackwell, 3rd Edition, 1991)
  • A collection of naturally occurring language
    text, chosen to characterize a state or variety
    of a language. (John Sinclair, Corpusize
    Concordance, Collocation, OUP, 1991)

8
What should it look like?
  • Sampling and representativeness. The texts in a
    corpus must be collected in a systematic way,
    under controlled conditions, and in such a way
    that the corpus re?ects the true distribution of
    the language/dialect/variety under study.
  • Finite size. In order to allow scientific and
    reproducible study, a corpus should be of ?xed
    and ?nite size, but often millions or 100s of
    millions of words.
  • Machine readable. To do anything interesting a
    corpus has to be machine readable and preferably
    annotated.

9
Some well-known corpora
  • the BROWN corpus
  • the LOB corpus
  • the BNC corpus
  • CHILDES

10
Brown Corpus
  • First major computational corpus project began by
    Francis and Kucera at Brown University
    (19611964).
  • Designed in co-operation with grammarian Quirk
    and others.
  • One million words drawn from randomly selected
    material written in American English in a variety
    of genres.

11
LOB corpus
  • Brown corpus is a snapshot of written American
    English in 1961.
  • Lancaster-Oslo/Bergen (LOB) corpus is a
    collection of British English sampled in the same
    way. Under the direction of Geo?rey Leech.
  • Allows direct comparison between American and
    British dialects.

12
BNC corpus
  • 100 million words
  • 90 written British English
  • Imaginative, natural science, social science,
    world a?airs, commerce, arts, belief and thought,
    leisure, other.
  • 10 spoken British English
  • Individual interviews, educational, business,
    institutional, leisure, other.

13
CHILDES
  • The CHILDES database contains transcript and
    media data collected from conversations between
    young children and their playmates and
    caretakers. Conversations with older children and
    adults at available from TalkBank. All of the
    data is transcribed in CHAT and CA/CHAT formats.

14
Corpora for Dutch
  • Eindhoven corpus written and spoken Dutch
    (periods 1964-1971 and 1960-1973) 600,000
    written words 120,000 spoken words
  • Corpus Gesproken Nederlands (CGN) spontaneous
    conversation, interviews, discussions, etc.
    FlemishDutch Dutch. 10 million words
    http//lands.let.kun.nl/cgn/ehome.htm
  • Twente Nieuws Corpus (TwNC) (release 2002)
    newspaper, teletext, subtitles, broadcast news,
    etc. now almost 500 million words
    http//www.vf.utwente.nl/druid/TwNC/TwNC-main.htm
    l

15
Corpora for Dutch
  • SONAR
  • gt500 million word corpus
  • Lassy
  • Lassy Small one million word corpus, fully
    annotated syntactically, manually corrected
  • Lassy Large one billion word corpus, fully
    annotated syntactically, but not manually
    corrected (includes SONAR)

16
What if you want to make your own corpus?
  • It's a lot of work!
  • Sampling and representativeness Which texts will
    you include? Some parts of a text, you might not
    want to include (headers footers, pictures, etc)
  • Machine readable Documents can be in any format
    pdf, Microsoft word, html. Conversion is not
    trivial.
  • You might want to annotate it.

17
What can we do with corpora?
  • Studies on linguistic structure such as grammar
    particles and choice of word order in Dutch verb
    ?nal clusters (dat Jan Marie aan zou aan
    hebben aan gesproken)
  • Lexicography modern dictionaries are created by
    exploring large corpora to determine frequency of
    words. Which words should be included?
    collocations and phraseology of individual words.
    What should the lexical entry for credit include?
    credit card, credit service

18
What can we do with corpora? (cont'd)
  • Cultural studies frequency and connotation of a
    political ideology
  • Translation studies parallel corpora more often
    used to describe translation options and
    preferences
  • Forensic linguistics determine the authorship of
    a document by comparing linguistic features in
    the disputed document(s), in undisputed documents
    and in a general corpus.

19
UNIX (recap)
  • Unix is an operating system. (ATT, 1969) plenty
    of variations (Linux).
  • Multi-users one host - many terminals
  • Multiprocessors, timesharing
  • Shells
  • Standard processes with commands Ex. cd, ls,
    pwd, mkdir
  • Standard functions cstop the command,
    dend-of-file,... ( CTRL).

20
Shell
  • The shell is a computer programme that allows the
    user to give commands to the computer
  • It takes in input (standard from the keyboard)
  • And it gives back output (standard to the screen)

21
(No Transcript)
22
UNIX file system
  • At the top root /
  • Path /home/its/ug1/ee51vn/pics
  • your home directory
  • . is the actual working directory
  • .. parent directory of working directory
  • ../.. grand-parent ,, ,, ,, ,,

23
About file names
  • UNIX is case sensitive
  • No extension (like .exe, .bat, .doc or .rtf) The
    period ('.') is just considered to be one
    character of the file name gt my.first.file.name
  • Furthermore, you do better not use the following
    characters in your file names space comma / ( )
    ' " ? lt gt \ and avoid using - as the first
    character as these characters have special
    meanings.
  • So not My first try at making a corpus study.doc
  • Use _ instead of spaces.

24
Unix commands
  • Commands are either functionalities of the
    running shell, or program files that you can
    simply launch from the shell.
  • The general syntax of them is
  • command -options arguments
  • means that they are optional.
  • You can get information on commands with
  • man ltcommand-namegt Ex man cd

25
Some handy commands
  • cat gt file_name ... d make a filecat
    file_name content of file
  • passwd change your password
  • exit quit the shell
  • cd change (working) directoryGoing home with
    'cd' or 'cd '
  • pwd print working directory
  • who who are logged in? 'whoami' or 'who am i'

26
Some handy commands
  • ls list
  • ls lists the actual working directory ls
    ltdirectorygt lists the given files or directories
  • ls -l gives a long list with more info
  • ls -R lists recursively the subdirectories
  • less view content of a file
  • tail view only last part
  • head view only first part

27
Some handy commands
  • mkdir make directory
  • rm, rmdir remove files and directories
  • (watch out!)
  • rm -r removes the subdirectories recursively
    rm -i asks for affirmation
  • cp copy
  • mv move (copy delete the original one)
  • renaming a file mv oldname newname

28
Examples
  • cp ../John Jack a new file is created in the
    actual working directory, whose name is Jack',
    and whose content is identical to the content of
    the file John' which is located in the parent
    directory.
  • mv Mary .. the file Mary', which is located
    in the current working directory, is moved one
    level up in the tree structure
  • mv ../../Willy . the file Willy' has been
    moved from the "grand mother directory" to the
    actual working directory (where I am now)

29
Unix commands
  • echo displays a line of text.
  • Example gt echo hello word hello word
  • expr evaluates a mathematical expression.
    Example
  • gt expr 3 5
  • 8

30
Wildcards
  • Possibilities to refer to more than one file
  • any sequence of zero or more characters
  • ? denotes a single character
  • any single character in the cset.
  • -   range
  • ! NOT

31
Examples
  • x   any name beginning with 'x'
  • x, xold, xerxes
  • x any name containing an 'x'
  • x, xold, fox, maxi, xx
  • x?  any two-character-long name beginning with
    'x'
  • xx, xy, x2

32
Wildcards(more)
  • ? denotes a single character
  • any single character in the cset.
  • -   range
  • ! NOT
  • any sequence of zero or more characters

33
Examples
  • xaeiou   any two-character-long name beginning
    with an 'x' followed by a vowel (e.g. xa, xu)
  • xaeiou  any name beginning with an 'x'
    followed by a vowel (e.g. xa, xaver, xerxes)
  • xaeiouabcx  any name beginning with an 'x'
    followed by a vowel, then any characters (or
    none), then 'a' or 'b' or 'c', then any
    characters (none or one or more), and ending with
    an 'x' (e.g. xabx, xanax, xenmnaqwx).
  • .  any name containing a period

34
Wildcards(more)
  • any sequence of zero or more characters
  • ? denotes a single character
  • any single character in the cset.
  • -   range
  • ! NOT

35
Examples
  • A-Z any name beginning with a capital letter.
  • 1-9     any non-zero number
  • ????    any four-character-long name
  • ????    any at least four-character-long name
  • ???0-9.x  any  at least four-character-long
    name ending with a numeral, a period or an 'x'
  • !T       any name not beginning with a capital
    T.

36
When do you use them?
  • mv tekstmanip alltekstmanip/
  • move everything that starts with tekstmanip to
    the tekstmanip directory
  • ls ../pdf list all files in parent directory
    that have the pdf extension

37
End of week 1See you at the practicum!!
38
metacharacters
  • Let's try to calculate how much is (3 4) 7?
  • expr ( 3 4 ) 7 it does not work
  • The problem is that some characters have special
    meanings, those characters are called
    metacharacters.
  • We have seen so far the special meanings of the
    characters '','?', '' and ''.
  • and space has also a special meaning it is the
    delimiter between two arguments of a command,
    therefore a file name containing a space will
    also cause problems under all versions of Unix.
    Use _ instead

39
escapes quotes
  • There are two types of neutralizing characters.
  • The escape character \ (backslash) neutralize
    next character
  • The two types of quotes ('...' and "...")
    neutralize all the metacharacters within them
    (simplifying)
  • echo what does mean ?
  • echo what does \ mean \?
  • echo what does '' mean '?'
  • echo what does "" mean "?"

40
Examples with expr
  • expr 3 \ 7
  • expr 3 '' 7
  • expr "3 7" error!
  • expr ( 3 4 ) 7 error!
  • expr \( 3 4 \) "" 7

41
File filters
  • We've seen cat it prints the content of a file
    to the screen
  • instead of just forwarding you can filter the
    content of a file.
  • head     outputs the first part of a file (first
    10 lines by default) head -c N     prints the
    first N bytes head -n N     gives the first N
    lines
  • tail    outputs the last part of a file (first 10
    line by default)
  • tail -c N     prints the last N bytes, etc

42
More file filters
  • rev  reverses the lines of a file
  • wc prints  line,  word,  and  character count
  • sort sort FILE to standard output.       -r     
    reverse the result of comparisons        -n
    sort numerically
  • uniq it removes the duplicate lines from a
    sorted file        -c     puts a prefix before
    each line, giving the number of occurrences
           

43
Input/output
  • standard input is the keyboard (the things you
    type)
  • standard output is the screen
  • Unless you specify standard input/output to be
    for instance a file
  • You do that with the gt , gtgt and lt
  • lt        the input is taken from the specified
    file gt        the output goes to the specified
    file (if it exists then it wil be overwritten)
    gtgt      the output goes to the specified file
    (if it already exists then the output is appended
    to its previous content)

44
Input/output
-l or -q
filename
options
arguments
stout
stin
command 1
return value
standard error
45
Input/output
options
arguments
options
arguments
stdout
stdin
stdout
stdin
command 2

command 1
return value
return value
standard error
standard error
46
Input/output
  • If you want to use the output of a command as the
    input of the other use a (pipe)
  • cal 2005 wc
  • If you want to use the output of a command as an
    argument for another command ... (back
    quotation marks)
  • example............................
  • If you want to combine commands use (......)
  • example...............

47
tr (translate)
  • There is more you can do with files making
    changes in content with tr.
  • tr k Q lt test replaces each k in test with a Q
  • tr kz KZ lt test replaces k with K and z with Z
  • Options
  • -d deletes all the tokens of the characters in
    set or its complement if -c is added as well
  • -s squeezes all repetitions of characters in
    set into a single character.                     
         

48
Examples with tr
  • tr -s    all sequences of spaces are
    condensed into a single space
  • tr -d -c 0-9 remove all non-numerical
    characters (-c complement of set1)
  • A very handy property of tr is that you can refer
    to the ASCII code of characters through it. We
    shall use it very often. The way to do it is
    '\XXX', where the quotation marks help to escape
    the \ character.
  • tr '\012' _at_ lt test

49
First real application a word list
  • You can make an alphabetical list or a frequency
    list or just a word list .
  • Why?
  • A word list shows the word usage
  • type/token ratio measures the number of different
    words divided by number of words in total. gt
    shows diversity
  • It will be different from person to person and it
    differs also according to age

50
How to make an alphabetical word list?
  • change all capitals to lower-case
  • every word should be on one line
  • remove doubles
  • sort alphabetically

51
How to make an alphabetical word list?
  • change all capitals to lower-case
  • tr 'A-Z' 'a-z' lt test
  • every word should be on one line and squeeze (-s)
    repeats of a space
  • tr -s ' \011\012' '\012' lt test
  • sorting sort test
  • remove doubles uniq

52
How to make a frequency word list?
  • We can put all this in one command by using pipes
  • cat test tr 'A-Z' 'a-z' \
    tr -s ' \011\012' '\012' sort uniq
  • if you want frequencies
  • uniq -c (count)
  • Then if you want to sort according to frequency
  • sort -n (numerical)
  • And if you want the highest frequency up
  • sort -nr (numerical and reverse)

53
Zipf's law

54
Zipf's Law
  • Zipf's law the observation of Harvard linguist
    George Kingsley Zipf that the frequency of use of
    the nth-most-frequently-used word in any natural
    language is approximately inversely proportional
    to n.
  • So, the second most common frequency will occur
    1/2 as often as the first. The third most common
    frequency will occur 1/3 as often as the first.
  • There are only very few words that occur often.
    There are very many words that occur infrequently.

55
Regular expressions(grep)
  • We've seen wildcards
  • Regular expressions build on the same idea but
    important differences.
  • To show how regular expressions work
  • grep ltreg_expgt It will return the lines of the
    given file that match the regular expression.
  • grep a test It will return the lines from
    file test that contain an 'a'

56
Regular expressions(grep)
  • Options for grep
  • -c    returns the number of lines matching the
    given regular expression
  • -i    ignore case distinction does not
    differentiate between capital and lowercase
    letters
  • -v    inverse returns those lines that don't
    match the condition

57
Regular expressions
  • the dot . matches any character (as ? in
    wildcards)
  • grep a..le matches lines with apple etc
  •     any character within the brackets
  • grep bfall
  • gtwill match fall and ball
  • a-z interval of characters grep a-z matches
    any line that contains an alphabetical character.
  • complement of the listed characters
    (anything except those) grep ab filename
  • matches any line that does not contain a or b

58
Regular expressions
  • Repetition of some sets
  •        Kleene-star (Kleene closure) the
    repetition of the expression before it, any times
    (even 0 times) a matches ' ' a aaaa
  • grep ba gt matches lines with bbbbba ba,
    bbbbbbbbbbba etc
  • grep l.b gt matches lines that have a sequence
    that starts with an 'l' then a sequence of any
    characters and then a 'b'

59
Regular expressions
  • Position within the line
  •        beginning of the line (only at the
    beginning, otherwise it matches itself)
  • Example grep aeiou filename
  •        end of the line (at the end of the
    outermost expression, otherwise it matches
    itself)
  • Example grep aeiou filename

60
More examples
  • oain            either 'on' or 'an' or 'in'
    0-90-9    two consecutive digits
    aeiou        a vowel at the beginning of the
    line .aeiou      a vowel at the second
    position of a line aeiou      a line
    consisting exactly of a vowel 0-9           
    anything but a digit 0-9          a line
    ending not ending with a digit d\-            
    a line beginning with a 'd' or a '-'
    abb              an 'a', followed by one or
    more bs
  • 0-90-9 one or more digits

61
End of week 2See you at the practicum!
62
Frequencies
  • We have made word lists in the practicum.
  • What if we want a frequency list?
  • The same but give uniq the -c option (for
    counts).
  • What you get is a list of words with the
    frequency attached to it. That is the absolute
    frequency.
  • You can get the relative frequency of a word by
    dividing the absolute frequency by the total
    number of words in the corpus under
    consideration. The frequency is then relative to
    the corpus size.

63
Type/token ratio (TTR)
  • The type/token ratio gives an idea of the
    richness in vocabulary.
  • How many words does this text have?
  • Do you mean tokens or types?
  • Number of tokens size of the corpus
  • Number of types size of vocabulary of the
    corpus
  • type/token ratio number of types divided by
    number of tokens

64
Type/token ratio (TTR) (cont'd)
  • TTR is very different for texts of different
    length.
  • Longer texts tend to have lower TTR values. Why?
  • If you are working with texts of different
    length, you have to use some form of
    normalization.
  • For example split your texts up in texts of 1000
    words and calculate the average TTR.
  • Or compare TTR versus text length

65
How to compute type/token ratio?
  • Tokens
  • tr 'A-Z' 'a-z'
  • tr -d 'punct'
  • tr -s ' \012' '\012' wc -l
  • Types
  • tr 'A-Z' 'a-z'
  • tr -d 'punct'
  • tr -s ' \012' '\012'
  • sort uniq wc -l
  • NB to count tokens, what if you use wc -w

66
Sed
  • sed  is  a stream editor.  A stream editor is
    used to perform basic text transformations on an
    input stream.
  • You might think hey didn't we see that already
    with tr? Differences
  • sed works on string-level (regexps),
  • tr works on character-level.

67
Sed
  • sed 's/regex/newstring/'
  • rewrite the first string that matches the given
    regular expression.
  • sed 's/regex/newstring/g'
  • replace all instances regex ('g' for global).

68
Sed
  • sed '/regex/d'
  • This rule will delete all lines that include
    regex
  • sed 's/regex//'
  • replacing with the empty string.
  • The stands for the string that has been
    matched.
  • echo "ik ben jan" sed 's/jan/-willem/'
  • escape metacharacters when necessary
  • sed 's/\//a/' lt will replace / into a

69
Sed
  • If you want to put more than one operation into
    the command line (or you have a script file with
    at least one operation in the command line), use
    the -e option before each operations of the
    command line
  • sed -e '/Henry/d' -e 's/Smith/White/g'

70
selecting columns (cut)
  • 0513678 John   8 0612942 Kathy  7 0418365
    Pieter 6 0539482 Judith 9
  • If you want to remove the names
  • cut -c1-8,16-18 grades

71
columns with delimiter
  • 0513678 John   8 0612942 Kathy  7
    0418365 Pieter 6 0539482 Judith 9
  • cut -d -f 2
  • The -d option of the command cut defines what the
    delimiter is, and you can refer to a field by
    giving its number after the option -f.

72
merging columns(paste)
text
text
text
  • cat
    paste
  • paste -d char file file...
  • Now how can you change the order of columns of a
    given file 'info'? Combining cut and paste
  • cut -c1-8 info gt name
  • cut -c9-14 info gt birthdate
  • cut -c15-40 info gt address
  • paste address birthdate name gt new_info



text
73
Second application N-grams
  • An N-gram is like a moving window over a text,
    where N is the number of elements in the window
    (bigrams, trigrams etc).
  • Why do we need N-grams?
  • There is more information in combinations of two
    or more things, than in single elements.
  • Language guesser a lot of languages share the
    same alphabet, certain combinations are typical
    for one language (aa for NL, sh for EN).
    gthttp//www.let.rug.nl/vannoord/TextCat/Demo/
  • text classification

74
Making a bigram at word level
  • Make a list of words starting from position 1
  • Make a list of words starting from position 2
  • paste these two lists
  • Making a list with tr
  • tr 'A-Z' 'a-z'tr -s ' ' '\012' gt list
  • Making a list starting at position 2
  • tail -n 2 lt list gt listplus2
  • paste list listplus2 sort uniq -c sort -nr

75
Interpunction and bigrams on word level
  • Remove interpunction?
  • John put on his hat and slept. Penguins were
    walking.
  • John put, put on, on his, his hat are all
    combinations that can be found in English.
  • slept penguins is not a common combination. It is
    better to keep the line breaks / interpunction...

76
Collocation
  • Within the area of corpus linguistics,
    collocation is defined as a pair of words (the
    'node' and the 'collocate') which co-occur more
    often than would be expected by chance. (from
    Wikipedia)
  • A collocation is any turn of phrase or accepted
    usage where somehow the whole is perceived to
    have and existence beyond the sum of the parts.
    (from Manning and Schütze, 1999)
  • Hard to determine the boundary between
    collocations and other frequent word combinations
    (co-occurrences).

77
Examples of collocations
  • Middle East
  • President Bush
  • real estate
  • brute force
  • take pictures (not make pictures)
  • do a favour (not make a favour)

78
Collocation of more than two words
  • the term 'collocations' is also used for gt 2
    words
  • (he) kicked the bucket
  • (hij heeft) de pijp aan Maarten gegeven
  • (hij heeft zonder commentaar) het veld geruimd

79
Collocations (common features)
  • Non-compositionality The meaning of a
    collocation is not a straightforward composition
    of the meaning of its parts. For example, the
    meaning of kick the bucket has nothing to do with
    kicking buckets. (it means 'to die')
  • Non-substitutability We cannot substitute a word
    in a collocation with a related word. For
    example, we cannot say yellow wine instead of
    white wine although both yellow and white are the
    names of colors.
  • Limited modifiability Adding modifiers or
    syntactic transformations not always possible.
    John kicked the green bucket or the bucket was
    kicked has nothing to do with dying.

80
Collocation check
  • A trick that often works is to translate a
    combination of words literally (word by word) in
    another language and see if that works.

81
Collocations
  • words in a collocation do not have to be adjacent
  • She is under a lot of pressure
  • They made it up to him
  • Hij gaat problemen altijd uit de weg
  • Special types institutionalised phrases (phrases
    that are syntactically/semantically compositional
    but co-occurrence is conventionalised)
  • strong tea vs. powerful tea
  • powerful computer vs. strong computer

82
How to find collocations automatically?
  • Using n-gram frequency counts?
  • We learned how to make n-grams
  • Perhaps the most frequent n-grams are
    collocations?

83
Problems with n-gram frequencies
  • We get many frequent, but uninteresting
    combinations (from Federalist papers)
  • 4021 of the
  • 1494 to the
  • 1440 in the
  • 1174 Z the
  • 872 to be
  • 676 that the
  • 612 by the
  • 608 it is

84
Remedy use linguistic info
  • If we have a corpus that has part of speech tags,
  • we can look for the combination of an adj and a
    verb only or noun noun for English
  • Justeson and Katz (1995) have applied a
    part-of-speech filter to identify likely
    collocations.
  • Results are surprisingly good for such a simple
    method.
  • First 100 bigrams of Federalist papers, checked
    on NN or AN
  • 205 new york
  • 193 united states

85
How to do it using statistics (t test)
  • What we want to know is whether two words occur
    more often together than expected by chance.
  • on occurs frequently, the occurs frequently.
  • on the occurs frequently.
  • The question is 'Does it occur more frequently
    than expected by chance.
  • We can use t test to determine that

86
Other methods for hypothesis testing
  • Pearson's chi-square test
  • Does not assume normal distribution, unlike t
    test.
  • Can be used to determine corpus similarity.
  • Likelihood ratios
  • More appropriate for sparse data.
  • Easier to interpret It gives a number that says
    how likely one hypothesis is over the other.
  • More info in Manning and Schütze, 1999. chapter 5

87
Relative frequency ratios
  • We already learned what relative frequencies are.
  • We calculate the ratio of the relative frequency
    of some bigram/word in two different corpora.
  • That way we can see for example what celebrity
    was particularly hot in 2007 compared to 2006.

88
Another statistical test point-wise mutual
information
  • Compare the frequency of the bigram with the
    expected frequency - assuming the elements of the
    bigram are independent
  • The usual definition is in terms of
    probabilities, we will use relative frequencies
    instead

89
Mutual information
  • Sometimes also used to find if words like to
    co-occur
  • f(w1) how many sentences contain w1
  • f(w1,w2) how many sentences contain both w1 and
    w2

90
Making a bigram at character level
  • Now you want a list of characters. What you do
    is
  • lowercase all letters
  • tr 'A-Z' 'a-z' lt filename
  • replace each character with itself a newline
  • sed 's/./\n/g' filename gt list
  • The rest is the same as for the bigram at word
    level.

91
Making a bigram at character level (cont'd)
  • But there are many empty lines. So better
  • tr 'A-Z' 'a-z'
  • tr -d '\012'
  • tr -s ' '
  • sed 's/./\n/g'
  • First delete all newlines, then squeeze all
    repeats of a space.

92
Concordance
  • A concordance is an alphabetical list of the
    principal words used in a book or body of work,
    with their immediate contexts. (From Wikipedia,
    the free encyclopedia)
  • historical concordances (manually built, years of
    work!)
  • The Bible
  • The Quran
  • The works of Shakespeare

93
Cruden's Concordance (1736) of the Bible
  • dry ground
  • behold the face of the ground was d. Gen 813
  • Israel shall go on d. ground in the sea Ex 1416,
    22
  • stood firm on d. ground in the sea Josh 317
  • Elijah and Elisha went over on d. ground 2 Ki 28
  • he turneth water-springs into d. ground Ps 10733
  • he turneth d. ground into water-springs 35
  • I will pour floods upon the d. ground Isa 443
  • He shall grow as a root out of a d. ground 532
  • She is planted in a d. and thirsty ground Ezek
    1913

94
Keyword in Context (KWIC)
  • Volkskrant 97
  • terwijl het nettoloon op peil blijft .
  • en eigenliefde scoren beneden peil .
  • club en het internationale peil .
  • aandelenkoersen op het laagste peil van de
    afgelopen drie
  • blijft op het huidige peil van
    zestigduizend gulden ,
  • om de accommodatie op peil te brengen .
  • Can be done automatically, with the UNIX
    utilities we have learned. We will do that in the
    practicum.

95
Why are concordances/KWIC so interesting?
  • language use in context
  • word senses defined by their surrounding context
  • qualitative analyses of concordance lines
  • frequencies can be calculated from concordances
  • intuitions/hypothesis can be validated
  • new hypotheses can be formulated from
    concordances
  • lexicographers love KWIC tools!

96
Third application KWIC
  • KWIC (Keyword in context) a table with the left
    and right context of a word
  • KWICs are handy for example for translators.
  • Example
  • 'verband' would translate to bandage, but if we
    look in context
  • 'In verband met' we don't want to translate with
    in bandage with

97
How do we make a KWIC?
  • grep to select lines (this already is a KWIC in
    fact)
  • cut to select context
  • sed cut to give the contexts all the same length
  • paste to combine contexts

98
KWIC -the commands
  • grep itself test sed -e 's/Eeen//'gtline
    s
  • cut -d -f 1 lt lines gt before
  • cut -d -f 2 lt lines gt itself
  • cut -d -f 3 lt lines gt after
  • sed -e 's// /' lt
    before gt before2
  • sed -e 's// /' lt
    after gt after2
  • cut -c 1-30  lt after2 gt after32
  • rev before2 cut -c 1-30 rev gt before3
  • paste before3 itself after3

99
The end of week 3, see you at the test
100
Reading
  • Read the following 4 pages BEFORE tomorrow's
    practicum
  • p162-p166 section 5.3 untill 5.3.2
  • Manning, C. and Schütze, H. 1999. Foundations of
    Statistical Natural Language Processing. Ch 5
  • Available from
  • http//nlp.stanford.edu/fsnlp/promo/colloc.pdf

101
References
  • Justeson J. S. and Katz, S. M. 1995. Technical
    terminology some linguistic properties and an
    algorithm for identification in text. Natural
    Language Engineering 19-27
  • Manning, C. and Schütze, H. 1999. Foundations of
    Statistical Natural Language Processing. Ch 5
  • Available from
  • http//nlp.stanford.edu/fsnlp/promo/colloc.pdf

102
variables
  • list of variables by typing set
  • better set grep PATH
  • examples of variables
  • PATH a set of paths that are checked when you
    give a command
  • HOME the path to the home directory of the user
  • PWD the working directory
  • You can set variables by typing
    SOME_VARIABLEsome_value
  • to let shelll know you are talking about the
    variable echo SOME_VARIABLE

103
Annotations
  • To annotate add extra data to the corpus
  • Extra-textual information/ meta data (title
    author, date of creation)
  • Linguistic information
  • word class (Part-of-speech) (noun/verb/adj/...)
  • lemma
  • syntactic info
  • semantic
  • phonetic
  • ...

104
SUSANNE corpus
POS wordForm Lemma N120510g - PPHS1m
He he N120510h - VVDv studied
study N120510i - AT the
the N120510j - NN1c problem
problem N120510k - IF for
for N120510m - DD221 a a N120510n
- DD222 few few N120510p - NNT2
seconds second N120520a - CC and
and N120520b - VVDv thought think
....
105
Syntactic annotation
S NP Claudia NP VP sat
PP on NP a stoneNP
PP VP S
106
Syntactic annotation in Alpino
ltnode rel"whd" index"1" frame"wh_tmp_adverb"
pos"adv" begin"0" end"1" root"wanneer"
word"Wanneer" wh"ywh" special"tmp"/gt
ltnode rel"body" cat"sv1" begin"0" end"6"gt
ltnode rel"mod" index"1"/gt ltnode rel"hd"
frame"verb(hebben,past(sg),part_intransitive(plaa
ts))" pos"verb" begin"1" end"2"
root"vind_plaats" word"vond"
sc"part_intransitive(plaats)" infl"past(sg)"/gt
ltnode rel"su" cat"np" begin"2" end"5"gt
ltnode rel"det" frame"determiner(de)"
pos"det" begin"2" end"3" root"de"
word"de" infl"de"/gt ltnode rel"mod"
frame"adjective(e,adv)" pos"adj"
begin"3" end"4" root"Duits" word"Duitse"
infl"e"/gt ltnode rel"hd"
frame"noun(de,count,sg)" pos"noun"
begin"4" end"5" root"hereniging"
word"hereniging" gen"de" num"sg"/gt
lt/nodegt ltnode rel"svp" frame"particle(plaa
ts)" pos"part" begin"5" end"6"
root"plaats" word"plaats"/gt lt/nodegt lt/nodegt
107
XML
  • XML EXtensible Markup Language
  • XML is a markup language much like HTML
  • XML tags are not predefined. You must define your
    own tags
  • XML was created to structure, store and to send
    information

108
Shell scripts
  • Wouldn't it be nice to save our often long
    commands in files?
  • cat gt a_simple_shell_script
  • echo Now I will list the subdirectories of the
    directories whose name contains exactly 4
    characters.
  • ls -l ???? grep d
  • echo Thank you for your waiting.
  • echo What about an alphabetical order of these?
  • ls -l ???? grep d sort
  • echo Here you have it.
  • ctrl-d

109
Links
  • pointing to the same file from different places
    and/or with different names
  • ln ltexisting-filegt ltnew_namegt
  • In long list you can see number of hard links
  • For difference between hard and soft (-s) link
    see Tamas files

110
Protocols
  • Imagine you are at home and you want to log in to
    one of the computers in the UNIX room.
  • Protocols are standards of communication between
    different systems that may be far away from
    eachother and operate in different ways
  • telnet or ssh (Secure Shell Protocol), but no
    graphical interface
  • ftp (File Transfer Protocol) to transfer files
    from one machine to another.

111
shell scripts
  • To let the shell know that this is a program that
    we want to execute, to make the file executable
  • chmod x
  • Imagine you are going to make loads, then you
    want to store them in a dir
  • To let the system know the place to look for this
    command when you type it, add it to your PATH
    variable
  • PATHPATHHOME/shellscripts

112
shell scripts
  • Now what if you want to give arguments ? You
    don't always want to look for dir with 4
    characters , sometimes 3 and you don't want to
    change your programme all the time.
  • refer with 1, 2 to first and second word after
    the scripts name.
  • ls -l ???? grep d ls -l 1 grep
    d
  • and you type in
  • a_simple_shell_script '???'

113
shell scripts
  • control structures
  • case
  • if
  • for
  • while
  • until

114
shell scripts
  • case ltselectorgt in
  • ltvalue1gt ) ltcommands1gt
  • ltvalue2gt ) ltcommands2gt
  • ltvalue3gt ) ltcommands3gt
  • ...
  • ltvalueNgt ) ltcommandsNgt
  • esac

115
if
  • if ltcommands1gt then ltcommands2gt else
    ltcommands3gt fi

116
for, while, until
  • for ltvariablegt in ltlistgt do ltcommand command
    ...gt done
  • while ltcommand command ... gt do ltcommand
    command ..gt done
  • until ltcommand command ...gt do ltcommand command
    ... gt done

117
How to kill a process
  • Imagine you are running a shell-script but it
    goes into an infinite loop, you want to stop it.
  • type ps (it will show you all your processes
  • check the process ID of the process to be killed
  • kill -9 ltprocess_idgt

118
N-gram-based text categorization
  • Based on N-grams found in text a company can
    automatically categorize texts.
  • by subject (what is it about), comparing the most
    typical words.
  • or by language (where is it from), using N-grams
    on character level.
  • Article about this on website under 'Literature'
  • William B. Canvar en John M. Trenkle
    N-gram-Based Text Categorization

119
Permissions ls -l
  • very first character '-' for simple file, 'd'
    for directory, 'l' for symbolic link, etc.
  • 3 times 3 character permission for user, group
    and others r permission to read, w permission
    to write, x permission to execute
  • number of links belonging to this file
  • user and the group owning the file
  • size of the file in characters ('total' on the
    top total number of disk blocks occupied by the
    listed files)
  • Date and time when the file was last modified

120
Permissions (chmod)
  • chmod changing the permissions of a file.
  • u user, g group, o others, a all (also
    ug, uo etc. )
  • give permission - remove permission r
    to read, w to write, x to execute
  • chmod gw filename    or    chmod o-x filename
  • 4 permission to read, 2 permission to write,
    1 permission to execute
  • chmod 753 filename  
  • rwx to owner, r-x to group and -wx to others.

121
Coding different alphabets
  • Three levels
  • keyboard layout Which key which character
    (QWERTY, QWERTZ, AZERTY)
  • code page a number (signal from keyboard) is
    translated to character (on your screen)
  • font exact graphical image (e.g. Times New
    Roman)
  • Be aware of behavior of intervening programs
    (shell, editor, xterm, less, cat)

122
Character encoding
  • How to store character in bits and bytes.
  • Each character has a unique code.
  • 1 bit has two states (0/1).
  • 8 bits (1 byte) can store 256 values.
  • 16 bit can store 65536 values.
  • Different languages require different encoding
    tables depending on the alphabet. English (a
    language without diacritics) can store all
    characters needed in just one byte.

123
Character encoding, historical perspective
  • First only English alphabet 7 bits was enough
    (ASCII)
  • Most computers had 8 bits.
  • With the last bit lots of people from different
    parts of the world did different things. (codes
    128-255)
  • On some PCs the character code 130 would display
    as é, but on computers sold in Israel it was the
    Hebrew letter Gimel ?
  • Standard ANSI They agreed on what to do with the
    codes below 128, the rest was up to where you
    live.

124
Character encoding, historical perspective
  • Then came Unicode.
  • All languages in one system, so possible to have
    document with different languages.
  • UTF-8 is a system for storing your string of
    Unicode code points in memory. In UTF-8, every
    code point from 0-127 is stored in a single byte.
    Only code points 128 and above are stored using
    2, 3, in fact, up to 6 bytes.
  • English text looks the same in UTF-8 as they do
    in ASCII.
  • Latin-1 is useful for any Western European
    language, but not for Russian or Hebrew.

125
What to remember about encoding
  • Be aware of encoding issues.
  • UTF-8 is more and more common if you use this,
    then noone will question it
  • For some languages, latin1 encoding is OK too
Write a Comment
User Comments (0)
About PowerShow.com