Title: Programming Interest Group http://www.comp.hkbu.edu.hk/~chxw/pig/index.htm
1Programming Interest Grouphttp//www.comp.hkbu.ed
u.hk/chxw/pig/index.htm
- Tutorial Three
- Strings
-
- Sorting
2Character Codes
- Character codes are mappings between numbers and
the symbols which make up a particular alphabet. - ASCII American Standard Code for Information
Interchange - A single byte character code
- 128 characters are specified
- The highest-order bit is left as zero
- Unicode
- Two bytes per character
- Natively supported by Java
3ASCII Code
4Some Properties about ASCII
- Uppercase letters, lowercase letters, and
numerical digits appear sequentially. - To iterate through all the lowercase letters
- for(ch a ch lt z ch)
- is character ch uppercase?
- (ch gt A) (ch lt Z)
- Convert uppercase character ch to lowercase
- ch (A a)
5Strings
- Strings are sequences of characters.
- Different programming languages may have
different representations! - C/C
- Null-terminated array the string ends with null
character \0. - Enough array must be allocated to hold the
largest possible string (plus the null). - Java
- Array plus length
6Manipulating Strings
- The length of the string
- Copy a string
- Reverse a string
- Concatenate two strings
- Search a character in a string
- Search a string in a string
- String matching problem (or string searching
problem)
7String Matching Problem
- String matching algorithms are also used to
search for particular patterns in DNA sequences. - E.g., find the location of pattern P abaa in
the text T abcabaabcabac
8String Matching Algorithms
- http//en.wikipedia.org/wiki/String_searching_algo
rithm - Given text T with length n, pattern P with
length m - Naïve algorithm
- Matching time O((n-m1)m)
- Rabin-Karp algorithm
- Preprocessing time ?(m)
- Matching time O((n-m1)m)
- Knuth-Morris-Pratt (or KMP) algorithm
- Preprocessing time ?(m)
- Matching time ?(n)
- Boyer-Moore (or BM) algorithm
- Preprocessing time ?(m)
- Matching time worst ?(n), average ?(n/m)
9C String Library Functions
include ltctype.hgt int isalpha(int c) int
isupper(int c) int islower(int c) int
isdigit(int c) int ispunct(int c) int
isxdigit(int c) int isprint(int c) int
toupper(int c) int tolower(int c)
include ltstring.hgt char strcat(char dst, const
char src) char strncat(char dst, const char
src, size_t n) int strcmp(const char s1, const
char s2) int strncmp(const char s1, const char
s2, size_t n) char strcpy(char dst, const
char src) char strncpy(char dst, const char
src, size_t n) size_t strlen(const char
s) char strstr(const char s1, const char
s2) char strtok(char s1, const char s2)
10C String Library functions
- C supports the c-style strings
- C also has a string class
stringsize() stringempty() stringappend(s) s
tringerase(n, m) stringinsert(size_type n,
const strings) stringfind(s) stringrfind(s)
11Java String Objects
- String class java.lang.String
- http//java.sun.com/j2se/1.4.2/docs/api/java/lang/
String.html - In Java, strings are constant their values
cannot be changed after they are created. - StringBuffer class java.lang.StringBuffer
- http//java.sun.com/j2se/1.4.2/docs/api/java/lang/
StringBuffer.html - A string buffer is like a String, but can be
modified. At any point in time it contains some
particular sequence of characters, but the length
and content of the sequence can be changed
through certain method calls.
12Example Corporate Renaming
- Corporate name changes are occurring with ever
greater frequency, as companies merge, buy each
other out, try to hide from bad publicity. These
changes make it difficult to figure out the
current name of a company when reading old
documents. - Your company, Digiscam, has put you to work on a
program which maintains a database of corporate
names changes and does the appropriate
substitutions to bring old documents up to date. - Your program should take as input a file with a
given number of corporate name changes, followed
by a given number of lines of text for you to
correct. Only exact matches of the string should
be replaced. - There will be at most 100 corporate changes, and
each line of text is at most 1,000 characters
long.
13Sample Input and Output
- 4
- Anderson Consulting to Accenture
- Enron to Dynegy
- DEC to Compaq
- TWA to American
- 5
- Anderson Accounting begat Anderson Consulting,
which - offered advice to Enron before it DECLARED
bankruptcy, - which made Anderson
- Consulting quite happy it changed its name
- in the first place!
- Output
- Anderson Accounting begat Accenture, which
- offered advice to Dynegy before it CompaqLARED
bankruptcy, - which made Anderson
- Consulting quite happy it changed its name
- in the first place!
14Required String Operations
- Read strings
- Store strings
- Search strings for patterns
- Modify strings
- Print strings
15Read and Store
- include ltstring.hgt
- define MAXLEN 1001 / longest possible string
/ - define MAXCHANGES 101 / maximum number of name
changes / - typedef char stringMAXLEN
- string mergersMAXCHANGES2 / store
before/after corporate names / - int nmergers / number of different name
changes / - read_changes( )
- int i
- scanf(d\n, nmergers)
- for (i 0 i lt nmergers i )
- read_quoted_string( (mergersi0) )
- read_quoted_string( (mergersi1) )
-
read_quoted_string(char s) int i 0
char c while ( (cgetchar()) ! \)
while ( (cgetchar()) ! \)
si c i
si \0
16Searching for PatternsReturn the position of the
first occurrence of the pattern p in the text t,
and -1 if it does not occur.
- int findmatch( char p, char t)
- int i, j
- int plen, tlen
- plen strlen(p)
- tlen strlen(t)
- for ( i 0 i lt (tlen plen) i)
- j 0
- while ( (j lt plen) (tij pj) )
- j
- if (j plen)
- return (i)
-
- return (-1)
17Manipulating StringsReplace the substring of
length xlen starting at position pos in string s
with the contents of string y.
- replace_x_with_y(char s, int pos, int xlen, char
y) -
- int i
- int slen, ylen
- slen strlen(s)
- ylen strlen(y)
- if (xlen gt ylen)
- for( i (posxlen) i lt slen i)
- si(ylen-xlen) si
- else
- for( i slen i gt (posxlen) i-- )
- si(ylen-xlen) si
- for (i 0 i lt ylen i)
- sposi yi
18Completing the Merger
- main ()
-
- string s
- char c
- int nlines
- int i, j
- int pos
- read_changes()
- scanf(d\n, nlines)
- for ( i 1 I lt nlines i )
- j 0
- while ( (cgetchar()) ! \n)
- sj c
- j
-
- sj \0
- for( j 0 j lt nmergers j )
- while ( (pos findmatch(mergersj0, s) )
! -1 ) - replace_x_with_y (s, pos, strlen(mergersj0
, mergersj1)
19Practice
- http//acm.uva.es/p/v8/848.html
- http//acm.uva.es/p/v8/850.html
- http//acm.uva.es/p/v100/10010.html
- http//acm.uva.es/p/v100/10082.html
- http//acm.uva.es/p/v101/10132.html
- http//acm.uva.es/p/v101/10150.html ()
- http//acm.uva.es/p/v101/10188.html
- http//acm.uva.es/p/v102/10252.html
20Sorting
- Sorting is the most fundamental algorithmic
problem in computer science. - Internal sorting the entire sort can be done in
main memory (the input fit into main memory) - External sorting cannot be performed in main
memory and must be done on disk or tape (the
input is much too large to fit into memory) - To see a list of sorting algorithms
- http//en.wikipedia.org/wiki/Sorting_algorithm
21Properties of sorting algorithms
- Computational complexity of element comparisons
in terms of the size of the list - Worst case, best case, average case
- Sort algorithms which only use an abstract key
comparison operation always need at least
O(n log n) comparisons on average. - Memory usage
- some sorting algorithms are "in place", such that
only O(1) or O(log n) memory is needed beyond the
items being sorted, while others need to create
auxiliary locations for data to be temporarily
stored. - Stability
- stable sorting algorithms maintain the relative
order of records with equal keys (i.e. values).
That is, a sorting algorithm is stable if
whenever there are two records R and S with the
same key and with R appearing before S in the
original list, R will appear before S in the
sorted list. - When equal elements are indistinguishable, such
as with integers, stability is not an issue. - Unstable sorting algorithms may change the
relative order of records with equal keys. - Unstable sorting algorithms can be specially
implemented to be stable.
22Some simple algorithms
- Bubble sort (or sinking sort)
- Selection sort
- Insertion sort
- Best case O(N)
- Worst case O(N2)
- Average case ?(N2)
- All are O(N2)
23Shellsort
- Proposed by Donald Shell in 1959.
- Increment sequences h1, h2, h3, ..., ht, used in
reverse order h11. - After a phase, with an increment hk, A i lt A
i hk. All elements spaced hk apart are
sorted. - The action of an hk-sort is to perform an
insertion sort on hk independent subarrays. - The running time of shell sort depends on the
choice of increment sequence. - The average-case running time of shellsort, using
Hibbards increments, is thought to be O(N5/4)
worst case ?(N3/2) - The average-case running time of shellsort, using
Sedgewicks increments, is conjectured to be
O(N7/6) worst case ?(N4/3) - Shellsort is simple, and the performance is
acceptable even for N in the tens of thousands.
24More complicated algorithms
- Mergesort
- A good example of divide and conquer
- Stable
- Heapsort
- Make use of data structure heap
- Unstable
- Running time O(NlogN)
- Remark
- Merge sort is the cornerstone of most external
sorting algorithm
25Quicksort
- Quicksort is the fastest known sorting algorithm
in practice. - A divide-and-conquer recursive algorithm
- Average running time is O(NlogN).
- Worst running time is O(N2).
- Quicksort an array S
- If the number of elements in S is 0 or 1, return
- Pick any element v in S. This is called the
pivot. - Partition S-v into two disjoint groups S1
x ? S-vx ? v, and S2 x ? S-vx ? v. - Return quicksort (S1) followed by v followed by
quicksort (S2). - Efficient implementations of Quicksort are
typically unstable. - Details can be found at any data structure
algorithm textbook, or goto http//en.wikipedia.or
g/wiki/Quicksort
26Non-comparison sorts
- Not limited by the O(nlog n) lower bound
- Bucket sort
- http//en.wikipedia.org/wiki/Bucket_sort
- Radix sort
- http//en.wikipedia.org/wiki/Radix_sort
- Counting sort
- http//en.wikipedia.org/wiki/Counting_sort
27Sorting Library Functions
Sort an array include ltstdlib.hgt void
qsort(void base, size_t nmemb, size_t size,
int (compar) (const void , const
void )) This function sorts an array with nmemb
elements pointed by base, where each element is
size-bytes long. Binary search include
ltstdlib.hgt void bsearch(const void key, const
void base, size_t nmemb,
size_t size, int (compar)(const void , const
void ))
28qsort( ) example
- int main(void)
- char line1024
- char line_array1024
- int i 0
- int j 0
- while((fgets(line, 1024, stdin)) ! NULL)
- if(i lt 1024)
- line_arrayi strdup(line)
- else
- break
- sortstrarr(line_array, i)
- while(j lt i)
- printf("s", line_arrayj)
- return 0
include ltstdio.hgt include ltstring.hgt include
ltstdlib.hgt void sortstrarr(void array, unsigned
n) static int cmpr(const void a, const void
b) static int cmpr(const void a, const void
b) return strcmp((char )a, (char
)b) void sortstrarr(void array, unsigned
n) qsort(array, n, sizeof(char ), cmpr)
29Sorting and Searching in C
- The C STL includes methods for sorting,
searching, and more.
void sort(RandomAccessIterator bg,
RandomAccessIterator end) void
sort(RandomAccessIterator bg, RandomAccessIterator
end, BinaryPredicate
op) void stable_sort(RandomAccessIterator bg,
RandomAccessIterator end) void
stable_sort(RandomAccessIterator bg,
RandomAccessIterator end,
BinaryPredicate op)
30Sorting and Searching in Java
static void sort(Object a) static void
sort(Object a, Comparator c) static int
binarysearch(Object a, Object key) static int
binarysearch(Object a, Object key, Comparator
c)
sort() methods in jave.util.Arrays are all stable.
31Example 1
- http//acm.uva.es/p/v100/10041.html
Background The world-known gangster Vito
Deadstone is moving to New York. He has a very
big family there, all of them living in Lamafia
Avenue. Since he will visit all his relatives
very often, he is trying to find a house close to
them. Problem Vito wants to minimize the total
distance to all of them and has blackmailed you
to write a program that solves his problem.
Input The input consists of several test
cases. The first line contains the number of test
cases. For each test case you will be given the
integer number of relatives r ( 0 lt r lt 500) and
the street numbers (also integers) s1, s2, , sr
where they live ( 0 lt si lt 30000 ). Note that
several relatives could live in the same street
number. Output For each test case your program
must write the minimal sum of distances from the
optimal Vito's house to each one of his
relatives. The distance between two street
numbers si and sj is dij si-sj.
32Example 1
- If there is 0 or 1 relative, just return 0
- If there are 2 relatives
- If there are 3 relatives
- If there are 4 relatives
- Can you see the solution now?
33Example 2
- The following is a list of some sorting
algorithms. - Bubble sort, heap sort, insertion sort, merge
sort, quick sort, selection sort, shell sort - My business here is to give you some numbers, and
to sort them is your business. - Attention, I want the smallest number at the top
of the sorted list. - Input
- The input file consist of a series of data sets.
Each data set has two parts the first part
contains two non-negative integers, n (1 n
100,000) representing the total of numbers you
will get, and m (1 m n) representing the
interval of the output sorted list. The second
part contains n positive integers which will be
less than 2,000,000,000. The input is terminated
by a line with two zeros. - Output
- For each data set, you should output several
numbers in ONE line. After you get the sorted
list, you should output the first number of each
m numbers, and you should print exact ONE space
between two adjacent numbers. And please make
sure that there should NOT be any blank line
between outputs of two adjacent data sets.
34Example 2
- Sample Input
- 8 2
- 3
- 5
- 7
- 1
- 8
- 6
- 4
- 2
- 0 0
- Output for the Sample Input
- 1 3 5 7
35Example 3
- Dr. Lee cuts a string S into N pieces, s0,
s1, , sN-1. Now, Dr. Lee gives you these N
sub-strings. There might be several possibilities
that the string S could be. For example, if Dr.
Lee gives you three sub-strings a, ab,
ac, the string S could be aabac, aacab,
abaac, . Your task is to output the
lexicographically smallest S. - Input
- The first line of the input is a positive
integer T. T is the number of the test cases. The
first line of each test case is a positive
integer N (1 N 8) which represents the number
of sub-strings. After that, N lines followed. The
i-th line is the i-th sub-string. Assume that the
length of each sub-string is positive and less
than 100. - Output
- The output of each test is the lexicographically
smallest S. No redundant spaces are needed.
36Example 3
- Sample Input
- 1
- 3
- a
- ab
- ac
- Output for the Sample Input
- aabac
37Example 3
- Analysis
- Solution One brute-force (N is small)
- 8! 40320
- A better solution
- Define a new relation between two strings X and Y
- If XY lt YX, then X ltlt Y. E.g. X b, Y ba.
We have - X lt Y. But Y ltlt X because bab lt bba
- Try to prove that if X ltlt Y and Y ltlt Z, then X ltlt
Z - Then we can sort the N strings based on ltlt
operator - Combine the sorted string
38Example 3
- include ltiostreamgt
- include ltstringgt
- include ltalgorithmgt
- Using namespace std
- int T, n
- string s10
- bool cmp(string x, string y)
- return x y lt y x
-
int main( ) int i cint gtgt T while (T--)
cin gtgt n for(i 0 i lt n i) cin gtgt
si sort(s, sn, cmp) for (i 0 i lt n
i) cout ltlt si cout ltlt endl return
0
39Practice
- http//acm.uva.es/p/v1/120.html
- http//acm.uva.es/p/v100/10026.html
- http//acm.uva.es/p/v100/10037.html ()
- http//acm.uva.es/p/v101/10138.html
- http//acm.uva.es/p/v101/10152.html
- http//acm.uva.es/p/v101/10191.html
- http//acm.uva.es/p/v101/10194.html