LIS618 lecture 0 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

LIS618 lecture 0

Description:

I will not talk about the strike. A look at the course home page ... IR has received a lot of impetus through the web, which poses unprecedented search challenges. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 41
Provided by: kric
Learn more at: http://openlib.org
Category:

less

Transcript and Presenter's Notes

Title: LIS618 lecture 0


1
LIS618 lecture 0
  • Thomas Krichel
  • 2003-09-14

2
today's lecture
  • I will not talk about the strike.
  • A look at the course home page
  • http//wotan.liu.edu/home/krichel/lis618n03a
  • administrative stuff
  • historical matters about the course
  • about me
  • business of database searching
  • indexes
  • the Boolean information retrieval model
  • practice example on Dialog

3
Organization
  • homepage http//wotan.liu.edu/home/krichel/lis618n
    03a
  • Contents to be discussed today.
  • Send mail to krichel_at_openlib.org
  • Your name
  • Your secret word for grades delivery
  • Interrupt me with as many questions as possible!
  • Ask for breaks!

4
Proposed Organization
  • Normal lecture
  • Quiz at the beginning of every lecture
  • Factually oriented, around 15 minutes
  • Remove worst performance
  • Average to form 50
  • Search exercise 50
  • Formal syllabus to be made early next week!

5
Search exercise
  • find victim of an information need
  • best to take someone you know in a professional
    capacity
  • conduct interview about an information need
    experienced by the victim, write down
    expectations
  • search in formal database and on web
  • discuss results with the victim
  • write essay, no longer than 7 pages.

6
about the course
  • This course is new wine in an old bottle
  • Officially a merger of
  • lis566 information resources on the Internet
  • mailing lists
  • usenet news
  • web searching
  • lis618 database searching
  • access and use of commercial databases

7
mix of theory and practice
  • I am not a database search practitioner.
  • Each database is different, practical skills are
    not easily transferable.
  • Thus my emphasis in the course is more on theory.
  • In the past, I theory first, then practice.
  • This year I will try to mix. Some theory and some
    practice in every session.

8
What databases?
  • Dialog has been the traditional database covered.
  • They were the market leaders in online databases
    in the past.
  • Nowadays the field is much more open
  • In addition I have done Nexis, FirstSearch (OCLC)
    in the past.
  • But I am open to suggestions.

9
About me
  • Born 1965, in Völklingen (Germany)
  • Studied economics and social sciences at the
    Universities of Toulouse, Paris, Exeter and
    Leiceister.
  • PhD in theoretical macroeconomics
  • Lecturer in Economics at the University of Surrey
    1993 and 2001
  • Since 2001 assistant professor at the Palmer
    School

10
Why?
  • During research assistantship period, (1990 to
    1993) I was constantly frustrated with difficult
    access to scientific literature.
  • At the same time, I discovered easy access to
    freely downloadable software over the Internet.
  • I decided to work towards downloadable scientific
    documents. This lead to my library career
    (eventually).

11
Steps taken I
  • 1993 founded the NetEc project at
    http//netec.mcc.ac.uk, later available at
    http//netec.ier.hit-u.ac.jp as well as at
    http//netec.wustl.edu.
  • These are networking projects targeted to the
    economics community. The bulk is
  • Information about working papers
  • Downloadable working papers
  • Journal articles were added later

12
Steps taken II
  • Set up RePEc, a digital library for economics
    research. Catalogs
  • Research documents
  • Collections of research documents
  • Researchers themselves
  • Organizations that are important to the research
    process
  • Decentralized collection, model for the open
    archives initiative

13
Steps taken III
  • Co-founder of Open Archives Initiative
  • Work on the Academic Metadata Format
  • Co-founded rclis, a RePEc clone for (Research in
    Computing, Library and Information Science)

14
Interest in databases
  • From my point of view I have two interests in
    database searching
  • As a provider, I must understand how people
    search in order to provide some data that they
    can use and will use.
  • As an economist, I have a strong interest in
    information as a commodity. The database market
    is an important market place.
  • Main emphasis of course is still on databases.

15
Database searching (DS)
  • subset of the subject of information retrieval
    (IR)
  • DS mainly thought as applicable to the set of
    large structured databases as opposed to do web
    searching
  • for those, a general knowledge of what databases
    are seems useful
  • Concentrate on textual databases

16
traditional social model
  • user goes to a library
  • describes problem to the librarian
  • librarian does the search
  • without the user present
  • with the user present
  • hands over the result to the user
  • user fetches full-text or asks a librarian to
    fetch the full text.

17
economic rational for traditional model
  • In olden days the cost of telecommunication was
    high.
  • database use costs
  • cost of communication
  • cost of access time to the database
  • the traditional model controls an upper bound on
    costs

18
disintermediation
  • with access cost time gone, the traditional model
    is under threat
  • there is disintermediation where the librarian
    looses her role
  • but that may not be good news for information
    retrieval results
  • user knows subject matter best
  • librarian knows searching best

19
Web searching
  • IR has received a lot of impetus through the web,
    which poses unprecedented search challenges.
  • with more and more data appearing on the web DS
    may be a subject in decline
  • it is primarily concerned with non-web databases
  • There is more and more web-based methods of
    searching

20
Public access vs quality
  • Now the public at large is able to do online
    searching.
  • At the same time need for quality answers has
    grown.
  • Quality-filtered services will become more
    important.
  • In the current databases, there is as lot that
    would already be available for free mixed with
    quality-controlled stuff.
  • Publishers have direct offerings and
    intermediated vending is in decline.

21
Main theory part
  • Literature "Modern Information Retrieval" by
    Ricardo Baeza-Yates and Berthier Ribiero-Neto
  • Don't buy it. It is a not a good book.

22
before the IR process
  • provider
  • define data that is available
  • documents that can be used
  • document operations
  • document structure
  • index
  • user
  • user need
  • IR system familiarity

23
the IR process
  • query expresses user need in a query language
  • processing of query yields retrieved documents
  • calculation of relevance ranking
  • examination of retrieved documents
  • possible relevance cycle

24
main problem
  • user is not an expert at the formulation of a
    query
  • garbage in garbage out, the retrieval yields poor
    result
  • ways out
  • design very intuitive interface for the query
  • give expert guidance

25
taxonomy of classic IR models
  • Boolean, or set-theoretic
  • fuzzy set models
  • extended Boolean
  • vector, or algebraic
  • generalized vector model
  • latent semantic indexing
  • neural network model
  • probabilistic
  • inference network
  • belief network

26
summary
  • There are three basic types of models in classic
    information retrieval.
  • Extensions of these types are a matter of
    research concern and require good mathematical
    skills.
  • All classic models treat document as individual
    pieces.

27
key aid index
  • an index is a list of terms, with a list of
    locations where the term is to be found.
  • The way to express locations usually depends on
    the form that the indexed data takes.
  • for a book, it is usually the page number, e.g.
  • "shmoo 34, 75"
  • for computer files it is usually the name of the
    file plus the number of the byte where the
    indexed term starts, e.g. "krichel index.html 34,
    cv.html 890 1209"
  • there is usually more than one location of the
    term.

28
key aid index terms
  • index term is a part of the document that has a
    meaning on its own.
  • it is usually a noun word.
  • retrieval based on index term raises questions
  • semantics in query or document is lost
  • matching done in imprecise space of index terms
  • predicting relevance is a central problem
  • the IR model determines the process of relevance
    ranking

29
basic concept weight of index term
  • given all nouns, not all appear to have the same
    relevance to the text
  • sometimes, we can have a simple measure of the
    importance of a term, example?
  • more generally, for each indexing term and each
    document we can associate a weight with the term
    and the document.
  • usually, if the document does not contain the
    term, its weight is zero

30
Boolean model
  • in the Boolean model, the index weight of all
    index term for any document is 1 if the term
    appears in the document. It is 0 otherwise.
  • This allows to combine query terms with Boolean
    operator AND, OR, and NOT
  • thus powerful queries can be written

31
Classic implementation dialog
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg1.pdf
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg2.pdf
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg3.pdf
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg4.pdf

32
Dialog is a databank
  • over 500 databases
  • these are also known as files and cover
  • references and abstracts for published
    literature,
  • business information and financial data
  • complete text of articles and news stories
  • statistical tables
  • Directories
  • DIALOG uses the Boolean model

33
DIALOG interface
  • is still rooted in "traditional" database systems
  • dismissed as "dial-a-dog"
  • is uses a command-driven interface
  • it is very complicated to learn fully
  • it is not suitable for the end-user
  • it therefore offers a valuable skill to the
    information professional
  • it is a challenge for a professor to teach

34
Accessing DIALOG
  • On the web, go to
  • http//www.dialogweb.com/
  • Enter username and password
  • Forget about subaccount
  • then click on logon
  • On the next screen go to command search
  • "continue" at the next screen

35
two steps in DIALOG
  • step one select databases (aka files) to look at
  • step two perform searches on the selected
    databases
  • You may wonder why one does not have one single
    step like in a search engine. Discuss.

36
sample search
  • We want to know something about "current
    awareness in digital libraries"
  • From dialogweb command search
  • databases
  • social sciences and humanities
  • library and information science
  • leads you to http//www.dialogweb.com/cgi/logoff?m
    ode
  • guidedurl/cgi/dwframe?hrefsearch.html

37
This is database selection
  • At that screen you see a number of "files" with
    their number.
  • You can select those that you want to search
  • then you click "begin datasbase"
  • and you get back to the command search
  • "b numbers" it will say. That is the command to
    begin working with files.

38
Boolean seach
  • Do a number of searches
  • s current(N)awarness
  • s digital(N)library
  • s digital(N)libraries
  • Each search retrieves a set of documents
  • The sets can be combined
  • s s1 and (s2 or s3)

39
What is the deal?
  • There are two stages.
  • At stage two we make Boolean queries.
  • Each query splits the the records into matching
    and non-matching records.
  • The set of matching records is return.
  • It can be further searched or combined with other
    sets using Boolean operators.
  • Try this at home.

40
http//openlib.org/home/krichel
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com