LIS618 lecture 6 - PowerPoint PPT Presentation

About This Presentation
Title:

LIS618 lecture 6

Description:

Which stories make it to the top depends on. how prominently the stories appear on news sites ... HotBot DirectHit. Headquarters run by Netscape. Appearance of ODP ... – PowerPoint PPT presentation

Number of Views:255
Avg rating:3.0/5.0
Slides: 47
Provided by: kric
Learn more at: https://openlib.org
Category:

less

Transcript and Presenter's Notes

Title: LIS618 lecture 6


1
LIS618 lecture 6
  • Thomas Krichel
  • 2004-03-14

2
Structure
  • Google
  • news
  • interfaces to non-web sources
  • Usenet
  • ODP
  • relational databases
  • OpenURL
  • file sharing

3
Google news
  • Is a gathering of top stories from news stories.
  • The entire pages in built by computer. Which
    stories make it to the top depends on
  • how prominently the stories appear on news sites
  • which sites the stories appear on
  • when the articles were published
  • how many articles cover the same story
  • Note the side bar with stories of different topic
    sections.

4
special syntax for news I
  • source gives news from a source only
  • example sourcecnn works
  • examples sourcebbc, sourcenytimes
    source"new york times" dont seem to get
    anywhere.
  • location gives a location. Can by a two-letter
    state or a country
  • locationny
  • locationrussia

5
special syntax for news II
  • allintitle searches for words in the title of
    the article (not of the page)
  • example allintitle dead injured
  • allintext searches for words in the text
  • example allintext saarland government
  • allinurl searches in article URLs
  • example allinurlbbc Wales
  • Restrictions
  • One allin??? special syntax only.
  • Must come first in the query.

6
Google interfaces to 3rd party data
  • Google groups are an interface to Usenet news,
    called Google Groups.
  • Google directory is an interface to the Open
    Directory Project.
  • In both cases Google is dependent on the quality
    of these underlying data source.

7
Usenet news
  • Usenet is a collection of user-submitted notes on
    various subjects that are posted to servers on a
    worldwide network. Each subject collection of
    posted notes is known as a newsgroup.
  • A newsgroup is a discussion about a particular
    subject consisting of notes written to a
    networked site and distributed through Usenet.
  • Newsgroups are hierarchical. Hierarchical levels
    are separated by dots example comp.text.tex.
  • alt, news, info, biz, rec, comp, sci, humanities,
    soc, misc, talk are classic world-wide groups.
  • alt stands for anarchists, lunatics and
    terrorists.

8
Usenet history
  • The idea of network news was born in 1979 when
    two graduate students, Tom Truscott and Jim
    Ellis, thought of using UUCP to connect machines
    for the purpose of information exchange among
    users. They set up a small network of three
    machines in North Carolina.
  • UUCP is UNIX to UNIX copy'' a protocol that is
    used to copy files between machines running some
    flavor of UNIX, without the need for IP protocol.
    Usenet is older than the Internet

9
decline of Usenet
  • essentially open to all (peer-to-peer system)
  • used by spammers for
  • posting
  • gathering addresses
  • steady decline of quality of contribution
  • steady decline of quantity of contributions

10
Usenet worth checking out
  • independent reviews of products, often written by
    experts.
  • Example interpretation of beethoven sonatas by
    Wilhelm Kempff.
  • Sorting by date reveals that the newsgroup
    rec.music.classical.recordings is still active.
    On a good day, you will find no finer guide to
    records.

11
special syntax for Google Groups
  • group limits posting to a certain group
  • title limits to titles of postings
  • author searches for author name or email address
  • Mixing syntaxes works well.
  • Example intitlekempff grouprec.music.classical
    .recordings

12
the open directory project
  • The Open Directory Project is the largest, most
    comprehensive human-edited directory of the Web.
    It is constructed and maintained by a vast,
    global community of volunteer editors.
  • Claim that there is a historic precedence in the
    Oxford English Dictionary.
  • Formerly known as GnuHoo'', then NewHoo'',
    then acquired by NetScape, and called dmoz''.

13
dmoz.org
  • dmoz is maintained by volunteers net-citizen''.
    No special qualifications required, but claimed
    to be experts.
  • There are about 30,000 volunteers (they claim).
  • Powers the core directory services for the Web's
    largest and most popular search engines and
    portals
  • Netscape Search AOL Search
  • Google Lycos
  • HotBot DirectHit
  • Headquarters run by Netscape.

14
Appearance of ODP
  • If Google finds a relevant category it puts it
    into the result.
  • Remember a Google response is a list of results.
  • Each result has
  • title
  • snippet
  • URL
  • Some results have optionally a category attached.
    Following such categories is a winner if your
    information need is broad.

15
full-text databases
  • These databases have an emphasis on providing
    full-text information in a web environment.
  • Their particular strength is the aggregation of
    material from a range of publishers.
  • This especially concerns scholarly publishing,
    where the source material are distributed among a
    large number of sources.

16
Access
  • Some of the is arranged via the Brooklyn LIU
    campus. We can use the on-campus access here.
  • The databases have some full-text, but not a lot.

17
Proquest
  • go into the database selection, delete everything
    and then use the research library.
  • we can search for Paul Levine. It appears that
  • not all articles have full-text
  • there is no distinction between different Paul
    Levines
  • Otherwise it appears straightforward to use

18
aggregators
  • Proquest and ebsco work as aggregators. They put
    different scholarly journals in one database
    together, so you dont have do deal with
    publishers different interfaces.
  • Publishers are reluctant to join and impose
    moving-wall embargos on full-text release.
  • So you can not access the full-text via them. But
    your library may have the text somewhere.

19
the library as aggregator
  • typically, a library buys holdings from a
    publisher, as well as cross-publisher abstract
    and indexing data.
  • when users finds a reference in an abstract and
    wants to access the full text, they are stuck
  • Herbert Van de Sompel has been working on this
    problem.

20
special effects (SFX)
  • Herberts idea was to equip the interface with a
    special effects button.
  • When users press the button, the interface would
    transmit metadata such as
  • author name
  • journal name
  • title
  • date
  • to a special database, called a resolver.

21
resolver
  • The resolver examines the metadata and makes a
    decision on what to show to the user.
  • if the journal is subscribed to and the date is
    recent, it may formulate a query to the
    publishers database and fetch the record and/or
    full text there.
  • if the journal is not held, suggest ILL
  • etc

22
configuring the resolver
  • librarians, who know the local setting, will
    configure the server so that users are given the
    appropriate extended services given the local
    circumstance.
  • Note that what is returned is a set of extended
    services, not the response to a specific query.

23
Bison Futé model
  • This refers to further work by Herbert to
    generalize the idea.
  • On a web page, you find a link. It has been made
    by the provider of the web pages.
  • But this link may not be a appropriate. There
    maybe better technology that allows you to move
    in the same direction but with your own link.
  • In other words we talk about context-sensitive
    linking.

24
OpenURL
  • This is now a draft standard with NISO to
    standardize the special effects request.
  • The OpenURL is a transport architecture for
    context objects.
  • Context objects unite descriptions of
  • the reference found
  • the context in which is was found

25
implications for information retrieval
  • The implications on the library world are already
    important.
  • many library systems software already implement
    OpenURLs and provide resolvers
  • But impact could be wider and could cover a whole
    new structure for the web, replacing static links
    with on-the-fly dynamic ones.

26
Databases
  • Databases are collection of data with some
    organization to them.
  • The classic example is the relational database.
  • But not all database need to be relational
    databases.

27
Relational databases
  • A relational database is a set of tables. There
    may be relations between the tables.
  • Each table has a number of record. Each record
    has a number of fields.
  • When the database is being set up, we fix
  • the size of each field
  • relationships between tables

28
Example Movie database
  • ID title director date
  • M1 Gone with the wind F. Ford Coppola 1963
  • M2 Room with a view Coppola, F Ford 1985
  • M3 High Noon Woody Allan 1974
  • M4 Star Wars Steve Spielberg 1993
  • M5 Alien Allen, Woody 1987
  • M6 Blowing in the Wind Spielberg, Steven
    1962
  • Single table
  • No relations between tables, of course

29
Problem with this database
  • I made up all the data. It is just for
    illustration.
  • Name covered inconsistently. There is no way to
    find films by Woody Allan without having to go
    through all spelling variations.
  • Mistakes are difficult to correct. We have to
    wade through all records, a masochists pleasure.

30
Better movie database
  • ID title director year
  • M1 Gone with the wind D1 1963
  • M2 Room with a view D1 1985
  • M3 High Noon D2 1974
  • M4 Star Wars D3 1993
  • M5 Alien D2 1987
  • M6 Blowing in the Wind D3 1962
  • ID director name birth year
  • D1 Ford Coppola, Francis 1942
  • D2 Allan, Woody 1957
  • D3 Spielberg, Steven 1942

31
Relational database
  • We have a one to many relationship between
    directors and film
  • Each film has one director
  • Each director has produced many films
  • Here it becomes possible for the computer, and
    then the user
  • To know which films have been directed by Woody
    Allen
  • To find which films have been directed by a
    director born in 1942

32
Many-to-many relationships
  • Each film has one director, but many actors star
    in it. Relationship between actors and films is a
    many to many relationship.
  • Here are a few actors
  • ID sex actor name birth year
  • A1 f Brigitte Bardot 1972
  • A2 m George Clooney 1927
  • A3 f Marilyn Monroe 1934

33
Actor/Movie table
  • actor id movie id
  • A1 M4
  • A2 M3
  • A3 M2
  • A1 M5
  • A1 M3
  • A2 M6
  • A3 M4
  • as many lines as required

34
SQL
  • Once we have the relational database, we can ask
    sophisticated questions
  • Which director has had the most female actors
    working for him?
  • In which years films have been shot that starred
    actors born between 1926 and 1935?
  • Such questions can be encoded in a language know
    as structured query language or SQL. All
    relational database vendors implement a dialect
    of SQL.

35
importance of relational databases
  • Relational databases dominate the world of
    structured information. Examples
  • employment and payroll in a company
  • stock management
  • e-commerce
  • There are quite easy ways to get relational
    databases to work with web interfaces. Some are
    freely available. The most common one is the LAMP
    (Linux Apache MySQL PHP) architecture.

36
relational databases in libraries
  • A 2004 enquiry on the LITA revealed that many
    respondents said that they did regret most not
    having learned more about relational databases in
    library school.
  • But there are problems with relational databases
    in libraries
  • Slow on very large databases (such as catalogs)
  • Library data has nasty ad-hoc relationships, e.g.
  • Translation of the first edition of a book
  • CD supplement that comes with the print version
  • Difficult to deal with in a system where all
    relations and field have to be set up at the
    start, can not be changed easily later.

37
off-web Internet information retrieval
  • Under this heading, I principally think about
    activities known as file-sharing.
  • They concern the (mostly illegal) exchange of
    files between users. Such files many encode
  • music
  • films
  • There is a lot of it going on, but we are not
    sure how much.

38
Napster
  • Napster was the first prominent file-sharing
    service.
  • Napster ran a central server. You connected to
    that server and announced what files you had to
    share.
  • Every search was conducted on the dataset
    assembled at the central server.
  • Connections to download files were done between
    peer machines only.

39
end of Napster
  • Napster argued since it was only involved in
    collecting the information about files available,
    it was legal.
  • Napster never shared any illegal file.
  • The courts thought otherwise.
  • It was shut down.
  • Napster network died without a central machine.
  • To enable true piracy, we need a truly
    distributed system.

40
gnutella protocol
  • This protocol underlies much of the current
    file-sharing activity on the Internet.
  • It enables a peer-to-peer network between
    machines. There every machine is a client and a
    server and called a servent accordingly.
  • To connect to a gnutella network, you need the IP
    address of one single machine that is already
    part of the network.

41
connection to the guntella network
  • Once you establish connection to the first
    servent, you announce your presence.
  • The first servent will pass on that message to
    all the servents that it is connected to, and so
    on.
  • This quickly adds up to a lot of traffic!

42
time to live
  • Every gnutella message has a time to live TTL. It
    is decremented every time it passes at a servent.
  • The TTL is usually quite small. It can be
    arbitrarily reduced by servents.
  • Therefore you only talk to servents that are
    close to you. But your software will determine
    which servents to try to contact first. That
    usually depends on previous query results.

43
searches
  • When you do a search, it is passed on from
    servent to servent through the p2p network.
  • Servents have their own rule how to respond to
    queries.
  • Most of the time search strings are matched
    against a file name.
  • Some may try to match against the directory name.
  • Some general queries may be rejected.
  • Some results sets may be truncated.

44
downloading
  • If you see a file that you like to have, you can
    try to download it.
  • To implement downloads the servents use http.
    Thus everyone who is connected to a file sharing
    network run a web server!
  • However, there usually is a tight limit on how
    many downloads a server will accept.
  • Modern servents have the ability to download from
    several servents.

45
ease to infringe
  • Clearly all the traffic on gnutella, with current
    technology, can be observed.
  • But the infringement is so massive that it
    appears difficult to clamp down on.
  • The easy to infringe is technological.
  • RIAA have sued. They reach the tippy top of the
    iceberg, with the hope to dissuade.

46
http//openlib.org/home/krichel
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com