Informetrics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Informetrics

1
Informetrics IR

Presentation
Readings Discussion Review
Projects Papers

2
Why use metrics?

Apply theory from another field to solve IS
problems
We need new modeling techniques or metaphors to
examine these complex systems
An attempt to apply some new models and metaphors
to complex systems
Bibliometrics
Direct Citation Counting
Bib Coupling
Co-Citation Analysis
Bibliometric Laws
Web Servers
Server Log
Log Analysis

3
How do Informetrics impact IR?

Measures of
Content subject area
Relationships
Use popularity
An information-based view of communications,
focused on documents
Instead of the text in a document, focus on the
document properties (metadata?)
Author(s)
Dates
Publication source(s)
Front Matter Titles Contact info
Back Matter Citations Support

4
What are these metrics?

Bibliometrics
series of techniques that seek to quantify the
process of written communication. Ikpaahindi
counting and analyzing citations
consistently observable patterns
referenced in key places Science Citation Index,
Social Science Citation Index, Arts and
Humanities Citation Index
Webometrics
Applying bibliometric methods to Web pages Web
sites
Informetrics
Wider scale application of methods to networked
information sources

5
Citing Linking

paying homage to pioneers
giving credit for related work (homage to peers)
identifying methodology, equipment, etc.
background reading
correcting ones own work
correcting the work of others
criticizing previous work
substantiating claims
alerting to forthcoming work
providing leads to poorly disseminated, poorly
indexed, or un-cited work
authenticating data and classes of fact -
physical constants, etc.
identifying original pubs in which an idea or
concept was discussed
id original pub or other work describing an
eponymic concept or term (Hodgkins Disease)
disclaiming work or ideas of others (negative
claims)
disputing priority claims of others (negative
homage)

6
Direct Citation Counting

How many citations over a given period of time.
Impact formula
n journal citations/n citable articles published
Immediacy index
n citations received by article during the year/
total number of citable articles published

7
Bibliometric Coupling

a number of papers bear a meaningful relation
to each other when they have one or more
references in common Kessler
Whats the Web equivalent?

8
Co-Citation Analysis

if two references are cited together, in a latter
literature, the two references are themselves
related. the greater the number of times they are
cited together, the greater their cocitation
strength. (Marshakova and Small 1973
independently)
How about Web citations?
Whats a set of Web pages? A Site, a long page?

9
Finer Points

Classification of references
is the reference conceptual or operational
is the reference organic or perfunctory
is the reference evolutionary or juxtapositional
(built on a preceding or an alternative to it)
is the reference confirmative or negational
Citation reference errors
multiple authors (not primary or et. al.) what
contribution/influence by order of names?
self-citations
like-names, initial/full names, different fields
field variation of citation amounts/purposes
fluctuation of influence/use
typos

10
Bibliometric Laws

Seek to describe the working of science by
mathematical means. Generally that a few entities
account for the many citations.
Bradfords Law of Scattering
Lotkas Law
Zipfs Law

11
Bradfords Law of Scattering

How literature in a subject in distributed in
journals.
If scientific journals are arranged in order of
decreasing productivity of articles on a given
subject, they may be divided into a nucleus of
periodicals more particularly devoted to the
subject and several other groups of zones
containing the same number of articles as the
nucleus.
9 journals had 429 articles, the next 59 had 499,
the last 258 had 404.
Bradford discovered this regularity of
calculating the number of titles in each of the
three groups 9 titles, 9x5 titles, 9x5x5 titles.
Can be influenced by sample size, area of
specialization and journal policies.

12
Brookes on Bradfords Formula

The index terms assigned to documents also
follow a Bradford distribution because those
terms most frequently assigned become less and
less specific and therefore increasingly
ineffective in retrieval.

13
Bradfords Formula Itself

Bradfords Formula makes it possible to estimate
how many of the most productive sources would
yield any specified fraction p of the total
number of items. The formula is
R(n) N log n/s (1 lt_ n lt_ N)
where R(n) cumulative total of items
contributed by the sources of rank 1 to n.
N total number of contributing sources
s a constant characteristic of the literature
then
R(N) N log N/s
is the total number of items contributed by N
sources.

14
More Bradfords Law

Citations originally counted year by year can be
expressed as the geometric sequence
R, Ra, Ra2, Ra3, Ra4, ..., Rat-1
where R presumed number of citations during the
first year, some of which do not immediately
emerge in publication. But as alt1, the sum of the
sequence converges to the finite limit R/(1-a).

15
Lotkas Law

An inverse square law that for every 100 authors
contributing on article, 25 will contribute 2, 11
will contribute 3 and 6 will contribute 4.
formula is- 1n2.
Voos found 1n3.5 for Info Science (1974).
What are other similar analysis tasks you could
use Lotkas law for?
Are users, browsers, bloggers like authors?

16
Zipfs Law

The distribution which applied to word frequency
in a text states that the nth ranking word will
appear k/n times, where k is a constant for that
text.
It is easier to choose and use familiar words,
therefore probabilities of occurrence of familiar
words is higher. rfC rank, frequency,
This can be applied by counting all of the words
in a document (minus some words in a stop list -
common words (the, therefore...)) with the most
frequent occurrences representing the subject
matter of the document. Could also use relative
frequency (more often than expected) instead of
absolute frequency.

17
Wyllys on Zipfs Law

Surprisingly constrained relationship between
rank and frequency in natural language.
Zipf said the fundamental reason for human
behavior the striving to minimize effort.
Mandelbrot - further refinement of Zipfs law
(rm)Bfc where r is the rank of a word, f is its
frequency, m, B and c are constants dependent on
the corpus. m has the greatest effect when r is
small.

18
Optimum utility of articles?

the most compact library is not the least costly
because you get rid of articles more quickly
therefore you buy more.
fewer articles are acquired and kept longer but
more shelf space and maintenance is needed.
the challenge is to keep the most frequently
accessed available.

19
Goffmans Theory

His General Theory of Information Systems
Ideas are endemic with minor outbreaks
occurring from time to time. Cycles of use. Like
memes and paradigm shifts (Kuhn). Based on
epidemiology and Shannons communications theory.

20
Online Article Life

Burton proposed a measure for the decay in
citations to older literature, a half-life
How is this different on the net?
a shorter life?
older sites referred less, more?
commercial sites vs. private sites.
advertised vs word of mouth?
linked from popular pages?

21
Prices Law

half of the scientific papers are contributed by
the square root of the total number of scientific
authors
Leads to
bibliographic coupling - the number of reference
two papers have in common, as a measure of their
similarity, a clustering based on this measure
yields meaningful groupings of papers for
information retrieval.

22
Cumulative advantage model

Price noticed this advantage
Success breeds success. also implies that an
obsolescence factor is at work. You get mentioned
a lot, you get mentioned in more and more cited
papers.
Polya describes this as contagion

23
Bibliometrics on the Web

We can use these techniques, rules and formulas
to analyze Web usage.
Like a bibliometric index for historical
analysis.
Key question are citations like page
browsing/using?
Using Web Servers Effectively
Server Logs give us much data to mine
Studies on the Web

24
Understanding the Web

User-based data collection
Surveys
GVU, Nielsen and GNN
Qualitative questions
phone
web forms
Self-selected sample problems
random selection
oversample

25
Understanding the Web

Web Servers
Serve
text
graphics
CGI
XMLHTTPRequest (REST, AJAX)
Web services (SOAP)
other MIME types
Server Logs represent this activity
A lot of empirical, quantitative data on use

26
Problems with Web Servers

Not as Foolproof as Print
No State Information
Interaction with Web pages or Web apps is
difficult to log analyze
Server Hits not Representative
Counters inaccurate
Different, non HTTP requests effects
Floods/Bandwidth can Stop intended usage
Robots, Spam, (D)DoS, Caching, etc.

27
Web Server Records

Server-based
Proxy-based
Client-based
Network-based

28
Clever Web Content Setup

unique file and directory names
clear, consistent structure
FTP server for file transfer
frees up logs and server!
Judicious use of links
Wise MIME types
some hard/impossible to log

29
Clever Web Server Setup

Redirect CGI to find referrer
Use a database
store web content
record usage data
create state information with programming
NSAPI
ActiveX
Have contact information
Have purpose statements
Bibliometric Servlets?

30
Managing Log Files

Backup
Store Results or Logs?
Beginning New Logs
Posting Results

31
Log File Format

see Appendix
key advantage
computer storage cost decreases while paper cost
rises
every server generates slightly different logs

32
Extended Log File Formats

WWW Consortium Standards
Will automatically record much of what is
programmatically done now.
faster
more accurate
standard baselines for comparison
graphics standards

33
Log Analysis Tools

Analog
WWWStat
GetStats
Perl Scripts
Commercial Tools

34
Log Analysis Cumulative Sample

Program started at Tue-03-Dec-2006 0120 local
time.
Analysed requests from Thu-28-Jul-2003 2031 to
Mon-02-Dec-2003 2359 (858.1 days).
Total successful requests 4 282 156 (88 952)
Average successful requests per day 4 990 (12
707)
Total successful requests for pages 1 058 526
(17 492)
Total failed requests 88 633 (1 649)
Total redirected requests 14 457 (197)
Number of distinct files requested 9 638 (2 268)
Number of distinct hosts served 311 878 (11 284)
Number of new hosts served in last 7 days 7 020
Corrupt logfile lines 262
Unwanted logfile entries 976
Total data transferred 23 953 Mbytes (510 619
kbytes)
Average data transferred per day 28 582 kbytes
(72 946 kbytes)

35
Downie and Web Usage

User-based analyses
who
where
what
File-based analyses
amount
Request analyses
conform (loosely) to Zipfs Law
Byte-based analyses

36
Neat Bibliometric Web Tricks

use a search engine to find references
linkwww.ischool.utexas/donturn
key to using unique names
use many engines
update times different
blocking mechanisms are different
use Google News (and the like)
look for references
look for IP addresses of users

37
Neat Tricks, cont.

Walking up the Links
follow URLs upward
Reverse Sort
look for relations
Use your own robot to index
test

38
Projects

capture current and previous user information
seeking behavior and modify interface and content
to meet needs
Dynamic Web Publishing System
anticipate information seeking behavior
based on recorded preferences and pre-supplied
rules, generate and guide users through a
document space.

39
Summary

Bibliometrics, now Informetrics
Bradfords - distribution of documents in a
specific discipline
Lotkas - number of authors of varying
productivity
Zipfs - word frequency rankings
The Web
out of control in growth opportunities
wise setup can help
use good analysis tools

Informetrics PowerPoint PPT Presentation