Open and selfsustaining digital library services: the example of NEP. - PowerPoint PPT Presentation

About This Presentation

Title:

Open and selfsustaining digital library services: the example of NEP.

Description:

Title 'Open and self-sustaining digital libraries' has been chosen before I was ... overlooks the lists. In early 2005, the command structure was changed to ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 38

Provided by: open6

Learn more at: https://openlib.org

Category:

more less

Transcript and Presenter's Notes

Title: Open and selfsustaining digital library services: the example of NEP.

1
Open and self-sustaining digital library
services the example of NEP.

Thomas Krichel
2005-06-29

2
introduction

Title "Open and self-sustaining digital
libraries" has been chosen before I was really
aware of the need of the audience.
I read in the announcement that I am supposed to
talk about "?? ??????????????? ?????? ?
?????????????? ????????? ???????". This is area I
don't know that much about but I hope to be
asking some interesting questions.
I hope to find someone who is interested enough
in some of them to work with me.

3
my background

I am a trained economist. An economist knows the
price of everything and the value of nothing.
I am interested in free digital libraries.
"Free" can mean "??????????" or "?????????". I am
interested more in the former than in the latter.
My work has mainly been on building such digital
libraries. I am less concerned with the usage of
such libraries.
The building and maintenance of the library will
generate costs. How can it be given away for 0?

4
automation

Digital libraries could be entirely automated.
This is true if the purpose of the digital
library is mainly to retrieve information.
Generally speaking, for information retrieval an
automated system is quite sufficient. Examples
are Google and CiteSeer.

5
limit to automation

This comes in when the library is used to assess
underlying facts.
If we say "Thomas Krichel wrote paper X" the
computer will not understand who Thomas Krichel
is. Only a human can know for sure.
When the library is used for evaluative purposes,
it needs some controlled human intervention.
By evaluative purpose I mean to purpose to say
how well a person or institution has behaved.

6
evaluative purpose

Seems vague but here are some evaluative issues
in academic libraries
which journal is the most cited in field X?
who has written the most papers in field Y?
which institution has the most researchers in
field Z?
Human intervention is critical because
identification problems that we have discussed
problem of abuse and fraud

7
why bother with evaluation?

For a self-sustaining freely available digital
library, the problem of contribution is critical.
Providers of data will have good incentives, if
the data that they contribute is used to evaluate
performance.
In academic digital libraries a crucial
ingredient that helps performance is visibility.
Publish (in the sense of "make public) or perish
quite literally.

8
role of automated means

Ideally a digital library will use a mixture of
automated and human activity.
We push automation as far as we can, and let
humans do the rest.
The design and successful implementation of such
digital libraries is a complex long-run task.
It can be helped if the digital library is also
open.

9
Example RePEc

This is what I am most famous for. I founded the
RePEc digital library. In fact its creation in
1997 goes back to efforts that I made as early as
1993.
RePEc is a digital library that aims to document
keys aspect of the discipline of Economics.
It is essentially a metadata collection. But it
goes beyond documentcollections metadata to
collect data about academic authors and
institutions.
These data on authors and institutions stand in
relation to the document metadata.

10
RePEc is based on 440 archives

WoPEc
EconWPA
DEGREE
S-WoPEc
NBER
CEPR

US Fed in Print
IMF
OECD
MIT
University of Surrey
CO PAH

11
to form a 300k item dataset

146,000 working papers
154,000 journal articles
1,600 software components
900 book and chapter listings
6,400 author contact and publication
listings
8,400 institutional contact listings

12
RePEc is used in many services

EconPapers
NEP New Economics Papers
Inomics
RePEc author service
Z39.50 service by the DEGREE partners

IDEAS
RuPEc
EDIRC
LogEc
CitEc

13
institutional registration

This works through a system called EDIRC.
Christian Zimmermann started it as a list of
departments that have a web site.
I persuaded him that his data would be more
widely used if integrated into the RePEc
database.
Now he is a crucial RePEc leader.

14
LogEc

It is a service by Sune Karlsson that tracks
usage of items in the RePEc database
abstract views
downloads
There is mail that is sent by Christian
Zimmermann to
archive maintainers
RAS registrants
that contains a monthly usage summary.

15
authors' incentives

Authors perceive the registration as a way to
achieve common advertising for their papers.
Author records are used to aggregate usage logs
across RePEc user services for all papers of an
author.
Stimulates a "I am bigger than you are"
mentality. Size matters!

16
NEP New Economics Papers

NEP is a current awareness service for new
working papers in RePEc.
Working papers are accounts of recent research
findings prior to formal publications.
Formal publication takes about four years in
Economics, so no formal paper is new.

17
NEP reports

NEP is a collection of subject-specific report.
Each report is a serial. It has issues, usually
every week.
Each report has
code e.g. nep-mic
subject e.g. microeconomics
editor, i.e. human who controls the contents.
A special NEP report, nep-all, contains all new
papers.

18
history

Initially, I opened NEP in 1998. John S. Irons
agreed to be the general editor.
The general editor is the person who
prepares nep-all
overlooks the lists
In early 2005, the command structure was changed
to
general editor who prepares nep-all
managing director who opens new reports and
communicates to the editors
controller who watches what editors are doing

19
edition control

In the years 1999 to 2001 I took a rather
peripheral interest in NEP. At this time many
reports developed long editorial delays or where
not issued at all.
Despite that the number of reports did still
grow.
But there is no organization of reports into line
of subject in economics.
The report subject space is linear, with most
subjects being covered.

20
coverage ratio analysis

In a paper by Krichel Bakkalbasi, there is an
effort to analyze the coverage ratio of NEP
issues. This is the ratio of papers in NEP-all
that make it to at least one subject report.
Historical data shows the mean coverage ratio is
not improving over time. Rather it stays constant
at around 70.
There are two theories that can help to explain
the static nature of the coverage ratio.

21
coverage ratio theory I target size

When editors compose the subject report, they
have an implicit report size in mind. When
nep-all is large, then the editors will be more
selective. That is, they will take a narrow view
of the subject area.
The chances of a paper to be included in the
subject report are likely to be smaller when a
nep-all issue is large.

22
coverage ratio theory II quality

Papers in RePEc have different quality.
Some papers have problems with "substantive
quality"
come from authors that are unknown
come from institutions that have an unenviable
research reputation
appear in collections that are unknown.
Some papers have problems with "descriptive
quality".
not in English
no abstract
no keywords
Editors also filter for quality.

23
empirical study

Krichel Bakkalbasi investigate this by using a
binary logistic regression analysis. This
estimates, for every paper that appeared in
nep-all, the probability that is will get
announced in any subject report.
They find support for both target size and
quality theories. There is strong empirical
support that the series matters. There is also
some empirical support that author prolificacy
matters.
These results have been greeted with protests by
the editors, who claim that they only consider
the subject when making decision.

24
pre-sorting reports

As RePEc is growing the growing size of nep-all
threatens the survival of NEP.
Editors simply don't want the cope with it.
In 2001 I developed an idea to pre-sort the
report for the editors. A computer program would
look at past issues of the report, extract
features, and make forecasts about the most
likely papers.
Editors would then only need to look at the top
part of the pre-sorted nep-all issue, not at the
bottom.

25
current state of play

I extract the following features
author names
title
abstract
keyword
journal of economic literature (JEL)
classifications
series
I remove punctuation, lowercase, normalize using
L2
I submit the result to svm_light for
classification.
I test using 300 record, and use the rest for
training.

26
How well am I doing?

This is not a trivial question. Precision and
recall are useless. It matters what documents are
judged relevant by the system. Only the ordering
matters. We know the best and worst outcomes.
Some measures have been proposed that do take
ordering. But they still need to be applied to
our case.
Ideally I have a measure that will evaluate
instant outcomes and that have some normalization
properties
The value of the measure at the best outcome
should be 1.
the expected value of the measure, under random
ordering should be 0.

27
the hiking measure

One measure that I have developed is what I call
the hiking measure.
I define a steps as a permutation of two
documents in the outcome vector.
I the number of steps that it takes, from an
outcome x to be evaluated, to the best outcome as
s(x)
Then the hiking measure h(x) 1 2s(x) / n / (
n r)
where n is the total number of documents and r
is the number of relevant documents.

28
example r2 n5

Here is the complete table and outcome
x h(x) x h(x)
1,1,0,0,0 1.0 1,0,0,0,1 0.0
1,0,1,0,0 2/3 0,0,1,1,0 -1/3
0,1,1,0,0 1/3 0,1,0,0,1 1/3
1,0,0,1,0 1/3 0,0,1,0,1 -2/3
0,1,0,1,0 0.0 0,0,0,1,1 -1.0
Problems
no strict ordering different outcomes have the
same hikes
violation of a "natural order of outcomes"

29
natural order

A conscientious editor will be concerned by how
low the last relevant paper sinks. Thus comparing
two outcomes, the one that has the last relevant
paper at a lower position should be preferred.
If two outcomes have the last relevant paper at
the same position, the second-to-last paper
relevant paper should be compared.
This leads to a complete ordering of outcomes.

30
conjecture

A rational editor faces two penalities when
composing the report.
examine a new paper
risk loosing a relevant paper
I claim that under a large class of formulation
of the editor's choice, ranking outcomes by the
natural order is consistent with minimizing the
loss experienced by the editor.
But I can not show this.

31
one way for the computational implementation of
natural order

Derive an algorithm that will associate
consecutive natural numbers with each of the
outcomes, ordered by the natural order.
The expected value is then trivial to compute,
and a measure can can be defined.
Does anyone know such an algorithm?

32
a more flexible way for the computational
implementation of natural order

Pick y gt 1
Then evaluate any outcome as
sum(yp)i,
where p is the position, starting from the right
i1 if relevant
i0 if not
example for y2, interpret x as a binary number
example for y3,
0 1 1 0 0 --gt 310320 33134135
Does anybody know the expected value?

33
outcome average hike, 30 trials

exp 98.66 cis 98.35 spo 96.08 ets 95.75 tra
95.61
hea 95.50 dcm 94.76 geo 94.56 int 94.43 ecm
94.27
gth 94.09 dge 92.94 mon 92.54 eff 91.48 ene
91.46
ifn 90.64 ino 90.31 cba 90.04 fmk 89.90 ure
89.86
hpe 88.91 agr 88.89 evo 87.90 law 87.84 env
87.22
cul 86.39 cbe 85.76 ent 85.07 com 84.52 net
84.20
edu 83.80 lab 83.58 dev 83.55 cfn 82.84 res
82.62
sea 82.25 ias 81.45 cmp 81.11 tur 80.50 fin
80.47
tid 80.29 pbe 78.99 pol 78.75 mfd 78.07 eec
78.01
mac 77.03 rmg 76.22 cdm 76.12 cwa 75.38 pub
74.60
his 71.90 ltv 71.23 afr 69.72 acc 68.72 ind
67.56
lam 66.20 mic 61.17 reg 59.12 pke 58.85 bec
57.76

34
some remarks

There is a great diversity in the results.
Some topics are more easy to classify
automatically than others. The value of the
report lies in what the human says that goes
beyond the recognition by the machine.
Unfortunately, manual inspection of poorly
forecasted results suggests that the reason for
the poor result may lie more in the inconsistency
of editor decision making than in the forecasting
technique.
This suggests that this could be used as
evaluation device for the editors. This was not
intended when I started this work!

35
how to improve

Clearly word ordering is important in this areas
since different classes don't differ that much by
word choice.
I can use all the keyword data in the RePEc
database to find phrases to add to my feature
set.
There may also be a way to automatically deduct
significant word combinations from titles and
abstracts.
Finally a combination with the quality criteria
mentioned may be good but it does not appear
obvious how to do it.

36
conclusions

To provide high quality digital library services,
human intervention still appears to be desirable.
However, we need ways to monitor how well the
humans are doing. If they take bad decisions
Forecastability can be one criterion.
Timeliness and usage can be others.
I will have to work further to develop better
monitoring systems for editor behavior.

37
http//openlib.org/home/krichel