Title: Open and selfsustaining digital library services: the example of NEP.
1Open and self-sustaining digital library
services the example of NEP.
- Thomas Krichel
- 2005-06-29
2introduction
- Title "Open and self-sustaining digital
libraries" has been chosen before I was really
aware of the need of the audience. - I read in the announcement that I am supposed to
talk about "?? ??????????????? ?????? ?
?????????????? ????????? ???????". This is area I
don't know that much about but I hope to be
asking some interesting questions. - I hope to find someone who is interested enough
in some of them to work with me.
3my background
- I am a trained economist. An economist knows the
price of everything and the value of nothing. - I am interested in free digital libraries.
- "Free" can mean "??????????" or "?????????". I am
interested more in the former than in the latter.
- My work has mainly been on building such digital
libraries. I am less concerned with the usage of
such libraries. - The building and maintenance of the library will
generate costs. How can it be given away for 0?
4automation
- Digital libraries could be entirely automated.
- This is true if the purpose of the digital
library is mainly to retrieve information. - Generally speaking, for information retrieval an
automated system is quite sufficient. Examples
are Google and CiteSeer.
5limit to automation
- This comes in when the library is used to assess
underlying facts. - If we say "Thomas Krichel wrote paper X" the
computer will not understand who Thomas Krichel
is. Only a human can know for sure. - When the library is used for evaluative purposes,
it needs some controlled human intervention. - By evaluative purpose I mean to purpose to say
how well a person or institution has behaved.
6evaluative purpose
- Seems vague but here are some evaluative issues
in academic libraries - which journal is the most cited in field X?
- who has written the most papers in field Y?
- which institution has the most researchers in
field Z? - Human intervention is critical because
- identification problems that we have discussed
- problem of abuse and fraud
7why bother with evaluation?
- For a self-sustaining freely available digital
library, the problem of contribution is critical.
- Providers of data will have good incentives, if
the data that they contribute is used to evaluate
performance. - In academic digital libraries a crucial
ingredient that helps performance is visibility.
Publish (in the sense of "make public) or perish
quite literally.
8role of automated means
- Ideally a digital library will use a mixture of
automated and human activity. - We push automation as far as we can, and let
humans do the rest. - The design and successful implementation of such
digital libraries is a complex long-run task. - It can be helped if the digital library is also
open.
9Example RePEc
- This is what I am most famous for. I founded the
RePEc digital library. In fact its creation in
1997 goes back to efforts that I made as early as
1993. - RePEc is a digital library that aims to document
keys aspect of the discipline of Economics. - It is essentially a metadata collection. But it
goes beyond documentcollections metadata to
collect data about academic authors and
institutions. - These data on authors and institutions stand in
relation to the document metadata.
10RePEc is based on 440 archives
- WoPEc
- EconWPA
- DEGREE
- S-WoPEc
- NBER
- CEPR
- US Fed in Print
- IMF
- OECD
- MIT
- University of Surrey
- CO PAH
11to form a 300k item dataset
- 146,000 working papers
- 154,000 journal articles
- 1,600 software components
- 900 book and chapter listings
- 6,400 author contact and publication
listings - 8,400 institutional contact listings
12RePEc is used in many services
- EconPapers
- NEP New Economics Papers
- Inomics
- RePEc author service
- Z39.50 service by the DEGREE partners
- IDEAS
- RuPEc
- EDIRC
- LogEc
- CitEc
13institutional registration
- This works through a system called EDIRC.
- Christian Zimmermann started it as a list of
departments that have a web site. - I persuaded him that his data would be more
widely used if integrated into the RePEc
database. - Now he is a crucial RePEc leader.
14LogEc
- It is a service by Sune Karlsson that tracks
usage of items in the RePEc database - abstract views
- downloads
- There is mail that is sent by Christian
Zimmermann to - archive maintainers
- RAS registrants
- that contains a monthly usage summary.
15authors' incentives
- Authors perceive the registration as a way to
achieve common advertising for their papers. - Author records are used to aggregate usage logs
across RePEc user services for all papers of an
author. - Stimulates a "I am bigger than you are"
mentality. Size matters!
16NEP New Economics Papers
- NEP is a current awareness service for new
working papers in RePEc. - Working papers are accounts of recent research
findings prior to formal publications. - Formal publication takes about four years in
Economics, so no formal paper is new.
17NEP reports
- NEP is a collection of subject-specific report.
- Each report is a serial. It has issues, usually
every week. - Each report has
- code e.g. nep-mic
- subject e.g. microeconomics
- editor, i.e. human who controls the contents.
- A special NEP report, nep-all, contains all new
papers.
18history
- Initially, I opened NEP in 1998. John S. Irons
agreed to be the general editor. - The general editor is the person who
- prepares nep-all
- overlooks the lists
- In early 2005, the command structure was changed
to - general editor who prepares nep-all
- managing director who opens new reports and
communicates to the editors - controller who watches what editors are doing
19edition control
- In the years 1999 to 2001 I took a rather
peripheral interest in NEP. At this time many
reports developed long editorial delays or where
not issued at all. - Despite that the number of reports did still
grow. - But there is no organization of reports into line
of subject in economics. - The report subject space is linear, with most
subjects being covered.
20coverage ratio analysis
- In a paper by Krichel Bakkalbasi, there is an
effort to analyze the coverage ratio of NEP
issues. This is the ratio of papers in NEP-all
that make it to at least one subject report. - Historical data shows the mean coverage ratio is
not improving over time. Rather it stays constant
at around 70. - There are two theories that can help to explain
the static nature of the coverage ratio.
21coverage ratio theory I target size
- When editors compose the subject report, they
have an implicit report size in mind. When
nep-all is large, then the editors will be more
selective. That is, they will take a narrow view
of the subject area. - The chances of a paper to be included in the
subject report are likely to be smaller when a
nep-all issue is large.
22coverage ratio theory II quality
- Papers in RePEc have different quality.
- Some papers have problems with "substantive
quality" - come from authors that are unknown
- come from institutions that have an unenviable
research reputation - appear in collections that are unknown.
- Some papers have problems with "descriptive
quality". - not in English
- no abstract
- no keywords
- Editors also filter for quality.
23empirical study
- Krichel Bakkalbasi investigate this by using a
binary logistic regression analysis. This
estimates, for every paper that appeared in
nep-all, the probability that is will get
announced in any subject report. - They find support for both target size and
quality theories. There is strong empirical
support that the series matters. There is also
some empirical support that author prolificacy
matters. - These results have been greeted with protests by
the editors, who claim that they only consider
the subject when making decision.
24pre-sorting reports
- As RePEc is growing the growing size of nep-all
threatens the survival of NEP. - Editors simply don't want the cope with it.
- In 2001 I developed an idea to pre-sort the
report for the editors. A computer program would
look at past issues of the report, extract
features, and make forecasts about the most
likely papers. - Editors would then only need to look at the top
part of the pre-sorted nep-all issue, not at the
bottom.
25current state of play
- I extract the following features
- author names
- title
- abstract
- keyword
- journal of economic literature (JEL)
classifications - series
- I remove punctuation, lowercase, normalize using
L2 - I submit the result to svm_light for
classification. - I test using 300 record, and use the rest for
training.
26How well am I doing?
- This is not a trivial question. Precision and
recall are useless. It matters what documents are
judged relevant by the system. Only the ordering
matters. We know the best and worst outcomes. - Some measures have been proposed that do take
ordering. But they still need to be applied to
our case. - Ideally I have a measure that will evaluate
instant outcomes and that have some normalization
properties - The value of the measure at the best outcome
should be 1. - the expected value of the measure, under random
ordering should be 0.
27the hiking measure
- One measure that I have developed is what I call
the hiking measure. - I define a steps as a permutation of two
documents in the outcome vector. - I the number of steps that it takes, from an
outcome x to be evaluated, to the best outcome as
s(x) - Then the hiking measure h(x) 1 2s(x) / n / (
n r) - where n is the total number of documents and r
is the number of relevant documents.
28example r2 n5
- Here is the complete table and outcome
- x h(x) x h(x)
- 1,1,0,0,0 1.0 1,0,0,0,1 0.0
- 1,0,1,0,0 2/3 0,0,1,1,0 -1/3
- 0,1,1,0,0 1/3 0,1,0,0,1 1/3
- 1,0,0,1,0 1/3 0,0,1,0,1 -2/3
- 0,1,0,1,0 0.0 0,0,0,1,1 -1.0
- Problems
- no strict ordering different outcomes have the
same hikes - violation of a "natural order of outcomes"
29natural order
- A conscientious editor will be concerned by how
low the last relevant paper sinks. Thus comparing
two outcomes, the one that has the last relevant
paper at a lower position should be preferred. - If two outcomes have the last relevant paper at
the same position, the second-to-last paper
relevant paper should be compared. - This leads to a complete ordering of outcomes.
30conjecture
- A rational editor faces two penalities when
composing the report. - examine a new paper
- risk loosing a relevant paper
- I claim that under a large class of formulation
of the editor's choice, ranking outcomes by the
natural order is consistent with minimizing the
loss experienced by the editor. - But I can not show this.
31one way for the computational implementation of
natural order
- Derive an algorithm that will associate
consecutive natural numbers with each of the
outcomes, ordered by the natural order. - The expected value is then trivial to compute,
and a measure can can be defined. - Does anyone know such an algorithm?
32a more flexible way for the computational
implementation of natural order
- Pick y gt 1
- Then evaluate any outcome as
- sum(yp)i,
- where p is the position, starting from the right
- i1 if relevant
- i0 if not
- example for y2, interpret x as a binary number
- example for y3,
- 0 1 1 0 0 --gt 310320 33134135
- Does anybody know the expected value?
33outcome average hike, 30 trials
- exp 98.66 cis 98.35 spo 96.08 ets 95.75 tra
95.61 - hea 95.50 dcm 94.76 geo 94.56 int 94.43 ecm
94.27 - gth 94.09 dge 92.94 mon 92.54 eff 91.48 ene
91.46 - ifn 90.64 ino 90.31 cba 90.04 fmk 89.90 ure
89.86 - hpe 88.91 agr 88.89 evo 87.90 law 87.84 env
87.22 - cul 86.39 cbe 85.76 ent 85.07 com 84.52 net
84.20 - edu 83.80 lab 83.58 dev 83.55 cfn 82.84 res
82.62 - sea 82.25 ias 81.45 cmp 81.11 tur 80.50 fin
80.47 - tid 80.29 pbe 78.99 pol 78.75 mfd 78.07 eec
78.01 - mac 77.03 rmg 76.22 cdm 76.12 cwa 75.38 pub
74.60 - his 71.90 ltv 71.23 afr 69.72 acc 68.72 ind
67.56 - lam 66.20 mic 61.17 reg 59.12 pke 58.85 bec
57.76
34some remarks
- There is a great diversity in the results.
- Some topics are more easy to classify
automatically than others. The value of the
report lies in what the human says that goes
beyond the recognition by the machine. - Unfortunately, manual inspection of poorly
forecasted results suggests that the reason for
the poor result may lie more in the inconsistency
of editor decision making than in the forecasting
technique. - This suggests that this could be used as
evaluation device for the editors. This was not
intended when I started this work!
35how to improve
- Clearly word ordering is important in this areas
since different classes don't differ that much by
word choice. - I can use all the keyword data in the RePEc
database to find phrases to add to my feature
set. - There may also be a way to automatically deduct
significant word combinations from titles and
abstracts. - Finally a combination with the quality criteria
mentioned may be good but it does not appear
obvious how to do it.
36conclusions
- To provide high quality digital library services,
human intervention still appears to be desirable. - However, we need ways to monitor how well the
humans are doing. If they take bad decisions - Forecastability can be one criterion.
- Timeliness and usage can be others.
- I will have to work further to develop better
monitoring systems for editor behavior.
37http//openlib.org/home/krichel