Open and selfsustaining digital library services: the example of NEP. - PowerPoint PPT Presentation

About This Presentation
Title:

Open and selfsustaining digital library services: the example of NEP.

Description:

Title 'Open and self-sustaining digital libraries' has been chosen before I was ... overlooks the lists. In early 2005, the command structure was changed to ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 38
Provided by: open6
Learn more at: https://openlib.org
Category:

less

Transcript and Presenter's Notes

Title: Open and selfsustaining digital library services: the example of NEP.


1
Open and self-sustaining digital library
services the example of NEP.
  • Thomas Krichel
  • 2005-06-29

2
introduction
  • Title "Open and self-sustaining digital
    libraries" has been chosen before I was really
    aware of the need of the audience.
  • I read in the announcement that I am supposed to
    talk about "?? ??????????????? ?????? ?
    ?????????????? ????????? ???????". This is area I
    don't know that much about but I hope to be
    asking some interesting questions.
  • I hope to find someone who is interested enough
    in some of them to work with me.

3
my background
  • I am a trained economist. An economist knows the
    price of everything and the value of nothing.
  • I am interested in free digital libraries.
  • "Free" can mean "??????????" or "?????????". I am
    interested more in the former than in the latter.
  • My work has mainly been on building such digital
    libraries. I am less concerned with the usage of
    such libraries.
  • The building and maintenance of the library will
    generate costs. How can it be given away for 0?

4
automation
  • Digital libraries could be entirely automated.
  • This is true if the purpose of the digital
    library is mainly to retrieve information.
  • Generally speaking, for information retrieval an
    automated system is quite sufficient. Examples
    are Google and CiteSeer.

5
limit to automation
  • This comes in when the library is used to assess
    underlying facts.
  • If we say "Thomas Krichel wrote paper X" the
    computer will not understand who Thomas Krichel
    is. Only a human can know for sure.
  • When the library is used for evaluative purposes,
    it needs some controlled human intervention.
  • By evaluative purpose I mean to purpose to say
    how well a person or institution has behaved.

6
evaluative purpose
  • Seems vague but here are some evaluative issues
    in academic libraries
  • which journal is the most cited in field X?
  • who has written the most papers in field Y?
  • which institution has the most researchers in
    field Z?
  • Human intervention is critical because
  • identification problems that we have discussed
  • problem of abuse and fraud

7
why bother with evaluation?
  • For a self-sustaining freely available digital
    library, the problem of contribution is critical.
  • Providers of data will have good incentives, if
    the data that they contribute is used to evaluate
    performance.
  • In academic digital libraries a crucial
    ingredient that helps performance is visibility.
    Publish (in the sense of "make public) or perish
    quite literally.

8
role of automated means
  • Ideally a digital library will use a mixture of
    automated and human activity.
  • We push automation as far as we can, and let
    humans do the rest.
  • The design and successful implementation of such
    digital libraries is a complex long-run task.
  • It can be helped if the digital library is also
    open.

9
Example RePEc
  • This is what I am most famous for. I founded the
    RePEc digital library. In fact its creation in
    1997 goes back to efforts that I made as early as
    1993.
  • RePEc is a digital library that aims to document
    keys aspect of the discipline of Economics.
  • It is essentially a metadata collection. But it
    goes beyond documentcollections metadata to
    collect data about academic authors and
    institutions.
  • These data on authors and institutions stand in
    relation to the document metadata.

10
RePEc is based on 440 archives
  • WoPEc
  • EconWPA
  • DEGREE
  • S-WoPEc
  • NBER
  • CEPR
  • US Fed in Print
  • IMF
  • OECD
  • MIT
  • University of Surrey
  • CO PAH

11
to form a 300k item dataset
  • 146,000 working papers
  • 154,000 journal articles
  • 1,600 software components
  • 900 book and chapter listings
  • 6,400 author contact and publication
    listings
  • 8,400 institutional contact listings

12
RePEc is used in many services
  • EconPapers
  • NEP New Economics Papers
  • Inomics
  • RePEc author service
  • Z39.50 service by the DEGREE partners
  • IDEAS
  • RuPEc
  • EDIRC
  • LogEc
  • CitEc

13
institutional registration
  • This works through a system called EDIRC.
  • Christian Zimmermann started it as a list of
    departments that have a web site.
  • I persuaded him that his data would be more
    widely used if integrated into the RePEc
    database.
  • Now he is a crucial RePEc leader.

14
LogEc
  • It is a service by Sune Karlsson that tracks
    usage of items in the RePEc database
  • abstract views
  • downloads
  • There is mail that is sent by Christian
    Zimmermann to
  • archive maintainers
  • RAS registrants
  • that contains a monthly usage summary.

15
authors' incentives
  • Authors perceive the registration as a way to
    achieve common advertising for their papers.
  • Author records are used to aggregate usage logs
    across RePEc user services for all papers of an
    author.
  • Stimulates a "I am bigger than you are"
    mentality. Size matters!

16
NEP New Economics Papers
  • NEP is a current awareness service for new
    working papers in RePEc.
  • Working papers are accounts of recent research
    findings prior to formal publications.
  • Formal publication takes about four years in
    Economics, so no formal paper is new.

17
NEP reports
  • NEP is a collection of subject-specific report.
  • Each report is a serial. It has issues, usually
    every week.
  • Each report has
  • code e.g. nep-mic
  • subject e.g. microeconomics
  • editor, i.e. human who controls the contents.
  • A special NEP report, nep-all, contains all new
    papers.

18
history
  • Initially, I opened NEP in 1998. John S. Irons
    agreed to be the general editor.
  • The general editor is the person who
  • prepares nep-all
  • overlooks the lists
  • In early 2005, the command structure was changed
    to
  • general editor who prepares nep-all
  • managing director who opens new reports and
    communicates to the editors
  • controller who watches what editors are doing

19
edition control
  • In the years 1999 to 2001 I took a rather
    peripheral interest in NEP. At this time many
    reports developed long editorial delays or where
    not issued at all.
  • Despite that the number of reports did still
    grow.
  • But there is no organization of reports into line
    of subject in economics.
  • The report subject space is linear, with most
    subjects being covered.

20
coverage ratio analysis
  • In a paper by Krichel Bakkalbasi, there is an
    effort to analyze the coverage ratio of NEP
    issues. This is the ratio of papers in NEP-all
    that make it to at least one subject report.
  • Historical data shows the mean coverage ratio is
    not improving over time. Rather it stays constant
    at around 70.
  • There are two theories that can help to explain
    the static nature of the coverage ratio.

21
coverage ratio theory I target size
  • When editors compose the subject report, they
    have an implicit report size in mind. When
    nep-all is large, then the editors will be more
    selective. That is, they will take a narrow view
    of the subject area.
  • The chances of a paper to be included in the
    subject report are likely to be smaller when a
    nep-all issue is large.

22
coverage ratio theory II quality
  • Papers in RePEc have different quality.
  • Some papers have problems with "substantive
    quality"
  • come from authors that are unknown
  • come from institutions that have an unenviable
    research reputation
  • appear in collections that are unknown.
  • Some papers have problems with "descriptive
    quality".
  • not in English
  • no abstract
  • no keywords
  • Editors also filter for quality.

23
empirical study
  • Krichel Bakkalbasi investigate this by using a
    binary logistic regression analysis. This
    estimates, for every paper that appeared in
    nep-all, the probability that is will get
    announced in any subject report.
  • They find support for both target size and
    quality theories. There is strong empirical
    support that the series matters. There is also
    some empirical support that author prolificacy
    matters.
  • These results have been greeted with protests by
    the editors, who claim that they only consider
    the subject when making decision.

24
pre-sorting reports
  • As RePEc is growing the growing size of nep-all
    threatens the survival of NEP.
  • Editors simply don't want the cope with it.
  • In 2001 I developed an idea to pre-sort the
    report for the editors. A computer program would
    look at past issues of the report, extract
    features, and make forecasts about the most
    likely papers.
  • Editors would then only need to look at the top
    part of the pre-sorted nep-all issue, not at the
    bottom.

25
current state of play
  • I extract the following features
  • author names
  • title
  • abstract
  • keyword
  • journal of economic literature (JEL)
    classifications
  • series
  • I remove punctuation, lowercase, normalize using
    L2
  • I submit the result to svm_light for
    classification.
  • I test using 300 record, and use the rest for
    training.

26
How well am I doing?
  • This is not a trivial question. Precision and
    recall are useless. It matters what documents are
    judged relevant by the system. Only the ordering
    matters. We know the best and worst outcomes.
  • Some measures have been proposed that do take
    ordering. But they still need to be applied to
    our case.
  • Ideally I have a measure that will evaluate
    instant outcomes and that have some normalization
    properties
  • The value of the measure at the best outcome
    should be 1.
  • the expected value of the measure, under random
    ordering should be 0.

27
the hiking measure
  • One measure that I have developed is what I call
    the hiking measure.
  • I define a steps as a permutation of two
    documents in the outcome vector.
  • I the number of steps that it takes, from an
    outcome x to be evaluated, to the best outcome as
    s(x)
  • Then the hiking measure h(x) 1 2s(x) / n / (
    n r)
  • where n is the total number of documents and r
    is the number of relevant documents.

28
example r2 n5
  • Here is the complete table and outcome
  • x h(x) x h(x)
  • 1,1,0,0,0 1.0 1,0,0,0,1 0.0
  • 1,0,1,0,0 2/3 0,0,1,1,0 -1/3
  • 0,1,1,0,0 1/3 0,1,0,0,1 1/3
  • 1,0,0,1,0 1/3 0,0,1,0,1 -2/3
  • 0,1,0,1,0 0.0 0,0,0,1,1 -1.0
  • Problems
  • no strict ordering different outcomes have the
    same hikes
  • violation of a "natural order of outcomes"

29
natural order
  • A conscientious editor will be concerned by how
    low the last relevant paper sinks. Thus comparing
    two outcomes, the one that has the last relevant
    paper at a lower position should be preferred.
  • If two outcomes have the last relevant paper at
    the same position, the second-to-last paper
    relevant paper should be compared.
  • This leads to a complete ordering of outcomes.

30
conjecture
  • A rational editor faces two penalities when
    composing the report.
  • examine a new paper
  • risk loosing a relevant paper
  • I claim that under a large class of formulation
    of the editor's choice, ranking outcomes by the
    natural order is consistent with minimizing the
    loss experienced by the editor.
  • But I can not show this.

31
one way for the computational implementation of
natural order
  • Derive an algorithm that will associate
    consecutive natural numbers with each of the
    outcomes, ordered by the natural order.
  • The expected value is then trivial to compute,
    and a measure can can be defined.
  • Does anyone know such an algorithm?

32
a more flexible way for the computational
implementation of natural order
  • Pick y gt 1
  • Then evaluate any outcome as
  • sum(yp)i,
  • where p is the position, starting from the right
  • i1 if relevant
  • i0 if not
  • example for y2, interpret x as a binary number
  • example for y3,
  • 0 1 1 0 0 --gt 310320 33134135
  • Does anybody know the expected value?

33
outcome average hike, 30 trials
  • exp 98.66 cis 98.35 spo 96.08 ets 95.75 tra
    95.61
  • hea 95.50 dcm 94.76 geo 94.56 int 94.43 ecm
    94.27
  • gth 94.09 dge 92.94 mon 92.54 eff 91.48 ene
    91.46
  • ifn 90.64 ino 90.31 cba 90.04 fmk 89.90 ure
    89.86
  • hpe 88.91 agr 88.89 evo 87.90 law 87.84 env
    87.22
  • cul 86.39 cbe 85.76 ent 85.07 com 84.52 net
    84.20
  • edu 83.80 lab 83.58 dev 83.55 cfn 82.84 res
    82.62
  • sea 82.25 ias 81.45 cmp 81.11 tur 80.50 fin
    80.47
  • tid 80.29 pbe 78.99 pol 78.75 mfd 78.07 eec
    78.01
  • mac 77.03 rmg 76.22 cdm 76.12 cwa 75.38 pub
    74.60
  • his 71.90 ltv 71.23 afr 69.72 acc 68.72 ind
    67.56
  • lam 66.20 mic 61.17 reg 59.12 pke 58.85 bec
    57.76

34
some remarks
  • There is a great diversity in the results.
  • Some topics are more easy to classify
    automatically than others. The value of the
    report lies in what the human says that goes
    beyond the recognition by the machine.
  • Unfortunately, manual inspection of poorly
    forecasted results suggests that the reason for
    the poor result may lie more in the inconsistency
    of editor decision making than in the forecasting
    technique.
  • This suggests that this could be used as
    evaluation device for the editors. This was not
    intended when I started this work!

35
how to improve
  • Clearly word ordering is important in this areas
    since different classes don't differ that much by
    word choice.
  • I can use all the keyword data in the RePEc
    database to find phrases to add to my feature
    set.
  • There may also be a way to automatically deduct
    significant word combinations from titles and
    abstracts.
  • Finally a combination with the quality criteria
    mentioned may be good but it does not appear
    obvious how to do it.

36
conclusions
  • To provide high quality digital library services,
    human intervention still appears to be desirable.
  • However, we need ways to monitor how well the
    humans are doing. If they take bad decisions
  • Forecastability can be one criterion.
  • Timeliness and usage can be others.
  • I will have to work further to develop better
    monitoring systems for editor behavior.

37
http//openlib.org/home/krichel
  • ??????? ?? ????????!
Write a Comment
User Comments (0)
About PowerShow.com