Bilingual Russian nglish Thesaurus and Domain Ontologies' ThesaurusBased Technologies and ValueAdded - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Bilingual Russian nglish Thesaurus and Domain Ontologies' ThesaurusBased Technologies and ValueAdded

Description:

Moscow State University Research Computing Center ... State Duma Analitical Department. State Duma. daily records. 70,000 700,000. 1990 ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 45
Provided by: lll75
Category:

less

Transcript and Presenter's Notes

Title: Bilingual Russian nglish Thesaurus and Domain Ontologies' ThesaurusBased Technologies and ValueAdded


1
  • Bilingual Russian Énglish Thesaurus and Domain
    Ontologies. Thesaurus-Based Technologies and
    Value-Added Servicies at University Information
    System RUSSIA

2
Moscow State University Research Computing
Center NCO Center for Information Research
University Information System RUSSIA   (Russian
inter-University Social Sciences Information and
Analytical consortium) www.cir.ru Prepared for
Seminar at Finish Social Science Data
Archive, Helsinki, March 9 - 10, 2006 by Tatyana
N. Yudina, Leading researcher, Ph.D.
(history) Moscow State University Research
Computing Center Anna Bogomolova, Assistant
professor, Ph.D. (economics) Moscow State
University Economic faculty yudina_at_mail.cir.ru
bogo_at_mail.cir.ru
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
University Information System RUSSIA Collections
2 000,000/ 20 Gb (www.cir.ru)
7
UIS RUSSIA
  • Collections of documents in English
  • - OECD Health Data,
  • - RePEc (Research Papers in Economics,
    www.repec.org) abstracts and full texts,
  • - Council of Europe documents,
  • - European Court for Human Rights archive,
  • - Publications of Kennan Institute, USA.

8
NLP technology in UIS RUSSIA
holdings
convertors
Automatic Linguistic Text Processing/Linguist
ic Processors
.POD
.OUT
.LEM
.HDR
.HTM
WEB www.cir.ru (Apache OAS)
ORACLE
Administrator.
9
  • Automatic Linguistic Text Processing (ALTP) is a
    UIS RUSSIA team know how.
  • ALTP is adjusted to content-based process and
    integrate all main types of business prose text
    corpora (documents and statistics) government
    publications, parliament chambers daily records,
    think tanks reports, scientific journals, mass
    media, public opinion polls. Content-based
    processing includes
  • -- Conceptual Indexing,
  • -- Coherent Summarization,
  • -- Text Categorisation.

10
Thesaurus
11
Sociopolitical Thesaurus
29,000  concepts,     75,000  terms 110,000  conce
ptual relations
  • constructed specially as a tool for automatic
    text
  • processing
  • contains terms from economic, financial,
    political,
  • military, social,legislative and cultural
    domains
  • regularly tested during automatic text
    processing.
  • set of relations is adjusted to serve
    content-based
  • search, navigation and query
    refinement.

12
General Structure of Thesaurus
13
English-Russian Sociopolitical Thesaurus
  • Hierarchical conceptual net of 65 thousand
    English terms
  • Manual work
  • Use of general and special domain
    English-Russian and Russian-English
  • dictionaries,
  • Study of conventional American and British
    dictionaries and thesauri,
  • Cross-checking of translations. Internet
    search checking.

14
Thesaurus terminology in social and political
domain
15
Adding languages to Thesaurus
  • It is a challenge to develop multilingual
    Sociopolitical thesaurus, to describe terms of
    social and political domains from different
  • languages and arrange in a multilingual
  • hierarchical net.
  • A project under discussion to add Tatar
    language to the bilingual thesaurus. Tatars is
    the second nation in Russia.

16
Term Extraction for Russian Official Documents
(RF Government Regulation N604 26.06.1995)
17
Thematic Lines of Thesaurus Terms (RF
Government Regulation N604 26.06.1995)
18
Network of Thematic Nodes (RF Government
Regulation N604 26.06.1995)
19
Network of Thematic Nodes in English (RF
Government Regulation N604 26.06.1995)
20
Structure of Thematic Representation
Main Thematic Nodes
Specific Thematic Nodes
21
Structural Thematic Summary(RF Government
Regulation N604 26.06.1995)
22
THESAURUS for Information Retrievalin
Sociopolitical Domain
  • Thesaurus provides for query refinement -
    reformulation - expansion
  • Terminology of Thesaurus covers 95-98 of
    business prose - terms of Russian government
    publications, academic papers and mass media
    texts from 1991
  • Thesaurus is a main element of ALTP/automatic
  • linguistic text processing technology at UIS
    RUSSIA.

23
Query Refinement
24
Navigation in Thesaurus
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Bilingual Information Retrieval
30
Document content representation in two
languagesscheme
Document in Russian
Content representation in Russian
Content representation of a document
Document In English
Content representation In English
31
Documents content representation in two
languagesexample
Document in Russian
Content representation In Russian
Content representation of a document
Document in English
Content representation In English
32
Bilingual Search in UIS RUSSIA
33
www.cir.ru/is4/
34
Text Categorization
35
Expert-made classification
60 coincidence
High accuracy Not high relevance
36
Classification in automatic mode
37
Text Categorization Using Thematic Representation
  • Systems of Subject Headings
  • UIS RUSSIA system of subject headings,
  • RF Central Election Committee Legal Subject
    Headings (450 items 4 levels),
  • 80 Top Terms of Legislative Indexing
    Vocabulary (LIV) Congressional Research
    Service of the US Congress.

38
English-Russian Sociopolitical Thesaurus new
applications
  • Automatic text categorization of research
    papers in economics exploiting JEL subject
    headings (700 categories),
  • Automatic text processing of statistical
    tables,
  • Automatic text processing of European
    organizations documents (European Court of Human
    Rights, Council of Europe, European Union).

39
System of Subject Headings for Budget Data
  • 87 hierarchic categories
  • First level categories are
  • Macroeconomic Indicators
  • Budget Revenues and Expenditures
  • Tax Concessions
  • Budget Deficit/Surplus
  • State and Municipal Debt
  • Budget Process
  • Budget Federalism
  • Extra-Budgetary Funds
  • State Authorities
  • Fiscal Misconduct

40
(No Transcript)
41
Foreign Exchange rate
  • 1. ((US Dollar OR Euro Currency OR Ruble) AND
    Foreign Exchange Rate)
  • OR
  • 2. ((US Dollar OR Euro Currency) AND Ruble AND
    Economic Development (Economic Crisis Economic
    Forecasting Economic Indicator Economic Growth
    Economic Laws Economic Situation))

42
(No Transcript)
43
(No Transcript)
44
  • Thank you !
  • Tatyana N. Yudina, Leading researcher, Ph.D.
    (history)
  • Moscow State University Research Computing Center
  • yudina_at_mail.cir.ru
  • Anna Bogomolova, Assistant professor, Ph.D.
    (economics)
  • Moscow State University Economic faculty
  • bogo_at_mail.cir.ru
Write a Comment
User Comments (0)
About PowerShow.com