Title: Bilingual Russian nglish Thesaurus and Domain Ontologies' ThesaurusBased Technologies and ValueAdded
1- Bilingual Russian Énglish Thesaurus and Domain
Ontologies. Thesaurus-Based Technologies and
Value-Added Servicies at University Information
System RUSSIA
2Moscow State University Research Computing
Center NCO Center for Information Research
University Information System RUSSIA (Russian
inter-University Social Sciences Information and
Analytical consortium) www.cir.ru Prepared for
Seminar at Finish Social Science Data
Archive, Helsinki, March 9 - 10, 2006 by Tatyana
N. Yudina, Leading researcher, Ph.D.
(history) Moscow State University Research
Computing Center Anna Bogomolova, Assistant
professor, Ph.D. (economics) Moscow State
University Economic faculty yudina_at_mail.cir.ru
bogo_at_mail.cir.ru
3(No Transcript)
4(No Transcript)
5(No Transcript)
6University Information System RUSSIA Collections
2 000,000/ 20 Gb (www.cir.ru)
7UIS RUSSIA
- Collections of documents in English
- - OECD Health Data,
- - RePEc (Research Papers in Economics,
www.repec.org) abstracts and full texts, - - Council of Europe documents,
- - European Court for Human Rights archive,
- - Publications of Kennan Institute, USA.
-
8NLP technology in UIS RUSSIA
holdings
convertors
Automatic Linguistic Text Processing/Linguist
ic Processors
.POD
.OUT
.LEM
.HDR
.HTM
WEB www.cir.ru (Apache OAS)
ORACLE
Administrator.
9- Automatic Linguistic Text Processing (ALTP) is a
UIS RUSSIA team know how. - ALTP is adjusted to content-based process and
integrate all main types of business prose text
corpora (documents and statistics) government
publications, parliament chambers daily records,
think tanks reports, scientific journals, mass
media, public opinion polls. Content-based
processing includes - -- Conceptual Indexing,
- -- Coherent Summarization,
- -- Text Categorisation.
10Thesaurus
11Sociopolitical Thesaurus
29,000 concepts, 75,000 terms 110,000 conce
ptual relations
- constructed specially as a tool for automatic
text - processing
- contains terms from economic, financial,
political, - military, social,legislative and cultural
domains - regularly tested during automatic text
processing. - set of relations is adjusted to serve
content-based - search, navigation and query
refinement.
12General Structure of Thesaurus
13English-Russian Sociopolitical Thesaurus
- Hierarchical conceptual net of 65 thousand
English terms - Manual work
- Use of general and special domain
English-Russian and Russian-English - dictionaries,
- Study of conventional American and British
dictionaries and thesauri, - Cross-checking of translations. Internet
search checking.
14Thesaurus terminology in social and political
domain
15Adding languages to Thesaurus
- It is a challenge to develop multilingual
Sociopolitical thesaurus, to describe terms of
social and political domains from different - languages and arrange in a multilingual
- hierarchical net.
- A project under discussion to add Tatar
language to the bilingual thesaurus. Tatars is
the second nation in Russia.
16Term Extraction for Russian Official Documents
(RF Government Regulation N604 26.06.1995)
17Thematic Lines of Thesaurus Terms (RF
Government Regulation N604 26.06.1995)
18Network of Thematic Nodes (RF Government
Regulation N604 26.06.1995)
19Network of Thematic Nodes in English (RF
Government Regulation N604 26.06.1995)
20Structure of Thematic Representation
Main Thematic Nodes
Specific Thematic Nodes
21Structural Thematic Summary(RF Government
Regulation N604 26.06.1995)
22THESAURUS for Information Retrievalin
Sociopolitical Domain
- Thesaurus provides for query refinement -
reformulation - expansion - Terminology of Thesaurus covers 95-98 of
business prose - terms of Russian government
publications, academic papers and mass media
texts from 1991 - Thesaurus is a main element of ALTP/automatic
- linguistic text processing technology at UIS
RUSSIA.
23Query Refinement
24Navigation in Thesaurus
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Bilingual Information Retrieval
30Document content representation in two
languagesscheme
Document in Russian
Content representation in Russian
Content representation of a document
Document In English
Content representation In English
31Documents content representation in two
languagesexample
Document in Russian
Content representation In Russian
Content representation of a document
Document in English
Content representation In English
32Bilingual Search in UIS RUSSIA
33www.cir.ru/is4/
34Text Categorization
35Expert-made classification
60 coincidence
High accuracy Not high relevance
36Classification in automatic mode
37Text Categorization Using Thematic Representation
- Systems of Subject Headings
- UIS RUSSIA system of subject headings,
- RF Central Election Committee Legal Subject
Headings (450 items 4 levels), - 80 Top Terms of Legislative Indexing
Vocabulary (LIV) Congressional Research
Service of the US Congress.
38English-Russian Sociopolitical Thesaurus new
applications
- Automatic text categorization of research
papers in economics exploiting JEL subject
headings (700 categories), - Automatic text processing of statistical
tables, - Automatic text processing of European
organizations documents (European Court of Human
Rights, Council of Europe, European Union).
39System of Subject Headings for Budget Data
- 87 hierarchic categories
- First level categories are
- Macroeconomic Indicators
- Budget Revenues and Expenditures
- Tax Concessions
- Budget Deficit/Surplus
- State and Municipal Debt
- Budget Process
- Budget Federalism
- Extra-Budgetary Funds
- State Authorities
- Fiscal Misconduct
40(No Transcript)
41Foreign Exchange rate
- 1. ((US Dollar OR Euro Currency OR Ruble) AND
Foreign Exchange Rate) - OR
- 2. ((US Dollar OR Euro Currency) AND Ruble AND
Economic Development (Economic Crisis Economic
Forecasting Economic Indicator Economic Growth
Economic Laws Economic Situation))
42(No Transcript)
43(No Transcript)
44- Thank you !
- Tatyana N. Yudina, Leading researcher, Ph.D.
(history) - Moscow State University Research Computing Center
- yudina_at_mail.cir.ru
- Anna Bogomolova, Assistant professor, Ph.D.
(economics) - Moscow State University Economic faculty
- bogo_at_mail.cir.ru