Title: A note on extracting sentiments in financial news in English, Arabic
1A note on extracting sentiments in financial
news in English, Arabic Urdu
- Yousif Almas
Khurshid Ahmad
Department of Computing University of Surrey,
Guildford, UK y.almas_at_surrey.ac.uk
Department of Computer Science Trinity College,
Dublin, Ireland kahmad_at_cs.tcd.ie
The Second Workshop on Computational Approaches
to Arabic Script-based Languages LSA 2007
Linguistic Institute July 21, 2007 Stanford
University
2Motivation
Daniel Kahneman (Nobel Prize Awarded 2002)
Herbert Simon (Nobel Prize Awarded 1978)
Robert Engle(Nobel Prize Awarded 2003)
News Impact Curvesand the Asymmetrical Effectof
News
Bounded Rationalityand Information Overload
Behavioural Financeand Human Psychology
3Outline
- Introduction
- Background
- Method
- Experiments
- Evaluation and Conclusion
4Introduction World Language Hierarchy
The world language hierarchy (1997)
The world language hierarchy (2050)
Graddol, David. (1997) The Future of English?
London The British Council.
5Introduction Financial News
- Many Sources and Languages
- Information about local markets are not always
available in English
Q To what extend do you consider news in taking
buy/sell decisions? A Regularly and in two
languages (English and Arabic) (Head of Treasury
in an international bank based in the Middle East)
6Introduction The role of Financial Language
Financial News
write
analyse
restrict
FinancialLanguage
Financial Professionals
communicate
use
Financial Reporters
describe
survey
report
affect
Financial Markets
7Introduction - Eyeballing the text!
- What is missing in the qualitative analysis
packages? - The texts have to be eye-balled Most phrases,
clauses, paragraphs have to be coded/annotated by
hand ? impossible task when texts all around us
is exploding (Surdenau et al, 2003 Chan and Wai
2005 Lan et al., 2005 Das and Chen, 2006) - There is a need for a domain specific thesaurus
(conceptually-organised terminology or
ontology) for each new domain ? - Identify ontological commitments
- Find terms, and the broader/narrower equivalents
synonyms and antonyms - Maintain terminology data bases
- Texts that are conceptually similar within a
domain have to be clustered using unsupervised
learning algorithms - Almost all systems are Anglo-centric
8Introduction - Objective
- Propose a language-informed framework for
financial news analysis using techniques of
corpus linguistics and special language
terminology - Expect that positive and negative sentiment
expressed will be tied to aspects of language
like metaphor - Assume frequency, collocation and local grammar
analysis will lead to patterns useful for the
automatic analysis of financial news
9Introduction - Objective
(2) analyse
Financial News
System
(1) learn
(3) assist
analyse
restrict
write
FinancialLanguage
communicate
Financial Professionals
use
Financial Reporters
describe
report
affect
survey
Financial Markets
10Background Behavioural Finance
Investors are not rational agents (machines) that
can process all available information rationally
Psychological and irrationality factors must be
considered when studying the movements of
financial markets
Clipart Source www.attitude2food.com
11Background The Role of Media
- Tetlock et al (2005, 2006) have examined whether
a simple quantitative measure of language can be
used to predict individual firms accounting
earnings and stock returns - 1) the fraction of negative words in
firm-specific news stories forecasts low firm
earnings - 2) firms stock prices briefly under-react to the
information embedded in negative words and - 3) the earnings and return predictability from
negative words is largest for the stories that
focus on fundamentals.
12Background Special Languages
- A special language is a linguistic subsystem
intended for unambiguous communication in a
particular subject field using a terminology and
other linguistic means (ISO 1087) - Authors of special texts share a common
vocabulary and common habits of word usage
(Hirschman and Sager, 1982) - Grammatical constructions of a natural language
is significantly reduced in special languages
(Kittredge, 1982) - Word frequency correlates with its acceptability
in a language community (Quirk et al, 1985) - There is a close relation between word
distribution and information-bearing phrases
(Hirschman,1986)
13Background Financial Language
- A special language comprising a terminology and
metaphorical mappings couched in a local grammar - Oilinstrument prices rosemovement to
68.52value a barrel amid worries about an
escalation in the standoff between Iran and the
westcause. - The firm lowered its revenue outlook for the
first quarter last night and now expects revenue
to fall six percent from the fourth quarter. - Ryanairs healthy margins give its earning a
strong defence.
Source www.reuters.co.uk
14Background - Local Grammars
- Descriptors of particular parts of language use
or sentences with specific functions (Gross,
1993) - Capture the contextual properties of lexical
items - Consider the lexical, syntactic and semantic
restrictions that words exhibit - Would only accept sentences that are meaningful
and related to the task
15Background - Local Grammars
- Words like rose or fall may be used as a name
- Local grammar rejects spurious use
- A local grammar, used almost exclusively in
financial reporting, can be used to extract
true sentiment from raw sentiment
16Background Arabic Financial News
- Increasing liquidity in some parts of the Arab
world poured large amounts of money in the local
financial markets and grasped the attention of
local and international media (e.g. Reuters,
CNBC, CNN, BBC,etc.)
17Background Case Study
- Correlation between manually selected positive
and negative words in Al-Wafd Arabic newspaper
and the Egyptian pound showed some
anti-correlation between negative news and the
value of the pound (Ahmed and Almas, IV05)
Positives - Negatives - Financial Instrument
18Background Metaphors in Finance
falling or rising ?????? ?? ??????
sick or healthy ???? ?? ?????
ascending or descending ???? ?? ????
Cartoons Source http//www.aleqt.com/
19Background - Metaphors
- Metaphors can be both culture and language bound,
but in financial news, they usually relate to
physical/biological movements across languages,
some exceptions - English bullish and bearish
- Arabic cod (hamoor, ?????)
20Background Financial Language and Trading
Financial Services(English)
Agricultural Commodities)Urdu)
Oil (Arabic)
21Background Multilingual Analysis
- Some metaphors may not transfer across languages
- If the pre-dominant trading changes from
financial instruments, e.g. shares, currencies,
bonds, to commodities, will the patterns then
survive? - What is seen as positive news, say, in the USA
might be received as a negative news in the
Middle East (e.g. the direction of oil prices)
22Background Word Order
- Word order and collocation extraction in a
multilingual environment - English (SVO) Oracle profit rose 50 percent
- Arabic (VSO) ?????? ????? ?????? 50 ?? ?????
- Urdu (SOV) ?????? ?? ????? ??? 50 ???? ?????
6 intervening tokens
?????? ????? ????? ??????
????? 138 ?? ????? Arabic Gloss
percent by-ratio Gulf Cement
profit rise English Translation Gulf
Cement profit up 138 percent
length is neutral
1 intervening token
23Method
Term extraction
- Statistical corpus-based (Ahmad et al., 2006)
- 1- Frequency Analysis and Terminology Extraction
- 2- Collocation Extraction
- 3- Significant N-gram Extraction
- There is more information in a sequence of words
(collocations) than in words individually (Firth,
Halliday and Sinclair) - Focus on special language properties
- Raw corpora are useful
- Automatic identification of patterns
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Pattern Generation
Pattern Pruning
24Method
- Frequency Analysis and Terminology Extraction
- Identify terms in special language texts by
comparing the relative frequency distribution of
a special language corpus with one that is
representative of a general language (Ahmad,
1995) - Extract terms with frequency and weirdness
z-score above a positive threshold
Term extraction
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Weirdness (w) ((fSpecial / fGeneral)
(NGeneral/NSpecial)) f frequency , N
corpus size (tokens)
Pattern Generation
Pattern Pruning
25Method
- Example Properties of the keyword percent in
English, Arabic and Urdu financial corpora - Using top news corpora for wierdness analysis
Term extraction
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Pattern Generation
Pattern Pruning
26Method
- Collocation Extraction (Smajda, 1993)
- For a given word, find all collocates at
positions -5 to 5 (Is it applicable to Arabic?
what about morphological/syntactic Complexity?) - Avoid semantic constraints (e.g. doctor and
nurse) - Three criteria
- strength (normalized frequency) 95 rejection
(K-Score) - position histogram must not be flat (U-Score)
- select peak from histogram (P-Score)
Term extraction
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Pattern Generation
Pattern Pruning
(adapted from a slide by V. Hatzivassiloglou)
27Method
Term extraction
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Pattern Generation
Pattern Pruning
28Method
- Extract N-grams in the corpus that comprise
highly collocating keywords (U,k,p) (10,1,1)
with weirdness z-score 0 - ?Avoids closed class collocates
Term extraction
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Pattern Generation
Pattern Pruning
29Method
- Replace words with frequency z-score less than a
threshold by a place marker , merge contiguous
place markers
Term extraction
Collocation Extraction
N-gram Extraction
Input
?????? ????? ????? ?????? ?????
138 ?? ????? percent by-ratio Gulf
Cement profit rise Gulf Cement profit up
138 percent
N-gram Normalisation
Output
Pattern Generation
?????? ?????
????? 138 ?? ????? percent
by-ratio profit rise
profit up 138 percent
Pattern Pruning
30Method
Term extraction
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Pattern Generation
Pattern Pruning
31Method
- Discard specific and frequent Arabic proclitics
(e.g. the conjunction and (w, ?)
Term extraction
Collocation Extraction
N-gram Extraction
N-gram Normalisation
Pattern Generation
Pattern Pruning
32Experiments LoLo (????) Local-Grammar for
Learning Terminology
- Designed and developed for managing corpora
(mainly Arabic-script based languages and
English) - Tools
- Corpus analyser
- Rules editor
- Information extractor
- Information visualiser
- Each component is accessible via LoLos GUI and
all the data generated can be exported.
33Experiments LoLos Architecture
General Language Corpus
Candidate Knowledge
Analyser
Special Language Corpus
user
Editor
Extractor
Visualiser
Knowledge Base
Texts
34Experiments - Corpora
Financial (divided into training and test)
Top News
35Experiments Top 10 Keywords
1
2
3
9
5
6
8
10
4
7
English
billion
bid
percent
pounds
market
shares
share
growth
company
?????
?????
???????
?????
?????
?????
?????
????
???
?????
2
1
3
8
9
10
4
5
6
7
Arabic
dollar
the-oil
million
prices
barrel
percent
billion
price
company
the-dollar
????
????
???
??
????
????
????
?????
??????
????
1
2
3
8
9
4
5
6
7
10
Urdu
bank
less
rupees
million
capital
percent
increase
increase
10 million
100,000
36Experiments Arabic Patterns
37Evaluation Corpus Regional Polarity
Sentences
38Evaluation
Precision
Precision
Recall
Recall
Positives
Negatives
39Conclusions
- We have captured the essence of financial news in
English and Arabic, the similarities and the
differences - The method produces productive patterns that have
high accuracy but low coverage at the sentence
level, coverage is much higher at the document
level (circa 50-60) - Lead sentences are very regular and give the most
important information or a summary the lead
sentence is the news (indication of news item
predominate polarity?)
40Future Work
- Unsupervised classification (clustering) of
patterns as positive and negative and
bootstrapping the patterns base utilising a cross
sentence local grammar - Arabic lead sentences start with a verb and many
are metaphorical movement words ? Seed lexicon - Titles contain a paraphrased word or phrase of
the polarity mentioned in the lead ? Automatic
bootstrapping and clustering - Asymmetry (positives are more than negatives) ?
Automatic labelling of clusters - Effect of pre-processing Arabic corpora
- Polarity across languages/regions/cultures
41Future Work
"????" ???? ??????? 2 ???????? ?????? 10 ???????
???? ????? ??? "????" ??????? 5 ????? ????????
11-7-2007? ????? ???? ????? ???????? ?????? ???
???? ??? 2? ?? ??????? ???? ?????? ?????? ???
10 ??????? ???? (??????? ????? 3.75...
???? ?????? ?????? ????? ?????? ????? ??? ????
?????? ???????? ????? ?????? ????? ????? ??????
???????? ???? ????? 15-7-2007? ????? ????? ?????
?? ??? ???? ?????? ??????? ??? ????? ????? ???
????? ?????? ???? ?????? ??????? ?? ???????..
Source www.alarabiya.net
42Future Work
- Evaluation of the translation of the General
Inquirer positive and negative lexicons to
Arabic and Urdu
43Questions