Combining research and teaching in knowledge management and corpus linguistics - PowerPoint PPT Presentation


PPT – Combining research and teaching in knowledge management and corpus linguistics PowerPoint presentation | free to download - id: 3c7819-NWE4N


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Combining research and teaching in knowledge management and corpus linguistics


School of Computing FACULTY OF ENGINEERING Combining research and teaching in knowledge management and corpus linguistics Based on Corpus Linguistics 2007 presentation: – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 24
Provided by: compLeed9
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Combining research and teaching in knowledge management and corpus linguistics

Combining research and teaching in knowledge
management and corpus linguistics
  • Based on Corpus Linguistics 2007 presentation
  • Which English dominates the World Wide Web,
    British or American?
  • by Eric Atwell, Junaid Arshad, Chien-Ming Lai,
    Lan Nim, Noushin Rezapour Asheghi, Josiah Wang,
    and Justin Washtell
  • School of Computing, Leeds University

  • Introduction
  • Experiments to combine research and teaching in
    Knowledge Management and Corpus Linguistics,
    using an AI-inspired intelligent agent
    architecture, but casting students as the
    intelligent agents. Each student applies
    KM/Data-mining to a corpus, then we combine
  • Methods
  • Students given detailed coursework spec, write up
    as a research paper
  • Results
  • Draft journal papers by Junaid Arshad, Chien-Ming
    Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah
    Wang, Justin Washtell ( 66 more?)
  • Conclusions
  • Spamming journals? More research questions for
    next years classes?

Background assumptions
  • The aim of research is to generate
    conference/journal papers (for RAE, for
    publicity, for promotion, ?)
  • Computing students should learn to apply ICT
    technology to practical, real / useful tasks
  • Research-led teaching and learning is a Leeds
    Univ strength, LT08 conference theme
  • SO students could learn by applying ICT to
    research questions, and writing
    conference/journal papers on results?
  • BUT research is hard surely a student cant
    come up with the ideas and results for a
    publishable research paper?!
  • Maybe one student cant but

Intelligent Agent Architecture
  • Wikipedia In computer science, an intelligent
    agent (IA) is a software agent that exhibits some
    form of artificial intelligence that assists the
    user and will act on their behalf, in performing
    repetitive computer-related tasks. While the
    working of software agents used for operator
    assistance or data mining (sometimes referred to
    as bots) is often based on fixed pre-programmed
    rules, "intelligent" here implies the ability to
    adapt and learn a multi-agent system (MAS) is a
    system composed of several agents, collectively
    capable of reaching goals that are difficult to
    achieve by an individual agent or monolithic
    system A multiple agent system (MAS) is a
    distributed parallel computer system built of
    many very simple components, each using a simple
    algorithm, and each communicating with other
    components. A paradigm of an ant colony or bee
    swarm is used many times.

Students as intelligent agents
  • Bio-Inspired Computing researchers aim to develop
    software which behaves like ants, bees, etc to
    model complex systems
  • Why not use students as super-intelligent
  • Prof David Cliff this is cheating his goal
    is software agents
  • BUT my goals are CL research, student
    research-led learning
  • I am not trying to build bio-inspired computing
  • Lets see what happens if I apply agent-based
    architecture to student coursework exercises

2005-06 PASCAL MorphoChallenge2005
  • Cognitive Systems and Multidisciplinary
    Informatics MSc students in my Computational
    Modelling class.
  • Coursework build an unsupervised machine
    learning program to learn morphological analysis
    of English, Finnish, Turkish.
  • seaside gt sea side
  • Systems developed ranged from minimalist to very
  • My hybrid voting system performed better than
    any individual students system
  • Atwell, Eric Roberts, Andrew. Combinatory Hybrid
    Elementary Analysis of Text (CHEAT) in Kurimo,
    M, Creutz, M Lagus, K (editors) Proceedings of
    the PASCAL Challenge Workshop on Unsupervised
    Segmentation of Words into Morphemes. 2006.

2006-07 UK v US English on WWW
  • 93 Computing students studying Computational
    Modelling and Technologies for Knowledge
    Management were given the data-mining coursework
    task of harvesting and analysing a Data Warehouse
    from WWW, using WWW-BootCat web-as-corpus
    technology (Baroni et al 2006).
  • Each student/agent collected English-language
    web-pages from a specific national top-level
  • The analysis task involved comparing each
    national sample web-as-corpus with given gold
    standard samples from UK and US domains, to
    assess whether national WWW English terminology /
    ontology was closer to UK or US English.
  • Then, MSc students summarised groups of results
    for CL07.

  • WWW-BootCat and Google
  • Compare to .UK and .US
  • Follow-up regional overviews

  • The task was cast as an exercise in applying the
    CRISP-DM methodology for computational modelling
    the Cross-Industry Standard Process for Data
    Mining projects. CRISP-DM specifies a series of
    phases or sub-tasks in a data-mining project it
    is a recipe to follow, allowing novices and
    non-experts to carry out data mining experiments
  • Business Understanding map UK v US English on
  • Data Understanding English text from web-pages
  • Data Preparation extract word-frequency list,
    key features?
  • Modelling compare national wordlist against UK,
  • Evaluation are results reasonable,
  • Deployment write a report to submit for

WWW-BootCat and Google
  • WWW-Bootcat easy-to-use web front-end to
  • User supplies seed terms, typical English words
    (Sharoff 06).
  • Constrain search to Domain (eg .fr), Language
    (eg English).
  • WWW-BootCat uses Google to find and download
  • hey presto 200,000-word national English
  • Problems
  • Technical, eg user licences/keys required server
  • Small national domains eg South Georgia Island
  • Legal restrictions, eg Algerian law promotes
    Arabic over French (et al)

Compare to .UK and .US
  • Next, each agent/student had to decide if their
    national sample was closer to British or American
  • Computing students/agents could not use
    Linguistic expertise
  • Instead, compute similarity to .UK and .US gold
    standards (also collected via WWW-BootCat and
  • Word-frequency Log-Likelihood profiles and
    averages Occurrences of selected words
    (color/colour, tap/fawcet) Lexical analysis only
    not syntax or pronunciation

Follow-up regional overviews
  • This yielded 93 reports on national web-as-corpus
  • but still difficult to collate results, see
  • Follow-up coursework for MSc students collate
    and compare results across a group of countries
    in a single geographical or political region, to
    produce overviews of English in the region.
  • Students could base their regional overview on
    the results gathered in the first exercise,
    though some chose to collate and analyse their
    own web-as-corpus data afresh.
  • Each regional report was to be written as a
    research journal paper, targeted at a journal
    specific to the region.

  • Draft journal papers
  • (accepted for CL2007, BUT they couldnt afford
    time or fees ?)
  • Junaid Arshad,
  • Chien-Ming Lai,
  • Lan Nim,
  • Noushin Rezapour Asheghi,
  • Josiah Wang,
  • Justin Washtell
  • More draft journal papers by
  • Precious CHIVESE, Binita DUTTA, Dureid
  • Sanaz GHODOUSI, Olatomiwale MALOMO, Anh NGUYEN

Junaid Arshad
  • Analysis of English used in a web corpus from
  • the Middle East
  • Jordan and Egypt English corpora were closer
    to UK than US English English websites in Saudi
    Arabia, Lebanon, Israel, Kuwait, and Bahrain were
    more similar to US English than UK English and
    UEA and Iran English websites contained a mix of
    UK and US English, with neither dominant

Chien-Ming Lai
  • Studying Influences of British English and
    American English on World Wide Web in Southeast
    Asia by Applying Web as Corpus
  • The countries studied were Indonesia,
    Malaysia, Philippines, Singapore, Thailand and
    Vietnam. Among these countries, only Philippines
    and Singapore recognize English as official
    language, but English is widely used in the other
    countries the English texts used in most of the
    chosen countries in the Southeast Asia are closer
    to the American English

Lan Nim
  • The Dominant English Type within the World Wide
    Web Domains of France and its Former Colonies
  • This paper investigates the English used in
    the WWW domains of France (.fr) and its former
    colonies of Vietnam (.vn), Laos (.ln), Mauritius
    (.mu) and Senegal (.sn) British English is more
    dominant overall in Francophone domains compared
    to American English. However, some local
    variation was observed American English is more
    widespread in Vietnam, probably due to American
    political influence after the end of French
    colonization and, more surprisingly, American
    English seems more prevalent than British English
    in the .FR domain of France.

Noushin Rezapour Asheghi
  • Which English dominates the World Wide Web in
    countries where English is a native language
    British or American?
  • The results from Log-Likelihood technique in
    modelling phase indicate that English used in
    Australian, South African and Irish web sites is
    closer to British English and text in New
    Zealand, Jamaican and Canadian web sites are more
    similar to American English. However, there is
    not a great difference between the results of
    comparing these corpora with British and American
    English and British spelling is used
    predominantly in the New Zealand domain

Josiah Wang
  • Dominance of British and American English on the
    World Wide Web in Malaysia, Singapore and Brunei
  • Malaysia, Singapore and Brunei have a history
    as British post-colonial countries ... As a
    comparison, we have also included three
    neighbouring countries Former British colonies
    like Malaysia, Singapore and Brunei still favour
    British English on the World Wide Web. In
    addition, Indonesia and Papua New Guinea which
    are indirectly influenced by British English
    (i.e. through the Netherlands and Australia) also
    tend to lean towards British English. The
    Philippines on the other hand still continue to
    exhibit Americas influence with their preference
    for American English on the Internet.

Justin Washtell
  • The Polynesian influence on English in the World
    Wide Web of Pacific island nations
  • This study analyses the effect of indigenous
    Polynesian languages upon the balance of a core
    of function (non-lexical) words in sample English
    web corpora taken from Polynesian island nation
    domains from a selection of New Zealand, Cook
    Islands and French Polynesian websites. These
    corpora are compared to those recovered from .uk
    and .us domains and significant grammatical
    differences are sought. Noted differences are
    compared with those found between a French corpus
    from France and one captured from French
    Polynesian websites using an identical technique

Conclusions research
  • We expected US English to dominate the WWW
  • Computing generally has been American-led
  • US-owned companies might base national websites
    on US originals
  • Result British English is holding its own no
    clear winner?
  • It is hard to find major differences
    International English?
  • Main differences are in pronunciation, not lexis?

Conclusions student learning
  • Students spent a lot of time on collecting the
  • painting by numbers exercise little
    intelligence needed?
  • OR practical experience of using web-as-corpus
  • Student feedback many relished the challenge of
    a real exercise with large-scale data,
    contributing to a big result
  • MSc students with papers accepted for
    MorphoChallenge and Corpus Linguistics
    conferences were very pleased! (even though I
    couldnt pay for them to go!)
  • it would be even better if EVERY student had a
    chance to publish

2007-08 66 Journal papers?
  • http//
  • Re-use the web-as-corpus samples from last year,
    more time for Data-Mining with WEKA
  • Select key features e.g. colour v color
  • Train classifier with UK and US samples
  • Classify unseen national samples
  • Classifier is a novel empirical model of UK/US
    English differences
  • and more time to write Report each student to
    choose a language-related Journal and draft a
    paper for this!

And finally
  • I want to run a similar exercise next year
    casting students as intelligent agents to combine
    teaching and research
  • I need other web-as-corpus research questions to
  • to be divided into 50 subtasks, one for each
  • with computable metrics, for Computing students