Developing the Korean Internet Network MinerKINM: Eresearch Tool for Social Network Analysis of Blog - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Developing the Korean Internet Network MinerKINM: Eresearch Tool for Social Network Analysis of Blog

Description:

E-research Tool for Social Network Analysis of Blogospherein South Korea ... (e.g., '?????' = mediamogul = Media Mogul) or phonetic romanization of Korean(e.g. ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 23
Provided by: englishwe
Category:

less

Transcript and Presenter's Notes

Title: Developing the Korean Internet Network MinerKINM: Eresearch Tool for Social Network Analysis of Blog


1
Developing the Korean Internet Network
Miner(KINM) E-research Tool for Social Network
Analysis of Blogospherein South Korea
  • Anatoliy Gruzd1), Chung Joo Chung2),
    Jaeeun(Angela) Yoo3),
  • Han Woo PARK4)
  • Work-in-progress
  • 4 September 2009

2
Contents
  • Introduction
  • Related Studies Tools for Blog Network Analysis
  • Section 1.
  • Development of the Korean Internet Network
    Miner
  • Section2.
  • Evaluation of the Name Network Discovery
    Algorithm
  • Conclusion  

3
Introduction
  •   The growing adoption of e-research tools has
    lead to changes in social and communication
    research(Jankowski, 2009)
  • With the growing proliferation of personal
    blogs, organizational sites, citizen media, and
    social networking services, e-research tools for
    humanists and social scientists have gradually
    become available.
  • Typical technologies in this domain include
    LexiURL(Thelwall, 2009), Virtual Observatory for
    the Study of Online Networks (Ackland, 2009), and
    Internet Community Text Analyzer(ICTA Gruzd,
    2009a).
  •  

4
Introduction
  •   In contrast to the e-research developments in
    North America and Europe, Soon and Park(2009)
    note that digital tools to support
    e-research are rare in Asia, even in South Korea
    with its high IT penetration, extensive network
    infrastructure, and widespread use of the Korean
    language on the Internet(Internet World Stats,
    2008).
  •  
  • Thus, we attempt to develop an e-research tool
    for automatic discovery
  • of online communication networks on the Korean
    Web.  
  • First section deals with prior studies related
    to large-scale blog network
  • analysis using automatic tools,
  • Second section illustrates the process of
    developing our analytic tool called
  • the Korean Internet Network Miner(KINM).

5
Related Studies Tools for Blog Network Analysis

  • The structure of networks can be measured
    mathematically and
  • visualized graphically. The shape of a network
    emerging from online
  • users writing and linking choices reflects
    interest trends.
  •  
  • Research on blog networks have confirmed that
    large-scale online
  • communities are structurally reflected in higher
    density network
  • neighborhoods through linking(Kelly and Etling,
    2008).
  • Traditional data mining research
  • Current research

6
Related Studies Tools for Blog Network Analysis

  • Traditional data mining research
  • Traditional data mining research focuses largely
    on algorithms for
  • inferring association rules and other
    statistical correlation measures
  • in a given data set(Kumar et al., 1999 Jung,
    2009).
  • For example)
  • Kelly and Etling(2008) used research firm
    Morningside Analytics for blog Selection and data
    mapping
  • along with Fruchterman-Rheingold's "physics
    modelalgorithm to understand the blog networks
    of
  • the Iran blogosphere.
  • Gryc and his colleagues(2008) developed
    categories for blog networks they studied by
    analyzing
  • key words, post classification, and linking
    patterns of blogs.

7
Related Studies Tools for Blog Network Analysis

  • Current research
  • The current research uses a web-based system for
    automated text
  • analysis to discover and understand social
    networks from blog data.
  • It focuses not only on chain networkssocial
    networks based on the
  • number of messages exchanged between
    individualsbut also on
  • name networkssocial networks built from mining
    personal names
  • and nicknames.

8
Section 1. Development of the Korean
Internet Network Miner

As an initial framework for the KINM, we used
some of the social network discovery and
visualization tools and techniques previously
developed by Gruzd(2009). These tools and
techniques were developed as part of a General
purpose web system for content and network
analysis of computer-mediated communication in
English called ICTA (available at
http//textanalytics.net).  
  • Barriersgt
  • since ICTA only works with texts in English, we
    had to modify all text
  • processing functions to support Korean
    texts.
  • 2. ICTA requires the data to be stored in a
    machine-readable format such
  • as an RSS feed. However, after a manual
    examination of a number of
  • Korean blogs, we noticed that the majority of
    them do not provide RSS
  • feeds for their comments data.

9
Section 1. Development of the Korean
Internet Network Miner

The reason we were especially interested in
analyzing comments data was because comments
turned out to be a good source for mining social
connections among blog readers because comments
contain most of The social interactions on a
blog. To address this challenge, we created a
script using the Kapow Mashup Server
(http//www.kapowtech.com) to retrieve comments
from a selected Blog automatically and output
them as an RSS feed.
10
Section 1. Development of the Korean
Internet Network Miner
  • After retrieving the blog data, it was processed
    to build two types of networks.
  • First, a chain network was extracted. In the
    chain network, one
  • commentator is connected to another if the
    first commentator directly
  • replied to the second commentator by clicking
    on the "reply-to" button.
  • However, after manually examining a number of
    comments on several
  • blogs, we found that there are some comments
    that are not "reply-to"
  • comments, but are addressing or referencing a
    previous poster.
  • To capture missing connections, we decided to
    rely on another network
  • discovery method called the Name network.

This observation is in-line with a previous
empirical study on online Learning communities
by Gruzd(2009a), which discovered that the chain
network misses on average 40 of possible
connections.
11
Section 1. Development of the Korean
Internet Network Miner
  • Name Networkgt
  • Instead of just relying on information about who
    replied to whom, the
  • Name network method starts by automatically
    finding all mentions
  • of personal names or nicknames in comments and
    uses them as nodes
  • in a social network.
  • Next, to discover ties between nodes, the method
    connects a sender of
  • a comment to all names found in his/her
    comment. (A more detailed
  • description of this method can be found in
    Gruzd(2009b).)
  • Although the name network approach provides
    additional information
  • about connections among blog commentators, it
    has its own challenges.
  • At the moment it does not differentiate between
    a word that is a personal
  • name/nickname and a word that just appears to
    be one.

12
Section 1. Development of the Korean
Internet Network Miner

Name Networkgt
Figure 1 Sample comment
For example, the algorithm marked the word ??
(people) as a reference to another person on the
blog. This happened because there was at least
one comment in the dataset posted by a person
with the "??" nickname. However, in the
sample comment, this word does not refer to
another online participant it is used as a noun
that means "people".
13
Section 1. Development of the Korean
Internet Network Miner
  • Name Networkgt
  • Another good example of challenges associated
    with the name/nickname
  • disambiguation problem in comments is the word
    "2mb".
  • This is because "2mb has at least three
    different meanings.
  • First, this word can be used as a nickname for
    one of the blog commentators.
  • Second, it could refer to the capacity of a
    computer memory (2 megabytes).
  • Finally, it could be the alias of the current
    Korean president, Lee Myung-Bak.
  •  
  • To address these challenges and develop
    recommendations for the next
  • generation of the name network discovery
    algorithm, we conducted a
  • semi-automated analysis of all names/nicknames
    discovered from a
  • sample dataset using our initial algorithm.

14
Section 2. Evaluation of the Name Network
Discovery Algorithm

To evaluate our automated approach for analyzing
communication networks from blog comments,
specifically the accuracy of the name Network
discovery algorithm,   We selected a single
blog authored by ??(bangzza) from
http//blog.ohmynews.com/bangzza This blog was
chosen because of the high number of readers and
comments. The blog is hosted on a popular
website called OhMyNews, which is ranked as the
sixth most influential media among newspapers
and broadcasting companies and ranked as the
seventh most powerful media in Korea(Chang,
2005).
15
Section 2. Evaluation of the Name Network
Discovery Algorithm

OhMyNews was ranked as one of the top three web
sites in terms of blog users in 2009(Rankey.com,
2009) it is frequently ranked as the most
popular blog site in Korea, registering over 20
million page views per day during the
presidential election. Users of OhMyNews, known
as "news guerrillas", contribute news articles on
the Web site. OhMyNews allows individuals in
far-flung locations to come together, share, and
build strong ties and a sense of communityunited
in ideology even if separated by geographic
distancethat fosters a true Grassroots
movement(Streitmatter, 2001). The comparative
influence of progressive, independent, and
alternative online news media has drawn global
attention, and OhMyNews has been in the center
of the discussion(Schroeder, 2004).
16
Section 2. Evaluation of the Name Network
Discovery Algorithm

For our tests, we retrieved and analyzed a
sample set of 943 comments (posted between April
2008 and April 2009) from the selected blog.
In the study, we relied on an interactive tag
cloud feature available in KINM To explore and
evaluate all names and nicknames that were found
automatically (see Figure 2). ltFigure
2gt An interactive tag cloud showing the 50 most
frequently used name/nickname candidates found
in the sample dataset
17
Section 2. Evaluation of the Name Network
Discovery Algorithm
  • The evaluation procedure involved clicking on
    each word found by the
  • name network algorithm and exploring the context
    where each instance
  • of the word was used(see Figure 3). The purpose
    of this semi-automated
  • analysis was to discover what name/nickname
    candidates were identified
  • incorrectly and why.


  • ltFigure 3gt A list of
    messages containing "2MB
  • This semi-automated analysis revealed a set of
    additional syntactic and
  • semantic clues that can be used to improve the
    accuracy of the name
  • Network discovery algorithm.

18
Section 2. Evaluation of the Name Network
Discovery Algorithm

The following set includes clues suggesting
that a word is likely to be a nickname
   
? a word candidate is followed by a context word
such as "?" an honorific or "?" Mr./Ms.
other possibilities, although rare, include "?"
Mr. for younger males or "??" Miss/Ms. for
younger females at the end of a word candidate,
and "???" Mr. or "??" Miss at the
beginning ? a word candidate contains a
combination of characters(Korean, English and/or
Chinese), symbols(e.g., underscores, hyphens)
and numbers   ? a word candidate appears to be
a real name, which is almost always three
characters a single-character last name
followed by the two-character first name, which
may be found in a dictionary of first names
and/or common characters used therein   ? a
word candidate is a less common, non-topic
word(e.g., "???" raccoon)   ? a word
candidate is followed by punctuation indicative
of someone being addressed (e.g., "/" or "")
? a word candidate contains patterns indicative
of non-native words-phonetic Koreanization of
English (e.g., "?????" mediamogul Media
Mogul) or phonetic romanization of Korean(e.g.
"jihwaja" ???).
19
Section 2. Evaluation of the Name Network
Discovery Algorithm

The second set includes clues suggesting that a
word is NOT likely to be used as a nickname
 
? a word candidate is a phrasefor example, if
the nickname input (the "FROM"field) is Used
more like a subject line(possible indicators
include white spaces and length)   ? a word
candidate consists of a single character(e.g.,
"a" or "?")   ? a word candidate consists of
netspeak, including emoticons(e.g. "_"), slang
and abbreviations(e.g., using "2MB" to refer to
the current Korean president), and onomatopoeia
(e.g. "??" tsk tsk, ??" heehee, "??"
haha, "?" hmm)   ? a word candidate appears
more than one time in the comment   ? a word
candidate consists of random characters(e.g.
"????" or "asdf")   ? a word candidate is a
short, conversational word or phrase(e.g., "?? "
me, "???" oh no, "???" so/therefore)   ?
a word candidate is a common word or idea in the
given context/topic(e.g., "????" Republic of
Korea, "????" a newly created word used to
refer to political fanatics).  
20
Conclusion
 
  • This research briefly reviewed some of the
    studies related to a large-scale blog analysis
    using automatic tools.
  • We reviewed the process of developing our own
    analysis tool KINM. The main goal of KINM is to
    automate the process of finding communication
    networks in the Korean blogosphere that
    accurately represent social interactions among
    blog readers.
  • To find these networks, the system relies on a
    set of text mining techniques to look for
    personal names and nicknames in userscomments.
  •  

21
Conclusion
 
  • To address some of the challenges associated
    with the automated discovery of names and
    nicknames in Korean texts, this paper also
    presented an exploratory study of a sample
    dataset.
  • The study suggests a set of additional rules
    to improve the accuracy of the current
    name/nickname discovery algorithm used in KINM.
  • These additional rules will be incorporated
    into KINM and evaluated in a subsequent study.
  •  

22

 
  • Thank you.

This research was supported by WCU(World Class
University) program through the National Research
Foundation of Korea funded by the Ministry of
Education, Science and Technology (No.
515-82-06574). htp//english-webometrics.yu.ac.kr
Write a Comment
User Comments (0)
About PowerShow.com