Title: Developing the Korean Internet Network MinerKINM: Eresearch Tool for Social Network Analysis of Blog
1Developing the Korean Internet Network
Miner(KINM) E-research Tool for Social Network
Analysis of Blogospherein South Korea
- Anatoliy Gruzd1), Chung Joo Chung2),
Jaeeun(Angela) Yoo3), - Han Woo PARK4)
- Work-in-progress
- 4 September 2009
2Contents
-
- Introduction
- Related Studies Tools for Blog Network Analysis
- Section 1.
- Development of the Korean Internet Network
Miner - Section2.
- Evaluation of the Name Network Discovery
Algorithm - Conclusion
-
3Introduction
- The growing adoption of e-research tools has
lead to changes in social and communication
research(Jankowski, 2009) - With the growing proliferation of personal
blogs, organizational sites, citizen media, and
social networking services, e-research tools for
humanists and social scientists have gradually
become available. - Typical technologies in this domain include
LexiURL(Thelwall, 2009), Virtual Observatory for
the Study of Online Networks (Ackland, 2009), and
Internet Community Text Analyzer(ICTA Gruzd,
2009a). -
-
4Introduction
- In contrast to the e-research developments in
North America and Europe, Soon and Park(2009)
note that digital tools to support
e-research are rare in Asia, even in South Korea
with its high IT penetration, extensive network
infrastructure, and widespread use of the Korean
language on the Internet(Internet World Stats,
2008). -
-
-
- Thus, we attempt to develop an e-research tool
for automatic discovery - of online communication networks on the Korean
Web. - First section deals with prior studies related
to large-scale blog network - analysis using automatic tools,
- Second section illustrates the process of
developing our analytic tool called - the Korean Internet Network Miner(KINM).
5Related Studies Tools for Blog Network Analysis
- The structure of networks can be measured
mathematically and - visualized graphically. The shape of a network
emerging from online - users writing and linking choices reflects
interest trends. -
- Research on blog networks have confirmed that
large-scale online - communities are structurally reflected in higher
density network - neighborhoods through linking(Kelly and Etling,
2008). -
- Traditional data mining research
- Current research
6Related Studies Tools for Blog Network Analysis
- Traditional data mining research
-
- Traditional data mining research focuses largely
on algorithms for - inferring association rules and other
statistical correlation measures - in a given data set(Kumar et al., 1999 Jung,
2009). -
- For example)
- Kelly and Etling(2008) used research firm
Morningside Analytics for blog Selection and data
mapping - along with Fruchterman-Rheingold's "physics
modelalgorithm to understand the blog networks
of - the Iran blogosphere.
- Gryc and his colleagues(2008) developed
categories for blog networks they studied by
analyzing - key words, post classification, and linking
patterns of blogs.
7Related Studies Tools for Blog Network Analysis
-
- Current research
- The current research uses a web-based system for
automated text - analysis to discover and understand social
networks from blog data. -
- It focuses not only on chain networkssocial
networks based on the - number of messages exchanged between
individualsbut also on - name networkssocial networks built from mining
personal names - and nicknames.
-
-
8Section 1. Development of the Korean
Internet Network Miner
As an initial framework for the KINM, we used
some of the social network discovery and
visualization tools and techniques previously
developed by Gruzd(2009). These tools and
techniques were developed as part of a General
purpose web system for content and network
analysis of computer-mediated communication in
English called ICTA (available at
http//textanalytics.net).
- Barriersgt
- since ICTA only works with texts in English, we
had to modify all text - processing functions to support Korean
texts. - 2. ICTA requires the data to be stored in a
machine-readable format such - as an RSS feed. However, after a manual
examination of a number of - Korean blogs, we noticed that the majority of
them do not provide RSS - feeds for their comments data.
9Section 1. Development of the Korean
Internet Network Miner
The reason we were especially interested in
analyzing comments data was because comments
turned out to be a good source for mining social
connections among blog readers because comments
contain most of The social interactions on a
blog. To address this challenge, we created a
script using the Kapow Mashup Server
(http//www.kapowtech.com) to retrieve comments
from a selected Blog automatically and output
them as an RSS feed.
10Section 1. Development of the Korean
Internet Network Miner
- After retrieving the blog data, it was processed
to build two types of networks. -
- First, a chain network was extracted. In the
chain network, one - commentator is connected to another if the
first commentator directly - replied to the second commentator by clicking
on the "reply-to" button. - However, after manually examining a number of
comments on several - blogs, we found that there are some comments
that are not "reply-to" - comments, but are addressing or referencing a
previous poster. -
- To capture missing connections, we decided to
rely on another network - discovery method called the Name network.
This observation is in-line with a previous
empirical study on online Learning communities
by Gruzd(2009a), which discovered that the chain
network misses on average 40 of possible
connections.
11Section 1. Development of the Korean
Internet Network Miner
- Name Networkgt
- Instead of just relying on information about who
replied to whom, the - Name network method starts by automatically
finding all mentions - of personal names or nicknames in comments and
uses them as nodes - in a social network.
-
- Next, to discover ties between nodes, the method
connects a sender of - a comment to all names found in his/her
comment. (A more detailed - description of this method can be found in
Gruzd(2009b).) - Although the name network approach provides
additional information - about connections among blog commentators, it
has its own challenges. - At the moment it does not differentiate between
a word that is a personal - name/nickname and a word that just appears to
be one. -
12Section 1. Development of the Korean
Internet Network Miner
Name Networkgt
Figure 1 Sample comment
For example, the algorithm marked the word ??
(people) as a reference to another person on the
blog. This happened because there was at least
one comment in the dataset posted by a person
with the "??" nickname. However, in the
sample comment, this word does not refer to
another online participant it is used as a noun
that means "people".
13Section 1. Development of the Korean
Internet Network Miner
- Name Networkgt
- Another good example of challenges associated
with the name/nickname - disambiguation problem in comments is the word
"2mb". - This is because "2mb has at least three
different meanings. - First, this word can be used as a nickname for
one of the blog commentators. - Second, it could refer to the capacity of a
computer memory (2 megabytes). - Finally, it could be the alias of the current
Korean president, Lee Myung-Bak. -
- To address these challenges and develop
recommendations for the next - generation of the name network discovery
algorithm, we conducted a - semi-automated analysis of all names/nicknames
discovered from a - sample dataset using our initial algorithm.
-
14Section 2. Evaluation of the Name Network
Discovery Algorithm
To evaluate our automated approach for analyzing
communication networks from blog comments,
specifically the accuracy of the name Network
discovery algorithm, We selected a single
blog authored by ??(bangzza) from
http//blog.ohmynews.com/bangzza This blog was
chosen because of the high number of readers and
comments. The blog is hosted on a popular
website called OhMyNews, which is ranked as the
sixth most influential media among newspapers
and broadcasting companies and ranked as the
seventh most powerful media in Korea(Chang,
2005).
15Section 2. Evaluation of the Name Network
Discovery Algorithm
OhMyNews was ranked as one of the top three web
sites in terms of blog users in 2009(Rankey.com,
2009) it is frequently ranked as the most
popular blog site in Korea, registering over 20
million page views per day during the
presidential election. Users of OhMyNews, known
as "news guerrillas", contribute news articles on
the Web site. OhMyNews allows individuals in
far-flung locations to come together, share, and
build strong ties and a sense of communityunited
in ideology even if separated by geographic
distancethat fosters a true Grassroots
movement(Streitmatter, 2001). The comparative
influence of progressive, independent, and
alternative online news media has drawn global
attention, and OhMyNews has been in the center
of the discussion(Schroeder, 2004).
16Section 2. Evaluation of the Name Network
Discovery Algorithm
For our tests, we retrieved and analyzed a
sample set of 943 comments (posted between April
2008 and April 2009) from the selected blog.
In the study, we relied on an interactive tag
cloud feature available in KINM To explore and
evaluate all names and nicknames that were found
automatically (see Figure 2). ltFigure
2gt An interactive tag cloud showing the 50 most
frequently used name/nickname candidates found
in the sample dataset
17Section 2. Evaluation of the Name Network
Discovery Algorithm
-
- The evaluation procedure involved clicking on
each word found by the - name network algorithm and exploring the context
where each instance - of the word was used(see Figure 3). The purpose
of this semi-automated - analysis was to discover what name/nickname
candidates were identified - incorrectly and why.
-
-
ltFigure 3gt A list of
messages containing "2MB - This semi-automated analysis revealed a set of
additional syntactic and - semantic clues that can be used to improve the
accuracy of the name - Network discovery algorithm.
18Section 2. Evaluation of the Name Network
Discovery Algorithm
The following set includes clues suggesting
that a word is likely to be a nickname
? a word candidate is followed by a context word
such as "?" an honorific or "?" Mr./Ms.
other possibilities, although rare, include "?"
Mr. for younger males or "??" Miss/Ms. for
younger females at the end of a word candidate,
and "???" Mr. or "??" Miss at the
beginning ? a word candidate contains a
combination of characters(Korean, English and/or
Chinese), symbols(e.g., underscores, hyphens)
and numbers ? a word candidate appears to be
a real name, which is almost always three
characters a single-character last name
followed by the two-character first name, which
may be found in a dictionary of first names
and/or common characters used therein ? a
word candidate is a less common, non-topic
word(e.g., "???" raccoon) ? a word
candidate is followed by punctuation indicative
of someone being addressed (e.g., "/" or "")
? a word candidate contains patterns indicative
of non-native words-phonetic Koreanization of
English (e.g., "?????" mediamogul Media
Mogul) or phonetic romanization of Korean(e.g.
"jihwaja" ???).
19Section 2. Evaluation of the Name Network
Discovery Algorithm
The second set includes clues suggesting that a
word is NOT likely to be used as a nickname
? a word candidate is a phrasefor example, if
the nickname input (the "FROM"field) is Used
more like a subject line(possible indicators
include white spaces and length) ? a word
candidate consists of a single character(e.g.,
"a" or "?") ? a word candidate consists of
netspeak, including emoticons(e.g. "_"), slang
and abbreviations(e.g., using "2MB" to refer to
the current Korean president), and onomatopoeia
(e.g. "??" tsk tsk, ??" heehee, "??"
haha, "?" hmm) ? a word candidate appears
more than one time in the comment ? a word
candidate consists of random characters(e.g.
"????" or "asdf") ? a word candidate is a
short, conversational word or phrase(e.g., "?? "
me, "???" oh no, "???" so/therefore) ?
a word candidate is a common word or idea in the
given context/topic(e.g., "????" Republic of
Korea, "????" a newly created word used to
refer to political fanatics).
20Conclusion
- This research briefly reviewed some of the
studies related to a large-scale blog analysis
using automatic tools. - We reviewed the process of developing our own
analysis tool KINM. The main goal of KINM is to
automate the process of finding communication
networks in the Korean blogosphere that
accurately represent social interactions among
blog readers. - To find these networks, the system relies on a
set of text mining techniques to look for
personal names and nicknames in userscomments. -
-
-
21Conclusion
- To address some of the challenges associated
with the automated discovery of names and
nicknames in Korean texts, this paper also
presented an exploratory study of a sample
dataset. - The study suggests a set of additional rules
to improve the accuracy of the current
name/nickname discovery algorithm used in KINM. - These additional rules will be incorporated
into KINM and evaluated in a subsequent study. -
-
-
-
22 This research was supported by WCU(World Class
University) program through the National Research
Foundation of Korea funded by the Ministry of
Education, Science and Technology (No.
515-82-06574). htp//english-webometrics.yu.ac.kr