The Ethics of Large-Scale Web Data Analysis (Webmetrics) - PowerPoint PPT Presentation

About This Presentation
Title:

The Ethics of Large-Scale Web Data Analysis (Webmetrics)

Description:

Title: Presentation Last modified by: Thelwall Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles: Times New Roman Tahoma ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 22
Provided by: eprintsN1
Category:

less

Transcript and Presenter's Notes

Title: The Ethics of Large-Scale Web Data Analysis (Webmetrics)


1
The Ethics of Large-Scale Web Data Analysis
(Webmetrics)
  • Mike Thelwall, Statistical Cybermetrics Research
    Group, University of Wolverhampton, UK
  • Rob Ackland, Australian Demographic and Social
    Research Institute, Australian National University

Virtual Knowledge Studio (VKS)
Information Studies
2
Contents
  • What is webmetrics?
  • Context Online access to personal information
  • Researchers use of personal information
  • Confidentiality and anonymity
  • Resource issues
  • What ethical considerations apply to collecting
    and analysing web data on a large scale from
    unaware web publishers ?

3
1. What is webmetrics?
  • Large-scale analysis if web-based data
  • Collecting and quantitatively analysing online
    information
  • Objective is not to find information about
    individuals but identify trends
  • Data gathered with VOSON, SocSciBot, Issue
    Crawler, LexiURL,

4
Example
  • VOSON Hyperlink
  • network of
  • political parties
  • from 6 countries
  • (Ackland and
  • Gibson, 2006).
  • Node size prop.
  • to outdegree.
  • 76 nodes.

5
Austria
Geopolitical connected
Switzerland
Example Links between EU universities
Belgium
Germany
France
Spain
NL
UK
Norway
Italy
Poland
Finland
Sweden
AltaVista link searches
Normalised linking, smallest countries removed
6
Link associations between social network sites
7
Example Blog searching
8
2. Context Online access to personal information
  • Blogs, social network sites, personal web sites
    contain information that is
  • Private and protected (invisible to researchers)
  • Intentionally public
  • Publicly private1 (intended for friends but
    allowed to be public)
  • Unintentionally public (public but believed by
    owner to be private)

1. Lang (2007)
9
Accessing public information
  • Commercial search engines
  • Web crawlers
  • Internet Archive (includes deleted info)

10
Who is using Dataveillance?
  • Dataveillance1 Downloading or otherwise
    gathering data on internet users in order to
    influence their behaviour
  • Google can use email, searching, blogging,
    social network activities to target advertising
    ( may report to US government)
  • Amazon can use past activities to target
    adverts or improve web site

1. Zimmer (2008)
11
3. Researchers use of personal information
  • Key issue for large scale research, data
    from/about the unaware is used without their
    approval, and possibly for purposes that they
    might disagree with
  • Which ethical safeguards should be taken for this
    kind of research?

12
Issue 1 People vs. Documents
  • Traditionally, documents can be researched
    without approval, but people cant
  • Even harsh criticism is fair practice (e.g., book
    review/analysis)
  • Since web pages are documents, researching them
    without permission is normally OK

13
Issue 2 Invasion of privacy? Natural vs.
normative
  • A situation is naturally private1 if a reasonable
    person would expect privacy
  • A situation is normatively private1 if a
    reasonable person would expect others to protect
    their privacy
  • Non-secure web pages/data are typically naturally
    private
  • Accessing is not normally invading privacy, even
    if undesired by page owners and with negative
    consequences

1. Moor (2004)
14
4. Confidentiality and anonymity
  • When should anonymity be granted to research
    subjects (page owners)?
  • When a possibly undesired label attached (e.g.,
    hate group, terrorist)
  • When undesired groups might benefit? (e.g.,
    league table of hate groups)
  • When publicly private individuals singled out
    (e.g., detailed analysis of average blogger)
  • Should data be anonymised as for Census data
    used for research?

15
5. Resource issues
  • Accessing a web page uses the owners server
    time/bandwidth
  • Crawling a web site can use a lot of the owners
    server time/bandwidth
  • May incur charges or loss of service quality

16
Robots.txt protocol
  • This file lists pages/folders in a web site may
    not be crawled
  • It does not restrict crawling speed
  • It should be obeyed in research
  • Most individual users are probably unaware of
    this and so dont use its protection

17
Crawling speed
  • Web crawlers should not run too fast that they
    cause service issues
  • Full speed is probably OK on a UK university web
    site but not on a Burkina Faso library web site
  • Use judgement to decide how quickly to crawl
    length of pauses in crawling

18
How many pages to crawl?
  • Crawling too many pages puts unnecessary strain
    on the server crawled
  • Use judgement to decide the minimum number of
    pages/crawl depth that is enough
  • Use search engine queries as a substitute, if
    possible

19
Automatic search engine searches
  • Research can piggyback off the crawling of
    commercial search engines
  • No resource implications for site owners
  • Uses search engine Applications Programming
    Interfaces
  • Search engines specify the maximum number of
    searches per day
  • Results limited to the imperfect web
    crawling/coverage of search engine crawlers

20
Summary
  • Researchers need to be aware of potential issues
    when doing large scale data analysis research
  • Judgement is called for in all issues
  • Research does not normally need participant
    permission
  • Be sensitive to impact of findings and any need
    for anonymity

21
References
  • Lange, P. G. (2007). Publicly private and
    privately public Social networking on YouTube.
    Journal of Computer-Mediated Communication,
    13(1), Retrieved May 8, 2008 from
    http//jcmc.indiana.edu/vol2013/issue2001/lange.ht
    ml
  • Zimmer, M. (2008). The gaze of the perfect search
    engine Google as an infrastructure of
    dataveillance. In A. Spink M. Zimmer (Eds.),
    Web search Multidisciplinary perspectives (pp.
    77-99). Berlin Springer.
  • Moor, J. H. (2004). Towards a theory of privacy
    for the information age. In R. A. Spinello H.
    T. Tavani (Eds.), Readings in CyberEthics (2nd
    ed., pp. 407-417). Sudbury, MA Jones and
    Bartlett.
Write a Comment
User Comments (0)
About PowerShow.com