Schema Evolution in Wikipedia - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Schema Evolution in Wikipedia

Description:

NOTE: 500 most common templates out of 2k extracted from over 780 millions query ... to several other Open-Source WIS (Joomla!, TikiWiki, Slashcode, Zen-Cart) ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 16
Provided by: yellowsto
Category:

less

Transcript and Presenter's Notes

Title: Schema Evolution in Wikipedia


1
Schema Evolution in Wikipedia
Toward a Web Information System Benchmark
Carlo A. Curino Hyun J. Moon Letizia Tanca Carlo
Zaniolo
2
Motivations
  • Understand the role of the Schema Evolution (SE)
    in Web Information Systems (WIS) our guess was
  • on the web everything evolves faster -gt SE
    should be relevant!
  • Compare the evolution in Traditional IS (known)
    and WIS
  • Obtain an in-depth understanding of Wikipedia
    DB backend
  • Lay the foundations of a Benchmark for Schema
    Evolution
  • Why MediaWiki (software platform behind
    wikipedia)?
  • Popular (used by gt30.000 websites including
    Wikipedia)
  • open-source and well-documented software
  • Wikipedia DATA and QUERIES also under
    open-source license


3
What we did
  • We developed a tool-suite to analyze Web
    Information System DB backends
  • We collect and dissect MediaWiki schema history
    (170 schema versions in 4.5 years)
  • We release tool-suite and data as a first step
    towards a Benchmark for Schema Evolution

4
MediaWiki Architecture
  • Classical Web architecture based on Linux,
    Apache, MySQL, PHP (LAMP)
  • Big scalability issues
  • Wikipedia is one of the 10 most popular websites
    in the WWW (about 29k requests/sec in average,
    peaks up to 85k requests/sec)
  • Several Layers of caching (both explicit and
    not, at DBMS and WS level)
  • According to the developers DBMS performance are
    the major bottleneck, DB size gt 700Gb not
    considering the multimedia content!!
  • plus poor load partitioning (one-language per
    server)!

5
The Schema
  • Tables can be grouped in
  • article and content
  • links and structure
  • users and permissions
  • performance and caching
  • statistics and special features
  • history and archival (represent a big portion of
    the schema 1/3 they dont know it but they need
    a temporal DB!)

6
Basic Statistics 1
  • Schema Evolution
  • 170 versions in 4.5 years
  • almost 250 increase

7
Basic Statistics 2
  • More frequent schema changes far away from
    releases
  • Schema Elements Lifetime
  • a group of stable relations
  • young tables and columns

8
Type of Changes
  • NOTE it doesnt adds up to 100 since several
    changes might coexist in an evolution step
  • total lack of integrity constraints (a part from
    primary keys)!!

9
Type of Changes
  • NOTE simple schema modifications are the most
    common

10
Schema Changes per Version
  • NOTEversion 41-42 represents a MAJOR evolution
    step where article versioning management is
    heavily modified!!

11
Impact on the Applications
  • NOTE over 4000 queries from which we extract
    75 templates

12
Wikipedia Profiler Queries
  • NOTE 500 most common templates out of 2k
    extracted from over 780 millions query instances
    from the On-Line Wikipedia Profiler
    http//noc.wikimedia.org/cgi-bin/report.py

13
Traditional vs Web IS
  • Comparing our results with existing analysis for
    Traditional IS
  • WIS evolve faster 38 (w.r.t. Sjoberg) and 539
    (w.r.t. Marche)
  • Collaborative WISs embrace information sharing,
    thus we got way better data 170 versions vs 2
    and 9 respectively. And we can share them
    (benchmark).
  • More in-depth analysis by means of SMOs

14
Towards a Unified Benchmark
  • We share the schema history we collect, the
    analysis data (raw and stats), the queries, the
    tool-suite
  • http//yellowstone.cs.ucla.edu/schema-evolution/in
    dex.php
  • Goal create a benchmark for schema evolution
    and in general a standard relational DB dataset

15
Conclusion
  • So far we
  • Create a tool-suite for schema evolution
    analysis
  • Dissect Wikipedia Schema Evolution history
  • Establish the core of a DB Schema Evolution
    Benchmark (released)
  • We developed tools to support Graceful Schema
    Evolution (PRISM)
  • We plan to
  • Extend the analysis to several other Open-Source
    WIS (Joomla!, TikiWiki, Slashcode, Zen-Cart)
  • Extend the analysis towards public scientific DB
    (Genome, HGVS)
  • Involve other research groups to define a
    commonly-agreed Benchmark
  • Improve the tool-suite and integrate it in PRISM

Write a Comment
User Comments (0)
About PowerShow.com