Min Song, Ph.D. - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Min Song, Ph.D.

Description:

Min Song, Ph.D. Course Web Page http://web.njit.edu/~song/courses/web_mining/is698_webmining_syllabus.html and Moodle The course has two parts: Lectures ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 20
Provided by: Prefer357
Category:
Tags: sigir | min | song

less

Transcript and Presenter's Notes

Title: Min Song, Ph.D.


1
IS698 Web Mining
  • Min Song, Ph.D.
  • Course Web Page
  • http//web.njit.edu/song/courses/web_mining/is698
    _webmining_syllabus.html
  • and
  • Moodle

2
Course structure
  • The course has two parts
  • Lectures - Introduction to the main topics
  • One projects (done either individually or group)
  • 1 research project.
  • Lecture slides will be made available on the
    course web page and on Moodle.

3
Grading
  • Class Participation 10
  • Assignments 20
  • Midterm 25
  • Projects 45

4
Prerequisites
  • Knowledge/Experience of
  • Java programming

5
Teaching materials
  • Required Text
  • Web Data Mining Exploring Hyperlinks, Contents
    and Usage data. By Bing Liu, Springer, ISBN
    3-450-37881-2.
  • References
  • Data mining Concepts and Techniques, by Jiawei
    Han and Micheline Kamber, Morgan Kaufmann, ISBN
    1-55860-489-8.
  • Principles of Data Mining, by David Hand, Heikki
    Mannila, Padhraic Smyth, The MIT Press, ISBN
    0-262-08290-X.
  • Introduction to Data Mining, by Pang-Ning Tan,
    Michael Steinbach, and Vipin Kumar,
    Pearson/Addison Wesley, ISBN 0-321-32136-7.
  • Machine Learning, by Tom M. Mitchell,
    McGraw-Hill, ISBN 0-07-042807-7

6
Topics
  • Introduction
  • Data pre-processing
  • Association rules and sequential patterns
  • Classification (supervised learning)
  • Clustering (unsupervised learning)
  • Post-processing of data mining results
  • Question Answering
  • Full-Text mining
  • Partially (semi-) supervised learning
  • Opinion mining and summarization
  • Link analysis

7
Feedback and suggestions
  • Your feedback and suggestions are most welcome!
  • I need it to adapt the course to your needs.
  • Let me know if you find any errors in the
    textbook.
  • Share your questions and concerns with the class
    very likely others may have the same.
  • No pain no gain
  • The more you put in, the more you get
  • Your grades are proportional to your efforts.

8
Rules and Policies
  • Statute of limitations No grading questions or
    complaints, no matter how justified, will be
    listened to one week after the item in question
    has been returned.
  • Cheating Cheating will not be tolerated. All
    work you submitted must be entirely your own. Any
    suspicious similarities between students' work
    will be recorded and brought to the attention of
    the Dean. The MINIMUM penalty for any student
    found cheating will be to receive a 0 for the
    item in question, and dropping your final course
    grade one letter. The MAXIMUM penalty will be
    expulsion from the University.
  • Late assignments Late assignments will not, in
    general, be accepted. They will never be accepted
    if the student has not made special arrangements
    with me at least one day before the assignment is
    due. If a late assignment is accepted it is
    subject to a reduction in score as a late
    penalty.

9
Web mining Examples
  • Link analysis
  • How does Google work?
  • How to find communities on the Web?
  • Structured data extraction
  • Web information integration

10
Example Web data extraction
Data region1
A data record
A data record
Data region2
11
Align and extract data items (e.g., region1)
image1 EN7410 17-inch LCD Monitor Black/Dark charcoal 299.99 Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
image2 17-inch LCD Monitor 249.99 Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
image3 AL1714 17-inch LCD Monitor, Black 269.99 Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
image4 SyncMaster 712n 17-inch LCD Monitor, Black Was 369.99 299.99 Save 70 After 70 mail-in-rebate(s) Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
12
Resources
  • ACM SIGKDD
  • Data mining related conferences
  • Data mining KDD, ICDM, SDM,
  • Databases SIGMOD, VLDB, ICDE,
  • AI AAAI, IJCAI, ICML, ACL,
  • Web WWW,
  • Information retrieval SIGIR, CIKM,
  • Kdnuggets http//www.kdnuggets.com/
  • News and resources. You can sign-up!
  • Our text and reference books

13
What is web mining?
  • The process of discovering knowledge from web
    page content, hyperlink structure, and usage data
  • Builds on existing data and text mining
    techniques, but adds many new tasks and
    algorithms
  • Three types, based on sources of data (often
    combined in practice)
  • Web structure mining
  • Web content mining
  • Web usage mining

14
Importance of web data mining
  • The web is unique!
  • Amount of information is huge and still growing,
    on almost any topic, and changes continuously
  • No single editorial control significant
    variations in quality, much duplication, and data
    formats vary widely
  • Significant information is linked (within and
    between web sites)
  • Web reflects a virtual society ---interactions
    among people, organizations, and automated
    systems, no longer limited by geography
  • The Web presents challenges and opportunities for
    mining

15
How to make best use of data?
  • Knowledge discovered from web data can be used
    for competitive advantage.
  • Online retailers (e.g., amazon.com) are largely
    driven by data mining.
  • Web search engines are based on information
    retrieval (text mining) and data mining, and NLP.
  • Web surfers/searchers need tools to find,
    recommend, organize, and extract useful
    information from the Web

16
Semester Research Project
  • Individual, or groups of two (will grade each
    other)
  • Plus formal and informal feedback from instructor
  • Should be the beginning of what could be a
    publishable project.
  • On some aspect of web mining
  • Topic will be given by instructor or proposed by
    student and approved by instructor
  • Students present
  • Ideas early in the semester for feedback
  • Completed project at the end of the semester
  • Write a scientific paper at the end.
  • Publish as a technical report if not more (some
    have been published at AMIS and under review)

17
Project Biomedical Fulltext Mining
  • Input data for Web Mining (particularly web
    content mining) consists of document surrogates,
    short web pages, email messages, etc.
  • Fulltext data (books and online articles) has
    become publically available.
  • Currently fulltext mining is not well studied.
  • Study fulltext mining in the context of
    Biomedical research problems.

18
BioFulltextMiner
19
Required Software
  • Java (jdk1.6.0 or above)
  • Tomcat 6
  • Apache-ant-1.7.1
  • Eclipse 3.4
  • BioFulltextMiner.zip (http//base.njit.edu/vline/B
    ioFullTextMiner.zip)
Write a Comment
User Comments (0)
About PowerShow.com