Data and Text Mining for Computational Biology - PowerPoint PPT Presentation


PPT – Data and Text Mining for Computational Biology PowerPoint presentation | free to view - id: b1c3-ZTZmY


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Data and Text Mining for Computational Biology


Data and Text Mining for Computational Biology – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 28
Provided by: VasileiosH9


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data and Text Mining for Computational Biology

Data and Text Mining for Computational Biology
  • Introduction

Course information
  • CS 6365
  • Data and Text Mining for Computational Biology
  • Meets Tuesday and Thursday 700-815
    pm at ECSS 2.412

  • Vasileios Hatzivassiloglou
  • Associate Professor, Computer Science
  • Founding Professor, Bioengineering
  • Research focus Discover knowledge from massive
    amounts of raw data
  • data not the same as information
  • information overload

Research Interests
  • Text analysis, machine learning, intelligent
    information retrieval, summarization, question
    answering, bioinformatics, medical informatics

Contact information
  • Office hours Tuesday and Thursday 600-700pm
    and by appointment
  • Office location ECSS 3.406
  • (972) 883-4342
  • Teaching Assistant TBA

Course goals
  • Introduce the field of bioinformatics
  • Discuss primary techniques used for data mining
  • Introduce text mining and additional issues it
    brings to data mining methods
  • Use examples from computational biology

Intended audience
  • For both computer scientists and biologists
  • Not an easy task to balance the two
  • Focus on data and text mining algorithms and
  • Coverage of machine learning background
  • No extensive algorithmic analysis / computational
  • Medium level of programming

  • Officially CS 6325 Introduction to
  • Waived for this offering of the course
  • You should know
  • Basic data structures (multidimensional arrays,
    hash tables, binary trees)
  • One high-level programming language and be able
    to adapt to a new one as needed
  • Be able to install and use external software

You need not know
  • Molecular biology
  • Machine learning
  • Data mining (in general)
  • Text analysis / natural language processing
  • Information retrieval
  • Artificial intelligence

Course level
  • Introductory graduate course (MS or first-year
  • Maturity in programming and data structures as of
    a Computer Science senior
  • Ability (and interest in) accessing the primary
    literature in a guided fashion

Course structure
  • 6 lectures on biological background and
    bioinformatics in general
  • 6 lectures on data similarity
  • 8 lectures on data mining methods
  • 3 lectures on text mining and knowledge mining
  • student presentations of research papers (3-4

Expected work load
  • Two homework sets given in mid-to-late September
    and mid-to-late October
  • Two weeks to turn in each homework set
  • Mid-term exam in early October
  • Each student selects two or three research papers
    to review in late October
  • Student presentations of research papers in the
    last week of November / first week of December
  • Final exam

Course project
  • In lieu of the research papers and presentation,
    students may elect to work on a project in teams
    of two or three
  • Project is chosen by the students with the advice
    and consent of the instructor
  • Project investigation/implementation should be
    approximately 1.5-2 times the work required for a
    regular homework

  • Each student selects their own programming
    language (must be available at UTD and accessible
    to TA)
  • Examples C, C, Java, Perl, Python
  • Can also use a package/programming environment
    specifically tailored to bioinformatics

One likely package
  • R (http//
  • R is the free alternative to S-Plus developed at
    ATT research
  • S-Plus is the extensible, programmable
    alternative to statistical packages like SAS and
  • If you know C, you will be right at home with R

Another likely package
  • BioPerl (http//
  • A collection of library modules in Perl written
    by and for bioinformaticians
  • Perl supports high-level operations such as
    hashes as a basic data structure, string
    matching, and regular expressions
  • Perl is really bad at OOP and efficiency
  • Easy to learn

  • Class participation 20
  • Homework assignments 30 (total)
  • Midterm 10
  • Research paper presentation or project 20
  • Final exam 20

  • No good integrated textbook on data mining from a
    computational biology perspective
  • We will use a text book covering bioinformatics
    algorithms and another text book on data mining
    in general, and additional chapters from other
    books and research articles
  • Copies of chapters / research articles will be

Recommended textbook 1
  • An Introduction to Bioinformatics Algorithms
    (Computational Molecular Biology), by Neil C.
    Jones and Pavel A. Pevzner, MIT Press, 2004.
  • ISBN 0262101068
  • 448 pages
  • Available on for 41, Barnes and Noble
    for 60

Recommended textbook 2
  • Data Mining Concepts and Techniques by Jiawei
    Han and Micheline Kamber, Elsevier, second
    edition, 2006.
  • ISBN 1558609016
  • 800 pages
  • Available on for 52, Barnes and Noble
    for 65

Supplementary textbooks
  • Bioinformatics The Machine Learning Approach
    by Pierre Baldi and Soren Brunak, 2nd edition,
  • Data mining multimedia, soft computing, and
    bioinformatics by Sushmita Mitra and Tinku
    Acharya, 2003.
  • Both of the above are available as full-text
    eBooks via http//

Background reading
  • Biology Molecular Biology of the Cell by Bruce
    Alberts et al., 4th edition, 2002.
  • Machine learning Machine Learning by Tom
    Mitchell, 1997.

Background reading (II)
  • Statistics The elements of statistical
    learning data mining, inference, and prediction
    by Trevor Hastie, Robert Tibshirani and Jerome
    Friedman, 2001.
  • Data structures and algorithms Introduction to
    Algorithms, by Thomas H. Cormen, Charles E.
    Leiserson, Ronald L. Rivest, and Clifford Stein,
    2nd edition, 2001.

So what is it all about?
  • Three parts
  • Bioinformatics / computational biology
  • Data mining
  • Text mining

  • A fast developing discipline
  • We will discuss
  • basic concepts of molecular biology
  • databases of biological data
  • structure and function of DNA, RNA, proteins
  • sequence searching (BLAST)
  • sequence similarity and comparison
  • protein structure (2D and 3D)
  • protein motifs and patterns
  • microarrays
  • phylogenetics

Data mining
  • Given a large amount of data of known types,
    extract useful information
  • We will discuss
  • data cleanup and outliers
  • model construction
  • data and dimensionality reduction
  • classification
  • prediction / probability estimation
  • clustering
  • measuring performance

Text mining
  • Not only we have a large amount of raw data, but
    we dont know what each item means
  • We will discuss
  • tokenization and basics of text processing
  • recognition of terms and entities
  • classification
  • dictionary creation
  • relationship learning and extraction
  • document level clustering and information