Title: Helsinki Institute for Information Technology Scientific Advisory Board Meeting November 1517 , 2004
1Helsinki Institute for Information
TechnologyScientific Advisory Board
MeetingNovember 15-17 , 2004
2Participants
- Prof. Alberto Apostolico
- Prof. Christos Faloutsos
- Prof. Bengt Jonsson
- Prof. Randy Katz
- Prof. Martin Kersten
- (Prof. Kari-Jouko Räihä)
- Prof. Mart Saarma
- Prof. John Shawe-Taylor
- Prof. Jukka Paakki
- Prof. Olli Simula
- Dr. Patrik Floréen
- Dr. Aapo Hyvärinen
- Prof. Heikki Mannila
- Prof. Petri Myllymäki
- Prof. Martti Mäntylä
- Prof. Kimmo Raatikainen
- Prof. Hannu Toivonen
- Prof. Esko Ukkonen
- Prof. Eero Hyvönen
- Dr. Marko Turpeinen
- Dr. Giulio Jacucci
- Dr. Greger Lindén
3Goals of the meeting
- Second meeting of the Scientific Advisory Board
- Obtain feedback from the Scientific Advisory
Board on the relevance, quality, and impact of
the current research - Obtain feedback and suggestions on plans for the
research themes, applications, collaborations,
etc. - A written evaluation of each group
- Scientific quality, innovativeness, productivity
and impact - Quality and quantity of industrial and societal
impact - Feasibility and innovativeness of future plans
- Competence and expertise of the team
- Main strengths and weaknesses
- Overall evaluation and suggestions from the SAB
4Agenda for Monday 15 Nov 04
- 14.00 Welcome, introductions, overview of HIIT
(Martti Mäntylä and Esko Ukkonen) - Basic Research Unit activites
- 16.00 Data Mining General (Heikki Mannila)
- 16.30 Data Mining Applications (Hannu Toivonen)
- 17.00 Neuroinformatics (Aapo Hyvärinen)
- 17.30 Adaptive Computing Systems (Patrik Floréen)
- 18.00-18.30 SAB internal discussions
- 20.00 Dinner at Restaurant Kappeli,
Eteläesplanadi 1
5Agenda for Tuesday 16 Nov 04
- Basic Research Unit, at Kumpula campus
- 9.30 Demonstrations and discussion with
researchers - 11.30 Lunch at Kumpula campus, Chemicum
- 12.30 Transportation to HTC
6Agenda for Tuesday 16 Nov 04
- Advanced Research Unit Activites, HTC
- 13.00 Mobile Computing (Kimmo Raatikainen)
- 13.30 Semantic Computing (Eero Hyvönen)
- 14.00 User Experience Research (Martti Mäntylä,
Giulio Jacucci) - 1430 Break and refreshments
- 14.45 Complex Systems Computation Group (Petri
Myllymäki) - 15.15 Digital Contents Communities Group (Marko
Turpeinen) - 15.45 Digital Economy (Jukka Kemppinen)
- 1615 Break and refreshments
- 16.30 Demonstrations and discussion with
researchers - 20.30 Dinner at Restaurant George, Kalevankatu
17, Helsinki
7Agenda for Wednesday 17 Nov 04
- At HTC
- 9.30 A la carte (SAB may request discussions,
interviews, further demonstrations) - 10.30 SAB internal discussions
- 12.00 Lunch at Aqua restaurant in High Tech
Center - 13.00 Feedback from SAB and discussion
8Helsinki Institute for Information Technology
- Joint research institute of University of
Helsinki and Helsinki University of Technology - Goals strategic research in information
technology and related topics, aiming at high
scientific, industrial, and societal impact - Main themes mobile computing, user experience,
intelligent systems, semantic Internet, societal
media, digital economy, adaptive computation,
data mining, bioinformatics, computational
neuroscience
9Organization two parts
- Advanced Research Unit (ARU) 1999?
- 2-3 year industry co-funded strategic research
projects, CEC research, basic research - Located primarily in HTC in Ruoholahti
- Martti Mäntylä, Research Director
- Basic Research Unit (BRU) 2002?
- Long-term research in areas relevant to other
sciences and to industry - Located in the premises of the departments of
computer science of the University of Helsinki
and Helsinki University of Technology - Esko Ukkonen, Research Director
10UH Comp. Sci.
TKK CS
BRU
ARU
11Organization of HIIT
- Joint board, scientific advisory board, and
industrial advisory board - The senior researchers typically have positions
also in one of the departments of computer
science - No permanent positions
12other sciences
companies
HIIT BRU
HIIT ARU
Industry co-funded research
Basic Research, Core Projects
Industrial RD
Advanced Research
Level of risk
Advanced Development
RD
0-2 years
2-5 years
5 years
Time
13Basic Research Unit (BRU)
- established 2002
- basic funding from UH
- main location in the premises of CS Dept of UH
(new Exactum building at Kumpula Campus) - activities also in Otaniemi campus of TKK, CS
Dept - infrastructure of CS Dept
- Directors Heikki Mannila / Esko Ukkonen 9/2004
-gt
14Mode of operation
- high-quality basic research of computer science
on areas that have application potential in other
sciences or in industry - collaboration between universitites
- close co-operation with CS departments
- participation in teaching
- international networking, international recruiting
15Personnel profile (2003)
- senior researchers 5
- prof. H. Mannila, prof. H. Toivonen, prof. J.
Hollmen, doc. P. Floréen, doc. A. Hyvärinen - PhDs 8
- PhD students 17
- students 12
- adm 1
- current total about 50 (TKK 10)
- from abroad 6 - 8
16Funding profile (2004)
- basic funding from UH (33) 660 kE
- Academy of Finland (32) 645 kE includes
research grants academy professor position
senior researcher position 3 postdoc positions - industry projects (incl TEKES) 226 kE
- graduate schools (Ministry of Education) 259 kE
- European Union 116 kE
- TKK 70 kE TOTAL 1976 kE
17Research programme
- Theory and applications of data mining (Heikki
Mannila, Hannu Toivonen) - Neuroinformatics (Aapo Hyvärinen)
- Adaptive computing (Patrik Floreen)
18Goals for BRU for 2005-2006
- Expanding and strengthening of the network of
collaboration in Finland and internationally - Strong recruiting from abroad
- More emphasis on software distribution
- Emphasis on high-quality international research
- Possibly opening a new research theme
19Advanced Research Unit (ARU)
- Established in 1999
- About 80 researchers and staff
- Main themes
- Future Internet
- Intelligent Systems
- Network Society
- Funding (2004) National Technology Agency (59),
companies (15), European Union (6), Academy of
Finland (9), universities (9), total some 3,8
M - Director Prof. Martti Mäntylä
- Primarily located in High Tech Center, Ruoholahti
20Mode of Operation
- Focusing on a few industrially relevant,
strategic, long-term research areas with high
potential impact - Core and basic research projects with long-term
vision (5 years) - Medium term (3-5 years) projects focusing on new
products, services, and technologies - Complementary impact-related projects (1-2 years)
- Multidisciplinary and cross-disciplinary research
founded on computer science competence - Networking with complementary research groups
- Strong liaison with ICT and media companies
- International cooperation
21Personnel
- 5 principal scientists
- Prof. Martti Mäntylä, Prof. Kimmo Raatikainen,
Prof. Petri Myllymäki, Prof. Jukka Kemppinen,
Prof. Eero Hyvönen - 11 senior researchers
- Dr. Pekka Nikander, Dr. Ken Rimey, Dr. Timo
Saari, Prof. Henry Tirri, Dr. Wray Buntine, Dr.
Jorma Rissanen, Dr. Patrik Floréen, Dr. Pekka
Himanen, Dr. Marko Turpeinen, Dr. Timo Saari, Dr.
Markku Stenborg - 2 post docs
- Dr. Andrei Gurtov, Dr. Giulio Jacucci
- 45 Ph.D. students
- 10 M.Sc. students
- 7 staff
22Goals for 2005-2006
- Maintain, upgrade, and expand competences of
research groups (post-docs, senior researchers) - Increase further co-operation between research
groups - Build new alliances with Finnish research units
- Strengthen further existing international
partnerships (UCB, Tsinghua), launch new ones
(MLE, Waseda, KTH, ) - Expand CEC funded research
- Commence work on one or two new thematic areas
- Caring of researchers and their careers
- Operational excellence
- Approximately 85-90 researchers
23Challenges
- Bottlenecks limiting the impact
- Load of senior researchers
- Inadequate processes and instruments for
end-to-end research - Weak link with interesting users and user
communities - Slow reaction speed, low risk tolerance
- Red tape in recruiting foreign researchers
- Funding and funding instruments
- Less than 10 basic funding is not sustainable
- Inadequate instruments for research testbeds
24HIIT 2008 strategy draft
- the first HIIT contract between UH and TKK for
the period 1999 2004 - new contract planned from 1 Aug 2005
- strategy paper presented to the Board of HIIT in
Sept 2004
25Main principles
- highest international level in research
- one organization
- requires balanced basic funding from UH and TKK
- selection of research programmes (4-6) in
co-operation between the Board, SAB, Industrial
advisory board, and the research groups working
in HIIT regular evaluation - co-operation between UH and TKK
- collaboration with the CS Departments at UH and
TKK, with industry, and with research
institutions in Finland and abroad
26Main principles (cont.)
- internationalisation, international recruiting
- participation in teaching
- caring of researchers and their careers
- no permanent positions
- location in Kumpula (UH) and in Otaniemi (TKK)
- long-term funding, competition based
27Actions
- Only modest growth. But should be large enough
also for multicomponent projects. - Participation in teaching and other collaboration
with mother departments will grow. - Stronger role for the SAB and industrial advisory
board in choosing research themes
28Actions (cont.)
- Reforming the organization by combining ARU and
BRU - ARU will relocate to Kumpula and Otaniemi,
schedule depending on the availability of
suitable locations. Kumpula V finished in autumn
2007. - Financing relies too much on short-term external
funding. More long-term basic funding is needed,
for example for post-doc positions
29Research programmes
- HIIT starts and maintains certain research
programmes - a programme started on Boards decision
- For each programme, HIIT will fund
- research director/principal scientist (can be
part-time) - senior researcher/postdoc positions (0-2)
- some seed money for other positions
- the programme will seek additional funding from
other sources (Academy, TEKES, EU, industry,) - a programme has several groups links to partners
- selection of senior researchers/groups using
competition evaluation
30Data mining general
- Heikki Mannila
- Academy professor
- HIIT Basic Research Unit
- Helsinki University of Technology and University
of Helsinki
31Data mining
- Data analysis is becoming more important in other
sciences and in industry - New measurement methods
- Ability to store data
- High-dimensional large data sets
- Non-traditional forms (e.g., strings, trees,
graphs) - Data analysis lags behind
32Data mining
- Has emerged as a major research area in the
interface of computer science and statistics - Machine learning, databases, algorithms
- Data analysis questions are increasingly visible
in database and algorithms research - Theory and practice interact
- Fits very well within the overall mission of HIIT
- Basic research in computer science
- Fast applicability, possibility of impact
33Goals
- Develop novel data analysis techniques for the
use of other sciences and industry - How?
- Look at data analysis problems arising in
practice - Abstract new computational concepts from them
- Analyse the concepts and develops new
computational methods - Take the results into practice
- Theoretical work in algorithms and foundations of
data analysis can have fast impact in the
application areas - The applications feed interesting novel questions
to theoretical research
34Data mining research in HIIT
- Three senior researchers (Mannila, Toivonen,
Hollmén) - Operates on the two campuses (UH Kumpula, HUT
Otaniemi) - Research groups with no strict borders, lots of
interaction - Interaction with the adaptive computing systems,
neuroinformatics, and complex systems computation
groups
35Data mining groups in HIIT
- Mannila (UH HUT) (10-13 persons)
- theory of data mining, discrete methods,
segmentation etc. - genome structure, paleontology, linguistics
- Toivonen (UH) (6-8 persons)
- pattern discovery, algorithms
- gene mapping, haplotyping, context awareness,
paleoecology etc. - Hollmén (HUT) (5-9 persons)
- mixture modeling, pattern discovery, bootstrap
methods, sparse regression - gene expression, environmental modeling
36Events in 2004
- Very good success in obtaining funds from the
Academy of Finland - Two postdoc positions
- Two projects in the SysBio program
- Academy professorship largish grant
- Good success in international recruiting
- Panayiotis Tsaparas, Alexander Hinneburg,
(Aristides Gionis in 2003) - EU funding (April II, MobiLife)
- Industrial projects
- Gene mapping (Tekes)
- Phenotype clustering (direct industrial funding)
- Visiting students
37Researchers
- Heikki Mannila, Hannu Toivonen, Jaakko Hollmén,
Aristides Gionis, Panayiotis Tsaparas, Ella
Bingham, Alexander Hinneburg, Marko Salmenkivi,
Mikko Koivisto, Saara Hyvönen, Petteri Sevon - Postdoc education!
- 12 Ph.D. students
- Good international visibility
38Ph.D. theses since last SAB
- Ella Bingham
- Mikko Koivisto
- Petteri Sevon
- Kari Vasko
- Matti Kääriäinen
- Real soon now
- Taneli Mielikäinen
- Jouni Seppänen
39Publications in major conferences in 2004
- SIGMOD
- PODS
- VLDB
- ISC
- KDD
- PSB
- ICDE
- ICDM
- PKDD
- EDBT
- ICDE
- ...
40Choice of research areas
- Foundational interest applicability
- Relevance of the methods
- Relevance of the application areas
- Impact for the cooperating groups in other
sciences - Methods that will be used (by collaborators)
- Methods influencing the research agendas data
gathering practices of the partners - Novel questions
- Industrial impact relevant problems, useful
solutions
41Major themes in methods
- Pattern discovery
- Pattern discovery and probabilistic modelling
- Methods for sequence decomposition
- Similarity of complex objects
- High-dimensional spatial data
- Decomposition of discrete data
42Application areas
- Genome structure
- Gene mapping
- Gene expression data analysis
- Ubiquitous computing (adaptive computing)
- Palaeontology, ecology, paleoecology
- Climate studies
- Linguistic applications
- Onomastics, study of variation in language,
dialect studies
43Examples of current work on theory of data mining
- Random walks on databases (Geerts, Terzi,
Mannila) ? - Distance measures between data sets (Tatti) ?
- Clustering aggregation (Tsaparas, Gionis,
Mannila) ? - Approximating a collection of frequent sets
(Afrati, Gionis,Mannila) ? - (k,h)-segmentation (Gionis, Mannila, Haiminen,
Terzi) - Vocabularies from sequences (Gionis, Tsaparas,
Wexler, Mannila) - Subspace discovery (Seppänen, Gionis, Tsaparas,
Hinneburg) - Segmentation distances (Terzi)
- Tiles from 0/1 data (Seppänen, Gionis, Mannila)
- Condensed representations (Mielikäinen Toivonen)
- Metric labeling and spatial data (Salmenkivi,
Tsaparas, Papadimitriou, Leino, Gionis, Mannila,
Terzi)
44Example random walks on databases
- How to generalize HITS etc. to work on databases
(and not just graphs) - Given a class of queries
- State space partial tuples from the database
- Transitions t?u, if there is a query Q such that
u belongs to Q(t) - Use to rank query answers etc.
- Quite nice results
- Geerts, Mannila, Terzi, VLDB 2004
45Example clustering aggregation
- Given k clusterings C1, C2,..., Ck, , find a
clustering C that minimizes the sum of the number
of disagreements between the clusterings Ci and C
- Robustness of clustering algorithms
- Clustering categorical data
- Detecting outliers
46Correlation clustering
- Given distances Xuv between objects in V
- Find a partition C minimizing
- Algorithms and approximation guarantees
- Gionis, Mannila, Tsaparas, ICDE 2005
47Example results
48Example metric labeling
- High-dimensional observation vectors in 1 or 2
dimensions - How to take into account the underlying topology
of the observational points? - Minimize
49Example Distances between data sets
- Nikolaj Tatti, HM
- Given two datasets D1 and D2 over the same set of
variables - What is their distance?
- Given a collection of statistics f1 s(D1) and
f2 s(D2) - E.g., certain marginal frequencies
- What is the distance between D1 and D2 from the
viewpoint of these statistics?
50Distances between data sets
- w1 the distribution having statistics f1 and
maximal entropy - w2 the distribution having statistics f2 and
maximal entropy - Distance I K-L(w2, w2)
- Difficult to compute
- 2nd order approximations
- Distance II (f1 - f2)T cov-1(s) (f1 - f2)
- Under certain assumptions the only choice
- Very promising initial results!
51Example Approximating a collection of frequent
sets
- Existing frequent set mining algorithms output
too many sets - many of the sets look quite similar
- difficult to obtain a global understanding
- Goal describe a transaction database using few
sets - necessarily resort to approximations
- which is OK since support threshold is arbitrary
52The main idea
53Theoretical results
- Afrati, Gionis, Mannila, KDD 2004
- Formalize notion of approximation
- Distinguish concrete problem variants
- Establish NP-hardness and develop algorithms
54Experimental results
- Course data set
- Collection 1637 sets, Border 268 sets, support
25
55Future work
- Theory and practice interact a lot!
- (Almost) all theoretical directions are motivated
by practical issues - The combination of continuous and combinatorial
methods - Application areas selected by theoretical
interest and potential impact (industrial
scientific) - Use by collaborators vs. general distribution
- Publications, collaborations, software releases
56Future work
- I Concepts and algorithms for describing
structure of sequences - Segment structure vocabularies inference of
order - II Methods for pattern discovery in and modelling
of spatiotemporal data - Metric labeling spatial rules
- Similarity of complex objects
- Foundational issues in pattern discovery (e.g.,
logical form of patterns and the difficulty in
discovering them) - Mixture modelling and pattern discovery
57Multilevel description of discrete sequences
- Discrete sequences
- Haplotypes, genomes, telecommunication alarms,
words in documents, - Such sequences typically have a block structure
- Underlying process has several different states,
each with different characteristics - Structure can also be hierarchical
- Describe the sequence in a useful way
- For prediction, clustering, rule discovery,
description,
58Multilevel description of discrete sequences
- Three linked main parts of the research program
- Segment structure of sequences
- Vocabulary of a sequence
- Order from unordered sets
- Rule discovery in sequences Time-series
similarity Bayesian methods for piecewise
constant approximation of event sequences
(k,h)-segmentation Block and mosaic structure in
haplotypes Clustering segmentations Vocabulary
of a sequence Fragments of order Inference of
partial orders Gene expression and chromosomal
location
59Segment structure of sequences
- Segment structure of sequences
- (k,h)-segmentation
- Grammar inference
- Hierarchical analysis
- Applications genome, several genomes,
haplotypes, telecom
dynamic programming, approximations
Aristides Gionis et al.
ACTAACGACG ACAATCCGCT TATACCAGAT CCAAATCAAC
grammar inference
G?(E F) E?EBE
(k,h)-segmentation
B
F
F
E
E
hierarchical description
60Vocabulary of a sequence
- Find a good set of recurrent words
- Motif discovery
- Segment structure on the basis of vocabularies
greedy algorithm on submodular functions, string
algorithms
Call me Ishmael. Some years ago never mind how
Panayiotis Tsaparas et al.
inthesea, harpoon, ...
61Order from unordered data
- Matrix reordering
- Fragments of order
- Inference of partial orders
Frequent patterns, spectral methods, mixture
modeling on combinatorial objects
Heikki Mannila et al.
B
ABCD ABCD ACBD ACBD
D
A
C
62Methods for pattern discovery in and modelling of
spatiotemporal data
- Data where observations have a location
- Biodiversity data grid cells 2-d
- Place names, dialect usage 2-d
- Genome location in one dimension 1-d, pieces
- Telecom alarms location in the network graph
- Lots of interest from the application areas
- How to take the underlying topology into account?
- 2003?
63Links on a fragmented 1-d space comparative
genomics
- Orthologous genes in different species
64Methods for spatial data
Antti Leino et al.
- Rule discovery
- spatial association rules
- Mixture models MCMC
- piecewise constant models
- Clustering, metric labeling
- algorithms approximations
- Spatial statistics interaction with the math
department
Marko Salmenkivi et al.
Aristides Gionis et al.
65Summary of future plans
- Theory and practice!
- Structure of sequences
- Spatiotemporal data
- Foundations of pattern discovery similarity ...
- Applications
66Data Mining Applications
67Application areas
- Bioinformatics, especially medical genetics
- gene mapping, haplotyping
- discovery of genome structure
- gene expression data analysis
- Paleontology, paleoecology
- Linguistics
- Ubiquitous computing
68Collaborations in applications
- Computational methods for genome structure
- Leena Peltonen (KTL), Juha Kere (Karolinska),
Anu Jalanko (KTL) - Gene mapping
- Leena Peltonen, Juha Kere
- Jurilab Ltd.
- Geneos Ltd.
- Phenotype clustering
- Orion Pharma
69Collaborations (cont.)
- Linguistics
- R.-L. Pitkänen (Research center for the Languages
of Finnish) - Terttu Nevalainen (Department of English)
- Paleontology
- Mikael Fortelius (Dept. of Geology), Jukka
Jernvall (Institute of Biotechnology) - Paleoecology
- Atte Korhola (Dept. of Ecology)
- Environmental studies
- Markku Kulmala (Dept. of Physics)
70Researchers
- Profs. Hollmen, Mannila, Toivonen
- Postdocs
- Ella Bingham genome structure, paleontology
- Aristides Gionis genome structure, paleontology,
spatial data - Alexander Hinneburg spatial data (linguistics)
- Saara Hyvönen environmental modeling
- Mikko Koivisto phenotype clustering, genome
structure - Päivi Onkamo genetics
- Marko Salmenkivi linguistics, spatial data
- Petteri Sevon genetics, bioinformatics
- Panayiotis Tsaparas web graphs, spatial data
- 10 PhD students
71Medical genetics
- Important applications
- locating disease predisposing genes is essential
for understanding the aetiology of complex common
diseases, such as heart disease or asthma - Focus on selected topics where
- we can have a significant impact
- we can combine our own expertise with the unique
research on medical genetics in Finland - Collaboration with leading groups in medical
genetics - Prof. Leena Palotie (Public Health Institute)
- Prof. Juha Kere (Karolinska Institutet)
72Gene mapping
marker locus
haplotype (chromosome)
case 1 4 8 2 2 1 2 6 2 case
2 4 3 7 3 2 8 4 2 case 4 5 2 4
5 5 2 6 4 case 7 2 3 7 5 4 5 2
2 case 5 2 4 6 2 4 2 6 1 case
3 4 3 7 3 1 3 3 4 case 1 2 1
5 2 5 2 6 2 case 5 3 3 7 3 2 1
4 3 control 2 4 7 1 3 4 1 4
8 control 7 3 7 7 5 7 8 6 6 control
3 4 3 2 5 3 2 3 2 control 2 5 2
4 3 1 3 6 2 control 3 3 1 2 4 2 1
4 2 control 1 6 4 5 5 5 9 1
3 control 4 2 8 4 2 3 5 2 5 control
2 2 4 9 5 4 4 2 4
allele
73Gene mapping
case 1 4 8 2 2 1 2 6 2 case
2 4 3 7 3 2 8 4 2 case 4 5 2 4
5 5 2 6 4 case 7 2 3 7 5 4 5 2
2 case 5 2 4 6 2 4 2 6 1 case
3 4 3 7 3 1 3 3 4 case 1 2 1
5 2 5 2 6 2 case 5 3 3 7 3 2 1
4 3 control 2 4 7 1 3 4 1 4
8 control 7 3 7 7 5 7 8 6 6 control
3 4 3 2 5 3 2 3 2 control 2 5 2
4 3 1 3 6 2 control 3 3 1 2 4 2 1
4 2 control 1 6 4 5 5 5 9 1
3 control 4 2 8 4 2 3 5 2 5 control
2 2 4 9 5 4 4 2 4 pattern 1 (3)(4) 3
7 (3)(2) pattern 2 (5) 2 6 (2)
74Highlights gene mapping
- Formulation of gene mapping as a mixture of
pattern discovery and classification - Concepts and methods from computer science
- Haplotype Pattern Mining (HPM)
- haplotype patterns
- efficient algorithms for finding relevant
patterns - a number of variants as follow-up
- successful in gene mapping
- patents licensing
75Highlights gene mapping
- Tree Disequilibrium Test (TreeDT)
- looks for tree structured haplotype patterns
- patterns reflect possible recombination histories
- gene localization based on the pattern that best
explains the disease status - efficient algorithms
- new solutions to multiple, nested permutation
tests
A B C D E F
M1 M2 M3 M4 M5 M6 M7 M8 M9 A 2 3 1 2 2 1 2
2 1 A B 3 1 1 2 1 1 2 1 1 B C 4 1
2 1 2 1 4 3 3 C D 1 2 1 2 3 1 4 1
4 D E 2 2 3 1 3 2 4 3 2 E F 1 2 1
3 1 2 1 4 2 F
76Highlights haplotyping
- Find the highest probability strings (haplotypes)
explaining sequences of pairs (genotypes) - 1,2 1,1 1,2 ? 111 112 211 212 212 211
112 111 - Exponential for each genotype
- HaploRec Markovian models efficient algorithms
- variable length Markov chains P(H) P(H1)
?igt1 P(Hi) Hsi, i-1), where si mins
Hs, i is statistically relevant - probabilities P(Hs,i) are estimated with EM
- unique scalability
?
77Highlights mosaic structure of haplotypes
- Defining and utilising haplotype block structure
of the human genome describing and finding the
possible mosaic-like structure of haplotypes (and
genotypes)
78Highlights simulation studies
- Population and marker simulation tools
- Large simulation studies to test mapping
methodologies - an unexpected result population-based haplotypes
are as powerful as true haplotypes
79Other applications
- Medical genetics
- analysis of phenotypic datasets finding robust
clusters that have genetic explanations - Paleontology and paleoecology
- finding good estimates of the ages of fossil
sites finding matrix re-orderings that
approximate the consecutive ones property
80Other applications
- Linguistics
- finding spatial structure of the distribution of
place names and words high-dimensional
clustering - preliminary results on pattern discovery and
mixture modelling techniques for large onomastic
data sets - Ubiquitous computing
- learning to recognize typical device contexts
on-line clustering of stream data.
81Future directions
- Genome structure Computational tools for
describing the variation between individuals and
between species - haplotype blocks and mosaics
- identification of rearrangements, duplications,
and other large-scale variations - comparative genomics several species!
- segment structure, reversal distances etc.
- vocabulary of the genome
- function and structure
82Future directions (continued)
- Mining biological databases
- analysis of the rich, heterogeneous public
databases - how to find patterns in complex irregular
structures - discovery of similarities and analogies
- producing plausible biological relationships and
hypothesis
83Future directions (continued)
- Spatial and temporal variation in language
- Spatiotemporal issues in paleontology and ecology
- methods to detect patterns of variance in species
abundances - methods to correlate paleoecological time series
data, to find features in such data - Recognition of contexts in mobile applications
84Summary
- Application problem ? new computational concepts
? novel methods ? practical applications - Important data analysis problems
- Successfully fielded applications
- A wide network of excellent applied collaborators
- Development of novel techniques
- combinations discrete and probabilistic
approaches - HIIT mode of working collaboration between
groups, universities, other disciplines, and
industry
85Posters and demos
- Genetic mapping studies Asthma and allergy.
Päivi Onkamo - Mining Athmospheric Data. Saara Hyvönen
- Geometric and combinatorial tiles in 0-1 data.
(DEMO) Aristides Gionis, Heikki Mannila, Jouni
Seppänen - Dimension induced clustering. (DEMO) Aristides
Gionis, Alexander Hinneburg, Spiros
Papadimitriou, Panayiotis Tsaparas. - Spatial Analysis of Area Data. Case Finnish lake
names. Marko Salmenkivi, Saara Hyvönen, Antti
Leino. - What was the Finnish hiisi? A case study on place
name data. Marko Salmenkivi, Antti Leino, Saara
Hyvönen. - Clustering aggregation. Aristides Gionis,
Panayiotis Tsaparas, Heikki Mannila. - Spatially coherent clustering. Aristides Gionis,
Heikki Mannila, Spiros Papadimitriou, Panayiotis
Tsaparas - Spectral ordering. Aristides Gionis, Heikki
Mannila, Mikael Fortelius, Jukka Jernvall
- Genome puzzle. (DEMO) Mikko Koivisto, Teemu
Kivioja, Heikki Mannila, Pasi Rastas, Esko
Ukkonen. - Segmentation-based analysis of genomic
sequences.(DEMO) Niina Haiminen, Evimaria Terzi,
Aristides Gionis, Heikki Mannila. - Mining non-redundant association rules. (DEMO)
Juho Muhonen - HaploRec Population-based reconstruction of
haplotypes.(DEMO) Lauri Eronen - TreeDT Gene mapping by tree disequilibrium test.
Petteri Sevon, Hannu Toivonen - An efficient method for association mapping in
phase-unknown genotype data. Petteri Sevon, Päivi
Onkamo, Hannu Toivonen - Techniques for simulating populations and marker
data. Petteri Hintsanen, Petteri Sevon - Efficient population-based reconstruction of
haplotypes. Lauri Eronen - Integrating the tools Power simulations for gene
mapping studies. Petteri Hintsanen, Petteri
Sevon, Päivi Onkamo. - Mapping susceptibility genes for familial glioma.
Päivi Onkamo
86Neuroinformatics
87Scope of Neuroinformatics
- Interface of brain research and information
technology - Functional models of brain
- Signal processing methods
- Databases
- Our specialization Multivariate statistical
models - Principal component analysis
- Independent component analysis
- Extensions of ICA
- New models (see later)
88Researchers in Neuroinformatics
- Aapo Hyvärinen, leader
- Two post-docs
- Patrik Hoyer
- Jarmo Hurri
- 5 PhD students (some partly)
- Funding from Academy of Finland, Univ of
Helsinki, foreign foundations
89Research goals
- Models of sensory processing in the brain,based
on statistical analysis of natural stimuli - New biologically-inspired data analysis methods
- Advanced statistical analysis of neuroscientific
data - Common theme is multivariate data analysis
90Reliability analysis of ICA
(Neuroimage, 2004)
- ICA can find underlying factors that are
independent and nongaussian - Results contain statistical and computational
errors, which components are good? - A software package on the Web
- Same approach works for comparison of individuals
(NeuroImage, in press)
In cooperation with Universities of Naples and
Maastricht
91Learning high-level features
- ICA gives linear features in images
- Extensions of ICA give features in 2nd layer
- We estimate third layer by ICA of outputs of 2nd
layer(submitted ms.)
92Learning segmentation (1)
- A new principle for multivariate data analysis
- Given very high-dimensional random vector
- Can we partition variables in each observation
- Based on the statistical structure
- With no prior knowledge of segments
- Visual system has learned this for its input
93Learning segmentation (2)
- Use correlations to find out which variables
belong together (submitted ms.) - Basic idea each segment should be such that
observed variables follow typical correlation
structure - Then, we can segment even data that has a weird
correlation structure
94New kinds of feature extraction
- We can try to find features that characterize
whole images (Proc Int Conf Pattern Recogn 2004) - Compute histograms of low-level features
- Analyze these histograms, e.g. by ICA
- Features from natural language (text) data
(Proc. Int. Joint Conf Neural Netw 2004) - Compute the context histograms of words (which
words are typically together) - Perform e.g. ICA on these histograms
In cooperation with HUT and FDK
95Exploration of causality
(Proc. Factor Analysis Cent. Symp. 2004)
- Classic methods can say x and y are correlated,
but which causes which? - Using nongaussianity, we can find causal
ordering - Closely related to ICA estimation an example of
a post-processing method.
In cooperation with University of Osaka
96Blind source separation
- Separation of underlying sources, e.g. in brain
activity - ICA
- Sources are independent
- No time structure
- We developed methods which
- are able to separate dependent sources (Signal
Processing, 2004) - Utilize time structure (Sig Proc, in press)
97Classification images
- Estimation of templates used in the human visual
system by linear regression - We attempt to develop nonlinear versions
- Basic approach changes in linear templates as a
function of context
In cooperation with Dept of Psychology, UH
98Non-negative sparse representations
- Non-negative matrix factorization (NMF) claimed
to give local, parts-based representations - Locality much better achieved when combined with
sparseness (J. Mach. Learn. Res, in press)
99Future questions
- Connection between segmentation and independent
components - Further post-processing methods for ICA
- Classification methods and ICA
- General multilayer models for natural images
- Nonlinear ICA
- Estimation of complex statistical models
100Adaptive Computing
101Premises of the research
- Adaptive computing refers to solutions that adapt
to their environment - Linked to the ubiquitous / pervasive / proactive
computing vision - We focus on some central topics to realise this
vision - Context-awareness and adaptation is central to
user-friendly ubiquitous applications and ad hoc
networking (incl. sensor networks) may in the
future provide infrastructure for many ubiquitous
applications
102Our Research Environment
- Draws on existing competence in data mining,
probabilistic reasoning, algorithmics and
language technology - At the intersection of many of the research
groups of HIIT many of our research groups deal
with context-awareness, personalisation and
adaptation - This presentation is about the AC groups at BRU
- Group of Prof. Hannu Toivonen (Kari Laasonen,
Renaud Petit, Mika Raento) - Group of Doc. Patrik Floréen (Greger Lindén,
Jukka Kohonen, Yevgeniya Kulikova, Petteri Nurmi,
Michael Przybilski, Jukka Suomela) - Small groups, short history (2003-)
103Present Research Issues (1/2)
- Context analysis
- Analysis of context information and its use in
proactive adaptivity on mobile devices, in
particular recognising and predicting locations
under limited resources CONTEXT Toivonen,
Laasonen, Petit, Raento - Reasoning about (mobile) context using data
mining and machine learning techniques, e.g.
segmentation, time series analysis MobiLife
Floréen, Nurmi, Przybilski, Raento, Suomela
104Present Research Issues (2/2)
- Architectural issues for context-aware systems
- Context-aware selection of software components on
mobile terminals with an architecture solution
based on a blackboard approach Space4U
Floréen, Przybilski - Context Management Framework a software
architecture for future context-aware mobile
systems MobiLife Floréen, Nurmi, Przybilski - Ad hoc and sensor networks
- Topology control and routing problems under
energy constrains - Self-organisation of ad hoc networks using a
game-theoretical approach - NAPS Floréen, Kohonen, Nurmi
105Summary of Ongoing Projects
- Group of Toivonen
- CONTEXT Academy of Finland, 11/02-12/05, with
ARU - Group of Floréen
- NAPS, Academy of Finland, 01/03-12/05, with HUT
- Space4U EUREKA/ITEA, Nokia subcontract,
07/03-06/05, also HUT - MobiLife EU IST IP, Nokia coordinator,
09/04-12/06, also ARU - In addition
- PROACT coordination, Academy of Finland,
01/02-05/06 Programme Director Heikki Mannila,
Programme Coordinator Greger Lindén
106Highlights of Recent Achievements CONTEXT
- The software developed for location recognition
and prediction is published as open source and is
used by other research groups and for presence
service and annotation of photographs
107Annotation of Photographs
108Highlights of Recent Achievements NAPS
- NP-hardness results and algorithms for maximizing
multicast lifetime under energy constraints, by
dynamically choosing transmission power levels - Modelling of routing in ad hoc networks using
dynamic Bayesian games Nurmi - Balanced data gathering in sensor networks,
including an approximation algorithm based on the
Garg Könemann fractional packing approximation
algorithm (FOCS98) - Energy limited sensor nodes
- Utility function F? (1-?) avgi?S qi ? mini?S
qi - 36 randomly placed sensors in figures that follow
109No balancing (?0)
Maximizing F0 avg qi
110Strict balancing (?1)
Maximizing F1 min qi
111Moderate balancing (?0.5)
Maximizing F0.5 0.5 avg qi 0.5 min qi
112Future Directions
- Emphasis more on continuing present topics than
on enlarging to new areas - Successful present principles to be continued
- Diverse funding sources
- Theory and practical implementation together
- There is potential for developing the activities
through - even more collaboration with other groups (inside
and outside of HIIT) - TEKES projects
- attention to recruitment of postdocs
113Future Research Issues
- Context reasoning and the use of context
- Combining context-awareness and component
architectures - Trust and privacy issues of the users in
context-aware applications - Modelling and algorithms for topology control and
routing in ad hoc networks and data gathering in
sensor networks - Application of game theory to problems in ad hoc
networking and context-aware computing