Helsinki Institute for Information Technology Scientific Advisory Board Meeting November 1517 , 2004

About This Presentation

Title:

Helsinki Institute for Information Technology Scientific Advisory Board Meeting November 1517 , 2004

Description:

Second meeting of the Scientific Advisory Board ... Palaeontology, ecology, paleoecology. Climate studies. Linguistic applications ... – PowerPoint PPT presentation

Number of Views:414

Avg rating:3.0/5.0

Slides: 114

Provided by: martt153

Category:

more less

Transcript and Presenter's Notes

Title: Helsinki Institute for Information Technology Scientific Advisory Board Meeting November 1517 , 2004

1
Helsinki Institute for Information
TechnologyScientific Advisory Board
MeetingNovember 15-17 , 2004
2
Participants

Prof. Alberto Apostolico
Prof. Christos Faloutsos
Prof. Bengt Jonsson
Prof. Randy Katz
Prof. Martin Kersten
(Prof. Kari-Jouko Räihä)
Prof. Mart Saarma
Prof. John Shawe-Taylor
Prof. Jukka Paakki
Prof. Olli Simula

Dr. Patrik Floréen
Dr. Aapo Hyvärinen
Prof. Heikki Mannila
Prof. Petri Myllymäki
Prof. Martti Mäntylä
Prof. Kimmo Raatikainen
Prof. Hannu Toivonen
Prof. Esko Ukkonen
Prof. Eero Hyvönen
Dr. Marko Turpeinen
Dr. Giulio Jacucci
Dr. Greger Lindén

3
Goals of the meeting

Second meeting of the Scientific Advisory Board
Obtain feedback from the Scientific Advisory
Board on the relevance, quality, and impact of
the current research
Obtain feedback and suggestions on plans for the
research themes, applications, collaborations,
etc.
A written evaluation of each group
Scientific quality, innovativeness, productivity
and impact
Quality and quantity of industrial and societal
impact
Feasibility and innovativeness of future plans
Competence and expertise of the team
Main strengths and weaknesses
Overall evaluation and suggestions from the SAB

4
Agenda for Monday 15 Nov 04

14.00 Welcome, introductions, overview of HIIT
(Martti Mäntylä and Esko Ukkonen)
Basic Research Unit activites
16.00 Data Mining General (Heikki Mannila)
16.30 Data Mining Applications (Hannu Toivonen)
17.00 Neuroinformatics (Aapo Hyvärinen)
17.30 Adaptive Computing Systems (Patrik Floréen)
18.00-18.30 SAB internal discussions
20.00 Dinner at Restaurant Kappeli,
Eteläesplanadi 1

5
Agenda for Tuesday 16 Nov 04

Basic Research Unit, at Kumpula campus
9.30 Demonstrations and discussion with
researchers
11.30 Lunch at Kumpula campus, Chemicum
12.30 Transportation to HTC

6
Agenda for Tuesday 16 Nov 04

Advanced Research Unit Activites, HTC
13.00 Mobile Computing (Kimmo Raatikainen)
13.30 Semantic Computing (Eero Hyvönen)
14.00 User Experience Research (Martti Mäntylä,
Giulio Jacucci)
1430 Break and refreshments
14.45 Complex Systems Computation Group (Petri
Myllymäki)
15.15 Digital Contents Communities Group (Marko
Turpeinen)
15.45 Digital Economy (Jukka Kemppinen)
1615 Break and refreshments
16.30 Demonstrations and discussion with
researchers
20.30 Dinner at Restaurant George, Kalevankatu
17, Helsinki

7
Agenda for Wednesday 17 Nov 04

At HTC
9.30 A la carte (SAB may request discussions,
interviews, further demonstrations)
10.30 SAB internal discussions
12.00 Lunch at Aqua restaurant in High Tech
Center
13.00 Feedback from SAB and discussion

8
Helsinki Institute for Information Technology

Joint research institute of University of
Helsinki and Helsinki University of Technology
Goals strategic research in information
technology and related topics, aiming at high
scientific, industrial, and societal impact
Main themes mobile computing, user experience,
intelligent systems, semantic Internet, societal
media, digital economy, adaptive computation,
data mining, bioinformatics, computational
neuroscience

9
Organization two parts

Advanced Research Unit (ARU) 1999?
2-3 year industry co-funded strategic research
projects, CEC research, basic research
Located primarily in HTC in Ruoholahti
Martti Mäntylä, Research Director
Basic Research Unit (BRU) 2002?
Long-term research in areas relevant to other
sciences and to industry
Located in the premises of the departments of
computer science of the University of Helsinki
and Helsinki University of Technology
Esko Ukkonen, Research Director

10
UH Comp. Sci.
TKK CS
BRU
ARU
11
Organization of HIIT

Joint board, scientific advisory board, and
industrial advisory board
The senior researchers typically have positions
also in one of the departments of computer
science
No permanent positions

12
other sciences
companies
HIIT BRU
HIIT ARU
Industry co-funded research
Basic Research, Core Projects
Industrial RD
Advanced Research
Level of risk
Advanced Development
RD
0-2 years
2-5 years
5 years
Time
13
Basic Research Unit (BRU)

established 2002
basic funding from UH
main location in the premises of CS Dept of UH
(new Exactum building at Kumpula Campus)
activities also in Otaniemi campus of TKK, CS
Dept
infrastructure of CS Dept
Directors Heikki Mannila / Esko Ukkonen 9/2004
-gt

14
Mode of operation

high-quality basic research of computer science
on areas that have application potential in other
sciences or in industry
collaboration between universitites
close co-operation with CS departments
participation in teaching
international networking, international recruiting

15
Personnel profile (2003)

senior researchers 5
prof. H. Mannila, prof. H. Toivonen, prof. J.
Hollmen, doc. P. Floréen, doc. A. Hyvärinen
PhDs 8
PhD students 17
students 12
adm 1
current total about 50 (TKK 10)
from abroad 6 - 8

16
Funding profile (2004)

basic funding from UH (33) 660 kE
Academy of Finland (32) 645 kE includes
research grants academy professor position
senior researcher position 3 postdoc positions
industry projects (incl TEKES) 226 kE
graduate schools (Ministry of Education) 259 kE
European Union 116 kE
TKK 70 kE TOTAL 1976 kE

17
Research programme

Theory and applications of data mining (Heikki
Mannila, Hannu Toivonen)
Neuroinformatics (Aapo Hyvärinen)
Adaptive computing (Patrik Floreen)

18
Goals for BRU for 2005-2006

Expanding and strengthening of the network of
collaboration in Finland and internationally
Strong recruiting from abroad
More emphasis on software distribution
Emphasis on high-quality international research
Possibly opening a new research theme

19
Advanced Research Unit (ARU)

Established in 1999
About 80 researchers and staff
Main themes
Future Internet
Intelligent Systems
Network Society
Funding (2004) National Technology Agency (59),
companies (15), European Union (6), Academy of
Finland (9), universities (9), total some 3,8
M
Director Prof. Martti Mäntylä
Primarily located in High Tech Center, Ruoholahti

20
Mode of Operation

Focusing on a few industrially relevant,
strategic, long-term research areas with high
potential impact
Core and basic research projects with long-term
vision (5 years)
Medium term (3-5 years) projects focusing on new
products, services, and technologies
Complementary impact-related projects (1-2 years)
Multidisciplinary and cross-disciplinary research
founded on computer science competence
Networking with complementary research groups
Strong liaison with ICT and media companies
International cooperation

21
Personnel

5 principal scientists
Prof. Martti Mäntylä, Prof. Kimmo Raatikainen,
Prof. Petri Myllymäki, Prof. Jukka Kemppinen,
Prof. Eero Hyvönen
11 senior researchers
Dr. Pekka Nikander, Dr. Ken Rimey, Dr. Timo
Saari, Prof. Henry Tirri, Dr. Wray Buntine, Dr.
Jorma Rissanen, Dr. Patrik Floréen, Dr. Pekka
Himanen, Dr. Marko Turpeinen, Dr. Timo Saari, Dr.
Markku Stenborg
2 post docs
Dr. Andrei Gurtov, Dr. Giulio Jacucci
45 Ph.D. students
10 M.Sc. students
7 staff

22
Goals for 2005-2006

Maintain, upgrade, and expand competences of
research groups (post-docs, senior researchers)
Increase further co-operation between research
groups
Build new alliances with Finnish research units
Strengthen further existing international
partnerships (UCB, Tsinghua), launch new ones
(MLE, Waseda, KTH, )
Expand CEC funded research
Commence work on one or two new thematic areas
Caring of researchers and their careers
Operational excellence
Approximately 85-90 researchers

23
Challenges

Bottlenecks limiting the impact
Load of senior researchers
Inadequate processes and instruments for
end-to-end research
Weak link with interesting users and user
communities
Slow reaction speed, low risk tolerance
Red tape in recruiting foreign researchers
Funding and funding instruments
Less than 10 basic funding is not sustainable
Inadequate instruments for research testbeds

24
HIIT 2008 strategy draft

the first HIIT contract between UH and TKK for
the period 1999 2004
new contract planned from 1 Aug 2005
strategy paper presented to the Board of HIIT in
Sept 2004

25
Main principles

highest international level in research
one organization
requires balanced basic funding from UH and TKK
selection of research programmes (4-6) in
co-operation between the Board, SAB, Industrial
advisory board, and the research groups working
in HIIT regular evaluation
co-operation between UH and TKK
collaboration with the CS Departments at UH and
TKK, with industry, and with research
institutions in Finland and abroad

26
Main principles (cont.)

internationalisation, international recruiting
participation in teaching
caring of researchers and their careers
no permanent positions
location in Kumpula (UH) and in Otaniemi (TKK)
long-term funding, competition based

27
Actions

Only modest growth. But should be large enough
also for multicomponent projects.
Participation in teaching and other collaboration
with mother departments will grow.
Stronger role for the SAB and industrial advisory
board in choosing research themes

28
Actions (cont.)

Reforming the organization by combining ARU and
BRU
ARU will relocate to Kumpula and Otaniemi,
schedule depending on the availability of
suitable locations. Kumpula V finished in autumn
2007.
Financing relies too much on short-term external
funding. More long-term basic funding is needed,
for example for post-doc positions

29
Research programmes

HIIT starts and maintains certain research
programmes
a programme started on Boards decision
For each programme, HIIT will fund
research director/principal scientist (can be
part-time)
senior researcher/postdoc positions (0-2)
some seed money for other positions
the programme will seek additional funding from
other sources (Academy, TEKES, EU, industry,)
a programme has several groups links to partners
selection of senior researchers/groups using
competition evaluation

30
Data mining general

Heikki Mannila
Academy professor
HIIT Basic Research Unit
Helsinki University of Technology and University
of Helsinki

31
Data mining

Data analysis is becoming more important in other
sciences and in industry
New measurement methods
Ability to store data
High-dimensional large data sets
Non-traditional forms (e.g., strings, trees,
graphs)
Data analysis lags behind

32
Data mining

Has emerged as a major research area in the
interface of computer science and statistics
Machine learning, databases, algorithms
Data analysis questions are increasingly visible
in database and algorithms research
Theory and practice interact
Fits very well within the overall mission of HIIT
Basic research in computer science
Fast applicability, possibility of impact

33
Goals

Develop novel data analysis techniques for the
use of other sciences and industry
How?
Look at data analysis problems arising in
practice
Abstract new computational concepts from them
Analyse the concepts and develops new
computational methods
Take the results into practice
Theoretical work in algorithms and foundations of
data analysis can have fast impact in the
application areas
The applications feed interesting novel questions
to theoretical research

34
Data mining research in HIIT

Three senior researchers (Mannila, Toivonen,
Hollmén)
Operates on the two campuses (UH Kumpula, HUT
Otaniemi)
Research groups with no strict borders, lots of
interaction
Interaction with the adaptive computing systems,
neuroinformatics, and complex systems computation
groups

35
Data mining groups in HIIT

Mannila (UH HUT) (10-13 persons)
theory of data mining, discrete methods,
segmentation etc.
genome structure, paleontology, linguistics
Toivonen (UH) (6-8 persons)
pattern discovery, algorithms
gene mapping, haplotyping, context awareness,
paleoecology etc.
Hollmén (HUT) (5-9 persons)
mixture modeling, pattern discovery, bootstrap
methods, sparse regression
gene expression, environmental modeling

36
Events in 2004

Very good success in obtaining funds from the
Academy of Finland
Two postdoc positions
Two projects in the SysBio program
Academy professorship largish grant
Good success in international recruiting
Panayiotis Tsaparas, Alexander Hinneburg,
(Aristides Gionis in 2003)
EU funding (April II, MobiLife)
Industrial projects
Gene mapping (Tekes)
Phenotype clustering (direct industrial funding)
Visiting students

37
Researchers

Heikki Mannila, Hannu Toivonen, Jaakko Hollmén,
Aristides Gionis, Panayiotis Tsaparas, Ella
Bingham, Alexander Hinneburg, Marko Salmenkivi,
Mikko Koivisto, Saara Hyvönen, Petteri Sevon
Postdoc education!
12 Ph.D. students
Good international visibility

38
Ph.D. theses since last SAB

Ella Bingham
Mikko Koivisto
Petteri Sevon
Kari Vasko
Matti Kääriäinen
Real soon now
Taneli Mielikäinen
Jouni Seppänen

39
Publications in major conferences in 2004

SIGMOD
PODS
VLDB
ISC
KDD
PSB

ICDE
ICDM
PKDD
EDBT
ICDE
...

40
Choice of research areas

Foundational interest applicability
Relevance of the methods
Relevance of the application areas
Impact for the cooperating groups in other
sciences
Methods that will be used (by collaborators)
Methods influencing the research agendas data
gathering practices of the partners
Novel questions
Industrial impact relevant problems, useful
solutions

41
Major themes in methods

Pattern discovery
Pattern discovery and probabilistic modelling
Methods for sequence decomposition
Similarity of complex objects
High-dimensional spatial data
Decomposition of discrete data

42
Application areas

Genome structure
Gene mapping
Gene expression data analysis
Ubiquitous computing (adaptive computing)
Palaeontology, ecology, paleoecology
Climate studies
Linguistic applications
Onomastics, study of variation in language,
dialect studies

43
Examples of current work on theory of data mining

Random walks on databases (Geerts, Terzi,
Mannila) ?
Distance measures between data sets (Tatti) ?
Clustering aggregation (Tsaparas, Gionis,
Mannila) ?
Approximating a collection of frequent sets
(Afrati, Gionis,Mannila) ?
(k,h)-segmentation (Gionis, Mannila, Haiminen,
Terzi)
Vocabularies from sequences (Gionis, Tsaparas,
Wexler, Mannila)
Subspace discovery (Seppänen, Gionis, Tsaparas,
Hinneburg)
Segmentation distances (Terzi)
Tiles from 0/1 data (Seppänen, Gionis, Mannila)
Condensed representations (Mielikäinen Toivonen)
Metric labeling and spatial data (Salmenkivi,
Tsaparas, Papadimitriou, Leino, Gionis, Mannila,
Terzi)

44
Example random walks on databases

How to generalize HITS etc. to work on databases
(and not just graphs)
Given a class of queries
State space partial tuples from the database
Transitions t?u, if there is a query Q such that
u belongs to Q(t)
Use to rank query answers etc.
Quite nice results
Geerts, Mannila, Terzi, VLDB 2004

45
Example clustering aggregation

Given k clusterings C1, C2,..., Ck, , find a
clustering C that minimizes the sum of the number
of disagreements between the clusterings Ci and C

Robustness of clustering algorithms
Clustering categorical data
Detecting outliers

46
Correlation clustering

Given distances Xuv between objects in V
Find a partition C minimizing
Algorithms and approximation guarantees
Gionis, Mannila, Tsaparas, ICDE 2005

47
Example results
48
Example metric labeling

High-dimensional observation vectors in 1 or 2
dimensions
How to take into account the underlying topology
of the observational points?
Minimize

49
Example Distances between data sets

Nikolaj Tatti, HM
Given two datasets D1 and D2 over the same set of
variables
What is their distance?
Given a collection of statistics f1 s(D1) and
f2 s(D2)
E.g., certain marginal frequencies
What is the distance between D1 and D2 from the
viewpoint of these statistics?

50
Distances between data sets

w1 the distribution having statistics f1 and
maximal entropy
w2 the distribution having statistics f2 and
maximal entropy
Distance I K-L(w2, w2)
Difficult to compute
2nd order approximations
Distance II (f1 - f2)T cov-1(s) (f1 - f2)
Under certain assumptions the only choice
Very promising initial results!

51
Example Approximating a collection of frequent
sets

Existing frequent set mining algorithms output
too many sets
many of the sets look quite similar
difficult to obtain a global understanding
Goal describe a transaction database using few
sets
necessarily resort to approximations
which is OK since support threshold is arbitrary

52
The main idea
53
Theoretical results

Afrati, Gionis, Mannila, KDD 2004
Formalize notion of approximation
Distinguish concrete problem variants
Establish NP-hardness and develop algorithms

54
Experimental results

Course data set
Collection 1637 sets, Border 268 sets, support
25

55
Future work

Theory and practice interact a lot!
(Almost) all theoretical directions are motivated
by practical issues
The combination of continuous and combinatorial
methods
Application areas selected by theoretical
interest and potential impact (industrial
scientific)
Use by collaborators vs. general distribution
Publications, collaborations, software releases

56
Future work

I Concepts and algorithms for describing
structure of sequences
Segment structure vocabularies inference of
order
II Methods for pattern discovery in and modelling
of spatiotemporal data
Metric labeling spatial rules
Similarity of complex objects
Foundational issues in pattern discovery (e.g.,
logical form of patterns and the difficulty in
discovering them)
Mixture modelling and pattern discovery

57
Multilevel description of discrete sequences

Discrete sequences
Haplotypes, genomes, telecommunication alarms,
words in documents,
Such sequences typically have a block structure
Underlying process has several different states,
each with different characteristics
Structure can also be hierarchical
Describe the sequence in a useful way
For prediction, clustering, rule discovery,
description,

58
Multilevel description of discrete sequences

Three linked main parts of the research program
Segment structure of sequences
Vocabulary of a sequence
Order from unordered sets

Rule discovery in sequences Time-series
similarity Bayesian methods for piecewise
constant approximation of event sequences
(k,h)-segmentation Block and mosaic structure in
haplotypes Clustering segmentations Vocabulary
of a sequence Fragments of order Inference of
partial orders Gene expression and chromosomal
location

59
Segment structure of sequences

Segment structure of sequences
(k,h)-segmentation
Grammar inference
Hierarchical analysis
Applications genome, several genomes,
haplotypes, telecom

dynamic programming, approximations
Aristides Gionis et al.
ACTAACGACG ACAATCCGCT TATACCAGAT CCAAATCAAC
grammar inference
G?(E F) E?EBE
(k,h)-segmentation
B
F
F
E
E
hierarchical description
60
Vocabulary of a sequence

Find a good set of recurrent words
Motif discovery
Segment structure on the basis of vocabularies

greedy algorithm on submodular functions, string
algorithms
Call me Ishmael. Some years ago never mind how

Panayiotis Tsaparas et al.
inthesea, harpoon, ...
61
Order from unordered data

Matrix reordering
Fragments of order
Inference of partial orders

Frequent patterns, spectral methods, mixture
modeling on combinatorial objects
Heikki Mannila et al.
B
ABCD ABCD ACBD ACBD
D
A
C
62
Methods for pattern discovery in and modelling of
spatiotemporal data

Data where observations have a location
Biodiversity data grid cells 2-d
Place names, dialect usage 2-d
Genome location in one dimension 1-d, pieces
Telecom alarms location in the network graph
Lots of interest from the application areas
How to take the underlying topology into account?
2003?

63
Links on a fragmented 1-d space comparative
genomics

Orthologous genes in different species

64
Methods for spatial data
Antti Leino et al.

Rule discovery
spatial association rules
Mixture models MCMC
piecewise constant models
Clustering, metric labeling
algorithms approximations
Spatial statistics interaction with the math
department

Marko Salmenkivi et al.
Aristides Gionis et al.
65
Summary of future plans

Theory and practice!
Structure of sequences
Spatiotemporal data
Foundations of pattern discovery similarity ...
Applications

66
Data Mining Applications

Hannu Toivonen
Professor

67
Application areas

Bioinformatics, especially medical genetics
gene mapping, haplotyping
discovery of genome structure
gene expression data analysis
Paleontology, paleoecology
Linguistics
Ubiquitous computing

68
Collaborations in applications

Computational methods for genome structure
Leena Peltonen (KTL), Juha Kere (Karolinska),
Anu Jalanko (KTL)
Gene mapping
Leena Peltonen, Juha Kere
Jurilab Ltd.
Geneos Ltd.
Phenotype clustering
Orion Pharma

69
Collaborations (cont.)

Linguistics
R.-L. Pitkänen (Research center for the Languages
of Finnish)
Terttu Nevalainen (Department of English)
Paleontology
Mikael Fortelius (Dept. of Geology), Jukka
Jernvall (Institute of Biotechnology)
Paleoecology
Atte Korhola (Dept. of Ecology)
Environmental studies
Markku Kulmala (Dept. of Physics)

70
Researchers

Profs. Hollmen, Mannila, Toivonen
Postdocs
Ella Bingham genome structure, paleontology
Aristides Gionis genome structure, paleontology,
spatial data
Alexander Hinneburg spatial data (linguistics)
Saara Hyvönen environmental modeling
Mikko Koivisto phenotype clustering, genome
structure
Päivi Onkamo genetics
Marko Salmenkivi linguistics, spatial data
Petteri Sevon genetics, bioinformatics
Panayiotis Tsaparas web graphs, spatial data
10 PhD students

71
Medical genetics

Important applications
locating disease predisposing genes is essential
for understanding the aetiology of complex common
diseases, such as heart disease or asthma
Focus on selected topics where
we can have a significant impact
we can combine our own expertise with the unique
research on medical genetics in Finland
Collaboration with leading groups in medical
genetics
Prof. Leena Palotie (Public Health Institute)
Prof. Juha Kere (Karolinska Institutet)

72
Gene mapping
marker locus
haplotype (chromosome)
case 1 4 8 2 2 1 2 6 2 case
2 4 3 7 3 2 8 4 2 case 4 5 2 4
5 5 2 6 4 case 7 2 3 7 5 4 5 2
2 case 5 2 4 6 2 4 2 6 1 case
3 4 3 7 3 1 3 3 4 case 1 2 1
5 2 5 2 6 2 case 5 3 3 7 3 2 1
4 3 control 2 4 7 1 3 4 1 4
8 control 7 3 7 7 5 7 8 6 6 control
3 4 3 2 5 3 2 3 2 control 2 5 2
4 3 1 3 6 2 control 3 3 1 2 4 2 1
4 2 control 1 6 4 5 5 5 9 1
3 control 4 2 8 4 2 3 5 2 5 control
2 2 4 9 5 4 4 2 4
allele
73
Gene mapping
case 1 4 8 2 2 1 2 6 2 case
2 4 3 7 3 2 8 4 2 case 4 5 2 4
5 5 2 6 4 case 7 2 3 7 5 4 5 2
2 case 5 2 4 6 2 4 2 6 1 case
3 4 3 7 3 1 3 3 4 case 1 2 1
5 2 5 2 6 2 case 5 3 3 7 3 2 1
4 3 control 2 4 7 1 3 4 1 4
8 control 7 3 7 7 5 7 8 6 6 control
3 4 3 2 5 3 2 3 2 control 2 5 2
4 3 1 3 6 2 control 3 3 1 2 4 2 1
4 2 control 1 6 4 5 5 5 9 1
3 control 4 2 8 4 2 3 5 2 5 control
2 2 4 9 5 4 4 2 4 pattern 1 (3)(4) 3
7 (3)(2) pattern 2 (5) 2 6 (2)
74
Highlights gene mapping

Formulation of gene mapping as a mixture of
pattern discovery and classification
Concepts and methods from computer science
Haplotype Pattern Mining (HPM)
haplotype patterns
efficient algorithms for finding relevant
patterns
a number of variants as follow-up
successful in gene mapping
patents licensing

75
Highlights gene mapping

Tree Disequilibrium Test (TreeDT)
looks for tree structured haplotype patterns
patterns reflect possible recombination histories
gene localization based on the pattern that best
explains the disease status
efficient algorithms
new solutions to multiple, nested permutation
tests

A B C D E F
M1 M2 M3 M4 M5 M6 M7 M8 M9 A 2 3 1 2 2 1 2
2 1 A B 3 1 1 2 1 1 2 1 1 B C 4 1
2 1 2 1 4 3 3 C D 1 2 1 2 3 1 4 1
4 D E 2 2 3 1 3 2 4 3 2 E F 1 2 1
3 1 2 1 4 2 F
76
Highlights haplotyping

Find the highest probability strings (haplotypes)
explaining sequences of pairs (genotypes)
1,2 1,1 1,2 ? 111 112 211 212 212 211
112 111
Exponential for each genotype
HaploRec Markovian models efficient algorithms
variable length Markov chains P(H) P(H1)
?igt1 P(Hi) Hsi, i-1), where si mins
Hs, i is statistically relevant
probabilities P(Hs,i) are estimated with EM
unique scalability

?
77
Highlights mosaic structure of haplotypes

Defining and utilising haplotype block structure
of the human genome describing and finding the
possible mosaic-like structure of haplotypes (and
genotypes)

78
Highlights simulation studies

Population and marker simulation tools
Large simulation studies to test mapping
methodologies
an unexpected result population-based haplotypes
are as powerful as true haplotypes

79
Other applications

Medical genetics
analysis of phenotypic datasets finding robust
clusters that have genetic explanations
Paleontology and paleoecology
finding good estimates of the ages of fossil
sites finding matrix re-orderings that
approximate the consecutive ones property

80
Other applications

Linguistics
finding spatial structure of the distribution of
place names and words high-dimensional
clustering
preliminary results on pattern discovery and
mixture modelling techniques for large onomastic
data sets
Ubiquitous computing
learning to recognize typical device contexts
on-line clustering of stream data.

81
Future directions

Genome structure Computational tools for
describing the variation between individuals and
between species
haplotype blocks and mosaics
identification of rearrangements, duplications,
and other large-scale variations
comparative genomics several species!
segment structure, reversal distances etc.
vocabulary of the genome
function and structure

82
Future directions (continued)

Mining biological databases
analysis of the rich, heterogeneous public
databases
how to find patterns in complex irregular
structures
discovery of similarities and analogies
producing plausible biological relationships and
hypothesis

83
Future directions (continued)

Spatial and temporal variation in language
Spatiotemporal issues in paleontology and ecology
methods to detect patterns of variance in species
abundances
methods to correlate paleoecological time series
data, to find features in such data
Recognition of contexts in mobile applications

84
Summary

Application problem ? new computational concepts
? novel methods ? practical applications
Important data analysis problems
Successfully fielded applications
A wide network of excellent applied collaborators
Development of novel techniques
combinations discrete and probabilistic
approaches
HIIT mode of working collaboration between
groups, universities, other disciplines, and
industry

85
Posters and demos

Genetic mapping studies Asthma and allergy.
Päivi Onkamo
Mining Athmospheric Data. Saara Hyvönen
Geometric and combinatorial tiles in 0-1 data.
(DEMO) Aristides Gionis, Heikki Mannila, Jouni
Seppänen
Dimension induced clustering. (DEMO) Aristides
Gionis, Alexander Hinneburg, Spiros
Papadimitriou, Panayiotis Tsaparas.
Spatial Analysis of Area Data. Case Finnish lake
names. Marko Salmenkivi, Saara Hyvönen, Antti
Leino.
What was the Finnish hiisi? A case study on place
name data. Marko Salmenkivi, Antti Leino, Saara
Hyvönen.
Clustering aggregation. Aristides Gionis,
Panayiotis Tsaparas, Heikki Mannila.
Spatially coherent clustering. Aristides Gionis,
Heikki Mannila, Spiros Papadimitriou, Panayiotis
Tsaparas
Spectral ordering. Aristides Gionis, Heikki
Mannila, Mikael Fortelius, Jukka Jernvall

Genome puzzle. (DEMO) Mikko Koivisto, Teemu
Kivioja, Heikki Mannila, Pasi Rastas, Esko
Ukkonen.
Segmentation-based analysis of genomic
sequences.(DEMO) Niina Haiminen, Evimaria Terzi,
Aristides Gionis, Heikki Mannila.
Mining non-redundant association rules. (DEMO)
Juho Muhonen
HaploRec Population-based reconstruction of
haplotypes.(DEMO) Lauri Eronen
TreeDT Gene mapping by tree disequilibrium test.
Petteri Sevon, Hannu Toivonen
An efficient method for association mapping in
phase-unknown genotype data. Petteri Sevon, Päivi
Onkamo, Hannu Toivonen
Techniques for simulating populations and marker
data. Petteri Hintsanen, Petteri Sevon
Efficient population-based reconstruction of
haplotypes. Lauri Eronen
Integrating the tools Power simulations for gene
mapping studies. Petteri Hintsanen, Petteri
Sevon, Päivi Onkamo.
Mapping susceptibility genes for familial glioma.
Päivi Onkamo

86
Neuroinformatics

Dr. Aapo Hyvärinen

87
Scope of Neuroinformatics

Interface of brain research and information
technology
Functional models of brain
Signal processing methods
Databases
Our specialization Multivariate statistical
models
Principal component analysis
Independent component analysis
Extensions of ICA
New models (see later)

88
Researchers in Neuroinformatics

Aapo Hyvärinen, leader
Two post-docs
Patrik Hoyer
Jarmo Hurri
5 PhD students (some partly)
Funding from Academy of Finland, Univ of
Helsinki, foreign foundations

89
Research goals

Models of sensory processing in the brain,based
on statistical analysis of natural stimuli
New biologically-inspired data analysis methods
Advanced statistical analysis of neuroscientific
data
Common theme is multivariate data analysis

90
Reliability analysis of ICA
(Neuroimage, 2004)

ICA can find underlying factors that are
independent and nongaussian
Results contain statistical and computational
errors, which components are good?
A software package on the Web
Same approach works for comparison of individuals
(NeuroImage, in press)

In cooperation with Universities of Naples and
Maastricht
91
Learning high-level features

ICA gives linear features in images
Extensions of ICA give features in 2nd layer
We estimate third layer by ICA of outputs of 2nd
layer(submitted ms.)

92
Learning segmentation (1)

A new principle for multivariate data analysis
Given very high-dimensional random vector
Can we partition variables in each observation
Based on the statistical structure
With no prior knowledge of segments
Visual system has learned this for its input

93
Learning segmentation (2)

Use correlations to find out which variables
belong together (submitted ms.)
Basic idea each segment should be such that
observed variables follow typical correlation
structure
Then, we can segment even data that has a weird
correlation structure

94
New kinds of feature extraction

We can try to find features that characterize
whole images (Proc Int Conf Pattern Recogn 2004)
Compute histograms of low-level features
Analyze these histograms, e.g. by ICA
Features from natural language (text) data
(Proc. Int. Joint Conf Neural Netw 2004)
Compute the context histograms of words (which
words are typically together)
Perform e.g. ICA on these histograms

In cooperation with HUT and FDK
95
Exploration of causality
(Proc. Factor Analysis Cent. Symp. 2004)

Classic methods can say x and y are correlated,
but which causes which?
Using nongaussianity, we can find causal
ordering
Closely related to ICA estimation an example of
a post-processing method.

In cooperation with University of Osaka
96
Blind source separation

Separation of underlying sources, e.g. in brain
activity
ICA
Sources are independent
No time structure
We developed methods which
are able to separate dependent sources (Signal
Processing, 2004)
Utilize time structure (Sig Proc, in press)

97
Classification images

Estimation of templates used in the human visual
system by linear regression
We attempt to develop nonlinear versions
Basic approach changes in linear templates as a
function of context

In cooperation with Dept of Psychology, UH
98
Non-negative sparse representations

Non-negative matrix factorization (NMF) claimed
to give local, parts-based representations
Locality much better achieved when combined with
sparseness (J. Mach. Learn. Res, in press)

99
Future questions

Connection between segmentation and independent
components
Further post-processing methods for ICA
Classification methods and ICA
General multilayer models for natural images
Nonlinear ICA
Estimation of complex statistical models

100
Adaptive Computing

Dr. Patrik Floréen

101
Premises of the research

Adaptive computing refers to solutions that adapt
to their environment
Linked to the ubiquitous / pervasive / proactive
computing vision
We focus on some central topics to realise this
vision
Context-awareness and adaptation is central to
user-friendly ubiquitous applications and ad hoc
networking (incl. sensor networks) may in the
future provide infrastructure for many ubiquitous
applications

102
Our Research Environment

Draws on existing competence in data mining,
probabilistic reasoning, algorithmics and
language technology
At the intersection of many of the research
groups of HIIT many of our research groups deal
with context-awareness, personalisation and
adaptation
This presentation is about the AC groups at BRU
Group of Prof. Hannu Toivonen (Kari Laasonen,
Renaud Petit, Mika Raento)
Group of Doc. Patrik Floréen (Greger Lindén,
Jukka Kohonen, Yevgeniya Kulikova, Petteri Nurmi,
Michael Przybilski, Jukka Suomela)
Small groups, short history (2003-)

103
Present Research Issues (1/2)

Context analysis
Analysis of context information and its use in
proactive adaptivity on mobile devices, in
particular recognising and predicting locations
under limited resources CONTEXT Toivonen,
Laasonen, Petit, Raento
Reasoning about (mobile) context using data
mining and machine learning techniques, e.g.
segmentation, time series analysis MobiLife
Floréen, Nurmi, Przybilski, Raento, Suomela

104
Present Research Issues (2/2)

Architectural issues for context-aware systems
Context-aware selection of software components on
mobile terminals with an architecture solution
based on a blackboard approach Space4U
Floréen, Przybilski
Context Management Framework a software
architecture for future context-aware mobile
systems MobiLife Floréen, Nurmi, Przybilski
Ad hoc and sensor networks
Topology control and routing problems under
energy constrains
Self-organisation of ad hoc networks using a
game-theoretical approach
NAPS Floréen, Kohonen, Nurmi

105
Summary of Ongoing Projects

Group of Toivonen
CONTEXT Academy of Finland, 11/02-12/05, with
ARU
Group of Floréen
NAPS, Academy of Finland, 01/03-12/05, with HUT
Space4U EUREKA/ITEA, Nokia subcontract,
07/03-06/05, also HUT
MobiLife EU IST IP, Nokia coordinator,
09/04-12/06, also ARU
In addition
PROACT coordination, Academy of Finland,
01/02-05/06 Programme Director Heikki Mannila,
Programme Coordinator Greger Lindén

106
Highlights of Recent Achievements CONTEXT

The software developed for location recognition
and prediction is published as open source and is
used by other research groups and for presence
service and annotation of photographs

107
Annotation of Photographs
108
Highlights of Recent Achievements NAPS

NP-hardness results and algorithms for maximizing
multicast lifetime under energy constraints, by
dynamically choosing transmission power levels
Modelling of routing in ad hoc networks using
dynamic Bayesian games Nurmi
Balanced data gathering in sensor networks,
including an approximation algorithm based on the
Garg Könemann fractional packing approximation
algorithm (FOCS98)
Energy limited sensor nodes
Utility function F? (1-?) avgi?S qi ? mini?S
qi
36 randomly placed sensors in figures that follow

109
No balancing (?0)
Maximizing F0 avg qi
110
Strict balancing (?1)
Maximizing F1 min qi
111
Moderate balancing (?0.5)
Maximizing F0.5 0.5 avg qi 0.5 min qi
112
Future Directions

Emphasis more on continuing present topics than
on enlarging to new areas
Successful present principles to be continued
Diverse funding sources
Theory and practical implementation together
There is potential for developing the activities
through
even more collaboration with other groups (inside
and outside of HIIT)
TEKES projects
attention to recruitment of postdocs

113
Future Research Issues

Context reasoning and the use of context
Combining context-awareness and component
architectures
Trust and privacy issues of the users in
context-aware applications
Modelling and algorithms for topology control and
routing in ad hoc networks and data gathering in
sensor networks
Application of game theory to problems in ad hoc
networking and context-aware computing