Title: Exploratory Data Analysis and Essential Statistics using R
1Exploratory Data Analysis and Essential
Statistics using R
- Aline Tabet
- University of British Columbia
2(No Transcript)
3About this workshop
- This workshop will not turn you in a statistician
or an R expert! - Instead you will be statistics and R aware
- This might push you to learn more about R and
statistics
3
Day 1
4Goal
- How to display statistical information properly
- Understand basic conceptsWhat is a p-value?Two
sample t-test or paired t-test?Why do we need
multiple testing? - Get a first exposition to the R statistical
language
4
Day 1
5Outline
- A bit of history
- Module 1 R Basics
- Module 2 Exploratory Data Analysis
- Module 3 Hypothesis testing
- Module 4 Data reduction (PCA)
- Module 5 Clustering and classification
- Module 6 Regression and correlation
5
Day 1
6Statistics in the news
The prize winner was a team of statisticians,
machine-learning experts and computer engineers
from the United States, Austria, Canada and
Israel, calling itself BellKors Pragmatic Chaos.
Sep 21, 2009
6
Day 1
7Statistics in the news
The prize winner was a team of statisticians,
machine-learning experts and computer engineers
from the United States, Austria, Canada and
Israel, calling itself BellKors Pragmatic Chaos.
Sep 21, 2009
7
Day 1
8Statistics in the news
I keep saying that the sexy job in the next 10
years will be statisticians, said Hal Varian,
chief economist at Google. And Im not kidding.
Aug 5, 2009
8
Day 1
9Statistics in the news
R is really important to the point that its
hard to overvalue it, said Daryl Pregibon, a
research scientist at Google, which uses the
software widely. It allows statisticians to do
very intricate and complicated analyses without
knowing the blood and guts of computing systems.
Jan 6, 2009
9
Day 1
10Statistics in the news
Medical journal editors should require
independent analysis of industry-sponsored trial
data by an academic statistician before
publishing results, according to an editorial
published in the March 24/31 Journal of the
American Medical Association.
April 19, 2010.
10
Day 1
11History
- R is the son of S
- S is a statistical programming language developed
by John Chambers from Bell Labs - Goal of S was to turn ideas into software,
quickly and faithfully - S was created in 1976
- New S language arrived in 1988 (Blue Book) and
introduced many changes (macros to functions)
11
Day 1
12History
- Version 4 was introduced in 1998 and introduced a
formal class-method model - In 1993, StatSci (maker of S-Plus) acquire
exclusive license to S - S-Plus integrates S with a nice GUI interface and
full customer support - R was created by Ross Ihaka and Robert Gentleman
at the University of Auckland, New Zealand
12
Day 1
13History
- The R project started in 1991
- R first appeared in 1996 as an open-source
software! - Highly customizable via packages
- R based community, power of collaboration with
thousands of packages freely available - Many commercial variants of R (http//www.revoluti
on-computing.com/)
13
Day 1
14Bioconductor
- Started by Robert Gentleman in 2001
- Based at the Fred Hutchinson Cancer Research
Center - Collection of packages for the analysis and
comprehension of genomic data - Uses R and is of course free, open source and
open to outside contributors - Contains hundreds of packages from microarray
analysis to next generation sequencing
14
Day 1
15What is R?
- R is an integrated suite of software facilities
for data manipulation, calculation and graphical
display. It includes - an effective data handling and storage facility
- a suite of operators for calculations on arrays,
in particular matrices - a large, coherent, integrated collection of
intermediate tools for data analysis - graphical facilities for data analysis and
display either on-screen or on hardcopy, and - well-developed, simple and effective programming
language which includes conditionals, loops,
user-defined recursive functions and input and
output facilities.
15
Day 1
16References
- Introductory Statistics with R by Peter Dalgaard
- R reference card http//cran.r-project.org/doc/co
ntrib/Short-refcard.pdf - R tutorial http//www.cyclismo.org/tutorial/R/
- r-project.org and bioconductor.org
16
Day 1
17Module 1R basics
Aline TabetExploratory Data Analysis and
Essential Statistics using R Sept 30 Oct 1, 2010
18(No Transcript)
19An overgrown calculator
22 exp(-2) Pi sin(2pi) cos(2pi)
gt exp(-2)1 0.1353353 gt pi1 3.141593 gt
sin(2pi)1 -2.449294e-16 gt cos(2pi)
19
Day 1 - Section 1
20Getting help
help(pi) equivalent ?pi ?sqrt ?sin ?Special
What if we do not know the name of the
function/object?
We can use help.search by specifying a key word
help.search("trigonometry") ??trigonometry
Even on a calculator we need some way to store
intermediate results.
20
Day 1 - Section 1
21Assignment
xlt-2y lt-2 xy
xlt-2 gt ylt-2 gt xy 1 4
Tips Avoid single-letter names, be explicit,
separate word with dots or capitals, e.g.
MyFavoriteVariable
21
Day 1 - Section 1
22Vectorized arithmetic
We cannot do much statistics with a single
number! We need a way to store a sequence/list of
numbers
One can simply concatenate elements with the c
function.
gt weightlt-c(60,72,75,90,95,72) gt weight1 1
60 gt weight2 1 72 gt weight 1 60 72 75 90 95
72 gt heightlt-c(1.75,1.80,1.65,1.90,1.74,1.91) gt
bmilt-weight/height2
weightlt-c(60,72,75,90,95,72) weight1 weight2 w
eight heightlt-c(1.75,1.80,1.65,1.90,1.74,1.91) bmi
lt-weight/height2 vector based operation
Note Vector based operation are much faster!
Ex Find at least one other way to create a
vector.
Note c can be used to concatenate strings and
numbers.
22
Day 1 - Section 1
23Vectors
We have three types of vectors numeric, logical,
character
gt Numeric vectors gt xlt-c(1,5,8) gt x 1 1 5 8 gt
Logical vectors gt xlt-c(TRUE,TRUE,FALSE,TRUE) gt
x 1 TRUE TRUE FALSE TRUE gt Character
vectors gt xlt-c("Hello","my","name","is","Francis")
gt x 1 "Hello" "my" "name" "is"
"Francis"
Numeric vectors xlt-c(1,5,8) x Logical
vectors xlt-c(TRUE,TRUE,FALSE,TRUE) x Character
vectors xlt-c("Hello","my","name","is","Francis") x
Ex Create a vector with the following elements
1,3,10,-1, call your vector x. Take the square
root of x. Take the log of (1x).
23
Day 1 - Section 1
24Missing and special values
We have already encountered the NaN symbol
meaning not-a-number, and Inf, -Inf. In practical
data analysis a data point is frequently
unavailable. In R, missing values are denoted by
NA.
Depending on the context, R provides different
ways to deal with missing values.
gt weightlt-c(60,72,75,90,NA,72) gt mean(weight) 1
NA gt mean(weight,na.rmTRUE) 1 73.8
weightlt-c(60,72,75,90,NA,72) mean(weight) mean(wei
ght,na.rmTRUE)
24
Day 1 - Section 1
25Matrices and Arrays
xlt-112 gt x 1 1 2 3 4 5 6 7 8 9 10 11
12 gt length(x) 1 12 gt dim(x) NULL gt
dim(x)lt-c(3,4) gt x ,1 ,2 ,3
,4 1, 1 4 7 10 2, 2 5 8
11 3, 3 6 9 12 gt
xlt-matrix(112,nrow3,byrowTRUE) gt x ,1
,2 ,3 ,4 1, 1 2 3 4 2, 5
6 7 8 3, 9 10 11 12 gt
xlt-matrix(112,nrow3,byrowFALSE) gt x
,1 ,2 ,3 ,4 1, 1 4 7 10 2,
2 5 8 11 3, 3 6 9 12 gt
rownames(x)lt-c("A","B","C") gt x ,1 ,2
,3 ,4 A 1 4 7 10 B 2 5 8
11 C 3 6 9 12 gt colnames(x)lt-c("1","2"
,"x","y") gt x 1 2 x y A 1 4 7 10 B 2 5 8
11 C 3 6 9 12
A matrix is a two dimensional array of numbers.
Matrices can be used to perform statistical
operations (linear algebra). However, they can
also be used to hold tables.
xlt-112 x length(x) dim(x) dim(x)lt-c(3,4) x xlt-mat
rix(112,nrow3,byrowTRUE) x xlt-matrix(112,nrow
3,byrowFALSE) x rownames(x)lt-c("A","B","C") x col
names(x)lt-c("1","2","x","y") x
25
Day 1 - Section 1
26Matrices and Arrays
Matrices can also be formed by glueing rows and
columns using cbind and rbind. This is the
equivalent of c for vectors.
x1lt-14 gt x2lt-58 gt y1lt-c(3,9) gt
MyMatrixlt-rbind(x1,x2) gt MyMatrix ,1 ,2
,3 ,4 x1 1 2 3 4 x2 5 6
7 8 gt MyNewMatrixlt-cbind(MyMatrix,y1) gt
MyNewMatrix y1 x1 1 2 3 4 3 x2 5 6 7
8 9
x1lt-14 x2lt-58 y1lt-c(3,9) MyMatrixlt-rbind(x1,
x2) MyMatrix MyNewMatrixlt-cbind(MyMatrix,y1)
MyNewMatrix
26
Day 1 - Section 1
27Factors
It is common to have categorical data in
statistical data analysis (e.g. Male/Female). In
R such variables are referred to as factors.
Makes it possible to assign meaningful names to
categories. A factor has a set of levels.
painlt-c(0,3,2,2,1) gt fpainlt-as.factor(c(0,3,2,2,1)
) gt levels(fpain)lt-c("none","mild","medium","sever
e") gt is.factor(fpain) 1 TRUE gt
is.vector(fpain) 1 FALSE
painlt-c(0,3,2,2,1) fpainlt-as.factor(c(0,3,2,2,1))
levels(fpain)lt-c("none","mild","medium","severe")
is.factor(fpain) is.vector(fpain)
27
Day 1 - Section 1
28Lists
Lists can be used to combined objects (of
possibly different kinds/sizes) into a larger
composite object.
xlt-c(31,32,40) gt ylt-as.factor(c("F","M","M","F"))
gt zlt-c("London","School") gt gt MyListlt-list(agex,
sexy,metaz) gt MyList age 1 31 32
40 sex 1 F M M F Levels F M meta 1
"London" "School gt MyListage 1 31 32 40
xlt-c(31,32,40) ylt-as.factor(c("F","M","M","F")) zlt
-c("London","School") MyListlt-list(agex,sexy,me
taz) MyList MyListage
The components of the list are named according to
the arguments used. Named components can be
accessed with the .
28
Day 1 - Section 1
29Data Frames
A data frame is a data matrix or a data set.
It is a list of vectors and/or factors of the
same length that are related across such that
data in the same position come from the same
experimental unit (subject, animal, etc).
MyDataFrameage 1 31 32 40 50 gt
is.vector(MyDataFrameage) 1 TRUE gt
is.vector(MyDataFramesex) 1 FALSE
MyDataFramelt-data.frame(agec(31,32,40,50),sexy)
MyDataFrame MyDataFrameage
Why do we need data frames if it is simply a
list?
More efficient storage, and indexing!
29
Day 1 - Section 1
30Names
Names of an R object can be accessed and/or
modified with the names function (method).
names(x) NULL gt names(x)lt-c("a","b","c") gt
MyDataFramelt-data.frame(agec(31,32,40,50),sexy)
gt MyDataFrame age sex 1 31 F 2 32 M 3
40 M 4 50 F gt names(MyDataFrame) 1 "age"
"sex gt names(MyDataFrame)lt-c("age","gender") gt
names(MyDataFrame)1lt-c("Age")
xlt-rep(13) names(x) names(x)lt-c("a","b","c") MyDa
taFramelt-data.frame(agec(31,32,40,50),sexy) MyDa
taFrame names(MyDataFrame) names(MyDataFrame)lt-c("
age","gender") names(MyDataFrame)1lt-c("Age")
Remark Give explicit names to variables
Names can be used for indexing.
30
Day 1 - Section 1
31Indexing
Indexing a vector gt painlt-c(0,3,2,2,1) gt
pain1 1 0 gt pain2 1 3 gt pain12 1 0
3 gt painc(1,3) 1 0 2 gt pain-5 1 0 3 2 2 gt
Indexing a matrix gt MyMatrix1,1 X1 1 gt
MyMatrix1, 1 1 2 3 4 gt MyMatrix,1 x1 x2 1
5 gt MyMatrix,-2 ,1 ,2 ,3 x1 1
3 4 x2 5 7 8 gt Indexing list is
done in the same way gt MyList3 meta 1
"London" "School gt MyList3 1 "London"
"School gt MyList31 1 "London gt
Indexing a data frame gt MyDataFrame1, age
sex 1 31 F gt MyDataFrame2, age sex 2 32
M
Indexing is a great way to directly assess
elements of interest.
Indexing a vector painlt-c(0,3,2,2,1) pain1 pai
n2 pain12 painc(1,3) pain-5 Indexing a
matrix MyMatrix1,1 MyMatrix1, MyMatrix,1 MyM
atrix,-2 Indexing list is done in the same
way MyList3 MyList3 MyList31
Indexing a data frame MyDataFrame1, MyDataFrame
2,
Note that with a data frame, the indexing of
subject is straightforward!
31
Day 1 - Section 1
32Indexing by name
Names can also be used to index an R object.
MyListage 1 31 32 40 gt MyList"age" age 1
31 32 40 gt MyList"age" 1 31 32 40 gt
MyDataFrame"Age" Age 1 31 2 32 3 40 4
50 gt MyDataFrame1 Age 1 31 2 32 3 40 4
50 gt MyDataFrame1 1 31 32 40 50
Indexing a vector painlt-c(0,3,2,2,1) pain1 pai
n2 pain12 painc(1,3) pain-5 Indexing a
matrix MyMatrix1,1 MyMatrix1, MyMatrix,1 MyM
atrix,-2 Indexing list is done in the same
way MyList3 MyList3 MyList31
Indexing a data frame MyDataFrame1, MyDataFrame
2,
What is the main difference between and ?
32
Day 1 - Section 1
33Conditional indexing
Indexing can be conditional on another variable!
painlt-c(0,3,2,2,1) gt sexlt-as.factor(c("M","M","F",
"F","M")) gt agelt-c(45,51,45,32,90) gt
painsex"M" 1 0 3 1 gt painagegt32 1 0 3 2
1
painlt-c(0,3,2,2,1) sexlt-as.factor(c("M","M","F","F
","M")) agelt-c(45,51,45,32,90) painsex"M" pain
agegt32
Ex Do the same by indexing with F. Do the same
with age less than 80.
33
Day 1 - Section 1
34Data Input
When using R, one normally starts by reading in
data. This can be done by using the read.table
function.
gvhdlt-read.table("GvHD.txt", headerTRUE) gt
gvhd110, FSC.Height SSC.Height CD4.FITC
CD8.B.PE CD3.PerCP CD8.APC 1 321
199 308 220 157 339 2
303 210 319 271 223
350 3 318 170 215 148
119 221 4 202 49 104
49 284 178 5 353
248 262 167 144 156 6
192 68 423 97 344
113 7 322 225 236 214
141 209 8 350 152 258
82 253 205 9 351
223 286 128 172 220 10
269 78 169 289 224 537
gvhdlt-read.table("GvHD.txt", headerTRUE) gvhd1
10,
Some data sets are also part of R and can be
loaded with the data function, e.g. data(iris).
34
Day 1 - Section 1
35Functions and arguments
Many things in R are done using function calls,
commands that look like an application of a
mathematical function of one or several
variables, e.g. log(x), plot(weight,height)
When you use plot(weight, height) R assumes that
the first argument is the x variable and the
second is the y. If you do not know how to
specify the arguments look at ?plot.
Most function arguments have sensible default and
can thus be omitted, e.g. plot(weight,
height,col1)
If you do not specify the names of the argument,
the order is very important!
35
Day 1 - Section 1
36Libraries
Many contributed functionalities of R are
available in R packages/libraries. Some of these
are distributed with R while others need to be
downloaded and installed separately.
library(survival) library(samr) install.packages("
samr")
36
Day 1 - Section 1
37R programming
R is a true programming language.
for/while loops
if statement
xlt--2 gt if(xgt0) print(x) else
print(-x) 1 2 gt gt if(xgt0)
print(x) else if(x0) print(0)
else print(-x) 1 2
xlt--2 if(xgt0) print(x) else
print(-x) if(xgt0) print(x) else
if(x0) print(0) else print(-x)
For loops nlt-1000000 xlt-rnorm(n,10,1) ylt-x2 ylt-
rep(0,n) for(i in 1n) yilt-sqrt(xi)
While loops counterlt-1 while(counterltn)
ycounterlt-sqrt(xcounter) counterlt-counter1
Apply sqrt to x as a vector and compare the
execution speed.
37
Day 1 - Section 1
38Creating your own function
As with other programming languages, you can
create your own function.
MyFirstFunctionlt-function(YourName,MyName"Raphael
",number0) if(number0)
return(YourName) else
return(MyName) gt MyFirstFunction("Franci
s",number1) 1 "Raphael gt MyFirstFunction("Fran
cis",number0) 1 "Francis"gt
MyFirstFunctionlt-function(YourName,MyName"Raphael
",number0) if(number0)
return(YourName) else
return(MyName) MyFirstFunction("Francis",num
ber1) MyFirstFunction("Francis",number0)
Ex Try creating a function to compute the
inverse of a number. Print a warning if x0.
38
Day 1 - Section 1
39Creating your own function
A more advanced example beyond the scope of this
workshop.
MySqrtlt-function(y) xlt-y/2
while(abs(xx-y)gt1e-10) xlt-(xy/x)/2
x gt MySqrt(81) 1 9 gt MySqrt(101) 1
10.04988
MySqrtlt-function(y) xlt-y/2 while(abs(xx-y)gt1e-1
0) xlt-(xy/x)/2 X MySqrt(81) MySqrt(101)
Based on Newtons method
39
Day 1 - Section 1