Exploratory Data Analysis and Essential Statistics using R - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Exploratory Data Analysis and Essential Statistics using R

Description:

Exploratory Data Analysis and Essential Statistics using R Aline Tabet University of British Columbia – PowerPoint PPT presentation

Number of Views:312
Avg rating:3.0/5.0
Slides: 40
Provided by: bioinforma55
Category:

less

Transcript and Presenter's Notes

Title: Exploratory Data Analysis and Essential Statistics using R


1
Exploratory Data Analysis and Essential
Statistics using R
  • Aline Tabet
  • University of British Columbia

2
(No Transcript)
3
About this workshop
  • This workshop will not turn you in a statistician
    or an R expert!
  • Instead you will be statistics and R aware
  • This might push you to learn more about R and
    statistics

3
Day 1
4
Goal
  • How to display statistical information properly
  • Understand basic conceptsWhat is a p-value?Two
    sample t-test or paired t-test?Why do we need
    multiple testing?
  • Get a first exposition to the R statistical
    language

4
Day 1
5
Outline
  • A bit of history
  • Module 1 R Basics
  • Module 2 Exploratory Data Analysis
  • Module 3 Hypothesis testing
  • Module 4 Data reduction (PCA)
  • Module 5 Clustering and classification
  • Module 6 Regression and correlation

5
Day 1
6
Statistics in the news
The prize winner was a team of statisticians,
machine-learning experts and computer engineers
from the United States, Austria, Canada and
Israel, calling itself BellKors Pragmatic Chaos.
Sep 21, 2009
6
Day 1
7
Statistics in the news
The prize winner was a team of statisticians,
machine-learning experts and computer engineers
from the United States, Austria, Canada and
Israel, calling itself BellKors Pragmatic Chaos.
Sep 21, 2009
7
Day 1
8
Statistics in the news
I keep saying that the sexy job in the next 10
years will be statisticians, said Hal Varian,
chief economist at Google. And Im not kidding.
Aug 5, 2009
8
Day 1
9
Statistics in the news
R is really important to the point that its
hard to overvalue it, said Daryl Pregibon, a
research scientist at Google, which uses the
software widely. It allows statisticians to do
very intricate and complicated analyses without
knowing the blood and guts of computing systems.
Jan 6, 2009
9
Day 1
10
Statistics in the news
Medical journal editors should require
independent analysis of industry-sponsored trial
data by an academic statistician before
publishing results, according to an editorial
published in the March 24/31 Journal of the
American Medical Association.
April 19, 2010.
10
Day 1
11
History
  • R is the son of S
  • S is a statistical programming language developed
    by John Chambers from Bell Labs
  • Goal of S was to turn ideas into software,
    quickly and faithfully
  • S was created in 1976
  • New S language arrived in 1988 (Blue Book) and
    introduced many changes (macros to functions)

11
Day 1
12
History
  • Version 4 was introduced in 1998 and introduced a
    formal class-method model
  • In 1993, StatSci (maker of S-Plus) acquire
    exclusive license to S
  • S-Plus integrates S with a nice GUI interface and
    full customer support
  • R was created by Ross Ihaka and Robert Gentleman
    at the University of Auckland, New Zealand

12
Day 1
13
History
  • The R project started in 1991
  • R first appeared in 1996 as an open-source
    software!
  • Highly customizable via packages
  • R based community, power of collaboration with
    thousands of packages freely available
  • Many commercial variants of R (http//www.revoluti
    on-computing.com/)

13
Day 1
14
Bioconductor
  • Started by Robert Gentleman in 2001
  • Based at the Fred Hutchinson Cancer Research
    Center
  • Collection of packages for the analysis and
    comprehension of genomic data
  • Uses R and is of course free, open source and
    open to outside contributors
  • Contains hundreds of packages from microarray
    analysis to next generation sequencing

14
Day 1
15
What is R?
  • R is an integrated suite of software facilities
    for data manipulation, calculation and graphical
    display. It includes
  • an effective data handling and storage facility
  • a suite of operators for calculations on arrays,
    in particular matrices
  • a large, coherent, integrated collection of
    intermediate tools for data analysis
  • graphical facilities for data analysis and
    display either on-screen or on hardcopy, and
  • well-developed, simple and effective programming
    language which includes conditionals, loops,
    user-defined recursive functions and input and
    output facilities.

15
Day 1
16
References
  • Introductory Statistics with R by Peter Dalgaard
  • R reference card http//cran.r-project.org/doc/co
    ntrib/Short-refcard.pdf
  • R tutorial http//www.cyclismo.org/tutorial/R/
  • r-project.org and bioconductor.org

16
Day 1
17
Module 1R basics
Aline TabetExploratory Data Analysis and
Essential Statistics using R Sept 30 Oct 1, 2010
18
(No Transcript)
19
An overgrown calculator
22 exp(-2) Pi sin(2pi) cos(2pi)
gt exp(-2)1 0.1353353 gt pi1 3.141593 gt
sin(2pi)1 -2.449294e-16 gt cos(2pi)
19
Day 1 - Section 1
20
Getting help
help(pi) equivalent ?pi ?sqrt ?sin ?Special
What if we do not know the name of the
function/object?
We can use help.search by specifying a key word
help.search("trigonometry") ??trigonometry
Even on a calculator we need some way to store
intermediate results.
20
Day 1 - Section 1
21
Assignment
xlt-2y lt-2 xy
xlt-2 gt ylt-2 gt xy 1 4
Tips Avoid single-letter names, be explicit,
separate word with dots or capitals, e.g.
MyFavoriteVariable
21
Day 1 - Section 1
22
Vectorized arithmetic
We cannot do much statistics with a single
number! We need a way to store a sequence/list of
numbers
One can simply concatenate elements with the c
function.
gt weightlt-c(60,72,75,90,95,72) gt weight1 1
60 gt weight2 1 72 gt weight 1 60 72 75 90 95
72 gt heightlt-c(1.75,1.80,1.65,1.90,1.74,1.91) gt
bmilt-weight/height2
weightlt-c(60,72,75,90,95,72) weight1 weight2 w
eight heightlt-c(1.75,1.80,1.65,1.90,1.74,1.91) bmi
lt-weight/height2 vector based operation
Note Vector based operation are much faster!
Ex Find at least one other way to create a
vector.
Note c can be used to concatenate strings and
numbers.
22
Day 1 - Section 1
23
Vectors
We have three types of vectors numeric, logical,
character
gt Numeric vectors gt xlt-c(1,5,8) gt x 1 1 5 8 gt
Logical vectors gt xlt-c(TRUE,TRUE,FALSE,TRUE) gt
x 1 TRUE TRUE FALSE TRUE gt Character
vectors gt xlt-c("Hello","my","name","is","Francis")
gt x 1 "Hello" "my" "name" "is"
"Francis"
Numeric vectors xlt-c(1,5,8) x Logical
vectors xlt-c(TRUE,TRUE,FALSE,TRUE) x Character
vectors xlt-c("Hello","my","name","is","Francis") x
Ex Create a vector with the following elements
1,3,10,-1, call your vector x. Take the square
root of x. Take the log of (1x).
23
Day 1 - Section 1
24
Missing and special values
We have already encountered the NaN symbol
meaning not-a-number, and Inf, -Inf. In practical
data analysis a data point is frequently
unavailable. In R, missing values are denoted by
NA.
Depending on the context, R provides different
ways to deal with missing values.
gt weightlt-c(60,72,75,90,NA,72) gt mean(weight) 1
NA gt mean(weight,na.rmTRUE) 1 73.8
weightlt-c(60,72,75,90,NA,72) mean(weight) mean(wei
ght,na.rmTRUE)
24
Day 1 - Section 1
25
Matrices and Arrays
xlt-112 gt x 1 1 2 3 4 5 6 7 8 9 10 11
12 gt length(x) 1 12 gt dim(x) NULL gt
dim(x)lt-c(3,4) gt x ,1 ,2 ,3
,4 1, 1 4 7 10 2, 2 5 8
11 3, 3 6 9 12 gt
xlt-matrix(112,nrow3,byrowTRUE) gt x ,1
,2 ,3 ,4 1, 1 2 3 4 2, 5
6 7 8 3, 9 10 11 12 gt
xlt-matrix(112,nrow3,byrowFALSE) gt x
,1 ,2 ,3 ,4 1, 1 4 7 10 2,
2 5 8 11 3, 3 6 9 12 gt
rownames(x)lt-c("A","B","C") gt x ,1 ,2
,3 ,4 A 1 4 7 10 B 2 5 8
11 C 3 6 9 12 gt colnames(x)lt-c("1","2"
,"x","y") gt x 1 2 x y A 1 4 7 10 B 2 5 8
11 C 3 6 9 12
A matrix is a two dimensional array of numbers.
Matrices can be used to perform statistical
operations (linear algebra). However, they can
also be used to hold tables.
xlt-112 x length(x) dim(x) dim(x)lt-c(3,4) x xlt-mat
rix(112,nrow3,byrowTRUE) x xlt-matrix(112,nrow
3,byrowFALSE) x rownames(x)lt-c("A","B","C") x col
names(x)lt-c("1","2","x","y") x
25
Day 1 - Section 1
26
Matrices and Arrays
Matrices can also be formed by glueing rows and
columns using cbind and rbind. This is the
equivalent of c for vectors.
x1lt-14 gt x2lt-58 gt y1lt-c(3,9) gt
MyMatrixlt-rbind(x1,x2) gt MyMatrix ,1 ,2
,3 ,4 x1 1 2 3 4 x2 5 6
7 8 gt MyNewMatrixlt-cbind(MyMatrix,y1) gt
MyNewMatrix y1 x1 1 2 3 4 3 x2 5 6 7
8 9
x1lt-14 x2lt-58 y1lt-c(3,9) MyMatrixlt-rbind(x1,
x2) MyMatrix MyNewMatrixlt-cbind(MyMatrix,y1)
MyNewMatrix
26
Day 1 - Section 1
27
Factors
It is common to have categorical data in
statistical data analysis (e.g. Male/Female). In
R such variables are referred to as factors.
Makes it possible to assign meaningful names to
categories. A factor has a set of levels.
painlt-c(0,3,2,2,1) gt fpainlt-as.factor(c(0,3,2,2,1)
) gt levels(fpain)lt-c("none","mild","medium","sever
e") gt is.factor(fpain) 1 TRUE gt
is.vector(fpain) 1 FALSE
painlt-c(0,3,2,2,1) fpainlt-as.factor(c(0,3,2,2,1))
levels(fpain)lt-c("none","mild","medium","severe")
is.factor(fpain) is.vector(fpain)
27
Day 1 - Section 1
28
Lists
Lists can be used to combined objects (of
possibly different kinds/sizes) into a larger
composite object.
xlt-c(31,32,40) gt ylt-as.factor(c("F","M","M","F"))
gt zlt-c("London","School") gt gt MyListlt-list(agex,
sexy,metaz) gt MyList age 1 31 32
40 sex 1 F M M F Levels F M meta 1
"London" "School gt MyListage 1 31 32 40
xlt-c(31,32,40) ylt-as.factor(c("F","M","M","F")) zlt
-c("London","School") MyListlt-list(agex,sexy,me
taz) MyList MyListage
The components of the list are named according to
the arguments used. Named components can be
accessed with the .
28
Day 1 - Section 1
29
Data Frames
A data frame is a data matrix or a data set.
It is a list of vectors and/or factors of the
same length that are related across such that
data in the same position come from the same
experimental unit (subject, animal, etc).
MyDataFrameage 1 31 32 40 50 gt
is.vector(MyDataFrameage) 1 TRUE gt
is.vector(MyDataFramesex) 1 FALSE
MyDataFramelt-data.frame(agec(31,32,40,50),sexy)
MyDataFrame MyDataFrameage
Why do we need data frames if it is simply a
list?
More efficient storage, and indexing!
29
Day 1 - Section 1
30
Names
Names of an R object can be accessed and/or
modified with the names function (method).
names(x) NULL gt names(x)lt-c("a","b","c") gt
MyDataFramelt-data.frame(agec(31,32,40,50),sexy)
gt MyDataFrame age sex 1 31 F 2 32 M 3
40 M 4 50 F gt names(MyDataFrame) 1 "age"
"sex gt names(MyDataFrame)lt-c("age","gender") gt
names(MyDataFrame)1lt-c("Age")
xlt-rep(13) names(x) names(x)lt-c("a","b","c") MyDa
taFramelt-data.frame(agec(31,32,40,50),sexy) MyDa
taFrame names(MyDataFrame) names(MyDataFrame)lt-c("
age","gender") names(MyDataFrame)1lt-c("Age")
Remark Give explicit names to variables
Names can be used for indexing.
30
Day 1 - Section 1
31
Indexing
Indexing a vector gt painlt-c(0,3,2,2,1) gt
pain1 1 0 gt pain2 1 3 gt pain12 1 0
3 gt painc(1,3) 1 0 2 gt pain-5 1 0 3 2 2 gt
Indexing a matrix gt MyMatrix1,1 X1 1 gt
MyMatrix1, 1 1 2 3 4 gt MyMatrix,1 x1 x2 1
5 gt MyMatrix,-2 ,1 ,2 ,3 x1 1
3 4 x2 5 7 8 gt Indexing list is
done in the same way gt MyList3 meta 1
"London" "School gt MyList3 1 "London"
"School gt MyList31 1 "London gt
Indexing a data frame gt MyDataFrame1, age
sex 1 31 F gt MyDataFrame2, age sex 2 32
M
Indexing is a great way to directly assess
elements of interest.
Indexing a vector painlt-c(0,3,2,2,1) pain1 pai
n2 pain12 painc(1,3) pain-5 Indexing a
matrix MyMatrix1,1 MyMatrix1, MyMatrix,1 MyM
atrix,-2 Indexing list is done in the same
way MyList3 MyList3 MyList31
Indexing a data frame MyDataFrame1, MyDataFrame
2,
Note that with a data frame, the indexing of
subject is straightforward!
31
Day 1 - Section 1
32
Indexing by name
Names can also be used to index an R object.
MyListage 1 31 32 40 gt MyList"age" age 1
31 32 40 gt MyList"age" 1 31 32 40 gt
MyDataFrame"Age" Age 1 31 2 32 3 40 4
50 gt MyDataFrame1 Age 1 31 2 32 3 40 4
50 gt MyDataFrame1 1 31 32 40 50
Indexing a vector painlt-c(0,3,2,2,1) pain1 pai
n2 pain12 painc(1,3) pain-5 Indexing a
matrix MyMatrix1,1 MyMatrix1, MyMatrix,1 MyM
atrix,-2 Indexing list is done in the same
way MyList3 MyList3 MyList31
Indexing a data frame MyDataFrame1, MyDataFrame
2,
What is the main difference between and ?
32
Day 1 - Section 1
33
Conditional indexing
Indexing can be conditional on another variable!
painlt-c(0,3,2,2,1) gt sexlt-as.factor(c("M","M","F",
"F","M")) gt agelt-c(45,51,45,32,90) gt
painsex"M" 1 0 3 1 gt painagegt32 1 0 3 2
1
painlt-c(0,3,2,2,1) sexlt-as.factor(c("M","M","F","F
","M")) agelt-c(45,51,45,32,90) painsex"M" pain
agegt32
Ex Do the same by indexing with F. Do the same
with age less than 80.
33
Day 1 - Section 1
34
Data Input
When using R, one normally starts by reading in
data. This can be done by using the read.table
function.
gvhdlt-read.table("GvHD.txt", headerTRUE) gt
gvhd110, FSC.Height SSC.Height CD4.FITC
CD8.B.PE CD3.PerCP CD8.APC 1 321
199 308 220 157 339 2
303 210 319 271 223
350 3 318 170 215 148
119 221 4 202 49 104
49 284 178 5 353
248 262 167 144 156 6
192 68 423 97 344
113 7 322 225 236 214
141 209 8 350 152 258
82 253 205 9 351
223 286 128 172 220 10
269 78 169 289 224 537
gvhdlt-read.table("GvHD.txt", headerTRUE) gvhd1
10,
Some data sets are also part of R and can be
loaded with the data function, e.g. data(iris).
34
Day 1 - Section 1
35
Functions and arguments
Many things in R are done using function calls,
commands that look like an application of a
mathematical function of one or several
variables, e.g. log(x), plot(weight,height)
When you use plot(weight, height) R assumes that
the first argument is the x variable and the
second is the y. If you do not know how to
specify the arguments look at ?plot.
Most function arguments have sensible default and
can thus be omitted, e.g. plot(weight,
height,col1)
If you do not specify the names of the argument,
the order is very important!
35
Day 1 - Section 1
36
Libraries
Many contributed functionalities of R are
available in R packages/libraries. Some of these
are distributed with R while others need to be
downloaded and installed separately.
library(survival) library(samr) install.packages("
samr")
36
Day 1 - Section 1
37
R programming
R is a true programming language.
for/while loops
if statement
xlt--2 gt if(xgt0) print(x) else
print(-x) 1 2 gt gt if(xgt0)
print(x) else if(x0) print(0)
else print(-x) 1 2
xlt--2 if(xgt0) print(x) else
print(-x) if(xgt0) print(x) else
if(x0) print(0) else print(-x)
For loops nlt-1000000 xlt-rnorm(n,10,1) ylt-x2 ylt-
rep(0,n) for(i in 1n) yilt-sqrt(xi)
While loops counterlt-1 while(counterltn)
ycounterlt-sqrt(xcounter) counterlt-counter1

Apply sqrt to x as a vector and compare the
execution speed.
37
Day 1 - Section 1
38
Creating your own function
As with other programming languages, you can
create your own function.
MyFirstFunctionlt-function(YourName,MyName"Raphael
",number0) if(number0)
return(YourName) else
return(MyName) gt MyFirstFunction("Franci
s",number1) 1 "Raphael gt MyFirstFunction("Fran
cis",number0) 1 "Francis"gt
MyFirstFunctionlt-function(YourName,MyName"Raphael
",number0) if(number0)
return(YourName) else
return(MyName) MyFirstFunction("Francis",num
ber1) MyFirstFunction("Francis",number0)
Ex Try creating a function to compute the
inverse of a number. Print a warning if x0.
38
Day 1 - Section 1
39
Creating your own function
A more advanced example beyond the scope of this
workshop.
MySqrtlt-function(y) xlt-y/2
while(abs(xx-y)gt1e-10) xlt-(xy/x)/2
x gt MySqrt(81) 1 9 gt MySqrt(101) 1
10.04988
MySqrtlt-function(y) xlt-y/2 while(abs(xx-y)gt1e-1
0) xlt-(xy/x)/2 X MySqrt(81) MySqrt(101)
Based on Newtons method
39
Day 1 - Section 1
Write a Comment
User Comments (0)
About PowerShow.com