R basics

About This Presentation

Title:

R basics

Description:

Fishers Exact Test Tests proportions in a 2x2 table by ... while chi-squre is an approximation Chi-square has problems with 0 count bins Chi-square is ... – PowerPoint PPT presentation

Number of Views:337

Avg rating:3.0/5.0

Slides: 245

Provided by: AlbinSa4

Category:

more less

Transcript and Presenter's Notes

Title: R basics

1
R basics

Albin Sandelin

2
Ultra-short R introduction

Most life scientists use spreadsheet programs for
analysis - like excel or statistica. Why?
Ease of use
- click buttons, select data by hand
you see the data in front of you
you can do limited programming

Disadvantages
Hard to handle large dataset (gt100 points)
Inflexible, few analyses available
Dumb-down click button effect
Hard to repeat analyses systematically with new
data

4
R

R is a computational environment - somewhere
between a program and a programming language
No buttons, no wizards only a command line
interface
Is a professional statistics toolset - likely has
all the analyses you will ever need in your
career
Is also a programming language
Can handle large datasets
Very powerful graphics
The state-of-the-art statistics program for
bioinformatics, especially array analysis
Free, and open source!

5
R in this course

In this course, we will use R for
Exploring data
Select subsets of data
Plotting data
Statistical tests on data
etc
almost everything.
It is the most important tool in this course!

6
R is hard

R is challenging to learn
No visual cues
Very different from spreadsheets
Hard to find the method to use
Cryptic manual pages (at times)
It requires a lot of effort to get inside the
system, but bear in mind that it is worth it -
you gain an edge to all spread-sheet users

7
Reading material

Keep the reference sheet handy
Save code from lecture exercises in a text file
Powerpoint slides are available after lectures at
the web page
Can also recommend the tutorials on the CRAN
website this is how I learned R
Dahlgaard book is very good, also for future
reference

8
Expected

You will have R open at all times during this
course
We will make MANY smaller exercises during class,
most often in pairs
Understanding these exercises is the key to
succeeding in the exam
We will have at least two teachers present at al
lectures - use us to get help when needed

NB It is YOUR responsibility to that you
understand the examples/exercises
if not, you have to ask for clarification during
the walkthrough of the problems

10
First challenge of the day

Start R
Type
demo(graphics)
Hit enter a few times

11
Starting with R

We will now walk through the most basic R
concepts
We will add successive detail on this with many
exercises - and in a while get over to plotting,
basic statistics, and finally some programming

12
Challenging?

Those who know R or have a math background might
find most things for today no-so-challenging
We cover this to keep everyone on the same ground

13
Vectors

In statistics, we are almost always dealing with
several data points
A vector is an collection of numbers and/or
strings
(albin, sanne, jens)
(1, 5.2, 10, 7, 2, 21)
(3)
The last example is a vector of length 1

In R, we make a vector by the c() command (for
concatenate)
gt c(1,5,10, 7, 2, 1)
1 1 5 10 7 2 1
gt c('albin', 'sanne', 'jens')
1 "albin" "sanne" "jens"
When we input strings or characters, we have to
surround them with or
If making vectors of size 1, we can skip c()
gt 3
1 3

Method name
Method arguments
Method output is also called return value(s)
15

Challenge
1) Make the following vector
45,5,12,10
2) What happens with the following command?
c(1100)
c(502)
A vector is a data structure, and the most
fundamental in R almost everything in R is some
kind of vector, although sometimes in several
dimensions vectors within vectors

16
The reference sheet is your friend!

You will get over-whelmed by different command
names fast
Use the reference sheet to remind yourself in all
exercises

17
Assignment to memory

The c() command is almost useless in itself - we
want to keep the vector for other analyses
The assignment concept

gt 45 add 4 and 5 1 9 gt alt-4
store 4 as a gt blt-5 store 5 as
b gt a just checking 1 4 gt b 1 5 gt ab
add ab (45) 1 9 yes,
works
Anything after a will be disregarded by R. Used
for making comments
18
Expanding this to a whole vector my_vectorlt-
c(1,5,10, 7, 2) Note that there is no return
value now - this is caught by the my_vector.
What happens if you type my_vector after
this? my_vector is a variable, with the variable
name my_vector. Variable names are totally
arbitrary! The anatomy of the vector
Name my_vector
Values 1 5 10 7 2
Index 1 2 3 4 5
19
We can access part of the vector like
thismy_vector5 will give the 5th item in the
vectorWhat happens if you do this?
my_vectorlt- c(1,5,10, 7, 2) define the
vectormy_vector c(1,3,5)my_vector14my_v
ector41
20
gt my_vectorc(1,3,5) 1 1 10 2 gt
my_vector14 1 1 5 10 7
Making a series of integers AB will give a
vector of A,A1, A2B
gtmy_vector41 also works backwards 1 7 10
5 1
21
Challenge

Using the reference sheet, figure out at least
three ways of making R print your vector in the
other direction -

my_vectorlt- c(1,5,10, 7, 2) define the vector
22

gt my_vectorc(5,4,3,2,1)
1 2 7 10 5 1
gt my_vectorc(51)
1 2 7 10 5 1
gt rev(my_vector)
1 2 7 10 5 1
gt

23
Interlude naming rules and the danger of
over-writing

Naming We can name vectors to almost anything.
The most basic rule is Never start a vector
name with a number
gt alt- c(1,5,4,2) OK
gt 1alt- c(1,5,4,2) NOT OK
Error syntax error
gt a1lt- c(1,5,4,2) OK
Over-writing
my_vectorlt- c(1,5,10, 7, 2)
my_vectorlt- c(10,5,2, 3, 1)
what does my_vector contain now?

24
Analyzing vectors

Many functions work directly on vectors - most
have logical names
For instance, length(my_vector) gives the number
of items in the vector ( 5)
Challenge
Make a vector big_vector with values 1 to 10000
using the ab construct
What is the
Length of the vector
Sum of all items in vector the sum() function
Mean( average) of all items in the vector the
mean() function

25
gt big_vectorlt-(110000) gt length(big_vector) 1
10000 gt sum(big_vector) 1 50005000 gt
mean(big_vector) 1 5000.5
26
Interlude the R help system

R has hundreds of functions where most have a
non-trivial syntax The easiest way of retrieving
documentation is by saying
?function_name OR
help(function_name) OR
help.search(something) for fuzzy searches
Look at the help for sample() and sort() and then
try them out on big_vector
Also try out RSiteSearch(sample) what happens

27
Adding and multiplying a number to a vector

Sometimes we want to add a number, like 10, to
each element in the vector
gt big_vector 10
Test this
big_vector2lt-big_vector 10
Also test
min(big_vector)
max(big_vector)
min(big_vector2)
max(big_vector2)
What happens?

28
gt big_vector2lt-big_vector 10 gt
min(big_vector) 1 1 gt max(big_vector) 1
10000 gt min(big_vector2) 1 11 gt
max(big_vector2) 1 10010
29
Adding vectors

We can also add one vector to another vector
Say that we have the three vectors
Alt-c(10, 20, 30, 50)
Blt-c(1,4,2,3)
Clt-c(2.5, 3.5)
Test what happens and explain the outcome
AB
AC

30
gt AB1 11 24 32 53gt AC1 12.5 23.5 32.5
53.5
Alt-c(10, 20, 30, 50) Blt-c(1, 4, 2, 3
) Clt-c(2.5,3.5 )

AB is easy to understand A1B1 , etc
AC is trickier - the C vector is just of length
2. It is re-used! So
A1C112.5
A2C223.5
A3C132.5
A4C253.5
Actually, this is what is happening also with
A10 the 10 is used many times.

31
Plotting vectors

Lets make up some semi-random data
datlt-rnorm (100) draw 100 random, normal
distributed data points
Test the following
plot(dat)
plot(dat,typel)
barplot(dat)
hist(dat)
What is the difference?

32
Why are your three first plots different than
mine? Why is your last plot more similar to mine?
33
Graph options

Generally, you can give extra options to
graphical commands like this
plot (some_vector, colblue, typel)
In plot try to vary the following options - note
that you can use several at once (and figure out
what they do)
typel, b
col hotpink
mainimportant plot
typeh
typeS
These options are really arguments to the plot()
function

34
More about functions

In almost all cases, a function needs some input,
like plot(dat).
dat here is an unnamed argument, and this works
because plot() assumes we mean x values
dat.
We could also say plot(xdat) - a named
argument. If you have many arguments, most of
them are named - such as
plot (some_vector, colblue, types)
The help pages will tell you what type of
arguments you can use

35
The danger of unnamed arguments

is that the order of them will make a big
difference. Try this out - what is the difference
between the plot commands?
alt-rnorm(100)
blt-rnorm(100)2
plot(a,b)
plot(b,a)
plot(xb, ya)

36
b
plot(a,b) plot(b,a) plot(xb, ya)
a
a
b
a
b
37
Some generic R arguments to plots - the par()
function

The par() function is used to set general plot
properties. It has hundreds of possible arguments
- see ?par
Two very handy par() arguments is mfrow() and
mfcol() - these will allow many plots in one page
You give these functions a vector of length 2 -
this gives the number of cells in the page (see
example)

38
Example

gt par( mfrowc(3,1) )
gt plot (a,b)
gt plot (b,a)
gt plot(xb, ya)
gt par( mfrowc(2,2) )
gt plot (a,b)
gt plot (b,a)
gt plot(xb, ya)

39
Challenge - can you get the three plots in a row
using mfrow?
40
gt par( mfrowc(1,3) ) gt plot (a,b) gt plot (b,a) gt
plot(xb, ya)
41
Overlaying plots

Sometimes we want to put may data sets within one
graph, on top of each other
This is often made by the lines() or points()
command, like this

42
gt plot(b, type"l", col"blue") gt lines(a,
col"red")
Why did I start with plotting b? What would have
happed if using points() instead of lines()?
43
Sizing graphs

Simple concept, but awkward to write
Change X scale xlimc(start_value, end_value)
Change Y scale ylimc(start_value, end_value)

44
gt plot(a, type"l", col"blue") gt plot(a,
type"l", col"blue", ylimc(-5,5))
45
Saving graphs

Different on different systems!
All systems can use the original tricky way with
device() see introduction PDFs if you have an
interest. We wont use this.

46
OSX

Click on the graph, and just copy it. Will become
a pdf or a bitmap when pasting.
On my mac you have top open a new file in Preview
and paste there first. On others you can paste it
into other programs directly
Or, just have the graphics window active and do
file-gtsave

47
Windows

Right-click on graph, copy as metafile or bitmap,
paste

48
Linux

Use awkward device() system, like pdf() or
Use GIMP and just make a screen capture from
there by acquire

49
Statistics part 1

Describing the data descriptive statistics
What is the center point of the data?
What is the spread of the data?
How does the distribution look?

50
Summary statistics

hist() ( Histogram) is a graphical way of
summarizing distributions it creates a number
of bins and calculates how many of the data
points that fall into each bin.
We can also summarize by the center points in
the data
mean()

median()
Sort all the data, pick the number in the center.
If the number of data points is even, take the
mean of the two center points
Challenge
We make another vector
dat2lt-rnorm(10)
And add a few extra points to it
dat2lt-c(dat2, 10, 10.5, 30 )
Test mean() and median() on dat2. Are they the
same? Can you explain the differences by plotting
a histogram? What is the advantage/disadvantage
of each measure?

52
gt dat2lt-rnorm(10) gt dat2lt-c(dat2, 10, 10.5, 30
) gt median(dat2) 1 0.11125 gt mean(dat2) 1
3.636139 gt hist(dat2)
Means are sensitive to outliers! Very common
situation in genomics
53
Percentiles

An extension of the median concept
Best explained by example
the 20th percentile is the value (or score) below
which 20 percent of the observations may be
found.
The median is the same as the 50th percentile
The first quartile is the 25th percentile, the
third is the 75th
Try summary(dat) and summary(dat2)

54
summary(dat) Min. 1st Qu. Median Mean
3rd Qu. Max. -2.20600 -0.60480 -0.06930
-0.01333 0.68030 3.49600 gt summary(dat2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.4010 -0.5385 0.1113 3.6360 0.6771 30.0000
The command ecdf() (empirical cumulative
distribution) calculates all percentiles in
your data and also understands plot() Try plot
(ecdf(dat2))
55
What fraction of the data that has been covered
at point X
Sorted values of dat2
56
Boxplots

An easier representation of ECDFs. Is based on
both making boxes that tell us about both center
point and spread of the data
First, calculate the first quartile, the median
and the third quartile
Calculate the inter-quartile range (IQR) 3rd
quartile -1st quartile
These will be used to draw a box
gtboxplot(dat)
gt rug(dat, side2)

57
Interquartile range (IQR)
3rd quartile
Median
1st quartile
58
continued

Sounds more complicated than it is
Any data observation which lies more than
1.5IQR lower than the first quartile is
considered an outlier.
Indicate where the smallest value that is not an
outlier is by a vertical tic mark or "whisker",
and connect the whisker to the box via a
horizontal line.
Do the same for higher values

59
The data point that is highest but within 1.5 IQR
from the center
More extreme data - outliers
3rd quartile
Median
1st quartile
60
(No Transcript)
61
Variance, standard deviation and data spread

What is the difference between these
distributions?

Same mean and median, but different spread over
the x axis
This can be measured by the variance of the data
It is basically the difference between each point
and the mean, squared

Standard deviation is simply variance squared.
This gives nice statistical features that you
will get to know if you study statistics ?

64
Why is variance and standard deviation important?

Variance tells you something about the quality of
measurements
The higher the variance, the harder it is to with
certainty say that two measurements are different
(whiteboard demo)

What are the R functions for variance and
standard deviation?
If we make some random data
smallsetlt-rnorm(100)
largesetlt-rnorm(10000)
What is the variance and standard deviation for
these?
Is the standard deviation really the square root
of the variance (what is the function for square
root?)

gtsmallsetlt-rnorm(100)
gtlargesetlt-rnorm(10000)
gtvar(smallset)
1 0.9504884
gtvar(largeset)
1 0.98416
gtsd(largeset)
1 0.9920484
gtsd(smallset)
1 0.97493
gtsqrt(var(smallset))
1 0.97493

Why do we get about the same variance?
67
(No Transcript)
68
Dealing with real data

involves interacting with messy data from other
people
Almost all interesting data is real
We need methods to get the data into R to start
with.

69
Reading data from files

Typically, you have data in a .txt or .xls file -
how do you get it into R for analysis?
Four steps
1 Format it to tab-separated text file
2 Tell R what directory to work in
3 Read in the file
4 Attach data to memory (optional)

70
What R likes

A data frame(table) with columns. The first
row can be column names, like this

Lecturer height age Poker_games_won
Albin 181 34 14
Jens 189 31 10
Sanne 170 NA 21
Name row, header
First data row
String data
Numerical data
Code for missing data Do not ask ladies for age
71
Reading data(1)

General rules
Use tabs as separators
Each row has to have the same number of columns
Missing data is NA, not empty
As .txt The ideal format!
As .xls save as tabbed text

72
(No Transcript)
73
Get some data to play withUpdate old file is
wrong! Please download the data again

See course web page!
http//people.binf.ku.dk/albin/teaching/htbinf/R/
Download the zipped file save this at a good
place that you know where it is
I will refer to this place on your hard drive as
the data directory

74
Reading data(2)

Tell R where to work.
In R, you always work in a given directory,
called working directory.
You can set this graphically in Windows or MacOS
(next pages)
Or use setwd(some directory)
In Linux, it pays to start R in the same
directory
It pays to have one directory per larger
analysis, and call the data files informative
things (not Data.txt)

75
In OSX
Change working directory
Current working directory
76
In Windows
77
Reading in data (3)

my_datalt- read.table(filename, headerT)

Only if you have column names I always have
that
Data frame object holding all the data
The actual command
78
Lets try

In the data directory, there is a file
flowers.txt. Have a look at it (it is a normal
text file can be opened in almost anything).
Now
my_datalt-read.table(flowers.txt, headerT)
names(my_data)
What happens what does names do?

gt my_datalt-read.table("flowers.txt", hT)
note to self show name completion feature
gt names(my_data)
1 "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width" "Species"

Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa Etc
80
Accessing the columns

There are three ways of getting the data in a
data frame
Get the column by saying
dataframe_namecolumn_name
Get the columns as separate vectors, named as
their header - the attach command
More complex queries, using index operations
(more in a few slides)

81
Columns by names

The attach(some_data_frame) function will put all
the columns of some_data_frame as vectors in
working memory
So, in our case
attach(my_data)
Will give the 5 vectors
"Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width" and "Species"
(which we then can analyze)

82
Dangers of the attach() function

Attach will make a vector out each column, named
after the column name
What happens if we already have a vector with
that name?
What does ls() and detach() do? Why would we
want to do this?

83
Challenge

Using the flowers file,
Plot a histogram of the length of sepals
(Sepal.Length)
Plot the correlation between sepal length and
sepal width (an x-y plot)
What is the mean of Sepal.Length and Sepal.Width?
Why do you get strange results on Sepal.Length?
Can you solve the problem by using the
appropriate help file?

gt hist(Sepal.Length)
gt plot(Sepal.Length, Sepal.Width)

85
gt mean(Sepal.Width) 1 3.057333 gt
mean(Sepal.Length) 1 NA gt
strange!
If we look at the file Sepal.Length Sepal.Width P
etal.Length Petal.Width Species 5.1 3.5 1.4 0.2 se
tosa 4.9 3.0 1.4 0.2 setosa NA 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.4
3.9 1.7 0.4 setosa 4.6 3.4 1.4 0.3 setosa
86
Missing values

In real data, we often have missing values. In R,
we label these NA
Any examples of missing data?
Some functions can by default deal with these -
for instance hist()
Some functions will need to be told what to do
with them, like mean()

87
gt mean(Sepal.Length) 1 NA na.rm is the option
for mean to remove all NAs first gt
mean(Sepal.Length, na.rmT) 1 5.851007 With
new files, you should check whether there are any
NA values first.
88
What if we wanted just the 5 first rows? Or just
all values from the species versicolor?

Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica

89
The index concept

Recap
In a vector, we can select individual values by
saying
Sepal.Widthsome position
gt Sepal.Width5
1 3.6
Sepal.Widthsome_positions
gt Sepal.Width210
1 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
Sepal.Widthc(2,4,6,8)
1 3.0 3.1 3.9 3.4
The first position will always have the number 1,
the last will have the length of the vector.
Challenge what is the mean of the first 100
values of Sepal.Width?

90
Albins code

gt alt-Sepal.Width1100
gt mean(a)
1 3.099
or, in just one line
gt mean(Sepal.Width1100)
1 3.099

91
Selecting parts of vectors depending on values

Often, we want parts of vector due to the values
of the vector -for instance all values lt10, or
all strings equaling versicolor.
This is made by a slightly odd syntax(it will
make sense later on when we have many vectors)
To get all values over 10 in Sepal.Length
Sepal.Length Sepal.Length gt10
So, we are using a criterion instead of the
index

92
Try out

Make a boxplot of all values in Sepal.Length that
are higher than 5
How many vales are there that satisfy this
criterion?

93
Albins code

alt-Sepal.LengthSepal.Length gt5
boxplot (a)
length(a)
1 119

94
criteria in R

lt Smaller than
gt Bigger than
gt Bigger or equal to
lt smaller than or equal to
equal to (not )
! not equal to
Challenge
how many rows for the virginica species?

95
Albins try

gt length(SpeciesSpeciesvirginica )
Error in NextMethod("") object "virginica" not
found
what happens here is that R thinks virginica is
an object - a vector or something else
gt length(SpeciesSpecies"virginica" )
1 50

For string comparisons, we need to enclose the
string with or
96
Selection using two vectors

Often we have two vectors from a file, and we
want to use the value of one vector to select
values from the other.
For instance, sepal length broken up by species

Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa 6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor 5.7 2.8 4.1 1.3 versico
lor 6.3 3.3 6.0 2.5 virginica 5.8 2.7 5.1 1.9 virg
inica 7.1 3.0 5.9 2.1 virginica 6.3 2.9 5.6 1.8 vi
rginica
97

Use the same syntax as before - but change the
vetor names!
some_vector other_vector something
Challenge
select the Sepal.Width values that correspond to
the setosa species
What is the mean of these values?
What is the distribution of these values?

98
Albins code
gt alt-Sepal.WidthSpecies"setosa" gt mean(a) 1
3.428 gt hist(a)
99
Fancy way of breaking up data (only works with
some functions)

boxplot (Sepal.WidthSpecies)
Try it out
This is a formula construction - break up width
in the levels of species

100
Even more complex multiple criteria

and combine two criteria
or either this or that has to be true
So,
Species Sepal.Width gt4 Sepal.Length lt5
will give the species rows where width gt4 AND
Length is lt5.
and, Species Sepal.Width gt4 Sepal.Length lt5
will give?
Challenge Make a boxplot of Sepal.Width where
Sepal.Length gt3 AND the species is virginica

101

gtboxplot (Sepal.WidthSepal.Length gt3
Species"virginica" )

102
More complex data selections

The dataframe is really a two-dimensional vector
- a vector that contains vectors

103

We can use almost the same commands to select
sub-sets of the data frame, without using the
attach() function
This is very handy at times

104

The data frame is really a big table - a 2D
vector.
Just as some_vector5 gives use the fifth value
in a vector,
some_dataframe10, 2 gives us the cell in the
10th row and the 2nd column
gtmy_data10, 2
1 3.1

Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa 5.4 3.9 1.7 0.4 setosa 4.6
3.4 1.4 0.3 setosa 5.0 3.4 1.5 0.2 setosa 4.4 2.9
1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa 5.4 3.7 1.5
0.2 setosa 4.8 3.4 1.6 0.2 setosa 4.8 3.0 1.4 0.1
setosa
105
Selecting rows

If we instead say
my_data10,
We will get the WHOLE 10th row back
One can think of it as saying
my_data10, give me everything
So, what will happen if we say
my_data15,
Or
my_data51,
Try it out

106
Indexing exercise

In the data directory, there is a file called
Rindexing.pdf it is a set of small exercises
designed to help you understand indexing
It works with a very small data file, just so
that you can check your results by eye
Indexing is very important for the rest of the
course

107
Nota Bene 1)Students on the reserve list

Please talk to me to get signed on to the course
(I need your student number or similar)
Otherwise you cannot go to the exam gt no credit
2) A watch was found in the last lecture will
be given back if you can describe it well

108
Sorting one vector by another

Very common situation with experimental data
what are the genes with the highest expression in
my data set?
Recap what happened if you said
my_data15,
Or
my_data51,
?

109

get some data to play with
dlt-read.table("mtcar.txt", hT)
attach(d)

Just by looking at the data, d what car has
the highest and lowest horsepower (hp)?
110

car_name mpg cyldisp hp drat wt qsecvs am
gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85
2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08
3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360.0 175 3.15
3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225.0 105 2.76
3.460 20.22 1 0 3 1
7 Duster 360 14.3 8 360.0 245 3.21
3.570 15.84 0 0 3 4
8 Merc 240D 24.4 4 146.7 62 3.69
3.190 20.00 1 0 4 2
9 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2
10 Merc 280 19.2 6 167.6 123 3.92
3.440 18.30 1 0 4 4
11 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4
12 Merc 450SE 16.4 8 275.8 180 3.07
4.070 17.40 0 0 3 3
13 Merc 450SL 17.3 8 275.8 180 3.07
3.730 17.60 0 0 3 3
14 Merc 450SLC 15.2 8 275.8 180 3.07
3.780 18.00 0 0 3 3
15 Cadillac Fleetwood 10.4 8 472.0 205 2.93
5.250 17.98 0 0 3 4
16 Lincoln Continental 10.4 8 460.0 215 3.00
5.424 17.82 0 0 3 4
17 Chrysler Imperial 14.7 8 440.0 230 3.23
5.345 17.42 0 0 3 4
18 Fiat 128 32.4 4 78.7 66 4.08
2.200 19.47 1 1 4 1

111
Permutations

sort.list(x) will give the permutation of x that
sorts x
Is much easier to explain with a small example
gt alt-c(3,1,10)
gt sort.list(a)
1 2 1 3

112

gt alt-c(3,1,10)
gt sort.list(a)
1 2 1 3
gt blt-sort.list(a)
gt ab
1 1 3 10
sort.list(hp)
1 19 8 20 18 26 27 3 9 21 6 32 1 2 4 28
10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29
31

113

car_name mpg cyl disp hp drat wt qsec vs
am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85
2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08
3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360.0 175 3.15
3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225.0 105 2.76
3.460 20.22 1 0 3 1
7 Duster 360 14.3 8 360.0 245 3.21
3.570 15.84 0 0 3 4
8 Merc 240D 24.4 4 146.7 62 3.69
3.190 20.00 1 0 4 2
9 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2
10 Merc 280 19.2 6 167.6 123 3.92
3.440 18.30 1 0 4 4
11 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4
12 Merc 450SE 16.4 8 275.8 180 3.07
4.070 17.40 0 0 3 3
13 Merc 450SL 17.3 8 275.8 180 3.07
3.730 17.60 0 0 3 3
14 Merc 450SLC 15.2 8 275.8 180 3.07
3.780 18.00 0 0 3 3
15 Cadillac Fleetwood 10.4 8 472.0 205 2.93
5.250 17.98 0 0 3 4
16 Lincoln Continental 10.4 8 460.0 215 3.00
5.424 17.82 0 0 3 4
17 Chrysler Imperial 14.7 8 440.0 230 3.23
5.345 17.42 0 0 3 4
18 Fiat 128 32.4 4 78.7 66 4.08
2.200 19.47 1 1 4 1

19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10
11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31
114
Thought exercise

What happens if we now say
hpc(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32,
1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12,
13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)
?

115
We get hp sorted

hpc(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32,
1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12,
13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)
1 52 62 65 66 66 91 93 95 97 105 109
110 110 110 113 123 123 150 150 175 175 175 180
180 180 205 215 230 245 245 264 335

116
So, sort.list really gives an instruction how to
sort

So, we can also say
sort_orderlt-sort.list(hp)
hpsort_order
What will happen if we say
car_namesort_order
?

117

car_namesort_order
1 Honda Civic Merc 240D
Toyota Corolla Fiat 128 Fiat X1-9
Porsche 914-2 Datsun 710
8 Merc 230 Toyota Corona
Valiant Volvo 142E Mazda RX4
Mazda RX4 Wag Hornet 4 Drive
15 Lotus EuropaMerc 280 Merc 280C
Dodge Challenger AMC Javelin
Hornet Sportabout Pontiac Firebird
22 Ferrari Dino Merc 450SE Merc
450SL Merc 450SLC Cadillac
Fleetwood Lincoln Continental Chrysler Imperial
29 Duster 360 Camaro Z28 Ford
Pantera L Maserati Bora
32 Levels AMC Javelin Cadillac Fleetwood Camaro
Z28 Chrysler Imperial Datsun 710 Dodge Challenger
Duster 360 Ferrari Dino Fiat 128 ... Volvo 142E
So, we use the hp permutation to reorder the
cars!
We can also do this with the whole dataframe, in
the same way
gtdsort_order,

118
gt dsort_order, car_name mpg cyl
disp hp drat wt qsec vs am gear carb 19
Honda Civic 30.4 4 75.7 52 4.93 1.615
18.52 1 1 4 2 8 Merc 240D 24.4
4 146.7 62 3.69 3.190 20.00 1 0 4 2 20
Toyota Corolla 33.9 4 71.1 65 4.22 1.835
19.90 1 1 4 1 18 Fiat 128 32.4
4 78.7 66 4.08 2.200 19.47 1 1 4 1 26
Fiat X1-9 27.3 4 79.0 66 4.08 1.935
18.90 1 1 4 1 27 Porsche 914-2 26.0
4 120.3 91 4.43 2.140 16.70 0 1 5 2 3
Datsun 710 22.8 4 108.0 93 3.85 2.320
18.61 1 1 4 1 9 Merc 230 22.8
4 140.8 95 3.92 3.150 22.90 1 0 4 2 21
Toyota Corona 21.5 4 120.1 97 3.70 2.465
20.01 1 0 3 1 6 Valiant 18.1
6 225.0 105 2.76 3.460 20.22 1 0 3 1 32
Volvo 142E 21.4 4 121.0 109 4.11 2.780
18.60 1 1 4 2 1 Mazda RX4 21.0
6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875
17.02 0 1 4 4 4 Hornet 4 Drive 21.4
6 258.0 110 3.08 3.215 19.44 1 0 3 1 28
Lotus Europa 30.4 4 95.1 113 3.77 1.513
16.90 1 1 5 2 10 Merc 280 19.2
6 167.6 123 3.92 3.440 18.30 1 0 4 4 11
Merc 280C 17.8 6 167.6 123 3.92 3.440
18.90 1 0 4 4 22 Dodge Challenger 15.5
8 318.0 150 2.76 3.520 16.87 0 0 3 2 23
AMC Javelin 15.2 8 304.0 150 3.15 3.435
17.30 0 0 3 2 5 Hornet Sportabout 18.7
8 360.0 175 3.15 3.440 17.02 0 0 3 2 25
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845
17.05 0 0 3 2 30 Ferrari Dino 19.7
6 145.0 175 3.62 2.770 15.50 0 1 5 6 12
Merc 450SE 16.4 8 275.8 180 3.07 4.070
17.40 0 0 3 3 13 Merc 450SL 17.3
8 275.8 180 3.07 3.730 17.60 0 0 3 3 14
Merc 450SLC 15.2 8 275.8 180 3.07 3.780
18.00 0 0 3 3 15 Cadillac Fleetwood 10.4
8 472.0 205 2.93 5.250 17.98 0 0 3 4 16
Lincoln Continental 10.4 8 460.0 215 3.00 5.424
17.82 0 0 3 4 17 Chrysler Imperial 14.7
8 440.0 230 3.23 5.345 17.42 0 0 3 4 7
Duster 360 14.3 8 360.0 245 3.21 3.570
15.84 0 0 3 4 24 Camaro Z28 13.3
8 350.0 245 3.73 3.840 15.41 0 0 3 4 29
Ford Pantera L 15.8 8 351.0 264 4.22 3.170
14.50 0 1 5 4 31 Maserati Bora 15.0
8 301.0 335 3.54 3.570 14.60 0 1 5 8

gt dsort_order,
car_name mpg cyl disp hp drat
wt qsec vs am gear carb
19 Honda Civic 30.4 4 75.7 52 4.93
1.615 18.52 1 1 4 2
8 Merc 240D 24.4 4 146.7 62 3.69
3.190 20.00 1 0 4 2
20 Toyota Corolla 33.9 4 71.1 65 4.22
1.835 19.90 1 1 4 1
18 Fiat 128 32.4 4 78.7 66 4.08
2.200 19.47 1 1 4 1
26 Fiat X1-9 27.3 4 79.0 66 4.08
1.935 18.90 1 1 4 1
27 Porsche 914-2 26.0 4 120.3 91 4.43
2.140 16.70 0 1 5 2
3 Datsun 710 22.8 4 108.0 93 3.85
2.320 18.61 1 1 4 1
9 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2
21 Toyota Corona 21.5 4 120.1 97 3.70
2.465 20.01 1 0 3 1
6 Valiant 18.1 6 225.0 105 2.76
3.460 20.22 1 0 3 1
32 Volvo 142E 21.4 4 121.0 109 4.11
2.780 18.60 1 1 4 2
1 Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
2.875 17.02 0 1 4 4
4 Hornet 4 Drive 21.4 6 258.0 110 3.08
3.215 19.44 1 0 3 1
28 Lotus Europa 30.4 4 95.1 113 3.77
1.513 16.90 1 1 5 2
10 Merc 280 19.2 6 167.6 123 3.92
3.440 18.30 1 0 4 4
11 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4

119
What type of data are we looking at?

Type here refers to what kind of data
representation - for example strings or
numbers
R usually does a good job at interpreting this
for us when we read data
This becomes important later on, as some
functions only take vectors of a certain data
type - for instance, you cannot take the mean()
of a vector with words!

120
Simple types

Numbers
num_vectorlt- c(1,5,2,10)
Strings(or characters)
String_vectorlt-c( a, trial1)
How can we know?
class(num_vector)
1 "numeric"

121
Factors/groups

Factors slightly more tricky. A factor is often
a grouping variable - such as the gender or
experiment group, like
genderlt-c(f, f, m, m, f)
experiment_numberlt-c(1,2,4,5,5,6)
Problem here is that R will think that the first
vector is a string vector, and the last a
numerical vector.
gt class(gender)
1 "character"
gt class(experiment_number)
1 "numeric"

122

This is resolved by
gt experiment_numberlt-as.factor( c(1,2,4,5,5,6))
gt class(experiment_number)
1 "factor"

123
Over to analysis of data!

boxplot (Sepal.WidthSpecies)

Is there a real difference?
124
We now move over from pure R to using R to
make statistical tests

Focus is what typical tests are there and how to
use them
We skip almost all statistical theory(!) -
covered in the course in fall

125
Population and sample

Population refers to a set of potential
measurements or values, including not only cases
actually observed but those that are potentially
observable. Really depends on the question at
hand!
Sample is the data we actually have - generally
this is very small compared to the whole
population, but we use it to draw inferences
about the population
So, the sample is a small part of a much larger
population, in most cases
Can samplepopulation? Examples?

126
Significance tests

In real life, we always get some difference
between datasets, because we sample limited data,
and there is always noise
So, is the difference in sepal width we see
between species something that is just due to
luck - or is it real?
Challenge What three features in the dataset
will be important for finding this out
(significant difference or just difference due to
random noise) ? Discuss with your sideman for a
minute

127
Albins take

The observed difference
Obviously, the bigger observed difference, the
less random it is
The data depth
The more data you have, the less likely it is
that strange things happens due to random effects
Data spread (variance)
Slightly less intuitive. The less spread, the
less it can be explained by random events.

128
The concept of null hypothesis and P-values

Always in statistics, we set a null-hypothesis.
This is basically what we want to disprove!
Usually, this is something like there is no
difference in mean in these two samples. The
alternative hypothesis is then, by default,
There is a difference
Then, we test how likely this is, with certain
assumptions (depending on what test)

129

The P-value of a test can be interpreted as the
chance that the observed difference occurs if the
null hypothesis is true
This only makes sense if your test is relevant!
So, a small P-value means that the null
hypothesis is not a good model - which means
that there is a difference
We often want P-values to be as small as possible
- the standard cutoff for significance is 5,
or 0.05
However, this is just tradition it is
completely arbitrary

130
Two-sample t-test

deliberate skip of math details Jens will
cover some later on
The t-test tests if two distributions come from
distributions with the same mean.
The assumption is that both distributions are
normal-shaped - we should always check this
first

131
Mean
Standard deviation a measure of spread
132
Normality - brief check

There are many tests for normality. Now we will
do the simplest by eye. Make a histogram and see
if the data looks reasonably bell-shaped
Challenge Test to make histograms for the sepal
length of each species

133
Albins take

warning wrong approach
hist(Sepal.Length)
not very normal-shaped!
why is this the wrong approach?

134
Albins take, again

now for each species
par (mfrowc(1,3))
hist(Sepal.LengthSpecies"setosa")
hist(Sepal.LengthSpecies"versicolor")
hist(Sepal.LengthSpecies"virginica")
IMHO, these are not very far away from
normal-shaped given the amount of data

135
Slightly more advanced way the
quartile-quartile plot

If two distributions have the same shape, their
cumulative distributions should match well

136
qqnorm(Sepal.Length)
137
qqnorm(Sepal.LengthSpecies"setosa")
138
So, are they different?
gt setosalt-Sepal.LengthSpecies"setosa" gt
versicolorlt-Sepal.LengthSpecies"versicolor" gt
t.test(setosa, versicolor) Welch Two Sample
t-test data setosa and versicolor t
-10.521, df 86.538, p-value lt 2.2e-16
Very! alternative hypothesis true difference in
means is not equal to 0 95 percent confidence
interval -1.1057074 -0.7542926 sample
estimates mean of x mean of y 5.006
5.936
139
Lets test it with some real data

In your data directory, there are two files with
binding sites lengths, from estrogen receptor
alfa and beta ERA_widths.txt, ERB_widths.txt
1) Read these into two different vectors, called
era, erb
2) Have they significantly different means?

140
Albins code

gt dlt-read.table("ERA_widths.txt", hT)
gt attach(d)
gt dlt-read.table("ERB_widths.txt", hT)
gt attach(d)
gt

gt t.test(ERA_width, ERB_width) Welch Two Sample
t-test data ERA_width and ERB_width t
2.7288, df 1066.975, p-value
0.00646 alternative hypothesis true difference
in means is not equal to 0 95 percent confidence
interval 11.18740 68.45565 sample
estimates mean of x mean of y 514.1370
474.3155
141
Hey, we forgot to check for normality!
gt par(mfrowc(1,2)) gt hist (ERA_width) gt hist
(ERB_width) NOT really normal shaped! t.test is
not valid!
142
The two-sample Wilcoxon test

The t-test is assuming that samples are
normal-distributed, which is often not true with
biological data
The wilcoxon test wilcox.test() is a less
powerful test which does not make such
assumptions.
Tests of this kind are called non-parametric
Use when the normality assumption is doubtful

143
Challenge

Repeat the analysis of the ERA/beta sites using
wilcox.test. Is the difference significant?

144
gt wilcox.test(ERA_width, ERB_width) Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.05551 alternative hypothesis true location
shift is not equal to 0
The AHH, so close to significance situation
145
Different types of alternative hypothesis

Null hypothesisthere is no difference
Default alternative hypothesis there is
difference. This is called two-sided, as the
differences can be in either direction
Optional We can make the alternative hypothesis
one-sided - like There is a difference -
distribution A has a greater mean than
distribution B. This increases the statistical
power 2

146
In R
gt wilcox.test(ERA_width, ERB_width) Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.05551 alternative hypothesis true location
shift is not equal to 0 gt wilcox.test(ERA_width,
ERB_width, alternative"greater") Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.02776 alternative hypothesis true location
shift is greater than 0 what happens if
alternativeless?
147
gt wilcox.test(ERA_width, ERB_width,
alternative"less") Wilcoxon rank sum test with
continuity correction data ERA_width and
ERB_width W 151243.5, p-value
0.9723 alternative hypothesis true location
shift is less than 0
148
Dangers with hypothesis fiddling

This should never be used to get twice as good
P-value without some justification
Homework for Tuesday read BMJ 1994309248
(linked at the web page) for discussion on when
this is appropriate

149
Testing many samples at once

Sometimes, we have a table of groups, and we want
to see if any of the groups are different in
means
treatment1 treatment2 treatment3
plant1
plant2
plant3
etc
This is the typical one-way Anova test situation

150
boxplot (Sepal.WidthSpecies)
Is there a real difference?
Last time we checked this, we just tested setosa
vs versicolor
151

In R, oneway anova tests can be made easy by the
oneway.test() command
The oneway.test() works by giving it a formula,
in our case
Sepal.WidthSpecies.
This will basically give the function a table
with three columns of values, one for each
species.
It will then test if the means of any of these
columns are different from the others

152
Test flowers set
gt dlt-read.table("flowers.txt", hT) gt
attach(d) oneway.test(Sepal.WidthSpecies) One-w
ay analysis of means (not assuming equal
variances) data Sepal.Width and Species F
45.012, num df 2.000, denom df 97.402,
p-value 1.433e-14
As expected, very, very significant. Is the
difference in length of sepals (not widths) even
more significant? Test it!
153
Non-parametric alternative to Anova

As with t-tests, Anova tests assume that each of
the columns has a normal-like distribution. This
is often not the case
The Kruskal-Wallis test is analogous to the
wilcoxon test - a non-paramatric variant.
Practically, it works exactly in the same way as
the oneway.test().

154
2 data sets gt2 data sets
Normal distribution t.test() oneway.test()
Not normal distribution (non-parametric tests) wilcox.test() kruskal.test()
155
Larger exercise

In the data folder, there is file
gene_lengths.txt, showing gene lengths from three
different chromosomes.
Is there a significant difference in the center
points (means or medians) of gene lengths for the
different chromosomes? What is the method of
choice? What two chromosomes have the most/least
significant difference?

156
gt dlt-read.table("gene_lengths.txt", hT) gt
attach(d) gt names(d) 1 "id" "chr"
"gene_length first - is the gene length
distribution normal-shaped? gt hist(gene_length)
nope! gt nonparametric tests needed
157
many more than two groups, and
non-parametric test needed gtkruskal wallis
test gt kruskal.test(gene_lengthchr) Kruskal-Wal
lis rank sum test data gene_length by chr
Kruskal-Wallis chi-squared 131.7681, df 2,
p-value lt 2.2e-16 very significant difference!
158
are all chromosome pairs significant? test
case use wilcox.test on two chromosomes gt
wilcox.test(gene_lengthchr"chr2" ,
gene_lengthchr"chr3") Wilcoxon rank sum
test with continuity correction data
gene_lengthchr "chr2" and gene_lengthchr
"chr3" W 1068584, p-value
0.1822 alternative hypothesis true location
shift is not equal to 0 so, all pairs are not
significantly different!
159
chr1 chr2 chr3
chr1 1 2.2E-16 2.2E-16
chr2 1 0.1822
chr3 1
160
for us who are lazy - we can do all vs all tests
like this gt pairwise.wilcox.test(gene_length,
gchr ) Pairwise comparisons using Wilcoxon
rank sum test data gene_length and chr
chr1 chr2 chr2 lt2e-16 - chr3 lt2e-16 0.18 P
value adjustment method holm
161
Small homework I Reading about one-and two-sided
tests What is horribly wrong in this (published)
study?
162
Small homework (II)

What is the statistical problem with making these
pairwise tests en masse, for instance making 1000
t-tests and reporting all cases where Plt0.05?
Hint Go back to the slides about what the 5 P
value cutoff really means

163
Homework and Examination
Big homework for examination goes online this
evening!
164
Examination

Three home works during the course
These will together give 50 of the final points
Main focus here is technical skill (how to do
analysis x)
Theoretically optional. Very strict deadlines.
One large home work at the end of the course
Will give 50 of final points
Overlaps substantially with the three first, but
is more high-level
Main focus here is to merge technical skill with
biologically motivated analysis - real science
Do the maththe three smaller home works pays off
Note High level of activity needed from the
start of the course. Will be demanding! The home
works are the examination!

165

The points with homework
Encourage study during the course
It is easy to think that you understood
something, but when meeting the real thing
Only realistic way of training
Evaluation

166
Rules for homework

Working together in 2-3 perons teams is allowed
and encouraged, BUT
Each person hands in a separate report
which must be unique. This is usually not a
problem if you write it yourself
If you work with others, write who you work with
on the report ( I worked with X and Y in this
homework!)
Cases of copying between students WILL be
reported. The BEST thing that can happen is that
you will get lower scores

167
(No Transcript)
168
(No Transcript)
169
Using material from others

Copying text/images from other sources
including fellow students is considered
cheating
You can quote sources, given that you cite them,
but this has a limit

170
Questions?
171
Correlations

Sometimes, we do not want to show that two things
are different, but that they are correlated.
For instance, weight and height for individuals
are clearly different, but also highly correlated
This is a positive correlation - but we could
also have a negative correlation
Correlations are dangerous things - it depends on
how we ask questions .Easily the most misused and
misunderstood part of statistics
Correlation ! Causation

172
Consider Why is it that

Truth Compared to the general population, there
is a strong correlation between being an American
football player and risk for testicle cancer
Interpretation American football leads to
cancer!

173
Consider Why is it that

Truth There is an inverse correlation between
yearly income and running speed
Interpretation carrying money makes you slow!

174
Time to explore

Open the height.txt file in R, and look at the
different data types. Try to plot various columns
against each other, like
plot (mean_height, Year)
What columns correlate? Can we break up the data
even more?
Try it out!

175
gtplot (mean_weight, year ) gt plot (mean_height,
year ) gt plot (mean_height, mean_weight )
Hmmmobviously there are TWO datasets here! How
can we break this up?
176
Visualize the groups with color
plot (mean_weight, year ) lines
(mean_weightgender"F", yeargender"F",
col"red" ) lines (mean_weightgender"M",
yeargender"M", col"blue" ) a typical case
of OVER-plotting
177
Assessing correlations

First plot a standard x-y plot
Second, make a correlation test
cor(x, y) gives the Pearson correlation
coefficient between z an y, which goes from
-1(perfect negative correlation) to 1(prefect
correlation). 0no correlation

178
Test it

gt cor(mean_weight, year)
1 0.1688499
BAD correlation - close to zero
What happens if we break it up by gender as in
the last plot? Test it!

179

gt cor (mean_weightgender"M",
yeargender"M" )
1 0.9527568
gt cor (mean_weightgender"F",
yeargender"F" )
1 0.931249
very, very correlated!

180
Important Pearson correlation gives a score,
not a significance value. Obviously, it is easier
to get a good correlation if you only have a few
values! Often, we want to also want to TEST if
the correlation is there. The null hypothesis is
then 0 correlation This is done with
the cor.test() command
181
gt cor.test (mean_weight, year ) Pearson's
product-moment correlation data mean_weight
and year t 0.5934, df 12, p-value
0.5639 alternative hypothesis true correlation
is not equal to 0 95 percent confidence
interval -0.3973253 0.6419208 sample
estimates cor 0.1688499
182
gt cor.test (mean_weightgender"F",
yeargender"F" ) Pearson's product-moment
correlation data mean_weightgender "F"
and yeargender "F" t 5.7147, df 5,
p

Write a Comment

User Comments (0)