Computational Personal Genomics - Week 3 Lab
Table of Contents
1 Introduction
In this lab we will look at HapMap data to look at similarity within and between ethnic groups. For pairs of individuals, we will calculate their global similarity simply by counting SNPs that are identical by state (IBS). We will then perform PCA to see how individuals cluster.
For simplicity, we will only download the first chromosome.
2 Source data
First, download phased chromosomes from HapMap Phase III: http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/, for five populations:
- ASW
- http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/ASW/UNRELATED/hapmap3_r2_b36_fwd.consensus.qc.poly.chr1_asw.unr.phased.gz
- CEU
- http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/CEU/UNRELATED/hapmap3_r2_b36_fwd.consensus.qc.poly.chr1_ceu.unr.phased.gz
- JPT+CHB
- http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/JPT+CHB/hapmap3_r2_b36_fwd.consensus.qc.poly.chr1_jpt+chb.unr.phased.gz
- MEX
- http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/MEX/TRIOS/hapmap3_r2_b36_fwd.consensus.qc.poly.chr1_mex.phased.gz
- YRI
- http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/YRI/UNRELATED/hapmap3_r2_b36_fwd.consensus.qc.poly.chr1_yri.unr.phased.gz
Note that for all populations except MEX, we have only unrelated individuals in the file; the MEX file contains trios. Family relationships are annotated here: http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/hapmap3_r2_b36_fwd.consensus.qc.poly.info
3 Loading data in R
Loading these data is very straightforward:
pops.per.column=vector("numeric") all=vector("numeric") for (pop in c("asw", "ceu", "jpt+chb", "mex", "yri")) { print(pop) if (pop == "mex"){ fn = paste("hapmap3_r2_b36_fwd.consensus.qc.poly.chr1_", pop, ".phased.gz", sep="") a = read.table(fn, header=T)[,c(-1,-2)] a=a[,1:18] }else{ fn = paste("hapmap3_r2_b36_fwd.consensus.qc.poly.chr1_", pop, ".unr.phased.gz", sep="") a = read.table(fn, header=T)[,c(-1,-2)] a=a[,1:18] } colnames(a)=rep(toupper(pop), ncol(a)) if (is.null(dim(all))){ all = a } else{ all = cbind(all, a) } }
We now have a matrix with the nucleotides. We can arbitrarily pick one individual to be the "reference", to convert this into a matrix of 0 and 1:
reference=all[,1] all = apply(all,2, function(x) as.numeric(x == reference))
4 Calculating IBS for pairs of chromosomes
We can count how many SNPs are concordand or discordant between two chromosomes. Let's take the first two in our sample (which are both from an ASW individual):
sum(all[,1] == all[,2]) mean(all[,1] == all[,2])
Then do this over all pairs:
chromosomes = ncol(all) ibs=matrix(NA, nrow=chromosomes, ncol=chromosomes, dimnames=list(colnames(all), colnames(all))) for (i in (1:chromosomes-1)){ for (j in ((i+1):chromosomes)){ ibs[i,j] = mean(all[,i] == all[,j]) } }
Visualize this:
image(t(ibs), col=colorRampPalette(c("black", "red"))(256), zlim=c(.66,.77))
Look on the diagonal: Which population do you expect to have a higher or lower nucleotide diversity? What is the difference between this calculation and the calculation of nucleotide diversity that we made in Lab 3?
Look off the diagonal: What relationships between the populations can you infer?
5 Principal Components Analysis
R has built-in tools for PCA. We can use the prcomp() function to project the chromsomes' genotypes onto the first two principal components:
pca.result=prcomp(t(all))
We can see how much of the variance each of the principal components explains:
plot(pca.result$sdev)
Then project our chromosomes:
plot(pca.result$x[,1:2], col="white") text(pca.result$x[,1:2], labels=colnames(all), cex=0.75)
What relationships are visible here that are not visible from the IBS values?
6 Assignment
For this week's assignment, perform these same two analyses again, including at least three different HapMap populations than the ones found here. Describe whether what you see is surprising, or consistent with theories about the history of those populations. Can you reliably assign an individual from two closely-related populations (for example, CEU and TSI) using the first two principal components?
Date: 2012-02-27
HTML generated by org-mode 6.34c in emacs 23