Sunday, June 3, 2012

Drawing heatmaps in R with heatmap.2

A heatmap is a scale colour image for representing the observed values of two o more conditions, treatments, populations, etc. The observations can be raw values, norlamized values, fold changes or any others. Let's begin with an example. Say that I'm interesting in the differential expression of the proteome of three cell lines. I have my data in tab delimited text and will upload it into RStudio (a prettier interface for R: http://rstudio.org/).


setwd("C:/Users/Cell_Lines")
data = read.table("Intensities.txt", header=T, stringsAsFactors=F)    # Input the data. 
data = na.omit(data)     # Throw out rows with missing values.
View(data)



row.namesCell_A_rep1Cell_A_rep2Cell_A_rep3Cell_B_rep1Cell_B_rep2Cell_B_rep3
1sp|A6NHR9|SMHD1_HUMAN8.325863e+057.933804e+051.312156e+042.307032e+03235229.3551.351383e+05
2sp|A8CG34|P121C_HUMAN1.282906e+042.592576e+044.963257e+031.613087e+0362421.2571.519155e+04
3sp|O00116|ADAS_HUMAN7.554998e+035.372312e+036.686471e+041.309784e+0525299.4022.170408e+04
4sp|O00148|DX39A_HUMAN9.819398e+051.121632e+064.671197e+055.534510e+05899565.0324.044675e+05
5sp|O00257|CBX4_HUMAN4.139926e+041.995894e+049.100997e+031.623696e+04258025.5181.002840e+05



My data consist of three cell lines (only two are showed) with three replicates each, where each row represents the expression value of a protein (only five are showed). It is classical in R to use the heatmap function which outputs a yellow-orange heatmap with two dendrograms: one on the left side representing the rows (i.e., entities, genes, proteins, etc.), and a second one on the top representing the columns (i.e., conditions, treatments, etc.). Additionally the names of the rows and the columns are written in the right side and bottom, respectively:


heatmap(as.matrix(data))    # The data must be in matrix format and not in frame format!



Some things can be improve in the above heatmap. You can see that the column names didn't fit in the figure, and the rownames are too many that we better avoid them. The heatmap function has many paramters that allow the user to improve the figure. You can see them by typing help(heatmap). Right now, we are interested in two of them the cexCol to adjust the size of the column names and the labRow to choose the name of the rows. Try the following line and you will see how it changes. 

heatmap(as.matrix(data), cexCol=0.7, labRow=NA)



Although heatmap is a good function, a better one exists nowadays and is heatmap.2. Heatmap2 allows further formatting of our heatmap figures. For example, we can change the colours to the common red-green scale, represent the original values or replace them with the row-Z-score, add a colour key and many other options. Let's see! 

install.packages("gplots")
library(gplots)
help(heatmap.2)
heatmap.2(as.matrix(data), col=redgreen(75), scale="row", key=T, keysize=1.5,
          density.info="none", trace="none",cexCol=0.9, labRow=NA)




However this is not the way of dealing with the data. When we work with high throughput data, the first step is to log-transform the intensities and then apply a normalization method. We can do that with the following lines:

data = log2(data)
boxplot(data, cex.axis=0.8, las=2, main="Original distribution of data",
        ylab="Log2(Intensity)")  # Draw a boxplot.


# Normalization
library(preprocessCore)
data2 = normalize.quantiles(as.matrix(data))   # A quantile normalization. 


# Copy the row and column names from data to data2:
rownames(data2) = rownames(data)
colnames(data2) = colnames(data)


boxplot(data2, cex.axis=0.8, las=2, main="Distribution after normalization",
        ylab="Log2(Intensity)") 


# t-test using the limma package:

library(limma)
design = cbind(Cell_A = c(1,1,1,0,0,0,0,0,0), # First 3 columns->Cell_A
               Cell_B = c(0,0,0,1,1,1,0,0,0),
               Cell_C = c(0,0,0,0,0,0,1,1,1)) # Last 3 columns->Cell_C
fit = lmFit(data2, design=design) # Fit the original matrix to the above design.
# We want to compare A vs. B, A vs. C and B vs. C
contrastsMatrix = makeContrasts("Cell_A-Cell_B","Cell_A-Cell_C","Cell_B-Cell_C",
                                levels = design) 
fit2 = contrasts.fit(fit, contrasts = contrastsMatrix) # Making the comparisons.
fit2 = eBayes(fit2) # Moderating the t-tetst by eBayes method.

A heatmap can be seen as an array of figures. The first figure is the real heatmap itself, the second figure is the rows' dendrogram, the third is the columns' dendrogram, and the last figure is the color-key. In that sense, we can control the relative position of each figure using the layout parameter lmat and also introduce blank spaces to tight the figures by introducing 0 (zeros) in the lmat matrix. Additionally, we can customize the width and height of the  array by tuning the parameters lhei and lwid. Two more things! First, when working with microarray data, the common colours are red and green, but the current fashion in proteomics is to use red and blue. Secondly, lets suppose that the first 250 proteins belong to a same family, the second 250 to another and so on; and now we wanna know if they actually group together in the dendrogram. We assing a colour to each group, and then match each protein in the heatmap with its corresponding colour, using the parameter RowSideColors. When introducing a row side colour bar, this element becomes now the first element of the figure-matrix, so the real heatmap becomes the second element, the rows' dendrogram the third, the columns' dendrogram the fourth and the color-key the fith. You can see all this changes in the heatmap below.


data3 = as.matrix(fit)
Label = c(rep("purple",250),rep("orange",250),rep("darkgreen",250),
          rep("brown",323))
heatmap.2(data3, col=redblue(256), dendrogram="both",
          scale="row", key=T, keysize=0.5, density.info="none",
          trace="none",cexCol=1.2, labRow=NA, RowSideColors=Label,
          lmat=rbind(c(5,0,4,0),c(3,1,2,0)), lhei=c(2.0,5.0),
          lwid=c(1.5,0.2,2.5,2.5))



Finally, if we had a table with the fold change values per each comparison, and a second table containing the p-values of those fold changes, we can plot both in a single heatmap where the colours represent the fold change values and the numbers over each colour square its respective p-value. In my heatmap below the numbers over the squares represent the p-values in a discrete scale, where -1 means non-significant, 0 means significant and 1 highly significant. To write the p.values over the heatmap, we use the parameter cellnote, and if we want them in black, we need to set  notecol = "black". Additionally, since we want to plot the real fold change values and not their Z-score, we set scale = "none". Look at the following lines.

FoldCh = read.table("Fold_changes.txt",header=T,stringsAsFactors=F)
Pval = read.table("Pvalues.txt",header=T,stringsAsFactors=F)
heatmap.2(as.matrix(FoldCh),dendrogram="col", cellnote=as.matrix(PVal),
          notecol="black",col=redblue(256),scale="none",key=TRUE, keysize=1.5,
          density.info="none", trace="none", cexRow=0.7,cexCol=1.2)





And that's the end of this post. I hope you manage to do something similar with your own data. Till next time!



Sunday, May 20, 2012

Coroutines in Python: an example applied to Bioinformatics.

Hello again! The first post of this blog dealed with a UniProt keylist parser, where I told you that we could mine information like the one store in the annotation field of each record with some additional scripts. Well this current post deals with it. The code presented at the bottom can actually be simpler, but my aim  is to introduce a non very common Python object called coroutine.

Coroutines are similar to functions and generators. However, whereas these operate on a single set of input arguments, coroutines can take a sequence of inputs sent to them. Indeed, coroutines are functions created by using the yield statement (as generators) but written as an expression '(yield)'. Let's see an example of David Beazley showed in the fourth edition of his book Python Essential Reference:

def print_matches(matchtext):
      print "Looking for", matchtext
      while True:
           line = (yield)     # Get a line of a text
           if matchtext in line:
                print line

To use a coroutine, you first call it with the next method, and then start sending arguments into it using the send method.

>>> matcher = print_matches("python")
>>> matcher.next()
Looking for python
>>> matcher.send("Hello World")   # Generates no output because the text 'python' is not in the given string
>>> matcher.send("python is cool")
python is cool
>>> matcher.send("Yeah!")
>>> matcher.close()    # Done with the matcher function call

Coroutines are useful when writing programs that run in parallel, where one part of a program cosumes the information generated by the other. Like generators, coroutines can be very useful for processing information produced by pipelines, streams or data flow. However in this post I will use it in an introductory way for you to observe how it works. Let's suppose we have a UniProt keylist file from which we want to know how many proteins are annotated as residing in the nucleus, cytoplasm or any other subcellular location. Our parser generator defined in a previous post can extract the annotations stored in the "CC" field of any record. The "CC" field contains information like the function of a protein, tissue specificity and subcellular location.

>>> re["CC"]
'-!- FUNCTION: May act as a transcriptional repressor. Down-regulates CA9 expression. -!- SUBUNIT: Interacts with HDAC4. -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm, cytosol. Note=Mainly located in the nucleus. -!- ALTERNATIVE PRODUCTS: Event=Alternative splicing; Named isoforms=2; Name=1; IsoId=Q9Y6X9-1; Sequence=Displayed; Note=No experimental confirmation available; Name=2; IsoId=Q9Y6X9-2; Sequence=VSP_041759; Note=No experimental confirmation available; -!- TISSUE SPECIFICITY: Highly expressed in smooth muscle, pancreas and testis. -!- SIMILARITY: Contains 1 CW-type zinc finger. -!- SEQUENCE CAUTION: Sequence=AAC12954.1; Type=Erroneous gene model prediction; Sequence=BAA74875.2; Type=Miscellaneous discrepancy; Note=Intron retention; ----------------------------------------------------------------------- Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms Distributed under the Creative Commons Attribution-NoDerivs License -----------------------------------------------------------------------'
>>> 

In the topic of this post, we are interested in the subcellular location of our proteins. The python code to work on that piece of  the "CC" field is the following:

''' This program looks for a specific text in the subcellular location field
    of the proteins stored in a UniProtKB keylist file. If requested, those
    entries with no subcellular location annotation at all are written in a
    file.'''


subloc = 0
def subloc_counter(matchtext): # This a coroutine (similar to, but neither a
     global subloc             # function nor a generator).
     while True:
          string = (yield) # Incorporates the input to be evaluated.
          if matchtext in string:
               subloc += 1

# Now let's read the KeyList file using the generator defined in UniProt_parser.py
from UniProt_parser import *
handle = open(raw_input("Path and name of the keylist file: "))
records = parse(handle)

# Proteins with non subcellular location annotation might be interesting.
store = raw_input("Do you wanna store the IDs of those proteins without \
subcellular location annotation? Answer yes/no: ")
if store == "yes":
     output = open(raw_input("Where to store the entries without subcell location? "), 'w')

# Now, let's count the matches to an input text:
text = raw_input("Enter the text you want to look for in the subcellular location field: ")
counter = subloc_counter(text)
counter.next() # Preparing for counting

for re in records:
     SL = re["CC"].split("-!- ") # The type of output return by the split function is a list.
     for item in SL:
          if item.startswith("SUBCELLULAR LOCATION"): # If subloc exists, then keep only the
               SL = item.lower() # lines related to it. And put it in lowercase so to avoid
               break             # case-sensitivity.
          else:
               pass

     if store == "yes"  and type(SL) == list: # If SL is still a list, it means that non
          output.write(re["ID"].split(" ")[0]+'\n') # subloc annotation was found. Then store
                                                    # that protein in the output file.
     else:
          counter.send(SL) # Look for the text in the current SubLoc field.

# After searching in the annotation of all proteins:
print "The text '%s' was found in the subcellular location field of %d proteins." \
      % (text, subloc)

# And close the coroutine and some still open files:
counter.close()
handle.close()
if store == "yes": output.close()


When being run in the python shell, the user is requested for certain information and files, and after it the program outputs the number of proteins localized in a particular subcellular location. Let's look how it works.

>>> ====================== RESTART ==========================
>>>
Path and name of the keylist file: C:\Users\Eduardo\Downloads\NU_keylist.txt
Do you wanna store the IDs of those proteins without subcellular location annotation? Answer yes/no: no
Enter the text you want to look for in the subcellular location field: nucleus
The text 'nucleus' was found in the subcellular location field of 1159 proteins.
>>>

As you can see, first the program ask for the path and name of the keylist file. In my case my kelist file is 'NU_keylist.txt' and it's located in 'C:\Users\Eduardo\Downloads'. The answer of the second question depends on the interest of the user; some proteins are poorly annotated and they don't even have subcellular location information. However, if, for instance, the keylist file of the researcher contains proteins that were found in an experimental nuclear fraction, then those poorly annotated hits may be potential nuclear proteins. Let's continue with the code. If the given answer is 'yes', then a path is asked for storing an output file that will contain those potential new candidates. The last question is the most important, and for being completely helpful the user must know exactly the type of phrases that UniProtKB uses to annotated the subcellular location information of a protein. In other words, the parental and child terms (e.g., nucleus, nucleus inner membrane, nuclear pore complex, nucleolus, etc). Finally, the program prints the number of proteins that contain the given text in their subcellular location field. In my example, 1159 proteins of my keylist file reside in the nucleus.


That's all for this post. See you next time!

Saturday, May 5, 2012

Agilent microarray data analysis in R

A month ago I started a collaboration with some researchers of my home university in Peru. They are doing some experiments with Aspergillus niger to observe its gene expression under different substrates and types of cultures. They are evaluating two substrates: lactose and xylose; and two types of culture: biofilms and submerged culture. Thus their experimental flasks were labeled something like this: CBL, CSL, CBX and CSX from their Spanish abbreviation: C for culture, B for biofilm, S for submerged, L for lactose and X for xylose.

As you may know already, R is a free statistical software that, among many other packages, has microarray packages developed by the Bioconductor group and collaborators. For the analysis of Agilent data that has been obtained from the proprietary Agilent Feature Extraction (AFE) image analysis algorithm, readers can refer to http://www.biomedcentral.com/1471-2164/12/64 and make use of the AgiMicroRna library in R. However, in this post I deal with Agilent data in a more general manner. It means that the code that will be presented a few lines below can be used for data obtained from AFE, but also for data that is not. Let's see!

First of all, we type in the working directory and upload the limma package:


# Set the working directory or path where your data is localized.
> setwd("C:/...")
> library(limma)


At this point I have to make an interruption to say that the analysis of Agilent data requires a so called "target file", which is just a tab delimited text file created by the user, and containing the experimental desing. Have a look at the one for my A. niger experiment:

SampleNumber  FileName     Condition
1         cbl1.txt     CBL
2         cbl2.txt     CBL
3         cbl3.txt     CBL
4         cbx1.txt     CBX
4         cbx2.txt     CBX
6         cbx3.txt     CBX
7         csl1.txt     CSL
8         csl2.txt     CSL
9         csl3.txt     CSL
10        csx1.txt     CSX
11        csx2.txt     CSX
12        csx3.txt     CSX


As you can see, in my experiment, I have 12 samples representing 4 conditions or treatments (i.e., CBL, CBX, CSL and CSX), and each condition has 3 replicates whose intensity signals are store in different files output by the scanner or any other Agilent image analyser. I have simply called my target file "target.txt" and store it in the current working directory. So now we can continue with the R code:

> targets = readTargets("targets.txt")

> raw = read.maimages(targets, source="agilent",green.only=TRUE)


# Subtract the background signal.
> raw_BGcorrected = backgroundCorrect(raw, method="normexp", offset=16)
# Then normalize and log-transformed the data.
> raw_BGandNormalized = normalizeBetweenArrays(raw_BGcorrected,method="quantile")
# Finally calculate the average intensity values from the probes of each gene.   
> raw_aver = avereps(raw_BGandNormalized,ID=raw_BGandNormalized$genes$ProbeName)


We can also asses the quality of the data before and after normalization. For that, we can use boxplots or MA plots:


# Before normalization.
> boxplot(log(as.matrix(raw_BGcorrected)),las=2,ylab="Log2(Intensity)")
# After it. 
> boxplot(as.matrix(raw_BGandNormalized), las=2, ylab = "Log2(Intensity)")


# Now some MA plots of only one replicate per condition:
> library(affy)
> mva.pairs(as.matrix(raw)[,c(1,4,7,10)]) # Before BG correction
> mva.pairs(as.matrix(raw_BGcorrected)[,c(1,4,7,10)]) # Before normalization. 
> mva.pairs(as.matrix(raw_BGandNormalized)[,c(1,4,7,10)]) # After it.


After that we can proceed  with the differential expression analysis. For that we need to create a design matrix and a contrast matrix. For the former, the user can use the information already store in the target file:


> f = factor(targets$Condition)
> design = model.matrix(~f)


Or to make this post more didactic erase the above two lines and let's create the design matrix in a different way:


> design = cbind(CBL = c(1,1,1,0,0,0,0,0,0,0,0,0),   # First 3 replicates -> CBL
               CBX = c(0,0,0,1,1,1,0,0,0,0,0,0),    # The following 3 -> CBX
               CSL = c(0,0,0,0,0,0,1,1,1,0,0,0),    # The following 3 -> CSL
               CSX = c(0,0,0,0,0,0,0,0,0,1,1,1))   # Last 3 replicates -> CSX

# Now fit the average intensity data to the design matrix: 
> fit = lmFit(raw_aver, design)


Then we create the constrast matrix for each comparison we are interested in and asses for the differences in expression using a t-test. 

> contrastMatrix = makeContrasts("CBL-CSL","CBX-CSX","CBL-CBX","CSL-CSX", levels=design)


> fit2 = contrasts.fit(fit, contrastMatrix)
> fit2 = eBayes(fit2)


Finally we might want to see the top ten significant hits in each comparison, or observe the highly significant or all the significant hits, or we may want to store the upregulated and downregulated hits in different files. For that we just need to type in the final lines:

# Now let's look for the top ten signficants hits in each comparison:
> topTable(fit2, coef = "CBL-CSL")   # The ten top significants in CBL vs CSL.
> topTable(fit2, coef = "CBX-CSX")   # The ten top significants in CBX vs CSX.
> topTable(fit2, coef = "CBL-CBX")   # The ten top significants in CBL vs CBX.
> topTable(fit2, coef = "CSL-CSX")   # The ten top significants in CSL vs CSX.

# The significant hits based on adjusted p-value for the comparison CBL vs CSL:   
> sig = length(which(topTable(fit2, coef = "CBL-CSL",number=42370)[,15]<0.05)) 
> signif = topTable(fit2, coef = "CBL-CSL",number=sig)


> upregulated = signif[which(signif[,11]>0),]   # The upregulated hits in CBL.
> downregulated = signif[which(signif[,11]<0),]  # The downregulated ones.

# Save them in different files for future annotation or functional cluster
# analysis:
> write.table(upregulated, "CBLvsCSL_Upre.txt", sep="\t")
> write.table(downregulated, "CBLvsCSL_Downre.txt", sep="\t")


To end this post I just want to make some notes. Firstly, the parameters of the topTable function can be studied by just typing in > help(topTable). For my A. niger data, I have used number=42370 because the number of probes in my microarrays are less but close to that number. Secondly, the indexes [,15] and [,11] may change for the user's data depending on the column localization of the adjusted p-value and the fold change, respectively, in the user's own data frame of significant hits. Finally, with some more scripts the user can also make heatmaps and cluster analyses. That might be a topic for another post in this blog. Till next time! 

Monday, April 23, 2012

A UniProt KeyList file parser in Python

In this first post, I'd like to share a parser that I wrote some months ago when working with proteomics data.

Usually a researcher wants to know what a protein does and therefore goes to UniProtKB and searches for it using its AC number (e.g., Q8NFH3)  or its ID (e.g, NUP43_HUMAN) and reads the available information that goes from the protein name, its description, function, sequence and links to other databases. Others the researcher wants to do a batch search and so uses the "Retrieve" tool of UniProtKB that outputs different types of file (e.g., FASTA, GFF, XML, Flat Text, etc.) with different information each. The Flat Text file is the most informative one since it contains all the information displayed in the interface of a basic UniProtKB search. A single record of this file looks like this:


ID   RBM47_HUMAN             Reviewed;         593 AA.
AC   A0AV96; A0PJK2; B5MED4; Q8NI52; Q8NI53; Q9NXG3;
DT   23-OCT-2007, integrated into UniProtKB/Swiss-Prot.
DT   30-NOV-2010, sequence version 2.
DT   18-APR-2012, entry version 46.
DE   RecName: Full=RNA-binding protein 47;
DE   AltName: Full=RNA-binding motif protein 47;
GN   Name=RBM47; 
:
:
CC   -!- SUBCELLULAR LOCATION: Nucleus (By similarity).
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=2;
CC       Name=1;
CC         IsoId=A0AV96-1; Sequence=Displayed;
CC       Name=2;
CC         IsoId=A0AV96-2; Sequence=VSP_028839;
CC   -!- SIMILARITY: Belongs to the RRM RBM47 family.
CC   -!- SIMILARITY: Contains 3 RRM (RNA recognition motif) domains. 
:
:
:
SQ   SEQUENCE   593 AA;  64099 MW;  AEA061F89A68010B CRC64;
     MTAEDSTAAM SSDSAAGSSA KVPEGVAGAP NEAALLALME RTGYSMVQEN GQRKYGGPPP
     GWEGPHPQRG CEVFVGKIPR DVYEDELVPV FEAVGRIYEL RLMMDFDGKN RGYAFVMYCH
     :
     :
     KVPEGVAGAP
//



As seen above, each record has a series of keywords (e.g., ID, AC, DE, CC, SQ, etc.) that store particular type of information of the protein. Because of these keywords, this type of file is commonly known as KeyList file, and thanks to them the file is easy to parse and so extract information record-wise and mine it. For example one can have a KeyList file with thousands of records and wants to extract their descriptions or all the accession numbers associated with each ID or, even more important, information like the functions or subcellular locations of the records. The code I wrote for parsing through this kind of file is the following:



class Record(dict): """ This record stores the information of one keyword or category in the keywlist.txt as a Python dictionary. The keys in this dictionary are the line codes that can appear in the keywlist.txt file: --------- --------------------------- ---------------------- Line code Content Occurrence in an entry --------- --------------------------- ---------------------- ID Identifier (keyword) Once; starts a keyword entry. AC Accession (KW-xxxx) Once. DE Definition Once or more. CC Subcellular Location Once or more; comments. SQ Sequence Once; contains only the heading information. """ def __init__(self): dict.__init__(self) for keyword in ("DE", "CC"): self[keyword] = [] def parse(handle): # The parameter handle is the UniProt KeyList file. record = Record() # Now parse the records for line in handle: key = line[:2] if key=="//": # The last line of the current record has been reached. record["DE"] = " ".join(record["DE"]) record["CC"] = " ".join(record["CC"]) yield record # So we output the record and pass to other one. record = Record() elif line[2:5]==" ": # If not, we continue recruiting the information. value = line[5:].strip() if key in ("ID", "AC", "SQ"): record[key] = value elif key in ("DE", "CC"): record[key].append(value) else: pass # Read the footer and throw it away for line in handle: pass




You can copy this script and save it to a python file called for instance UniProt_parser.py and then use it as a module in the current python shell or any other new script using the import tool. Something like this:



from UniProt_parser import * handle = open("Name of the UniProt keylist file") records = parse(handle) # Uses the function 'parse' from the module.
for record in records:
print record["ID"] print record["AC"] print record["CC"] print record["SQ"]




With a bit more of scripting lines, the parser can be use for mining the information, for example to know how many proteins have subcellular location in the membrane, nucleus, mitochondrion, etc. Or retrieve the molecular weigth and/or sequence length of the protein and store them in a file.

That's it for this post. I hope it was useful. Till next time!