Sunday, May 20, 2012

Coroutines in Python: an example applied to Bioinformatics.

Hello again! The first post of this blog dealed with a UniProt keylist parser, where I told you that we could mine information like the one store in the annotation field of each record with some additional scripts. Well this current post deals with it. The code presented at the bottom can actually be simpler, but my aim  is to introduce a non very common Python object called coroutine.

Coroutines are similar to functions and generators. However, whereas these operate on a single set of input arguments, coroutines can take a sequence of inputs sent to them. Indeed, coroutines are functions created by using the yield statement (as generators) but written as an expression '(yield)'. Let's see an example of David Beazley showed in the fourth edition of his book Python Essential Reference:

def print_matches(matchtext):
      print "Looking for", matchtext
      while True:
           line = (yield)     # Get a line of a text
           if matchtext in line:
                print line

To use a coroutine, you first call it with the next method, and then start sending arguments into it using the send method.

>>> matcher = print_matches("python")
Looking for python
>>> matcher.send("Hello World")   # Generates no output because the text 'python' is not in the given string
>>> matcher.send("python is cool")
python is cool
>>> matcher.send("Yeah!")
>>> matcher.close()    # Done with the matcher function call

Coroutines are useful when writing programs that run in parallel, where one part of a program cosumes the information generated by the other. Like generators, coroutines can be very useful for processing information produced by pipelines, streams or data flow. However in this post I will use it in an introductory way for you to observe how it works. Let's suppose we have a UniProt keylist file from which we want to know how many proteins are annotated as residing in the nucleus, cytoplasm or any other subcellular location. Our parser generator defined in a previous post can extract the annotations stored in the "CC" field of any record. The "CC" field contains information like the function of a protein, tissue specificity and subcellular location.

>>> re["CC"]
'-!- FUNCTION: May act as a transcriptional repressor. Down-regulates CA9 expression. -!- SUBUNIT: Interacts with HDAC4. -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm, cytosol. Note=Mainly located in the nucleus. -!- ALTERNATIVE PRODUCTS: Event=Alternative splicing; Named isoforms=2; Name=1; IsoId=Q9Y6X9-1; Sequence=Displayed; Note=No experimental confirmation available; Name=2; IsoId=Q9Y6X9-2; Sequence=VSP_041759; Note=No experimental confirmation available; -!- TISSUE SPECIFICITY: Highly expressed in smooth muscle, pancreas and testis. -!- SIMILARITY: Contains 1 CW-type zinc finger. -!- SEQUENCE CAUTION: Sequence=AAC12954.1; Type=Erroneous gene model prediction; Sequence=BAA74875.2; Type=Miscellaneous discrepancy; Note=Intron retention; ----------------------------------------------------------------------- Copyrighted by the UniProt Consortium, see Distributed under the Creative Commons Attribution-NoDerivs License -----------------------------------------------------------------------'

In the topic of this post, we are interested in the subcellular location of our proteins. The python code to work on that piece of  the "CC" field is the following:

''' This program looks for a specific text in the subcellular location field
    of the proteins stored in a UniProtKB keylist file. If requested, those
    entries with no subcellular location annotation at all are written in a

subloc = 0
def subloc_counter(matchtext): # This a coroutine (similar to, but neither a
     global subloc             # function nor a generator).
     while True:
          string = (yield) # Incorporates the input to be evaluated.
          if matchtext in string:
               subloc += 1

# Now let's read the KeyList file using the generator defined in
from UniProt_parser import *
handle = open(raw_input("Path and name of the keylist file: "))
records = parse(handle)

# Proteins with non subcellular location annotation might be interesting.
store = raw_input("Do you wanna store the IDs of those proteins without \
subcellular location annotation? Answer yes/no: ")
if store == "yes":
     output = open(raw_input("Where to store the entries without subcell location? "), 'w')

# Now, let's count the matches to an input text:
text = raw_input("Enter the text you want to look for in the subcellular location field: ")
counter = subloc_counter(text) # Preparing for counting

for re in records:
     SL = re["CC"].split("-!- ") # The type of output return by the split function is a list.
     for item in SL:
          if item.startswith("SUBCELLULAR LOCATION"): # If subloc exists, then keep only the
               SL = item.lower() # lines related to it. And put it in lowercase so to avoid
               break             # case-sensitivity.

     if store == "yes"  and type(SL) == list: # If SL is still a list, it means that non
          output.write(re["ID"].split(" ")[0]+'\n') # subloc annotation was found. Then store
                                                    # that protein in the output file.
          counter.send(SL) # Look for the text in the current SubLoc field.

# After searching in the annotation of all proteins:
print "The text '%s' was found in the subcellular location field of %d proteins." \
      % (text, subloc)

# And close the coroutine and some still open files:
if store == "yes": output.close()

When being run in the python shell, the user is requested for certain information and files, and after it the program outputs the number of proteins localized in a particular subcellular location. Let's look how it works.

>>> ====================== RESTART ==========================
Path and name of the keylist file: C:\Users\Eduardo\Downloads\NU_keylist.txt
Do you wanna store the IDs of those proteins without subcellular location annotation? Answer yes/no: no
Enter the text you want to look for in the subcellular location field: nucleus
The text 'nucleus' was found in the subcellular location field of 1159 proteins.

As you can see, first the program ask for the path and name of the keylist file. In my case my kelist file is 'NU_keylist.txt' and it's located in 'C:\Users\Eduardo\Downloads'. The answer of the second question depends on the interest of the user; some proteins are poorly annotated and they don't even have subcellular location information. However, if, for instance, the keylist file of the researcher contains proteins that were found in an experimental nuclear fraction, then those poorly annotated hits may be potential nuclear proteins. Let's continue with the code. If the given answer is 'yes', then a path is asked for storing an output file that will contain those potential new candidates. The last question is the most important, and for being completely helpful the user must know exactly the type of phrases that UniProtKB uses to annotated the subcellular location information of a protein. In other words, the parental and child terms (e.g., nucleus, nucleus inner membrane, nuclear pore complex, nucleolus, etc). Finally, the program prints the number of proteins that contain the given text in their subcellular location field. In my example, 1159 proteins of my keylist file reside in the nucleus.

That's all for this post. See you next time!

Saturday, May 5, 2012

Agilent microarray data analysis in R

A month ago I started a collaboration with some researchers of my home university in Peru. They are doing some experiments with Aspergillus niger to observe its gene expression under different substrates and types of cultures. They are evaluating two substrates: lactose and xylose; and two types of culture: biofilms and submerged culture. Thus their experimental flasks were labeled something like this: CBL, CSL, CBX and CSX from their Spanish abbreviation: C for culture, B for biofilm, S for submerged, L for lactose and X for xylose.

As you may know already, R is a free statistical software that, among many other packages, has microarray packages developed by the Bioconductor group and collaborators. For the analysis of Agilent data that has been obtained from the proprietary Agilent Feature Extraction (AFE) image analysis algorithm, readers can refer to and make use of the AgiMicroRna library in R. However, in this post I deal with Agilent data in a more general manner. It means that the code that will be presented a few lines below can be used for data obtained from AFE, but also for data that is not. Let's see!

First of all, we type in the working directory and upload the limma package:

# Set the working directory or path where your data is localized.
> setwd("C:/...")
> library(limma)

At this point I have to make an interruption to say that the analysis of Agilent data requires a so called "target file", which is just a tab delimited text file created by the user, and containing the experimental desing. Have a look at the one for my A. niger experiment:

SampleNumber  FileName     Condition
1         cbl1.txt     CBL
2         cbl2.txt     CBL
3         cbl3.txt     CBL
4         cbx1.txt     CBX
4         cbx2.txt     CBX
6         cbx3.txt     CBX
7         csl1.txt     CSL
8         csl2.txt     CSL
9         csl3.txt     CSL
10        csx1.txt     CSX
11        csx2.txt     CSX
12        csx3.txt     CSX

As you can see, in my experiment, I have 12 samples representing 4 conditions or treatments (i.e., CBL, CBX, CSL and CSX), and each condition has 3 replicates whose intensity signals are store in different files output by the scanner or any other Agilent image analyser. I have simply called my target file "target.txt" and store it in the current working directory. So now we can continue with the R code:

> targets = readTargets("targets.txt")

> raw = read.maimages(targets, source="agilent",green.only=TRUE)

# Subtract the background signal.
> raw_BGcorrected = backgroundCorrect(raw, method="normexp", offset=16)
# Then normalize and log-transformed the data.
> raw_BGandNormalized = normalizeBetweenArrays(raw_BGcorrected,method="quantile")
# Finally calculate the average intensity values from the probes of each gene.   
> raw_aver = avereps(raw_BGandNormalized,ID=raw_BGandNormalized$genes$ProbeName)

We can also asses the quality of the data before and after normalization. For that, we can use boxplots or MA plots:

# Before normalization.
> boxplot(log(as.matrix(raw_BGcorrected)),las=2,ylab="Log2(Intensity)")
# After it. 
> boxplot(as.matrix(raw_BGandNormalized), las=2, ylab = "Log2(Intensity)")

# Now some MA plots of only one replicate per condition:
> library(affy)
> mva.pairs(as.matrix(raw)[,c(1,4,7,10)]) # Before BG correction
> mva.pairs(as.matrix(raw_BGcorrected)[,c(1,4,7,10)]) # Before normalization. 
> mva.pairs(as.matrix(raw_BGandNormalized)[,c(1,4,7,10)]) # After it.

After that we can proceed  with the differential expression analysis. For that we need to create a design matrix and a contrast matrix. For the former, the user can use the information already store in the target file:

> f = factor(targets$Condition)
> design = model.matrix(~f)

Or to make this post more didactic erase the above two lines and let's create the design matrix in a different way:

> design = cbind(CBL = c(1,1,1,0,0,0,0,0,0,0,0,0),   # First 3 replicates -> CBL
               CBX = c(0,0,0,1,1,1,0,0,0,0,0,0),    # The following 3 -> CBX
               CSL = c(0,0,0,0,0,0,1,1,1,0,0,0),    # The following 3 -> CSL
               CSX = c(0,0,0,0,0,0,0,0,0,1,1,1))   # Last 3 replicates -> CSX

# Now fit the average intensity data to the design matrix: 
> fit = lmFit(raw_aver, design)

Then we create the constrast matrix for each comparison we are interested in and asses for the differences in expression using a t-test. 

> contrastMatrix = makeContrasts("CBL-CSL","CBX-CSX","CBL-CBX","CSL-CSX", levels=design)

> fit2 =, contrastMatrix)
> fit2 = eBayes(fit2)

Finally we might want to see the top ten significant hits in each comparison, or observe the highly significant or all the significant hits, or we may want to store the upregulated and downregulated hits in different files. For that we just need to type in the final lines:

# Now let's look for the top ten signficants hits in each comparison:
> topTable(fit2, coef = "CBL-CSL")   # The ten top significants in CBL vs CSL.
> topTable(fit2, coef = "CBX-CSX")   # The ten top significants in CBX vs CSX.
> topTable(fit2, coef = "CBL-CBX")   # The ten top significants in CBL vs CBX.
> topTable(fit2, coef = "CSL-CSX")   # The ten top significants in CSL vs CSX.

# The significant hits based on adjusted p-value for the comparison CBL vs CSL:   
> sig = length(which(topTable(fit2, coef = "CBL-CSL",number=42370)[,15]<0.05)) 
> signif = topTable(fit2, coef = "CBL-CSL",number=sig)

> upregulated = signif[which(signif[,11]>0),]   # The upregulated hits in CBL.
> downregulated = signif[which(signif[,11]<0),]  # The downregulated ones.

# Save them in different files for future annotation or functional cluster
# analysis:
> write.table(upregulated, "CBLvsCSL_Upre.txt", sep="\t")
> write.table(downregulated, "CBLvsCSL_Downre.txt", sep="\t")

To end this post I just want to make some notes. Firstly, the parameters of the topTable function can be studied by just typing in > help(topTable). For my A. niger data, I have used number=42370 because the number of probes in my microarrays are less but close to that number. Secondly, the indexes [,15] and [,11] may change for the user's data depending on the column localization of the adjusted p-value and the fold change, respectively, in the user's own data frame of significant hits. Finally, with some more scripts the user can also make heatmaps and cluster analyses. That might be a topic for another post in this blog. Till next time!