In this first post, I'd like to share a parser that I wrote some months ago when working with proteomics data.
Usually a researcher wants to know what a protein does and therefore goes to UniProtKB and searches for it using its AC number (e.g., Q8NFH3) or its ID (e.g, NUP43_HUMAN) and reads the available information that goes from the protein name, its description, function, sequence and links to other databases. Others the researcher wants to do a batch search and so uses the "Retrieve" tool of UniProtKB that outputs different types of file (e.g., FASTA, GFF, XML, Flat Text, etc.) with different information each. The Flat Text file is the most informative one since it contains all the information displayed in the interface of a basic UniProtKB search. A single record of this file looks like this:
ID RBM47_HUMAN Reviewed; 593 AA. AC A0AV96; A0PJK2; B5MED4; Q8NI52; Q8NI53; Q9NXG3; DT 23-OCT-2007, integrated into UniProtKB/Swiss-Prot. DT 30-NOV-2010, sequence version 2. DT 18-APR-2012, entry version 46. DE RecName: Full=RNA-binding protein 47; DE AltName: Full=RNA-binding motif protein 47; GN Name=RBM47;:
CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing; Named isoforms=2; CC Name=1; CC IsoId=A0AV96-1; Sequence=Displayed; CC Name=2; CC IsoId=A0AV96-2; Sequence=VSP_028839; CC -!- SIMILARITY: Belongs to the RRM RBM47 family. CC -!- SIMILARITY: Contains 3 RRM (RNA recognition motif) domains.:
SQ SEQUENCE 593 AA; 64099 MW; AEA061F89A68010B CRC64; MTAEDSTAAM SSDSAAGSSA KVPEGVAGAP NEAALLALME RTGYSMVQEN GQRKYGGPPP GWEGPHPQRG CEVFVGKIPR DVYEDELVPV FEAVGRIYEL RLMMDFDGKN RGYAFVMYCH :
As seen above, each record has a series of keywords (e.g., ID, AC, DE, CC, SQ, etc.) that store particular type of information of the protein. Because of these keywords, this type of file is commonly known as KeyList file, and thanks to them the file is easy to parse and so extract information record-wise and mine it. For example one can have a KeyList file with thousands of records and wants to extract their descriptions or all the accession numbers associated with each ID or, even more important, information like the functions or subcellular locations of the records. The code I wrote for parsing through this kind of file is the following:
class Record(dict): """ This record stores the information of one keyword or category in the keywlist.txt as a Python dictionary. The keys in this dictionary are the line codes that can appear in the keywlist.txt file: --------- --------------------------- ---------------------- Line code Content Occurrence in an entry --------- --------------------------- ---------------------- ID Identifier (keyword) Once; starts a keyword entry. AC Accession (KW-xxxx) Once. DE Definition Once or more. CC Subcellular Location Once or more; comments. SQ Sequence Once; contains only the heading information. """ def __init__(self): dict.__init__(self) for keyword in ("DE", "CC"): self[keyword] =  def parse(handle): # The parameter handle is the UniProt KeyList file. record = Record() # Now parse the records for line in handle: key = line[:2] if key=="//": # The last line of the current record has been reached. record["DE"] = " ".join(record["DE"]) record["CC"] = " ".join(record["CC"]) yield record # So we output the record and pass to other one. record = Record() elif line[2:5]==" ": # If not, we continue recruiting the information. value = line[5:].strip() if key in ("ID", "AC", "SQ"): record[key] = value elif key in ("DE", "CC"): record[key].append(value) else: pass # Read the footer and throw it away for line in handle: pass
You can copy this script and save it to a python file called for instance UniProt_parser.py and then use it as a module in the current python shell or any other new script using the import tool. Something like this:
from UniProt_parser import * handle = open("Name of the UniProt keylist file") records = parse(handle) # Uses the function 'parse' from the module.
for record in records:
print record["ID"] print record["AC"] print record["CC"] print record["SQ"]
With a bit more of scripting lines, the parser can be use for mining the information, for example to know how many proteins have subcellular location in the membrane, nucleus, mitochondrion, etc. Or retrieve the molecular weigth and/or sequence length of the protein and store them in a file.
That's it for this post. I hope it was useful. Till next time!