Lab 8. File IO

Lab 8. File IO. Reading and processing biological sequences.

How to submit your code

Each program (the source code .cpp file) should be submitted through Blackboard (Course Materials > Lab).

You can submit all your programs at the end of the lab session in one submission. This way, we can hopefully avoid the situation when you are quickly writing your program, immediately uploading it to Blackboard, but then, say 10 minutes later, realizing that there is a bug in it.

Basically, submit when you are sure that it will be your final version.

Each program should start with a comment that contains your name and a short program description, for example:

/*
  Author: Your Name 
  Description: Lab 1. Task 1. Hello world
*/

You can submit incomplete programs, but their “incomplete” status must be clearly mentioned in the comment to the program. In this case, also briefly describe what is implemented, and what is not.

Reading protein sequences from a file

Protein

A protein molecule is a folded chain of amino acids. There are 24 possible types of amino acids, so the sequence can be easily encoded as a string of letters, for example:

LVHQRTELVNALRAVLYEFGLVVPQGIAHIRHIEAMLDEAVLPEAVKQECLDLLRQISEQSVRIDVRTKK

Each letter represents a certain type of amino acid:
L = Leucine, V = Valine, H = Histidine, Q = Glutamine, R = Arginine, and so on.

Protein structure

In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or amino acid sequences, in which nucleotides or amino acids are represented using single-letter codes.

>gi|384117544|ref|YP_005479417.1| transposase [Acetobacter pasteurianus]
MVVGRNDCAKGRQMKDTVIGVDLAKNIFQVHGASRAGEVMFRKKLRRQQFMQFMATQPPALVVLEACGSA
HYWARELAGAGHEVRLIAPQYVKPFVKRQKNDAADAEAIVIAARQPEMRFVEPRTEAQQARGVLFRARQR
LVHQRTELVNALRAVLYEFGLVVPQGIAHIRHIEAMLDEAVLPEAVKQECLDLLRQISEQSVRIDVRTKK
IRMLAQESENTCRLQSMPGVGPLTALAIEAFAPDLQSFRRGRDFAAWLGLVPRQFSSGGKERLGKISKAG
QADIRRLLIMGAMTQVNWASRKAPAPGSWLARMLARKPRMLVAIALANRMARAIWAMATKQEDYRDPALS
VAA

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (>) symbol in the first column. The word following the > symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the > and the first letter of the identifier. It is recommended (but not required) that all lines of text be shorter than 80 characters.

The sequence ends if another line starting with a > appears; this indicates the start of another sequence. So multiple sequences can be stored in one file:

>first
MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDTSLNARSLTTT
PSLTTCTPPSLAACTPPTSLGMVDSPPHINPPRRIGTLCFDFGSAKSPQRCECVASDRPS
TTSNTAPDTYRLLITNSKTRKNNYGTCRLEPLTYGI
>second
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
>third
MRYTVLIALQGALLLLLLIDDGQGQSPYPYPGMPCNSSRQCGLGTCVHSRCAHCSSDGTL
CSPEDPTMVWPCCPESSCQLVVGLPSLVNHYNCLPNQCTDSSQCPGGFGCMTRRSKCELC
KADGEACNSPYLDWRKDKECCSGYCHTEARGLEGVCIDPKKIFCTPKNPWQLAPYPPSYH
QPTTLRPPTSLYDSWLMSGFLVKSTTAPSTQEEEDDY

Download data:

Download a larger dataset: main.fasta

Download a smaller dataset for testing: testing.fasta

More FASTA files can be downloaded, for example, from National Center for Biotechnology Information. Look for files with an extension “.faa” (FASTA amino acid).

Assignments

Preliminary task. Reading a file (don’t submit this program).

Read the file “testing.fasta” using a file stream object ifstream and print its contents on the screen.

ifstream usage example:

ifstream fin;
fin.open("myfile.txt");
if (fin.fail()) {
    cout << "File does not exist" << endl;
    exit(1);
}

char c;
while(fin.get(c)) {
    cout << c;
}

Task 1. Printing certain chosen sequences from the file.

Read the file “testing.fasta” using a file stream object ifstream.

Print out the 3rd and the 7th sequences from the file including both the description lines and the sequence data.

(If you want, you may assume that the greater-than > symbol appears only at the beginning of every description line, and not anywhere else).

Task 2. Finding the longest stretch composed of a single amino acid type.

The program should read the input file searching for the longest contiguous stretch of a single amino acid type (one letter repeated many times).

Example: when reading the file

>some_protein
LVHAHILPEAVKQECLDLLRQISEQSVRAAAAAAA
AAAAAALDFAAWLGLLLLQFSSGGKERLGKISKAG
>another_one
CSPEDPTMVWPCCPESSCQLVVGLPSLVNHYNCLP
NQCTDSSQCPGGFGCMTRRSKCELC

the longest stretch would be AAAAAAAAAAAAA, that is A appears 13 times in a row. Observe that it can span multiple lines. The program should print out:

A x 13: AAAAAAAAAAAAA

What to do with the program:

Briefly report your findings in the comment to the program.

Task 3. Searching for a specific substring in the file.

The task is to read the file “main.fasta”, searching for a sequence that contains the substring “CSCI” in it (the amino acids: C, S, C, I).

Since we did not learn about strings yet, you may use an array of four characters:

char query[4] = {'C', 'S', 'C', 'I'};
int length = 4;

The task is to find an amino acid sequence that contains the substring “CSCI” and print it out. Since you have to print the entire sequence, including its description, you may want to use the member functions seekg and tellg of ifstream. The first function lets you set the position of where you are reading from the file. And the second lets you find what your current position in the file is (it is simply the number of bytes from the beginning of the file).

So, when used together, the functions let you remember your current position in the file, and return to it later, if it is necessary.

It works as follows:

ifstream fin;
fin.open("file");

// remember your current position in the file
int pos = fin.tellg();

// read something from the file 
fin >> x;
fin >> y; 
...

// return to the remembered position:
fin.seekg(pos);

Hint: So, you may remember the position of the beginning of the sequence (>), and return to it if you see that the sequence contains “CSCI”.

The output should look like:

>sp|A1C874|3HAO2_ASPCL 3-hydroxyanthranilate 3,4-dioxygenase 2 OS=Aspergillus clavatus (strain ATCC 1007 / CBS 513.65 / DSM 816 / NCTC 3887 / NRRL 1) GN=bna1-2 PE=3 SV=2
MNPMPLSPLFFATWLAENEDQLRPPVNNYCLYQGNDFILMAVGGPNERNDYHVNETEVCL
QPSWCSREANEQEWFYQVKGDMLLRVVENNAFRDIPIKEGEMFLLPGNTPHNPVRFKDTI
GLVMERQRPAGSRDRLRWYCTKGDHASPTIIREEVFHCSDLGTQLKPIIEQWQQDEDGRR
CAECSCIADPK