Each program (the source code .cpp file) should be submitted through Blackboard (Course Materials > Lab).
You can submit all your programs at the end of the lab session in one submission. This way, we can hopefully avoid the situation when you are quickly writing your program, immediately uploading it to Blackboard, but then, say 10 minutes later, realizing that there is a bug in it.
Basically, submit when you are sure that it will be your final version.
Each program should start with a comment that contains your name and a short program description, for example:
/*
Author: Your Name
Description: Lab 1. Task 1. Hello world
*/
You can submit incomplete programs, but their “incomplete” status must be clearly mentioned in the comment to the program. In this case, also briefly describe what is implemented, and what is not.
A protein molecule is a folded chain of amino acids. There are 24 possible types of amino acids, so the sequence can be easily encoded as a string of letters, for example:
LVHQRTELVNALRAVLYEFGLVVPQGIAHIRHIEAMLDEAVLPEAVKQECLDLLRQISEQSVRIDVRTKK
Each letter represents a certain type of amino acid:
L = Leucine, V = Valine, H = Histidine, Q = Glutamine, R = Arginine, and so on.
In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or amino acid sequences, in which nucleotides or amino acids are represented using single-letter codes.
>gi|384117544|ref|YP_005479417.1| transposase [Acetobacter pasteurianus]
MVVGRNDCAKGRQMKDTVIGVDLAKNIFQVHGASRAGEVMFRKKLRRQQFMQFMATQPPALVVLEACGSA
HYWARELAGAGHEVRLIAPQYVKPFVKRQKNDAADAEAIVIAARQPEMRFVEPRTEAQQARGVLFRARQR
LVHQRTELVNALRAVLYEFGLVVPQGIAHIRHIEAMLDEAVLPEAVKQECLDLLRQISEQSVRIDVRTKK
IRMLAQESENTCRLQSMPGVGPLTALAIEAFAPDLQSFRRGRDFAAWLGLVPRQFSSGGKERLGKISKAG
QADIRRLLIMGAMTQVNWASRKAPAPGSWLARMLARKPRMLVAIALANRMARAIWAMATKQEDYRDPALS
VAA
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.
The description line is distinguished from the sequence data by a greater-than (>
) symbol in the first column.
The word following the >
symbol is the identifier of the sequence, and the rest of the line is the description (both are optional).
There should be no space between the >
and the first letter of the identifier.
It is recommended (but not required) that all lines of text be shorter than 80 characters.
The sequence ends if another line starting with a >
appears; this indicates the start of another sequence.
So multiple sequences can be stored in one file:
>first
MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDTSLNARSLTTT
PSLTTCTPPSLAACTPPTSLGMVDSPPHINPPRRIGTLCFDFGSAKSPQRCECVASDRPS
TTSNTAPDTYRLLITNSKTRKNNYGTCRLEPLTYGI
>second
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
>third
MRYTVLIALQGALLLLLLIDDGQGQSPYPYPGMPCNSSRQCGLGTCVHSRCAHCSSDGTL
CSPEDPTMVWPCCPESSCQLVVGLPSLVNHYNCLPNQCTDSSQCPGGFGCMTRRSKCELC
KADGEACNSPYLDWRKDKECCSGYCHTEARGLEGVCIDPKKIFCTPKNPWQLAPYPPSYH
QPTTLRPPTSLYDSWLMSGFLVKSTTAPSTQEEEDDY
Download a larger dataset: main.fasta
Download a smaller dataset for testing: testing.fasta
More FASTA files can be downloaded, for example, from National Center for Biotechnology Information. Look for files with an extension “.faa” (FASTA amino acid).
Read the file “testing.fasta” using a file stream object ifstream
and print its contents on the screen.
ifstream
usage example:
ifstream fin;
fin.open("myfile.txt");
if (fin.fail()) {
cout << "File does not exist" << endl;
exit(1);
}
char c;
while(fin.get(c)) {
cout << c;
}
Read the file “testing.fasta” using a file stream object ifstream
.
Print out the 3rd and the 7th sequences from the file including both the description lines and the sequence data.
(If you want, you may assume that the greater-than >
symbol appears only at the beginning
of every description line, and not anywhere else).
The program should read the input file searching for the longest contiguous stretch of a single amino acid type (one letter repeated many times).
Example: when reading the file
>some_protein
LVHAHILPEAVKQECLDLLRQISEQSVRAAAAAAA
AAAAAALDFAAWLGLLLLQFSSGGKERLGKISKAG
>another_one
CSPEDPTMVWPCCPESSCQLVVGLPSLVNHYNCLP
NQCTDSSQCPGGFGCMTRRSKCELC
the longest stretch would be AAAAAAAAAAAAA
, that is A
appears 13 times in a row.
Observe that it can span multiple lines.
The program should print out:
A x 13: AAAAAAAAAAAAA
What to do with the program:
L x 6: LLLLLL
.
You can make your own input files for additional testing.Briefly report your findings in the comment to the program.
The task is to read the file “main.fasta”, searching for a sequence that contains the substring “CSCI” in it (the amino acids: C, S, C, I).
Since we did not learn about strings yet, you may use an array of four characters:
char query[4] = {'C', 'S', 'C', 'I'};
int length = 4;
The task is to find an amino acid sequence that contains the substring “CSCI”
and print it out. Since you have to print the entire sequence, including its
description, you may want to use the member functions
seekg
and tellg of ifstream
.
The first function lets you set the position of where you are reading from the file.
And the second lets you find what your current position in the file is (it is simply the number of bytes from the beginning of the file).
So, when used together, the functions let you remember your current position in the file, and return to it later, if it is necessary.
It works as follows:
ifstream fin;
fin.open("file");
// remember your current position in the file
int pos = fin.tellg();
// read something from the file
fin >> x;
fin >> y;
...
// return to the remembered position:
fin.seekg(pos);
Hint: So, you may remember the position of the beginning of the sequence (>
), and return to it
if you see that the sequence contains “CSCI”.
The output should look like:
>sp|A1C874|3HAO2_ASPCL 3-hydroxyanthranilate 3,4-dioxygenase 2 OS=Aspergillus clavatus (strain ATCC 1007 / CBS 513.65 / DSM 816 / NCTC 3887 / NRRL 1) GN=bna1-2 PE=3 SV=2
MNPMPLSPLFFATWLAENEDQLRPPVNNYCLYQGNDFILMAVGGPNERNDYHVNETEVCL
QPSWCSREANEQEWFYQVKGDMLLRVVENNAFRDIPIKEGEMFLLPGNTPHNPVRFKDTI
GLVMERQRPAGSRDRLRWYCTKGDHASPTIIREEVFHCSDLGTQLKPIIEQWQQDEDGRR
CAECSCIADPK