ViroBLAST Searches

Searching for specific sequences within the Tallapoosa darter genome assembly.

The following two examples show how scaffolds can be identified that contain specific protein coding sequences.

Example 1: Gene for protein contained within one scaffold.

In this example the scaffold that contains the gene for Urate Oxidase is identified. To start, the amino acid sequence of the protein of interest is obtained.


The Tallpoosa darter genome assembly is searched utilizing a ViroBLAST server.


The protein sequence is pasted into the text box and the scaffold database is searched with the tblastn program. The proper blast program has to be selected from the drop down menu, the scaffold73-2 assembly database has to be selected and usually the Basic search works fine.


The database search can take a few minutes but a visual progress bar is displayed (the blue progress squares don't appear at a rapid rate so be patient).


In this example only one scaffold was found that contained significant homology as indicated in the table on the bottom of the page shown below. To see the actual alignment, the "Inspect BLAST output" link is clicked.


The resulting page shows the actual alignment between the urate oxidase protein sequence (Query) and the protein sequence potentially encoded within the scaffold. In this case, the protein appears to be encoded within scaffold 530 that is 18,542 nucleotides long. The portions of the scaffold that encode the protein sequence are indicated as Sbjct and the circled numbers indicate the nucleotide positions of the corresponding nucleotides.


With this information the gene for urate oxidase can be annotated within WebApollo. Note that the scaffold 530 can be located by typing the name of the scaffold in the search box labeled "Filter:".


The fgenesh program has prediced three potential genes in scaffold 530. Since the portion of the BLAST output shown above shows the urate oxidase protein aligning with the scaffold DNA sequence in the 10,000 to 13,000 nucleotide range, the urate oxidase gene is likely represented by the middle gene model. 


If desired, the scaffold DNA sequence can be retrieved from ViroBLAST by checking the box to the right of the scaffold ID and clicking the "Submit" button. Note that if only one scaffold is displayed in the table, it may be necessary to click the box indicated by the arrow.


The scaffold DNA sequence is retrieved either as text data in the next web page or it can be downloaded as a fasta formatted file by clicking the "Download" button.



Example 2: Gene for protein spans multiple scaffolds.

In this example the scaffolds that contain the gene for Growth Hormone are identified. To start, the amino acid sequence of the protein of interest is obtained.


The scaffold database is searched with tbalsn as above.


In this case a large number of short scaffolds show significant homology to different parts of the protein query sequence. This indicates that the gene for this protein spans multiple scaffolds. Alignments to segments of three scaffolds are shown below. Note that based on the Query amino acid numbers in the alignments, the scaffolds can be concatenated in order to make a longer scaffold that may contain the entire gene or a larger part of a gene. (This is only possible for single copy genes where the genome does not contain any closely related paralogs.) In this case, related to the alignments shown below, the following concatenation is made: scaffold_401014-scaffold_383171-scaffold_187258.

Note that such concatenation may contain all adjacent exons but may be missing portions of introns.


The concatenated scaffold was run through the fgenesh program and then imported into WebApollo. The final gene model in the yellow User-created Annotations area is a product of a fgenesh+ alignment of the concatenated scaffold and the Perca flavescens growth hormone protein. In this example, the concatenation did not reassemble the entire gene. A short segment of the 5' end is missing as is a portion of the 3'end.


A particular problem with this concatenated scaffold illustrates a problem with scaffolding of adjacent contigs. Note that the 5' end of the second exon is incomplete. As shown in the red oval below, the NNNN sequence represents a scaffolding junction where nucleotides are potentially missing. The actual splice junction is within this region and the codons for the first two amino acids of this exons are missing from the assembly.


While the above graphic illustrates a problem where an exon/intron junction is affected by scaffolding, in many other such concatenations that have been carried out all of the intron/exon junctions are intact.