Tallapoosa Darter Genome

This page describes the details of Tallapoosa darter genome sequencing and assembly.

Sequencing and Coverage

DNA was extracted from a single Tallapoosa darter utilizing the Qiagen DNeasy Blood and Tissue kit. This DNA was then prepared for sequencing and sequenced with an Illumina MiSeq by the Georgia Genomics Facility at the University of Georgia. The DNA was sheared to an average size of about 700 bp and libraries were prepared with the TruSeq sample preparation kit. Two 250PE runs were performed on a MiSeq instrument. Two barcoded libraries were sequenced in the first run (this was necessary because this run also contained a small percentage of barcoded amplicons for a different study). The second run just sequenced a single genomic DNA library.

From these runs the following sets of sequence data were obtained:

Library 1: 14,119,092 individual paired-end reads
Library 2:   8, 897,578 individual paired-end reads
Library 3: 29,312,578 individual paired-end reads

Total: 52,929,248 individual paired-end reads of up to 250 nucleotides each.

These reads add up to about 13 billion nucleotides of DNA sequence. Since percid fishes have a genome size of about 1 billion nucleotides (animal genome size database), this would indicate an expected coverage of at least 12 fold.

To check the actual coverage, the paired end sequences were aligned onto three previously characterized Tallapoosa darter genomic regions (description) (related publication - PDF). These assemblies are shown in Figures 1, 2 and 3 below. Note that, on average, at least a 12 fold coverage is observed although coverage varies from as low as 3 fold to 28 fold. It is likely, of course, that some small portions of the genome may lack coverage.


Figure 1. The black line in the yellow background is the 2,038 nucleotide long Lysophospholipase II (LPLII) genomic fragment that acts as the reference for the assembly. The blue and orange bars respectively represent the forward and reverse 250 nucleotide paired-end sequences that align to the reference sequence. The thin blue lines indicate the portion of the DNA fragments not sequenced in the reactions. The green graph shows the coverage obtained for each nucleotide along the reference sequence. Gaps within black, blue and orange bars indicate the presence of indel variation.


Figure 2. The black line in the yellow background is the 2,853 nucleotide long Phospholipase C-gamma-1 (PLC) genomic fragment that acts as the reference for the assembly. All other items as described in Figure 1.


Figure 3. The black line in the yellow background is the 3,182 nucleotide long Ribosomal Protein S6 (RPS6) genomic fragment that acts as the reference for the assemblyAll other items as described in Figure 1.


Assembly

For an initial assembly of the sequences into longer contigs, the three libraries were combined into one fasta file. This file was used as the input for the Minia short read assembler software. The only parameters that can be varied are the k-mer length and minimum abundance. The sequences were assembled for most k-mer lengths between 31 and 75 at minimum abundance settings of 2 and 3. The longest contigs and the most nucleotides assembled into contigs were obtained at k-mer 61 with minimum abundance setting of 3 and at k-mer 73 with minimum overlap abundance of 2. The following are the details of these assemblies:

K-mer =  61 and minimum abundance = 3 assembly data (contigs 61-3):
Total number of contigs = 595905
Sum (bp) = 649989172
Total number of N's = 0
Sum (bp) no N's = 649989172
Max contig size = 33537
Min contig size = 186
Average contig size = 1090
N50 = 2008

K-mer =  73 and minimum abundance = 2 assembly data (contigs 73-2):
Total number of contigs = 539616
Sum (bp) = 660984269
Total number of N's = 0
Sum (bp) no N's = 660984269
Max contig size = 35431
Min contig size = 222
Average contig size = 1224
N50 = 2197

The contigs from each of these assemblies were further assembled into scaffolds with SSPACE software using the alignment of paired reads from the three sequence libraries to the ends of the various contigs and joining of the contigs aligned to matched paired reads. The following are the details of the scaffolding results:

Scaffolding of contigs 61-3:
Total number of scaffolds = 515914
Sum (bp) = 648873426
Total number of N's = 165069
Sum (bp) no N's = 648708357
Max scaffold size = 57700
Min scaffold size = 186
Average scaffold size = 1257
N50 = 2735

Scaffolding of contigs 73-2:
Total number of scaffolds = 470492
Sum (bp) = 660664090
Total number of N's = 130317
Sum (bp) no N's = 660533773
Max scaffold size = 57949
Min scaffold size = 222
Average scaffold size = 1404
N50 = 2913

While the scaffolding did increase the lengths of the initial contigs, the increases were not dramatic. Indeed, the average scaffold sizes are not that much greater than the average contig sizes in the various assemblies. However, this is partly due to the presence of a very large number of very small contigs. As shown in the tables below, comparison of the distribution of larger contigs and scaffolds does show that scaffolding did significantly increase the number of longer scaffolds relative to the number of longer contigs.

Size range of contigs/scaffolds number of 61-3 contigs number of 61-3 scaffolds
 >20,000  23  211
 >10,000<20,000  1,224  3,544
 >5,000<10,000  13,108  19,089
 >2,000<5,000 75,185  70,438


Size range of contigs/scaffolds number of 73-2 contigs number of 73-2 scaffolds
 >20,000  53  339
 >10,000<20,000  1,747 4,216
 >5,000<10,000  15,189  20,067
 >2,000<5,000 76,620  70,557



Assessment of Assembly Quality

An initial assessment to determine if the sequence reads were correctly assembled was performed by finding scaffolds which shared sequence similarity with the LPLII, PLC and RPS6 Tallapoosa darter genomic sequences described above. A ViroBLAST server was set up and databases were created from each of the scaffold assemblies. Scaffolds that contain sequences homologous to each of the three Tallapoosa darter reference sequences were found with blastn searches. These scaffolds were then paired with the reference sequences. As shown in the Figures 4, 5 and 6 below. The scaffolds align perfectly with the reference sequences indicating that correct assembly was achieved. Examples shown are from scaffolds based on contig 73-2 assemblies only. Similar results (not shown) are obtained from alignments of these reference sequences to scaffolds based on the other contig assemblies, though the scaffold length, number of homologous scaffolds and relative position of homology to the reference sequences varies.


Figure 4. Alignment of LPLII reference sequence to scaffold showing maximum homology. In this case the reference sequence is within a single 7,121 nucleotide long scaffold. While sequences are nearly identical along the length of homology there are a few variations since each sequence is obtained from a different individual. Gaps within the black bars indicate the presence of SNP and indel variation.
 

Figure 5.  Alignment of PLC reference sequence to scaffolds showing maximum homology. In this case the reference sequence spans two overlapping scaffolds. The two ~3,000 nucleotide long scaffolds define a genomic DNA region 7,121 nucleotide long. While sequences are nearly identical  along the length of homology there are a few variations since each sequence is obtained from a different individual. Gaps within the black bars indicate the presence of SNP and indel variation.


Figure 6. Alignment of RPS6 reference sequence to scaffold showing maximum homology. In this case the reference sequence is within a single 11,690 nucleotide long scaffold. While sequences are nearly identical along the length of homology there are a few variations since each sequence is obtained from a different individual. Gaps within the black bars indicate the presence of SNP and indel variation.


Utility of Scaffolds for Genome Annotation

In general, the lengths of the scaffolds are relatively short. While a subsequent phase of this genomic sequencing effort will address the issue of scaffold length, can this current version of the Tallapoosa darter genome assembly be utilized to begin an annotation of the genome? Obviously, those scaffolds that are above 5,000 nucleotides in length likely contain a gene or a significant part of a gene. These scaffolds can certainly be annotated by blastn and tblastn searches against various databases. 

Alternately, mRNA or protein sequences of interest can be used in blastn and tblastn searches of the Tallapoosa darter scaffold databases and those scaffolds showing areas of homology can then be annotated. Examples of this approach are shown in Figures 7 and 8 below. In the first example, 
Perca flavescens urate oxidase mRNA was used to search the Tallapoosa darter scaffold database with blastn (also the urate oxidase protein sequence was used with tblastn) to identify scaffolds that contain areas of homology. Since Perca flavescens is in the same family as the darters, the sequences are highly homologous. These searches identified a single 18,542 nucleotide long scaffold that contains the entire urate oxidase gene. The alignment of the urate oxidase mRNA to the scaffold indicates the relative location of this gene as well as it's exon/intron structure (Figure 7).


Figure 7. Alignment of the Perca flavescens urate oxidase mRNA to a Tallapoosa darter genomic DNA scaffold that contains the urate oxidase gene. Vertical lines indicate exons and thin horizontal connecting lines indicate introns.

Not all genes are short and relatively compact. Some genes can span long stretches of genomic DNA and thus would not be found within a single scaffold. This is illustrated by the neprilysin gene (NEP1). Perca flavescens NEP1 mRNA and protein sequences were used to search the Tallapoosa darter scaffold database with blastn and tbalstn, respectively. A number of scaffolds were identified that shared a high degree of homology to different portions of the NEP1 sequences. These scaffolds were then concatenated in the order corresponding to NEP1 homology. The final assembly of the Tallapoosa darter NEP1 gene region and the positions of the relevant scaffolds is shown in Figure 8 below. An alignment of the final 23,781 nucleotide scaffold assembly and the Perca flavescens NEP1 mRNA is shown in Figure 9 below. This example demonstrates that it is possible to utilize mRNA or protein sequences from closely related species to further assemble even the shorter scaffolds and produce annotation of entire genes.  Note that while the scaffolds shown in this example contain all of the mRNA corresponding sequences, the scaffolds do not overlap and thus is is possible that additional intronic sequences are missing from the final assembly at the scaffold junctions.


Figure 8. Assembly of scaffolds into a longer contiguous sequence based on the order of homology of each scaffold to a portion of the Perca flavescens NEP1 mRNA. 


Figure 9. Alignment of Perca flavescens NEP1 mRNA to the Tallapoosa darter NEP1 gene as assembled from scaffolds as shown in Figure 8. Vertical lines indicate exons and thin horizontal connecting lines indicate introns.


Conclusions

While the current assembly of the Tallapoosa darter genome based on a 12 fold coverage of PE250 reads produced scaffolds that are relatively short, it appears that the assembly is of sufficiently high quality to facilitate the start of darter genome annotation. As improvements to the assembly are made over time, the annotations can then be incorporated into the more complete darter genome.