----------------------------------------------------------------------
Release Notes
STACKdb v3.1.1
----------------------------------------------------------------------
Release date: January 2003 Based on: Genbank 125.0
INTRODUCTION
STACKdb v3.1.1 contains the same data as STACKdb v3.1 but is provided
with the much improved stackPACK v2.2 viewing software. The updated
viewing software includes new viewing and extraction functions that
enable rapid simplified analysis and manipulation of alignments and
alignment analyses. Data exchange with third-party programs is also
simplified resulting in easier assessment of highlighted areas of
potential interest.
The current release of STACKdb is based on all human EST and mRNA
sequences from GB125.0, 24 August 2001, downloaded from NCBI as of
25 August 2001. 1,761,079 new EST and 87,085 new mRNA sequences have
been added to the STACKdb v3.0 data to form 270,515 clusters and 5,711
clonelinks in total. The database is organized into 15 tissue-based
categories and a disease category.
The mRNA sequences within STACKdb v3.1.1 were assigned to one or more
tissue categories using BLAST comparisons instead of relying on mRNA
annotation. This more comprehensive mRNA assignment ensured superior
supervised clustering and consensus sequence accuracy.
STACKdb v3.1.1 includes:
- Relational database tables for use with the stackPACK v2.2 viewing
software, included with the release.
- Non-redundant sets of linked clusters, clusters and singletons in
FastA format.
- Alternate consensus sequences in FastA format that represent
potential alternate expression forms.
- A comprehensive full-length mRNA index consisting of all mRNA
sequences within HTD, MGC and RefSeq as a preview to the next
release of STACKdb. STACKdb v4.0 will consist of a whole body
index with the mRNA index acting as a scaffold for the EST sequences.
mRNA Index Input Data
| Source | Location | Downloaded | Input mRNA |
| BCM Human Transcript Database1 | ftp://ftp.hgsc.bcm.tmc.edu/pub/data/HTDB/ | 6 June 2002 | 15,305 |
| Mammalian Gene Collection2 | http://mgc.nci.nih.gov | 6 June 2002 | 11,754 |
| RefSeq3 (hs.fna.gz) | ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/hs.fna.gz | 6 June 2002 | 15,740 |
| Total | 42,799 |
mRNA Index Cluster Results
| mRNA Index | Total Input Seqs | Total MS Clusters | Seq in MS Clusters | Total Singletons | Total Clonelinks | MS Clusters in Clonelinks | Unlinked MS Clusters |
| mRNA Index | 42,799 | 10,703 | 36,365 | 6,434 | 0 1 | 0 | 0 |
STACKdb v3.1.1 Input Data
| Tissue Category | Total Input Sequences | Total ESTs | Total mRNAs1 | |
| Adipose | 15,665 | 2,376 | 13,289 | |
| Brain | 545,635 | 469,126 | 76,509 | Cochlea | 28,857 | 8,423 | 20,434 |
| Connective | 221,663 | 148,334 | 73,329 | |
| Digestive | 511,804 | 427,421 | 84,383 | |
| Disease 2 | 775,397 | 685,663 | 89,734 | |
| Eye | 119,521 | 67,782 | 51,739 | |
| Genomic | 364,277 | 279,673 | 84,604 | |
| Gland | 387,421 | 306,850 | 80,571 | |
| Heart | 131,555 | 74,375 | 57,180 | |
| Hemato-lymph | 660,211 | 570,381 | 89,830 | |
| Lung | 342,567 | 263,908 | 78,659 | |
| Muscle | 120,568 | 72,022 | 48,546 | |
| Olfactory | 17,291 | 2,600 | 14,691 | |
| Other | 362,387 | 285,019 | 77,368 | |
| Reproductive | 710,185 | 621,335 | 88,850 | |
| Total excluding Disease Duplicates | 4,539,607 | 3,599,625 | 939,982 | |
| Total including Disease Duplicates | 5,315,004 | 4,285,288 | 1,029,716 |
STACKdb v3.1.1 Cluster Results
| Tissue Category | Total Input Sequences | Total MS Clusters | Seqs in MS Clusters | Total Singletons | Total Clonelinks | MS Clusters in Clonelinks | Unlinked MS Clusters |
| Adipose | 15,665 | 2,313 | 13,261 | 2,404 | 0 | 0 | 2,313 |
| Brain | 545,635 | 26,567 | 454,217 | 91,418 | 2,172 | 4,944 | 21,623 |
| Cochlea | 28,857 | 4,157 | 25,176 | 3,681 | 1 | 2 | 4,155 |
| Connective | 221,663 | 11,789 | 198,838 | 22,825 | 25 | 81 | 11,708 |
| Digestive | 511,804 | 21,539 | 423,341 | 88,463 | 13 | 30 | 21,509 |
| Disease | 775,397 | 30,305 | 632,500 | 142,897 | 175 | 482 | 29,823 |
| Eye | 119,521 | 9,196 | 102,718 | 16,803 | 336 | 707 | 8,489 |
| Genomic | 364,277 | 24,901 | 321,914 | 42,363 | 210 | 500 | 24,401 |
| Gland | 387,421 | 19,591 | 314,217 | 73,204 | 114 | 280 | 19,311 |
| Heart | 131,555 | 10,794 | 114,414 | 17,141 | 76 | 169 | 10,625 |
| Hemato-lymph | 660,211 | 32,147 | 557,793 | 102,418 | 1,753 | 3,918 | 28,229 |
| Lung | 342,567 | 18,239 | 291,479 | 51,088 | 122 | 340 | 17,899 |
| Muscle | 120,568 | 8,468 | 105,688 | 14,880 | 21 | 84 | 8,384 |
| Olfactory | 17,291 | 2,609 | 14,422 | 2,869 | 11 | 23 | 2,586 |
| Other | 362,387 | 19,723 | 297,330 | 65,057 | 201 | 492 | 19,231 |
| Reproductive | 710,185 | 28,147 | 596,861 | 113,324 | 481 | 1,301 | 26,846 |
| Total excluding Disease Category | 4,539,607 | 240,210 | 3,831,669 | 707,938 | 5,536 | 13,090 | 227,309 |
| Total including Disease Category | 5,315,004 | 270,515 | 4,464,169 | 850,835 | 5,711 | 13,576 | 257,132 |
WHAT'S NEW IN THIS RELEASE:
STACKdb v3.1.1 has been produced with the stackPACK v2.1, v2.1.1 and
v2.2 Transcript Reconstruction and Variation Analysis Management System,
and includes several improvements and new features.
1. A comprehensive full-length mRNA index consisting of all mRNA
sequences within MGC, RefSeq and HTDb is provided as a preview to
the next release of STACKdb. STACKdb v4.0 will consist of a whole
body index with the mRNA index acting as a scaffold for the EST
sequences.
2. Several enhancements have been made towards improving the quality
and accuracy of STACKdb v3.1.1 results:
- BLAST comparison, rather than relying on mRNA annotation, was
used to assign each mRNA sequence to one or more tissue
categories. This more comprehensive mRNA assignment ensured
superior supervised clustering and consensus sequences accuracy.
The same mRNA sequence may thus appear in more than one tissue,
and a great increase in the total number of mRNAs included in
clusters can be observed in this release of STACKdb compared to
the previous.
- Significantly enhanced accuracy of EST tissue classification and
a great increase in disease-annotated sequences result from
considerable updates to the tissue tree used to sort the sequences
into the various STACKdb categories. The tissue tree is provided
with this release.
- Improved parsing of sequence annotation has been implemented.
Sequence direction, for example, is now included in the sequence
definition line and can be observed both in the web interface and
in the FastA files.
3. Simplifications and improvements have been implemented in terms of
analysis and manipulation of results:
- All alternate consensus sequences for each STACKdb category are
provided in FastA format. These alternate consensus sequences
represent potential alternate expression forms.
- Improved reporting functions have been implemented both via the
web interface and via the command line.
4. Duplicate mRNA sequences, both in terms of duplicate accession
numbers and 100% sequence identity, have been removed prior to
STACKdb processing. In cases of 100% sequence duplication, the
sequence with the most comprehensive annotation was retained.
5. The following problems, observed in the previous release of STACKdb,
have been resolved:
- Presence of empty clonelink consensus sequences.
- Erroneous non-sequence text within the sequence data.
- Absence of large alignment analysis data that was not stored in
the database due to size constraints.
- Since the STACKdb v3.1.1 clonelink and final consensus sequence
accession numbers are unique per tissue but not unique per se,
the STACKdb tissue category name has been added to the consensus
header information to distinguish between the various tissue
categories. The header information can be seen in the viewing
software as well as the FastA files.
KNOWN PROBLEMS
Several known problems for STACKdb v3.1.1 exist:
- STACKdb v3.1.1 was created with stackPACK v2.1.1 and was converted
to stackPACK v2.2 format for use with the much improved stackPACK
v2.2 viewing and output software. The following limitations apply
to converted projects:
o Data cannot be output in their original unmasked format.
o Alignments cannot be output in ACE format.
o PHRED format input files may not be added to converted projects.
- STACKdb v3.1.1 clonelink and final consensus sequence accession numbers
are unique per tissue but not unique per se. The STACKdb tissue
category name has however been added to the consensus header
information to help distinguish between the various tissue
categories. The header information can be seen in the viewing
software as well as the FastA files.
- The existence of fragments of real genes within the vector file
distributed by NCBI has been reported by some users. This vector
file is a component of the STACKdb masking database and may lead to
the masking of some real gene fragments during STACKdb processing.
- STACKdb v3.1.1 contains some superclusters primarily due to the
absence of low complexity regions from the STACKdb masking database.
This has been addressed in the next release of STACKdb.
- The linking algorithm uses all contig consensus sequences to create
linked clusters rather than those specific contig consensus sequences
with the shared clone ID and may result in the production of 'super
linked clusters'. This algorithm maximizes final consensus sequence
length by connecting clusters together by virtue of clone ID.
- Clones from the IMAGE consortium are for example represented as
'CLONE:512142' within STACKdb v3.1.1 and do not have the word 'IMAGE'
in the sequence annotation.
- STACKdb processing is performed using the annotations in GenBank.
It is known that annotation errors occur within GenBank and we
cannot thus account for those annotation errors that occur within
STACKdb as a result of this.
- Some clusters do not have contigs due to database constraints. This
has been rectified in the next release of STACKdb.
- Browser limitations may limit the size or length of cluster or
clonelink that can be viewed in WebProbe.
- The hierarchical navigational icons representing the various cluster
consensus and alignment views within WebProbe may become misaligned
when using certain font settings on Netscape under Linux. This can
be rectified by setting the Netscape variable width font to 14 in
Edit: Preferences: Font.
CONTACT INFORMATION
We value your comments and feedback. Please get in touch with Electric
Genetics if you have any queries:
phone +27 21 9593964
fax +27 21 9592512
e-mail support@egenetics.com