----------------------------------------------------------------------

                       Release Notes
                       STACKdb v3.1.1

----------------------------------------------------------------------

Release date: January 2003                      Based on: Genbank 125.0


INTRODUCTION
STACKdb v3.1.1 contains the same data as STACKdb v3.1 but is provided
with the much improved stackPACK v2.2 viewing software. The updated 
viewing software includes new viewing and extraction functions that
enable rapid simplified analysis and manipulation of alignments and
alignment analyses. Data exchange with third-party programs is also
simplified resulting in easier assessment of highlighted areas of   
potential interest.

The current release of STACKdb is based on all human EST and mRNA 
sequences from GB125.0, 24 August 2001, downloaded from NCBI as of 
25 August 2001. 1,761,079 new EST and 87,085 new mRNA sequences have 
been added to the STACKdb v3.0 data to form 270,515 clusters and 5,711 
clonelinks in total. The database is organized into 15 tissue-based 
categories and a disease category. 

The mRNA sequences within STACKdb v3.1.1 were assigned to one or more 
tissue categories using BLAST comparisons instead of relying on mRNA 
annotation. This more comprehensive mRNA assignment ensured superior 
supervised clustering and consensus sequence accuracy. 

STACKdb v3.1.1 includes:
- Relational database tables for use with the stackPACK v2.2 viewing 
  software, included with the release.
- Non-redundant sets of linked clusters, clusters and singletons in 
  FastA format.
- Alternate consensus sequences in FastA format that represent 
  potential alternate expression forms.
- A comprehensive full-length mRNA index consisting of all mRNA 
  sequences within HTD, MGC and RefSeq as a preview to the next 
  release of STACKdb. STACKdb v4.0 will consist of a whole body 
  index with the mRNA index acting as a scaffold for the EST sequences.

mRNA Index Input Data

Source Location Downloaded Input mRNA
BCM Human Transcript Database1 ftp://ftp.hgsc.bcm.tmc.edu/pub/data/HTDB/ 6 June 2002 15,305
Mammalian Gene Collection2 http://mgc.nci.nih.gov 6 June 2002 11,754
RefSeq3 (hs.fna.gz) ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/hs.fna.gz 6 June 2002 15,740
Total 42,799
1HTD_unique records were downloaded.
2Records for 3 June 2002
3Records for Homo sapiens 5 June 2002


mRNA Index Cluster Results

mRNA Index Total Input Seqs Total MS Clusters Seq in MS Clusters Total Singletons Total Clonelinks MS Clusters in Clonelinks Unlinked MS Clusters
mRNA Index 42,799 10,703 36,365 6,434 0 1 0 0
1 Clonelinking was not performed on the mRNA index data as the data does not have 3' and 5' reads. The data was pre-assembled from numerous sequencing reactions, and there is only one sequence entry per clone.


STACKdb v3.1.1 Input Data

Tissue Category Total Input Sequences Total ESTs Total mRNAs1
Adipose 15,665 2,376 13,289
Brain 545,635 469,126 76,509
Cochlea 28,857 8,423 20,434
Connective 221,663 148,334 73,329
Digestive 511,804 427,421 84,383
Disease 2 775,397 685,663 89,734
Eye 119,521 67,782 51,739
Genomic 364,277 279,673 84,604
Gland 387,421 306,850 80,571
Heart 131,555 74,375 57,180
Hemato-lymph 660,211 570,381 89,830
Lung 342,567 263,908 78,659
Muscle 120,568 72,022 48,546
Olfactory 17,291 2,600 14,691
Other 362,387 285,019 77,368
Reproductive 710,185 621,335 88,850
Total excluding Disease Duplicates 4,539,607 3,599,625 939,982
Total including Disease Duplicates 5,315,004 4,285,288 1,029,716
1 Total mRNAs:
- Includes both full and partial length mRNA sequences extracted from NCBI on 25 August 2001. There may thus be overlapping mRNA sequences.
- BLAST comparisons assigned each mRNA sequence to one or more tissue categories. The same mRNA sequence may thus appear in more than one tissue.
2The disease category is a duplication of sequences that were annotated as disease-related.


STACKdb v3.1.1 Cluster Results

Tissue Category Total Input Sequences Total MS Clusters Seqs in MS Clusters Total Singletons Total Clonelinks MS Clusters in Clonelinks Unlinked MS Clusters
Adipose 15,665 2,313 13,261 2,404 0 0 2,313
Brain 545,635 26,567 454,217 91,418 2,172 4,944 21,623
Cochlea 28,857 4,157 25,176 3,681 1 2 4,155
Connective 221,663 11,789 198,838 22,825 25 81 11,708
Digestive 511,804 21,539 423,341 88,463 13 30 21,509
Disease 775,397 30,305 632,500 142,897 175 482 29,823
Eye 119,521 9,196 102,718 16,803 336 707 8,489
Genomic 364,277 24,901 321,914 42,363 210 500 24,401
Gland 387,421 19,591 314,217 73,204 114 280 19,311
Heart 131,555 10,794 114,414 17,141 76 169 10,625
Hemato-lymph 660,211 32,147 557,793 102,418 1,753 3,918 28,229
Lung 342,567 18,239 291,479 51,088 122 340 17,899
Muscle 120,568 8,468 105,688 14,880 21 84 8,384
Olfactory 17,291 2,609 14,422 2,869 11 23 2,586
Other 362,387 19,723 297,330 65,057 201 492 19,231
Reproductive 710,185 28,147 596,861 113,324 481 1,301 26,846
Total excluding Disease Category 4,539,607 240,210 3,831,669 707,938 5,536 13,090 227,309
Total including Disease Category 5,315,004 270,515 4,464,169 850,835 5,711 13,576 257,132


WHAT'S NEW IN THIS RELEASE:
STACKdb v3.1.1 has been produced with the stackPACK v2.1, v2.1.1 and 
v2.2 Transcript Reconstruction and Variation Analysis Management System, 
and includes several improvements and new features.

1. A comprehensive full-length mRNA index consisting of all mRNA 
   sequences within MGC, RefSeq and HTDb is provided as a preview to 
   the next release of STACKdb. STACKdb v4.0 will consist of a whole 
   body index with the mRNA index acting as a scaffold for the EST 
   sequences. 

2. Several enhancements have been made towards improving the quality 
   and accuracy of STACKdb v3.1.1 results:
   - BLAST comparison, rather than relying on mRNA annotation, was 
     used to assign each mRNA sequence to one or more tissue 
     categories. This more comprehensive mRNA assignment ensured 
     superior supervised clustering and consensus sequences accuracy. 
     The same mRNA sequence may thus appear in more than one tissue, 
     and a great increase in the total number of mRNAs included in 
     clusters can be observed in this release of STACKdb compared to 
     the previous. 
   - Significantly enhanced accuracy of EST tissue classification and
     a great increase in disease-annotated sequences result from
     considerable updates to the tissue tree used to sort the sequences 
     into the various STACKdb categories. The tissue tree is provided 
     with this release.    
   - Improved parsing of sequence annotation has been implemented. 
     Sequence direction, for example, is now included in the sequence 
     definition line and can be observed both in the web interface and 
     in the FastA files. 

3. Simplifications and improvements have been implemented in terms of 
   analysis and manipulation of results:
   - All alternate consensus sequences for each STACKdb category are 
     provided in FastA format. These alternate consensus sequences 
     represent potential alternate expression forms.
   - Improved reporting functions have been implemented both via the 
     web interface and via the command line.

4. Duplicate mRNA sequences, both in terms of duplicate accession 
   numbers and 100% sequence identity, have been removed prior to 
   STACKdb processing. In cases of 100% sequence duplication, the 
   sequence with the most comprehensive annotation was retained. 

5. The following problems, observed in the previous release of STACKdb, 
   have been resolved:
   - Presence of empty clonelink consensus sequences.
   - Erroneous non-sequence text within the sequence data. 
   - Absence of large alignment analysis data that was not stored in 
     the database due to size constraints.
   - Since the STACKdb v3.1.1 clonelink and final consensus sequence 
     accession numbers are unique per tissue but not unique per se, 
     the STACKdb tissue category name has been added to the consensus 
     header information to distinguish between the various tissue 
     categories. The header information can be seen in the viewing 
     software as well as the FastA files.   


KNOWN PROBLEMS
Several known problems for STACKdb v3.1.1 exist:

- STACKdb v3.1.1 was created with stackPACK v2.1.1 and was converted 
  to stackPACK v2.2 format for use with the much improved stackPACK 
  v2.2 viewing and output software. The following limitations apply 
  to converted projects:
   o Data cannot be output in their original unmasked format.
   o Alignments cannot be output in ACE format.
   o PHRED format input files may not be added to converted projects.

- STACKdb v3.1.1 clonelink and final consensus sequence accession numbers 
  are unique per tissue but not unique per se. The STACKdb tissue 
  category name has however been added to the consensus header 
  information to help distinguish between the various tissue 
  categories. The header information can be seen in the viewing 
  software as well as the FastA files.

- The existence of fragments of real genes within the vector file 
  distributed by NCBI has been reported by some users. This vector 
  file is a component of the STACKdb masking database and may lead to 
  the masking of some real gene fragments during STACKdb processing.

- STACKdb v3.1.1 contains some superclusters primarily due to the 
  absence of low complexity regions from the STACKdb masking database. 
  This has been addressed in the next release of STACKdb.

- The linking algorithm uses all contig consensus sequences to create 
  linked clusters rather than those specific contig consensus sequences 
  with the shared clone ID and may result in the production of 'super 
  linked clusters'. This algorithm maximizes final consensus sequence 
  length by connecting clusters together by virtue of clone ID.

- Clones from the IMAGE consortium are for example represented as 
  'CLONE:512142' within STACKdb v3.1.1 and do not have the word 'IMAGE' 
  in the sequence annotation.

- STACKdb processing is performed using the annotations in GenBank.
  It is known that annotation errors occur within GenBank and we
  cannot thus account for those annotation errors that occur within
  STACKdb as a result of this.

- Some clusters do not have contigs due to database constraints. This
  has been rectified in the next release of STACKdb.

- Browser limitations may limit the size or length of cluster or
  clonelink that can be viewed in WebProbe.

- The hierarchical navigational icons representing the various cluster
  consensus and alignment views within WebProbe may become misaligned
  when using certain font settings on Netscape under Linux. This can 
  be rectified by setting the Netscape variable width font to 14 in  
  Edit: Preferences: Font.   



CONTACT INFORMATION
We value your comments and feedback. Please get in touch with Electric 
Genetics if you have any queries:

phone	+27 21 9593964
fax	+27 21 9592512
e-mail	support@egenetics.com