----------------------------------------------------------------------

                        Generation Protocol 
                           STACKdb v3.1.1

----------------------------------------------------------------------

Release Date: January 2003


STACKdb v3.1.1 was created using the stackPACK v2.1, v2.1.1 and v2.2 
Transcript Reconstruction and Variation Analysis Management System. 
Processing was primarily carried out on a Silicon Graphics Origin 2000 
platform with 12 processors. Compaq and Linux platforms were also 
utilized for processing.


1. DATA APPREHENSION
   All gbest files and human mRNA sequences from the GenBank release 
   125.0, 24 August 2001, were downloaded on 25 August 2001 for STACKdb 
   production:

   - The gbest files were downloaded from: 
     ftp://ftp.ncbi.nlm.nih.gov/genbank/
   - The mRNA sequences were downloaded from: 
     http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide. Based on 
     recommendations from NCBI, a list of GI accession numbers of all 
     human mRNA sequences were downloaded by submitting an Entrez batch 
     search query as follows: 

     ((human[Organism] AND biomol mrna[Properties]) NOT gbdiv est[Properties])


2. EST SEQUENCE TISSUE SEPARATION
   Partitioning at a tissue level enhances searching efforts by 
   presenting the opportunity to rapidly explore transcript expression 
   in specific tissues or subsets such as the disease-related sequences.

   - All EST sequences of human origin were extracted from the gbest 
     files.
   - The raw human EST sequences were converted from GenBank format 
     into STACK FastA format.
   - These sequences were separated into 15 different tissue-based 
     categories and one disease category according to the STACK tissue 
     tree file, provided with this release.
   - Sequences that did not partition into any of the designated 
     categories were assigned manually. 


3. mRNA SEQUENCE TISSUE SEPARATION
   The mRNA sequences were assigned to one or more tissue categories 
   using BLAST comparisons instead of relying on mRNA annotation. This 
   more comprehensive mRNA assignment ensured superior supervised 
   clustering and consensus sequences accuracy. 

   - The raw human mRNA sequences were extracted from NCBI in FastA 
     format using the list of mRNA GI accession numbers obtained from 
     the Entrez batch search query in step 1. 
   - Duplicate mRNA sequences, both in terms of duplicate accession 
     numbers and 100% sequence identity, were removed. In cases of 
     100% sequence duplication, the sequence with the most 
     comprehensive annotation was retained. 
   - Each of the FastA formatted mRNA sequences was assigned to one or 
     more of the 16 tissue categories using HTBLAST. The same mRNA 
     sequence may thus be in more than one tissue category.
   - The resulting 16 STACKdb files with multiple EST and mRNA sequence 
     entries in FastA format were then submitted to the stackPACK pipeline. 


4. StackPACK PIPELINE
   The stackPACK Transcript Reconstruction and Variation Analysis 
   Management System allows rapid clustering, alignment, analysis and 
   linking of ESTs as well as full-length sequences. StackPACK 
   thoroughly and effectively clusters error-laden and redundant data 
   into loose clusters and then refines these groupings in a step-wise 
   manner to elucidate the contributions of sequence polymorphism, 
   alternate expression forms, error and artifact. The data was 
   processed and managed via command line data submission in order to 
   allow greater manipulation and flexibility. Web-based submission is 
   also possible. StackPACK simplifies storage and retrieval of cluster 
   data by utilizing a relational database to coordinate, store and 
   manage data throughout the clustering pipeline. This pipeline is 
   described below.

   stack_Import
   ------------
   Each tissue category was processed separately as an individual 
   project. The sequences were imported into the stackPACK MySQL 
   database in GUESS FastA format for further processing by the 
   clustering engine.  

   stack_Mask
   -----------
   Sequences were masked in order to eliminate contaminants and 
   artifacts that can erroneously cluster sequences together.


   Input Data         Imported EST and mRNA sequences
   Algorithm          CrossMatch v.990319
   Parameter Settings minmatch=12
                      minscore=20
   Compared Against   RepBase v6.7 released on October, 2001. Vector 
                      database distributed by NCBI: 
                      http://www.girinst.org/Repbase_Update-Login_Form.html
                      Other potential contaminants such as rodent, 
                      mitochondrial and ribosomal DNA. The file 
                      containing the potential contaminants was 
                      created from GenBank entries for STACKdb 
                      processing. This file and the vector database 
                      are included with the stackPACK distribution. 
   Output             Masked EST and mRNA sequences. Any contaminated 
                      base pairs in sequences were replaced with "x"


   stack_Cluster
   -------------
   Clustering employs d2_cluster, a high-performance comparison 
   algorithm that determines the relative similarity of large 
   datasets of nucleotide sequences. d2_cluster implements a loose 
   approach to sequence clustering by identifying and counting 
   matching n-length words. This loose approach presents the 
   opportunity to identify splice variants and alternate expression 
   forms. 

   Input Data         All masked EST and mRNA sequences
   Algorithm          d2_cluster
   Parameter Settings word_size=6 
                      similarity_cutoff=0.96 
                      minimum_sequence_size=50 
                      window_size=100
   Output             Multi-sequence clusters  
                      Singletons which did not cluster

   The d2_cluster algorithm ignores sequences with a pre-masked length 
   of less than 50 valid base pairs. Only A,T,C and G are considered 
   valid bases. Sequences less than 50 bp are not included in the 
   clustering step and are considered as singletons. Masked sequence 
   bases, represented by "x", are not considered valid bases, and thus 
   will not be counted toward the minimum number of base pairs required 
   for processing by d2_cluster.

d2_cluster is characterized in the following publication:
"d2_cluster: A Validated Method for Clustering EST and Full-Length 
cDNA Sequences." John Burke, Dan Davison, and Winston Hide, Genome 
Research 9:1135-1142 


   stack_Assemble
   ---------------
   The related but loose clusters are subsequently aligned and 
   assembled in order to identify, characterize and isolate any 
   sequence divergence. 

   Input Data         All multi-sequence clusters
   Algorithm          PHRAP
   Parameter Settings vector_bound=0 
                      trim_score=20
                      forcelevel=0 
                      penalty=-2 
                      gap_init=-4 
                      gap_ext=-3 
                      ins_gap_ext=-3 
                      del_gap_ext=-3 
                      maxgap=30
                      flags=-retain_duplicates
   Output             Aligned and assembled clusters. Multiple 
                      contigs within clusters, generated when 
                      divergent groups are found within a cluster.


   stack_Analysis
   ---------------
   Consensus sequences are generated, consensus lengths are maximized, 
   sub-assemblies are partitioned and polymorphic regions and 
   alternative splicing forms within the clusters are annotated.

   Input Data         All aligned and assembled clusters
   Algorithm          CRAW
   Parameter Settings sig=0.5 
                      window_size=100
                      ignore_first=50
   Output             CRAW alignments. Analyzed CRAW alignments. Final 
                      contig consensus sequences. 


   stack_Link
   ----------- 
   Since all sequences generated from the same cDNA clone correspond 
   to a single gene, each sequence was searched for clone identification 
   so that transcripts corresponding to the same gene could be 
   identified. In order to avoid erroneous linking, we require two 
   independent clones to form a link between two clusters. This is 
   represented by the "red" (redundancy) parameter. 

   Input Data         Cluster membership. All EST and mRNA sequence 
                      clone names
   Algorithm          Internal linking algorithm
   Parameter Settings red=2
                      max_seq_per_clone=2
   Output             Clonelink consensus sequences consisting of 
                      linked multi-sequence clusters. 


Accession conventions used in the stackPACK interface:
The program assigns internal accession numbers to the clonelinks, 
cluster and consensus sequences produced in the clustering pipeline 
and these, as well as the original sequence accession numbers, may be 
used to query the viewer.

- ln# 	Clonelink accession number.
- cl# 	Cluster accession number.
- ct# 	Contig accession number.
- cn# 	Consensus accession number.


5. CONVERSION OF DATA TO stackPACK V2.2 FORMAT
   Data within STACKdb v3.1.1 was converted to stackPACK v2.2 format 
   in order to use with the much improved stackPACK v2.2 viewing 
   software. STACKdb v3.1.1 thus contains the same data as STACKdb v3.1 
   but is provided in a format compatible with stackPACK v2.2. The 
   updated viewing software includes new viewing and extraction 
   functions that enable rapid simplified analysis and manipulation of 
   alignments and alignment analyses. Data exchange with third-party 
   programs is also simplified resulting in easier assessment of 
   highlighted areas of potential interest.

   The data was converted as follows:

   Command: stack_ProjectManager -convert <project> <old_dsn> <old_dsn_login> <old_dsn_password> <1|0>

   where:
   Project =          project to be converted
   old_dsn =          old data source name
   old_dsn_login =    old data source name login
   old_dsn_password = old data source name password
   0|1 =              This argument specifies whether sequences in the 
                      stackPACK 2.1 project have been clustered or not. 
                      If this argument is set to 1, sequences will be 
                      assumed to be clustered. If this argument is set 
                      to 0 (or if this argument is not specified) 
                      sequences will be assumed to be unclustered.

   Example: stack_ProjectManager -convert adipose3_1 stacksys stackpack stackpack 1 


6. DATABASE FILES
   The project data for each of the STACKdb categories was dumped 
   from the stackPACK MySQL relational database into 16 different 
   database files as follows:
 
   Command: <mysql location>/mysqldump --opt -u<username> -p<password> -h<stackPACK host name> stackPACK<project ID> > output.dump

   where: 
   u = username
   p = password
   h = host

   Example: mysqldump --opt -ustackpack -pstackpack -hlocalhost stackPACK5 > adipose3_1_1.dump

   These 16 database files can then be loaded into the STACKdb MySQL 
   database for viewing, output report generation and data 
   manipulation. See the STACKdb v3.1 installation instructions for 
   details on importing STACKdb dump files.


7. OUTPUT REPORT PRODUCTION
   Two reports for each of the STACKdb categories, including the mRNA 
   index, have been generated to enable immediate access to STACKdb 
   data for searching and analysis.

   1. Non-redundant output reports:
      One comprehensive non-redundant report containing all STACKdb 
      entries is provided per STACKdb category in FastA format. These 
      reports were generated using the stackPACK v2.2 command line 
      stack_ReportNonRedundantFasta.py script as follows:  

      Command:  stack_ReportNonRedundant.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=FASTA [SequenceOptions] --Output=<OutputFilename>

      Example: stack_ReportNonRedundant.py --Owner=support@egenetics.com --Project=adipose3_1_1 --Format=FASTA --Clonelink --Primary --Singletons --Output=adipose_nonredundant.fasta

   Each report consists of the following concatenated STACKdb entries:
   - All clonelinked consensus sequences. 
   - All primary consensus sequences from multi-sequence clusters that 
     are NOT found in clonelinked sequences. 
   - All singleton sequences that are NOT found in clonelinked 
     sequences. 


   2. Alternate output reports:
      One alternate output report containing all alternate consensus 
      sequences from all multi-sequence clusters is provided per 
      STACKdb category in FastA format. These alternate consensus 
      sequences represent potential alternate expression forms and the 
      reports were generated using the stackPACK v2.2 command line 
      stack_ReportAlternateConsensus.py script as follows:  

      Command: stack_ReportConsensus.py --Owner=<ProjectOwner> --Project=<ProjectName> <ConsensusOptions> --Output=<OutputFilename>

      Example: stack_ReportConsensus.py --Owner=support@egenetics.com --Project=adipose3_1_1 --Alternate --Output=adipose_alternate.fasta


   These reports can be input into formatdb in order to create a BLASTable database.



CONTACT INFORMATION
We value your comments and feedback. Please get in touch with Electric 
Genetics if you have any queries:

phone   +27 21 9593964
fax     +27 21 9592512
e-mail  support@egenetics.com