----------------------------------------------------------------------
Generation Protocol
STACKdb v3.1.1
----------------------------------------------------------------------
Release Date: January 2003
STACKdb v3.1.1 was created using the stackPACK v2.1, v2.1.1 and v2.2
Transcript Reconstruction and Variation Analysis Management System.
Processing was primarily carried out on a Silicon Graphics Origin 2000
platform with 12 processors. Compaq and Linux platforms were also
utilized for processing.
1. DATA APPREHENSION
All gbest files and human mRNA sequences from the GenBank release
125.0, 24 August 2001, were downloaded on 25 August 2001 for STACKdb
production:
- The gbest files were downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/genbank/
- The mRNA sequences were downloaded from:
http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide. Based on
recommendations from NCBI, a list of GI accession numbers of all
human mRNA sequences were downloaded by submitting an Entrez batch
search query as follows:
((human[Organism] AND biomol mrna[Properties]) NOT gbdiv est[Properties])
2. EST SEQUENCE TISSUE SEPARATION
Partitioning at a tissue level enhances searching efforts by
presenting the opportunity to rapidly explore transcript expression
in specific tissues or subsets such as the disease-related sequences.
- All EST sequences of human origin were extracted from the gbest
files.
- The raw human EST sequences were converted from GenBank format
into STACK FastA format.
- These sequences were separated into 15 different tissue-based
categories and one disease category according to the STACK tissue
tree file, provided with this release.
- Sequences that did not partition into any of the designated
categories were assigned manually.
3. mRNA SEQUENCE TISSUE SEPARATION
The mRNA sequences were assigned to one or more tissue categories
using BLAST comparisons instead of relying on mRNA annotation. This
more comprehensive mRNA assignment ensured superior supervised
clustering and consensus sequences accuracy.
- The raw human mRNA sequences were extracted from NCBI in FastA
format using the list of mRNA GI accession numbers obtained from
the Entrez batch search query in step 1.
- Duplicate mRNA sequences, both in terms of duplicate accession
numbers and 100% sequence identity, were removed. In cases of
100% sequence duplication, the sequence with the most
comprehensive annotation was retained.
- Each of the FastA formatted mRNA sequences was assigned to one or
more of the 16 tissue categories using HTBLAST. The same mRNA
sequence may thus be in more than one tissue category.
- The resulting 16 STACKdb files with multiple EST and mRNA sequence
entries in FastA format were then submitted to the stackPACK pipeline.
4. StackPACK PIPELINE
The stackPACK Transcript Reconstruction and Variation Analysis
Management System allows rapid clustering, alignment, analysis and
linking of ESTs as well as full-length sequences. StackPACK
thoroughly and effectively clusters error-laden and redundant data
into loose clusters and then refines these groupings in a step-wise
manner to elucidate the contributions of sequence polymorphism,
alternate expression forms, error and artifact. The data was
processed and managed via command line data submission in order to
allow greater manipulation and flexibility. Web-based submission is
also possible. StackPACK simplifies storage and retrieval of cluster
data by utilizing a relational database to coordinate, store and
manage data throughout the clustering pipeline. This pipeline is
described below.
stack_Import
------------
Each tissue category was processed separately as an individual
project. The sequences were imported into the stackPACK MySQL
database in GUESS FastA format for further processing by the
clustering engine.
stack_Mask
-----------
Sequences were masked in order to eliminate contaminants and
artifacts that can erroneously cluster sequences together.
Input Data Imported EST and mRNA sequences
Algorithm CrossMatch v.990319
Parameter Settings minmatch=12
minscore=20
Compared Against RepBase v6.7 released on October, 2001. Vector
database distributed by NCBI:
http://www.girinst.org/Repbase_Update-Login_Form.html
Other potential contaminants such as rodent,
mitochondrial and ribosomal DNA. The file
containing the potential contaminants was
created from GenBank entries for STACKdb
processing. This file and the vector database
are included with the stackPACK distribution.
Output Masked EST and mRNA sequences. Any contaminated
base pairs in sequences were replaced with "x"
stack_Cluster
-------------
Clustering employs d2_cluster, a high-performance comparison
algorithm that determines the relative similarity of large
datasets of nucleotide sequences. d2_cluster implements a loose
approach to sequence clustering by identifying and counting
matching n-length words. This loose approach presents the
opportunity to identify splice variants and alternate expression
forms.
Input Data All masked EST and mRNA sequences
Algorithm d2_cluster
Parameter Settings word_size=6
similarity_cutoff=0.96
minimum_sequence_size=50
window_size=100
Output Multi-sequence clusters
Singletons which did not cluster
The d2_cluster algorithm ignores sequences with a pre-masked length
of less than 50 valid base pairs. Only A,T,C and G are considered
valid bases. Sequences less than 50 bp are not included in the
clustering step and are considered as singletons. Masked sequence
bases, represented by "x", are not considered valid bases, and thus
will not be counted toward the minimum number of base pairs required
for processing by d2_cluster.
d2_cluster is characterized in the following publication:
"d2_cluster: A Validated Method for Clustering EST and Full-Length
cDNA Sequences." John Burke, Dan Davison, and Winston Hide, Genome
Research 9:1135-1142
stack_Assemble
---------------
The related but loose clusters are subsequently aligned and
assembled in order to identify, characterize and isolate any
sequence divergence.
Input Data All multi-sequence clusters
Algorithm PHRAP
Parameter Settings vector_bound=0
trim_score=20
forcelevel=0
penalty=-2
gap_init=-4
gap_ext=-3
ins_gap_ext=-3
del_gap_ext=-3
maxgap=30
flags=-retain_duplicates
Output Aligned and assembled clusters. Multiple
contigs within clusters, generated when
divergent groups are found within a cluster.
stack_Analysis
---------------
Consensus sequences are generated, consensus lengths are maximized,
sub-assemblies are partitioned and polymorphic regions and
alternative splicing forms within the clusters are annotated.
Input Data All aligned and assembled clusters
Algorithm CRAW
Parameter Settings sig=0.5
window_size=100
ignore_first=50
Output CRAW alignments. Analyzed CRAW alignments. Final
contig consensus sequences.
stack_Link
-----------
Since all sequences generated from the same cDNA clone correspond
to a single gene, each sequence was searched for clone identification
so that transcripts corresponding to the same gene could be
identified. In order to avoid erroneous linking, we require two
independent clones to form a link between two clusters. This is
represented by the "red" (redundancy) parameter.
Input Data Cluster membership. All EST and mRNA sequence
clone names
Algorithm Internal linking algorithm
Parameter Settings red=2
max_seq_per_clone=2
Output Clonelink consensus sequences consisting of
linked multi-sequence clusters.
Accession conventions used in the stackPACK interface:
The program assigns internal accession numbers to the clonelinks,
cluster and consensus sequences produced in the clustering pipeline
and these, as well as the original sequence accession numbers, may be
used to query the viewer.
- ln# Clonelink accession number.
- cl# Cluster accession number.
- ct# Contig accession number.
- cn# Consensus accession number.
5. CONVERSION OF DATA TO stackPACK V2.2 FORMAT
Data within STACKdb v3.1.1 was converted to stackPACK v2.2 format
in order to use with the much improved stackPACK v2.2 viewing
software. STACKdb v3.1.1 thus contains the same data as STACKdb v3.1
but is provided in a format compatible with stackPACK v2.2. The
updated viewing software includes new viewing and extraction
functions that enable rapid simplified analysis and manipulation of
alignments and alignment analyses. Data exchange with third-party
programs is also simplified resulting in easier assessment of
highlighted areas of potential interest.
The data was converted as follows:
Command: stack_ProjectManager -convert <project> <old_dsn> <old_dsn_login> <old_dsn_password> <1|0>
where:
Project = project to be converted
old_dsn = old data source name
old_dsn_login = old data source name login
old_dsn_password = old data source name password
0|1 = This argument specifies whether sequences in the
stackPACK 2.1 project have been clustered or not.
If this argument is set to 1, sequences will be
assumed to be clustered. If this argument is set
to 0 (or if this argument is not specified)
sequences will be assumed to be unclustered.
Example: stack_ProjectManager -convert adipose3_1 stacksys stackpack stackpack 1
6. DATABASE FILES
The project data for each of the STACKdb categories was dumped
from the stackPACK MySQL relational database into 16 different
database files as follows:
Command: <mysql location>/mysqldump --opt -u<username> -p<password> -h<stackPACK host name> stackPACK<project ID> > output.dump
where:
u = username
p = password
h = host
Example: mysqldump --opt -ustackpack -pstackpack -hlocalhost stackPACK5 > adipose3_1_1.dump
These 16 database files can then be loaded into the STACKdb MySQL
database for viewing, output report generation and data
manipulation. See the STACKdb v3.1 installation instructions for
details on importing STACKdb dump files.
7. OUTPUT REPORT PRODUCTION
Two reports for each of the STACKdb categories, including the mRNA
index, have been generated to enable immediate access to STACKdb
data for searching and analysis.
1. Non-redundant output reports:
One comprehensive non-redundant report containing all STACKdb
entries is provided per STACKdb category in FastA format. These
reports were generated using the stackPACK v2.2 command line
stack_ReportNonRedundantFasta.py script as follows:
Command: stack_ReportNonRedundant.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=FASTA [SequenceOptions] --Output=<OutputFilename>
Example: stack_ReportNonRedundant.py --Owner=support@egenetics.com --Project=adipose3_1_1 --Format=FASTA --Clonelink --Primary --Singletons --Output=adipose_nonredundant.fasta
Each report consists of the following concatenated STACKdb entries:
- All clonelinked consensus sequences.
- All primary consensus sequences from multi-sequence clusters that
are NOT found in clonelinked sequences.
- All singleton sequences that are NOT found in clonelinked
sequences.
2. Alternate output reports:
One alternate output report containing all alternate consensus
sequences from all multi-sequence clusters is provided per
STACKdb category in FastA format. These alternate consensus
sequences represent potential alternate expression forms and the
reports were generated using the stackPACK v2.2 command line
stack_ReportAlternateConsensus.py script as follows:
Command: stack_ReportConsensus.py --Owner=<ProjectOwner> --Project=<ProjectName> <ConsensusOptions> --Output=<OutputFilename>
Example: stack_ReportConsensus.py --Owner=support@egenetics.com --Project=adipose3_1_1 --Alternate --Output=adipose_alternate.fasta
These reports can be input into formatdb in order to create a BLASTable database.
CONTACT INFORMATION
We value your comments and feedback. Please get in touch with Electric
Genetics if you have any queries:
phone +27 21 9593964
fax +27 21 9592512
e-mail support@egenetics.com