|
|
The dbSTSeq,abbreviated for database of Standard Transcript Sequence is focused on building a collection of revised transcript sequences based on officially released genomic DNA sequence of human and other model species. The transcript sequences are collected from the cDNA, mRNA and EST sequences in public domain. The polymorphisms, sequencing errors or contamination of vectors in the transcript sequences will be "masked" after mapped to a unique set of genomic DNA sequence. A program named EIparser was specially developed to meet this purpose. EIparser was used to determine the gene structures in details for advancing gene annotation of each record. According to which, dbSTSeq will provide a "standard" reference database of human beings and other model species with high quality. This database will be widely used for large-scale data analysis for gene transcription and regulation element detection. |
1. Source of data for analysis |
(1)Human genome database: (Release date: 20050420) |
URL: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/human_genomic.gz |
(2)Human RefSeq database (Release date: 20050418, containing 29176 records) |
URL: ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/ |
|
2. dbSTSeq for human genes based on RefSeq |
(1)In the records of 29176 of the RefSeq database, there are 29156 records could be mapped to the genome |
sequence. However, there are also 20 records could not find chromosomal localization. |
(2)Three programs, EIparser, Blat and Sim4 were used to evaluate the quality for each RefSeq record with |
its corresponding genomic DNA region. The cross-comparison was summarized in following table. |
| Program |
Perfect match
(Type I) |
Revisable match
(Type II) |
Unrevisable match
(Type III) |
EIparser |
15663 |
11758 |
1735 |
Blat |
15500 |
13270 |
386 |
Sim4 |
14292 |
13795 |
1069 |
| Cross comparison based on three programs |
| |
|
Perfect match
(Type I) |
Revisable match
(Type II) |
Unrevisable match
(Type III) |
|
|
EIparser |
Blat |
Sim4 |
EIparser |
Blat |
Sim4 |
EIparser |
Blat |
Sim4 |
Perfect match
(Type I) |
EIparser |
15663 |
15434 |
14250 |
|
224 |
1174 |
|
5 |
239 |
Blat |
15434 |
15500 |
14153 |
45 |
|
1114 |
21 |
|
233 |
Sim4 |
14250 |
14153 |
14292 |
33 |
136 |
|
9 |
3 |
|
Revisable match
(Type II) |
EIparser |
|
45 |
33 |
11758 |
11669 |
11329 |
|
44 |
396 |
Blat |
224 |
|
136 |
11669 |
13270 |
12550 |
1377 |
|
584 |
Sim4 |
1174 |
1114 |
|
11329 |
12550 |
13795 |
1292 |
131 |
|
Unrevisable match
(Type III) |
EIparser |
|
21 |
9 |
|
1377 |
1292 |
1735 |
337 |
434 |
Blat |
5 |
|
3 |
44 |
|
131 |
337 |
386 |
252 |
Sim4 |
239 |
233 |
|
396 |
584 |
|
434 |
252 |
1069 |
Reference
LI Yu-Jian, LI Zhi-Feng, ZHANG Cheng-Gang. EIparser: An Efficient Tool for Parsing Exon/IntronStructure of Spliced Genes.(submitted)<download EIparser> <supplement> |
|
|
|
|
|
|
|