Introduction

Small RNAs of less than 40 nucleotides (nt) in length, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), scanRNAs (scnRNAs) and piwi-interacting RNAs (piRNAs), constitute a large family of tiny regulatory molecules. Among them, miRNAs, piRNAs and siRNAs are also discovered in mammals, which have diverse and important functions.

Although our knowledge of mammalian small RNAs has advanced rapidly, the primary transcripts of most mammalian small RNAs remain to be determined. Uncovering primary transcripts of small RNAs is very important to our understanding of the biogenesis of small RNAs. It facilitates (a) identifying the regulatory regions such as transcription factor binding sites (TFBS) and hence discovering upstream regulators in the network, (b) detecting other sequence and structural motifs required in small RNA processing and (c) providing essential information for small RNA knockout (Saini et al., 2007). The direct and reliable way to determine primary transcripts is wet experiments such as rapid amplification of cDNA ends (RACE). However, wet experiments are expensive and not suitable for large-scale analysis. Genomic analysis is an efficient substitute. By combining transcription information available from public databases such as GenBank, genomic analysis can provide preliminary evidence for primary transcripts. Several studies (Saini et al., 2007; Jin et al., 2006; Rodriguez et al., 2004) have attempted to delineate the genomic boundaries of miRNAs by large-scale genomic analysis.

BatchGenAna is specifically developed for large-scale genomic analysis of small RNAs, which has the following seven distinct features:

1. Provide batch mapping and annotation for as many as 1,000 nucleotide sequences or 10,000 genomic loci of small RNAs at a time.
2. Utilize two alignment algorithms, miBLAST (Kim et al., 2005) and blat (Kent 2002), for sequences shorter than 40nt and sequences longer respectively, to improve both computational efficiency and accuracy; In our case, miBLAST is ~30 times faster than BLAST and ~10 times faster than WUBLAST for sequences shorter than 40nt and has the same sensitivity.
3. Provide genomic features including RefSeq genes, mRNAs, ESTs and repeat elements.
4. Produce two types of results. One is tabular results, i.e. files in tab-delimited format, in which the genomic loci and names of the genomic features overlapping submitted queries are listed separately; The other is graphical views, in which users’ queries are denoted as green boxes (in plus strand) and red boxes (in minus strand). ESTs, mRNAs, RefSeq genes and repeat elements are displayed in different tracks with different colors. Users may select either or both of them.
5. Find clusters of submitted queries automatically according to their genomic loci, and display the whole cluster in one genome view.
6. Provide options for fetching flanking sequences of submitted queries and the overlapping transcripts. The flanking sequences of the overlapping transcripts can be extracted according to their GenBank accession numbers in a second run.
7. Notify users via email when their jobs are completed. The results will be kept in the server for 120 hours.

 

Data Source
The human genome is reference assembly in NCBI Build 36.2 (released on September 14, 2006), downloaded from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens. The annotations of RefSeq genes, mRNAs and ESTs along the sequences are from NCBI Map Viewer, downloaded from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/mapview.

The mouse genome is reference assembly in NCBI Build 37.1 (released on July 5, 2007), downloaded from ftp://ftp.ncbi.nih.gov/genomes/M_musculus. The reference assembly represents the C57BL/6J genome and includes contigs assembled from finished (phase 3) high throughput genomic sequence (HTGs), single fragment HTGS phase 2 and WGS contigs (For details, please see http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=10090). The annotations of RefSeq genes, mRNAs and ESTs along the sequences are from NCBI Map Viewer, downloaded from ftp://ftp.ncbi.nih.gov/genomes/M_musculus/mapview.

The rat genome is RGSC_v3.4 assembly in NCBI Build 4.1 (released on July 5, 2006), downloaded from ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus. The annotations of RefSeq genes, mRNAs and ESTs along the sequences are from NCBI Map Viewer, downloaded from ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/mapview.

The repeat of human, mouse and rat genome are downloaded from UCSC goldenPath with the same reference assembly.

All the related annotation data are stored in a MySQL database.

 

Computer Configuration

The current web service is running on a quad Intel Xeon 3.2 GHz machine with 4 GB RAM. The operating system is Redhat Linux AS3.

 

Updates

December 6, 2007: The mouse genome and annotation has been updated to NCBI build 37.1, the same as UCSC mm9.

January 24, 2007: The human genome and annotation has been updated to NCBI build 36.2, the same as UCSC hg18.

January 22, 2007: The rat genome and annotation has been updated to NCBI build 4.1, the same as UCSC rn4.

 

Stats

Job processed: 518

Job processing: 0

Job in queue: 0

To facilitate users to estimate the executing time of their jobs, we here provide the executing time of alignment, tabular results generation, graphical results generation and flanking sequence extraction in human genome separately. It is worth mentioning that if users select mouse or rat, the executing time will be much less due to the smaller genomes.Click to download the table of executing time.