P. sativum CSFL RefTrans V1

Overview
Analysis NameP. sativum CSFL RefTrans V1
Methodreftrans (1.0)
Sourcep.sativum_csfl_reftransV1
Date performed2016-04-13

Materials & Methods

CSFL Pea RefTrans combines published RNA-Seq and EST data sets to create a reference transcriptome (RefTrans) for pea and provides putative gene function identified by homology to known proteins.   

In  P.sativum_RefTrans_V1, 2.9 billion RNA-Seq reads from publicly available peer-reviewed pea RNA-Seq data sets (Franssen et al. 2011, Duarte et al. 2014, Sindhu et al. 2014, Trujillo et al. 2014, Sudheesh et al. 2015, Yendrek et al. 2015),  and 18,576 ESTs, were downloaded from the NCBI Short Read Archive database (SRP006313, SRP027017, SRP042372, SRP045233, SRP056009, SRP056105, SRP009826) and the NCBI dbEST database, respectively. These RNA-Seq data sets include 258 million single-end reads  and 2.6 billion paired-end reads generated from 454 platforms and Illumina. The RNA-Seq sequences were subjected to quality control using the NGS QC Toolkit (v2.3.3, default parameters, Patel and Jain, 2012), Trimmomatic (v0.32, default parameters, Bolger et al, 2014) and custom Perl scripts.  The remaining 157 million RNA-Seq reads were assembled de novo with Trinity (v2.0.6, Grabherr et al, 2011) using default assembly parameters and a minimum coding length of 200 bases. Quality control of the ESTs included vector sequence screening (UniVec_Core,ftp://ftp.ncbi.nih.gov/pub/UniVec/) using cross_match (Gordon et al, 1998), removal of tRNA/rRNA/snRNA sequences identified using tblastx (Altschul et al, 1990), and Poly-A tail trimmimg.  The filtered ESTs were assembled using the CAP3 program (P -90, Huan and Madan, 1999). Bowtie (v 2-2.2.3) (Langmead et al, 2009) was applied to multi-map the RNA-Seq reads and ESTs back to the assembled contigs and singlets. The contigs and singlets were clustered into genes using CH-HIT (v4.6.4, Fu et al. 2012) and Corset (v1.0.4) (Davidson and Oshlack, 2014) with default parameters. The longest isoform with longer than 500 nt was selected to represent each Corset cluster, creating a RefTrans V1 for pea of 45,727 sequences. The RefTrans were functionally characterized by pairwise comparison using the BLASTX algorithm against the Swiss-Prot (UniProtKB/Swiss-Prot Release 2015_10) and TrEMBL (UniProtKB/TrEMBL Release 2015_10) (Boeckmann et al, 2003) protein databases.  Information on the top 25 matches with an expect (E) value of ≤ 1E-06 were recorded and stored in the database. The transcriptome and annotation (GO Terms, match description, InterPro domains) are available for searching and downloading.    

 

References:

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 215(3):403-10.

  2. Bolger, A. M., Lohse, M., & Usadel, B. (2014) Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170

  3. Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Sneider M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids. 31:365-370.

  4. Davidson, N. M. and Oshlack, A. (2014). Corset: enabling differential gene expression analysis for de novo assembled transcriptomes. Genome Bio. 15(7):410 

  5. Gordon D, Abajian C, Green P. (1998) Consed: a graphical tool for sequence finishing.  Genome Res. 1998 Mar;8(3):195-202.

  6. Grabherr MG, Haas BJ, Yassour M, et al. (2011) Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29(7):644-652.

  7. Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Research, 9, 868-877.

  8. Langmead, B., Cole Trapnell, C., Pop, M. and Salzberg, S.L. (2009) Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Bio. 10:R25 doi:10.1186/gb-2009-10-3-r25

  9. Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, (2012), 28 (23): 3150-3152. doi: 10.1093/bioinformatics/bts565.

  10. Patel RK, Jain M (2012). NGS QC Toolkit: A toolkit for quality control of next generation sequencing data. PLoS ONE, 7(2): e30619
Properties
Additional information about this analysis:
Property NameValue
Analysis Typereftrans