第二代测序技术又称作深度测序技术,应用到RNA上统称作RNA-seq或RNA测序,它已成为基因表达和转录组分析的重要手段。第二代转录组测序数据中含有大量不编码蛋白质的ncRNA序列,因为它们像宇宙中的暗物质一样难以识别和有重要功能,也被称为“基因组暗物质”。由于数据量巨大,保守性差,又有噪音干扰,这些“暗物质”的识别成为表观遗传学和调控网络研究的瓶颈。piRNA是数量最大的一类ncRNA,主要是通过与转座子的序列互补来控制转座子的表达,进而调控生殖和发育。由于不同物种的piRNA之间同源性很差,至今国际上还没有有效的识别方法。
中国科学院动物研究所康乐研究组的张屹等最近发表的题为A k-mer scheme to predict piRNA and characterize locust piRNA 的最新研究论文,解决了高精度预测生物体中数量最大的一类非编码RNA---piRNA的难题,论文发表在生物信息学权威期刊《生物信息学》(Bioinformatics,IF=4.926)上。
这篇文章中提出了一种基于k-mer串频率的Fisher判别式来预测piRNA的算法, 精度达90%以上,超过了哈佛大学B. Doron的61%的精度。利用该方法,他们成功地鉴定出飞蝗8万多条piRNA,预测飞蝗可能存在约13万条piRNA。进一步分析发现,这些piRNA在飞蝗群居型和散居型间存在巨大差异,这可能为解释飞蝗两型生殖力差异提供了重要的线索。
这个不依赖基因组数据来鉴定非模式生物piRNA的新方法具有重要的理论意义和广泛的应用价值。目前,在线软件piRNApredictor (http://59.79.168.90/piRNA/index.php) 已被国外科研机构用于猪的piRNA研究中。
piRNA预测算法的突破为其它ncRNA的预测提供了重要的启示:不保守的ncRNA是可以预测的。由于该算法理论的普遍性,该方法不仅可以预测其它物种的piRNA,还可以通过变更训练集来预测其它种类的ncRNA。而且,在线软件给出的piRNA高精度预测结果,对表观遗传学、调控网络与piRNA功能的进一步研究有重要理论意义和应用价值。(生物谷Bioon.com)
生物谷推荐原文出处:
Bioinformatics (2011) 27 (6): 771-776. doi: 10.1093/bioinformatics/btr016
A k-mer scheme to predict piRNAs and characterize locust piRNAs
Yi Zhang1,2, Xianhui Wang1 and Le Kang1,*
Motivation: Identifying piwi-interacting RNAs (piRNAs) of non-model organisms is a difficult and unsolved problem because piRNAs lack conservative secondary structure motifs and sequence homology in different species.
Results: In this article, a k-mer scheme is proposed to identify piRNA sequences, relying on the training sets from non-piRNA and piRNA sequences of five model species sequenced: rat, mouse, human, fruit fly and nematode. Compared with the existing ‘static’ scheme based on the position-specific base usage, our novel ‘dynamic’ algorithm performs much better with a precision of over 90% and a sensitivity of over 60%, and the precision is verified by 5-fold cross-validation in these species. To test its validity, we use the algorithm to identify piRNAs of the migratory locust based on 603 607 deep-sequenced small RNA sequences. Totally, 87 536 piRNAs of the locust are predicted, and 4426 of them matched with existing locust transposons. The transcriptional difference between solitary and gregarious locusts was described. We also revisit the position-specific base usage of piRNAs and find the conservation in the end of piRNAs. Therefore, the method we developed can be used to identify piRNAs of non-model organisms without complete genome sequences.