2012年9月29日 电 /生物谷BIOON/ --与蛋白编码基因相比,长非编码RNA的生物学功能是后基因组时代的热门话题。日前,北京大学分子医学研究所与北京大学生命科学学院、中国科学院动物所等单位合作,采用新一代测序技术,创建了恒河猴“一站式”基因组知识库RhesusBase,发现了长非编码RNA参与基因起源的新机制,首次提出长非编码RNA可能是孕育蛋白编码基因过程中的“半成品”(semi-product)。相关论文于近日发表于Nucleic Acids Research和PLOS Genetics。
人类基因组计划揭示, 占基因组95以上的区域并不编码蛋白质, 长期以来被认为是没有功能的垃圾序列(Junk DNA)。然而,最新研究表明,某些非编码区域可以转录形成长非编码RNA,解读其生物学功能迅速成为该领域的前沿热点问题。而从比较基因组学角度,系统追溯基因及长非编码RNA起源过程,可为解开长非编码RNA之谜提供启迪。
恒河猴与人类分歧时间大约为2500万年,从进化距离上是研究这一问题的最佳模型。研究组对恒河猴全身组织进行了转录组测序,总测序片段数达到12亿条,对全转录组的覆盖度达到97,在全基因组尺度上实现了对两万多个恒河猴基因的精细结构修正。论文的通讯作者李川昀博士指出,“正如我们猜想的那样,通过对数以亿计的恒河猴表达片段进行拼接和进一步的实验验证,我们发现现有数据库中近三分之一的基因结构注释存在错误”。研究组采用纠错修正后的精细基因组框架图,对近百个数据来源的基因功能信息进行整合,构建了一个集基因结构、表达、调控、遗传变异、疾病、功能及药物开发等信息于一体的、拥有56亿条独立注释信息的恒河猴“一站式”基因组知识库RhesusBase(http://www.rhesusbase.org),力争打造整合恒河猴研究的“一家店”(Nucleic Acids Research, 2012)。
恒河猴基因组信息的完善,为认识人类基因的起源和调控提供了独特的视角。进一步的研究捕捉到了从长非编码RNA转变为蛋白编码基因的精彩过程:研究首次发现24个类人猿物种特有的蛋白编码基因(Hominoid-specific, 包括人类和黑猩猩),而在与人类近缘的恒河猴基因组中,这些基因绝大多数(83)以长非编码RNA形式存在。更有意思的是,它们已具有与人类同源基因相似的转录结构和基因表达模式。研究者提出,部分长非编码RNA是蛋白编码基因的前体,处于向蛋白编码基因转化的过渡阶段。简言之,非编码RNA是新基因诞生的温床(PLOS Genetics, 2012)。
上述发现对于完善基因起源理论、并从整体上理解长非编码RNA的生物学功能具有重要意义。(生物谷Bioon.com)
doi: 10.1093/nar/gks835
RhesusBase: a knowledgebase for the monkey research community
Shi-Jian Zhang1, Chu-Jun Liu1, Mingming Shi1, Lei Kong2, Jia-Yu Chen1,Wei-Zhen Zhou2, Xiaotong Zhu1, Peng Yu1, Jue Wang1, Xinzhuang Yang1,Ning Hou1, Zhiqiang Ye3, Rongli Zhang1, Ruiping Xiao1, Xiuqin Zhang1,*and Chuan-Yun Li1,*
Although the rhesus macaque is a unique model for the translational study of human diseases, currently its use in biomedical research is still in its infant stage due to error-prone gene structures and limited annotations. Here, we present RhesusBase for the monkey research community (http://www.rhesusbase.org). We performed strand-specific RNA-Seq studies in 10 macaque tissues and generated 1.2 billion 90-bp paired-end reads, covering >97.4% of the putative exon in macaque transcripts annotated by Ensembl. We found that at least 28.7% of the macaque transcripts were previously mis-annotated, mainly due to incorrect exon–intron boundaries, incomplete untranslated regions (UTRs) and missed exons. Compared with the previous gene models, the revised transcripts show clearer sequence motifs near splicing junctions and the end of UTRs, as well as cleaner patterns of exon–intron distribution for expression tags and cross-species conservation scores. Strikingly, 1292 exon–intron boundary revisions between coding exons corrected the previously mis-annotated open reading frames. The revised gene models were experimentally verified in randomly selected cases. We further integrated functional genomics annotations from >60 categories of public and in-house resources and developed an online accessible database. User-friendly interfaces were developed to update, retrieve, visualize and download the RhesusBase meta-data, providing a ‘one-stop’ resource for the monkey research community.
doi:10.1371/journal.pgen.1002942
Hominoid-Specific De Novo Protein-Coding Genes Originating from Long Non-Coding RNAs
Chen Xie1#, Yong E. Zhang2#, Jia-Yu Chen3#, Chu-Jun Liu3, Wei-Zhen Zhou1, Ying Li3, Mao Zhang3, Rongli Zhang3, Liping Wei1*, Chuan-Yun Li3¤*
Tinkering with pre-existing genes has long been known as a major way to create new genes. Recently, however, motherless protein-coding genes have been found to have emerged de novofrom ancestral non-coding DNAs. How these genes originated is not well addressed to date. Here we identified 24 hominoid-specific de novo protein-coding genes with precise origination timing in vertebrate phylogeny. Strand-specific RNA–Seq analyses were performed in five rhesus macaque tissues (liver, prefrontal cortex, skeletal muscle, adipose, and testis), which were then integrated with public transcriptome data from human, chimpanzee, and rhesus macaque. On the basis of comparing the RNA expression profiles in the three species, we found that most of the hominoid-specific de novoprotein-coding genes encoded polyadenylated non-coding RNAs in rhesus macaque or chimpanzee with a similar transcript structure and correlated tissue expression profile. According to the rule of parsimony, the majority of these hominoid-specific de novo protein-coding genes appear to have acquired a regulated transcript structure and expression profile before acquiring coding potential. Interestingly, although the expression profile was largely correlated, the coding genes in human often showed higher transcriptional abundance than their non-coding counterparts in rhesus macaque. The major findings we report in this manuscript are robust and insensitive to the parameters used in the identification and analysis of de novo genes. Our results suggest that at least a portion of long non-coding RNAs, especially those with active and regulated transcription, may serve as a birth pool for protein-coding genes, which are then further optimized at the transcriptional level.