PIR全称The Protein Information Resource,是一个集成了关于蛋白质功能预测数据的公共资源的数据库,其目的是支持基因组/蛋白质组研究。PIR与MIPS(the Munich Information Center for Protein Sequences)、JIPID(the Japan International Protein Information Database)合作,共同构成了PIR-国际蛋白质序列数据库(PSD)——一个主要的已预测的蛋白质数据库,包括250000个蛋白。为了提高蛋白质预测和实验数据之间的相互吻合程度,PIR建立了一套系统,允许研究者们递交、分类、提取文献信息。PIR提供了在超家族、域和模体水平上的对蛋白的分类。PIR同时提供了蛋白的结构和功能信息,并给出了与其他40个数据库之间的相互参考。PIR还提供了一个非冗余的蛋白质数据库,包括从PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq 、PDB收集来的约800000条序列,对每条序列给出了一个符合的名称和相关文献。为了提高数据库的协同工作能力,PIR采用开发的数据库框架,利用XML技术进行数据发布。在PIR的站点上(http://pir.georgetown.edu/)也提供了常规的生物信息学工具,以进行数据发掘。
INTRODUCTION
The Protein Information Resource (PIR) has been providing the scientific community with annotated protein databases and analysis tools for over three decades. To better support research in functional genomics and proteomics and facilitate knowledge discovery, we have made several new advances in the last year, in addition to further enhancing the PIR-International Protein Sequence Database. Some key developments include: launch of a new submission mechanism for literature data, distribution of a new non-redundant reference protein database, enhancement of the integrated classification database, and redesign of the web site for easy navigation, information retrieval and sequence analysis.
PIR-INTERNATIONAL PROTEIN SEQUENCE DATABASE
The PIR, along with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), continues to enhance and distribute the PIR-International Protein Sequence Database (PSD), a non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain. It contains about 250 000 protein sequences with comprehensive coverage across the entire taxonomic range, including sequences from all the publicly available complete genomes.
Superfamily classification
A unique characteristic of the PIR-PSD is the superfamily/family classification (1) that provides complete and non-overlapping clustering of proteins based on global (end-to-end) sequence similarity. Sequences in the same superfamily share common domain architecture (i.e. have the same number, order and types of domains) and do not differ excessively in overall length unless they are fragments or result from alternate splicing or initiators. The automated classification system places new members into existing superfamilies and defines new superfamily clusters using parameters including the percentage of sequence identity, overlap length ratio, distance to neighboring superfamily clusters, and overall domain arrangement. Currently, >99% of sequences are classified into families of closely related sequences (at least 45% identical), and over two-thirds of sequences are classified into over 33 000 superfamilies. The automated classification is being augmented by manual curation of superfamilies, starting with those containing at least one definable domain, to provide superfamily names, brief descriptions, bibliography, list of representative and seed members, as well as domain and motif architecture characteristic of the superfamily.
Bibliography submission and literature mapping
Linking protein data to literature data that describes or characterizes the proteins is crucial for us to increase the amount of experimentally verified data and to improve the quality of protein annotation. Attribution of protein annotations to validated experimental sources provides effective means to avoid propagation of errors that may have resulted from large-scale genome annotation. We have developed a bibliography submission system for the scientific community to submit, categorize and retrieve literature information for PSD protein entries. The submission interface guides users through steps in mapping the paper citation to given protein entries, entering the literature data, and summarizing the literature data using categories such as genetics, tissue/cellular localization, molecular complex or interaction, function, regulation and disease. Also included is a literature information page that provides literature data mining and displays both references cited in PIR and submitted by users.
INTEGRATED PROTEIN CLASSIFICATION DATABASE
The iProClass (integrated Protein Classification) database (2) is designed to provide comprehensive descriptions of all proteins and to serve as a framework for data integration in a distributed networking environment. The database describes family relationships at both global (whole protein) and local (domain, motif, site) levels, as well as structural and functional classifications and features of proteins. The current version (Release 1.0, August 2001) consists of more than 270 000 non-redundant PIR-PSD and SWISS-PROT proteins organized with more than 33 000 PIR superfamilies, 100 000 families, 3400 PIR homology and Pfam domains (3), 1300 ProClass/ProSite motifs (4,5), 280 PIR post-translational modification sites, and links to over 40 databases of protein families, structures, functions, genes, genomes, literature and taxonomy. Protein sequence and superfamily summary reports provide rich annotations such as membership information with length, taxonomy and keyword statistics, extensive cross-references and graphical display of domain and motif regions. Directly linked to the iProClass sequence report are two additional PIR databases, ASDB and RESID (6). PIR-Annotation and Similarity Database (ASDB) lists pre-computed, biweekly updated FASTA neighbors of all PSD sequences with annotation information and graphical displays of sequence similarity matches. PIR-RESID documents over 280 post-translational modifications and links to PSD entries containing either experimentally determined or computationally predicted modifications with evidence tags. Future versions of iProClass and ASDB will be based on the new PIR Non-redundant Reference Protein database (NREF).
PIR-NREF
As a major resource of protein information, one of our primary aims is to provide a timely and comprehensive collection of all protein sequence data that keeps pace with the genome sequencing projects and contains source attribution and minimal redundancy. The PIR-NREF protein database includes sequences from PIR, SWISS-PROT (7), TrEMBL (7), RefSeq (8), GenPept, PDB (9) and other protein databases. The NREF entries, each representing an identical amino acid sequence from the same source organism redundantly presented in one or more underlying protein databases, can serve as the basic unit for protein annotation. The NCBI taxonomy (http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/) is used as the ontology for matching source organism names at the species or strain (if known) levels. The NREF report provides source attribution (containing protein IDs, accession numbers and protein names from underlying databases), in addition to taxonomy, amino acid sequence and composite literature data. The composite protein names, including synonyms, alternate names and even misspellings, can be used to assist the ontology development on protein names and the identification of mis-annotated proteins. Related sequences, including identical sequences from different organisms and closely related sequences within the same organism, are also listed. The database presently consists of about 800 000 entries and is updated biweekly.
AVAILABILITY
PIR web site
The PIR web site (http://pir.georgetown.edu) (10) connects data mining and sequence analysis tools to underlying databases for exploration of protein information and discovery of new knowledge. The site has been redesigned to include a user-friendly navigation system and more graphical interfaces and analysis tools. The PIR-PSD and iProClass pages represent primary entry points in the PIR web site. A list of the major PIR pages is shown in Table 1.
The PIR-PSD interface provides entry retrieval, batch retrieval, basic or advanced text searches, and various sequence searches. The iProClass interface also includes both sequence and text searches. The BLAST search (11) returns best-matched proteins and superfamilies, while peptide match allows protein identification based on peptide sequences. Text search involves direct search of the underlying Oracle tables using unique identifiers or combinations of text strings. The NREF database is searchable by BLAST search, peptide match and direct report retrieval based on the NREF ID or the entry identifiers of the source databases. Other sequence searches supported on the PIR web site include FASTA (12), pattern matching, hidden Markov model (HMM) (13) domain and motif search, Smith–Waterman (14) pair-wise alignment, CLUSTALW (15) multiple alignment and GeneFIND (16) family identification.
PIR FTP site
The PIR anonymous FTP site (ftp://nbrfa.georgetown.edu/pir_databases) provides direct file transfer. Files distributed include the PIR-PSD (quarterly release and interim updates), PIR-NREF, other auxiliary databases, other documents, files and software programs. The PIR-PSD is distributed as flat files in NBRF and CODATA formats, with corresponding sequences in FASTA format. Both PIR-PSD and PIR-NREF are also distributed in XML format with the associated document type definition (DTD) file.
The PIR-PSD, iProClass and PIR-NREF databases have been implemented in Oracle 8i object-relational database system on our Unix server. To enable open source distribution, the databases are being mapped to MySQL and ported to Linux system. To establish reciprocal links to PIR databases, to host a PIR mirror web site or to request PIR database schema, please contact pirmail@nbrf.georgetown.edu.