"Protein functional classification using text data-mining" is a BBSRC project (2002-2005) carried out jointly by the School of Informatics and Faculty of Life Sciences, University of Manchester, UK.
[1] Overview
Automatic information extraction and analysis of biological text is becoming increasingly feasible, because of the growth of available information from on-line journals, bibliographic databases, web pages and other sources. The volume of written material pertaining to the function of proteins/genes is increasing rapidly, and - some might argue - the quality of the information is decreasing.
Historically, classification has been at the heart of biology and with the increasing volume of data, automatic methods for classifying biological entities are highly desirable. Placing entities into a classification can be an intensive activity. Invariably, computers assist a manual classification; examples include using sequence homology to infer the functional class of a protein, or retrieval of relevant documents to aid a human classifier.
ProFClass-TM aims to use automatic text-classification to assist in the assignment of proteins to functional categories. Classifying bodies of text (documents) is an active area of research and has applications in information extraction, information retrieval and information filtering. This project involves the application of techniques from text classification - notably Support Vector Machines (SVMs) - to classify proteins into functional classes based on retrieved text documents in combination with experimental and other data. The aim is to develop tools that can accurately predict/extract information on protein function such as sub-cellular location, enzymatic mechanism, and physiological role from combinations of relevant text, sequence, and experimental data.
Textual information on protein function is assembled from a variety of sources and placed in a database. Using the vector model of information retrieval, we use support vector machines and other methods to classify the proteins into functional categories - training on the MIPS classification, Gene Onotology, and Enzyme Registry. The aim is to generate a tool that allows a user to submit a body of text relevant to a protein and retrieve probable functional classes for that protein.
[2] Objectives
[3] People
[4] Related publications
[5] BioCreative 2004
Using the technologies devised within the project, we have taken part in the BioCreative challenge. More information at:
Last modified: June 1, 2005