Bio-text mining project: mining term associations from literature

Bio-MITA: Mining Term Associations from Literature to Support Knowledge Discovery in Biology

Project description

Bio-MITA was a BBSRC-funded project (2006-2009), whose purpose was to support biological knowledge discovery by means of text-based term association mining, in order to facilitate access to information and to increase research productivity. We have investigated combining various text mining approaches to establishing associations among biological terms. More specifically, the objectives of the bio-MITA project are:

To implement text-based methods for determining term similarity from large document collections, thus permitting the mining of associations between various biological entities such as proteins, genes, species, cells, and experiments.
To investigate, implement and evaluate a novel "term kernel" method for biological text mining; the mining technologies will include - but not be limited to - term clustering, classification, principle component analysis, and regression.

The aim is to design text mining services that can provide accelerated knowledge acquisition, offer plausible hypotheses for testing, prevent unnecessary repetition of previous work, and help in experimental design. Therefore, our further aims are

To identify, implement and evaluate suitable kernel-based technologies for solving user-elicited biological text mining scenarios,
To make the tools available to the wider research community via national text mining services, and through an efficient and robust implementation (including the GRID technologies).

The project was carried out in the School of Computer Science (SCS). The School has developed a large variety of expertise in natural language processing (NLP) and text mining, including collection and standardisation of language resources, automatic terminology extraction, distributed information retrieval and document classification, ontology-driven information extraction, knowledge representation techniques, machine translation, dialogue-based systems etc.

Duration and funding of the project

January 2006 – June 2009, £193,000

Staff involved

Principal Investigator: Dr Goran Nenadic
Co-Investigator: Prof John Keane (2006-2009), Dr Benjamin Stapley (2006)
Research Associates: Dr Hui Yang (2006-2008), Dr Ann Gledson (2008), James Eales (2008-2009)
PhD student: Hammad Afzal

Repository of Text Mining Tools

Publications

Yang, H., Keane, J., Bergman, C., Nenadic, G.: Assigning Roles to Protein Mentions: the Case of Transcription Factors, Journal of Biomedical Informatics, Vol. 42(5), pp. 887-894 (link)
Yang, H., Nenadic, G., Keane, J.: Identification of transcription factor contexts in literature using machine learning approaches, BMC Bioinformatics, 9(Suppl 3):S11. (pdf)
Yang, H., Nenadic, G., Keane, J.: A cascaded approach to normalising gene mentions in biomedical literature, Bioinformation 2(5), 197-206.(pdf)
Nenadic, G., Ananiadou, S.: Mining Semantically Related Terms from Biomedical Literature, ACM Transactions on ALIP (Special Issue on “Text Mining and Management in Biomedicine”), Vol. 5(1), pp. 22-43
Yang, H., Spasic, I., Keane, J., Nenadic, G.: A Text Mining Approach to the Prediction of a Disease Status from Clinical Discharge Summaries, J. of American Medical Informatics Association, 16(4):596-600 (link)
Sarafraz, F., Eales, J., Mohammadi, R., Dickerson, J., Robertson, D., Nenadic, G.: Biomedical Event Detection using Rules, Conditional Random Fields and Parse Tree Distances, Proceedings of the BioNLP shared task 2009 (in press)
Rebholz-Schuhmann, D., Nenadic, G.: Towards Standardisation of Named Entity Annotations in the Life Science Literature, in Proc. of Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), p. 167-169
Rebholz-Schuhmann, D., Kirsch, H., Nenadic, G.: IeXML: towards an annotation framework for biomedical semantic types enabling interoperability of text processing modules, Proceedings of Joint BioLINK and Bio-Ontologies SIG Meeting, ISMB 2006
Rebholz-Schuhmann, D., Kirsch, H., Gaudan, S., Arregui, M., Nenadic, G.: Annotation and Disambiguation of Semantic Types in Biomedical Text: a Cascaded Approach to Named Entity Recognition, Proceedings of NLPXML 2006, EACL 2006
Nenadic, G., Okazaki, N., Ananiadou, S.: Towards a terminological resource for biomedical text mining, Proceedings of LREC 2006, ELRA
Afzal, H, Stevens, R., Nenadic, G.: Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature, Proc. of European Semantic Web Conference (ESWC) 2009, LNCS 5554, Springer-Verlag, pp. 535-549, 2009.
Greenwood, M., Nenadic, G.: Lexical Profiling of Existing Web Directories to Support Fine-grained Topic-Focused Web Crawling , Proc. of Corpus Profiling for IR and NLP Workshop, London, p. 42-49
Afzal, H, Stevens, R., Nenadic, G.: Towards Semantic Network of Bioinformatics Resources, ISMB/ECCB 2009 (poster)
Ananiadou, S., Nenadic, G: Automatic Terminology Management in Biomedicine, Chapter in "Text Mining for Biology and Biomedicine", S. Ananiadou and J. McNaught (Eds.), Artech House Books, London, pp. 67-98

Contact

Dr Goran Nenadic (School of Computer Science)
e-mail:
phone: +44-(0)161-30-65936