Links - Software
- Genia Part-of-Speech Tagger for biomedical text mining
- Enju, a probabilistic HPSG parser
- TIMS, Tag Information Management System [pdf]
Text mining tools used in NaCTeM
TerMine extracts automatically technical terms. It is based on a hybrid, domain independent automatic term recognition method, C-value.C-value combines linguistic and statistical information, emphasis being placed on the statistical part. The linguistic analysis enumerates all candidate terms in a given text using linguistic filters. C-value uses as input, text annotated with part-of-speech tags. For biomedical text processing we use the Genia part-of-speech tagger. The statistical analysis assigns a termhood to a candidate term by using the following four characteristics:
- the occurrence frequency of the candidate term
- the frequency of the candidate term as part of other longer candidate terms
- the number of these longer candidate terms
- the length of the candidate term
The current implementation is optimized for scalability and processing speed: given a set of 1.3 million MEDLINE abstracts (2GB text), TerMine (standalone version) extracts 9.8 million term candidates and their termhood scores in about ten minutes.
Medie is an intelligent search engine, retrieving biomedical events from Medline. Medie is based on the analysis of Enju which performs deep parsing of biomedical text.
InfoPubMed is based on Medie.. It helps users to find relevant information about genes, proteins and their interactions.