Evolve - Disease Ontology Prediction

From Humanitarian FOSS Summer Institute 2009

Jump to: navigation, search


Introduction: The current project is part of the InSTEDD.org's Evolve effort to predict and prevent the spread of diseases. More specifically, the project is concentrated on augmenting the knowledge of an article by searching specific terms in an ontology and adding the related concepts, hyponims and hypernyms as metadata of the article. That would allow correlating to different articles that refer to 2 different hyponyms of the same concept. As part of the implementation, machine learning will play an important role in the feasibility of the project.

Developers Mentors
Nicolae Dragu, Trinity College
Fouad El Khoury, University of Hartford
Nicolas di Tada (Instedd)
Prof. Takunari Miyazaki
Prof. Ralph Morelli
Trishan de Lanerolle


Contents

Project Overview

The current project is a part of InSTEDD.org's Evolve effort to predict and prevent the spread of diseases. More specifically, the project is concentrated on augmenting the knowledge of an article by searching specific terms in an ontology and adding the related concepts, hyponims and hypernyms as metadata of the article. That would allow correlating to different articles that refer to 2 different hyponyms of the same concept. As part of the implementation, machine learning will play an important role in the feasibility of the project.

About Evolve

Evolve [project code named Riff] enables detection, prediction and response to health-related events (such as disease outbreaks or pandemics) through a collaborative environment that combines data exploration, integration, search and inferencing – providing more complex analysis and deeper insight.

Evolve brings forth the following benefits:

  • Detect emerging critical events sooner and enable your team to take the right action earlier.
  • Allow human experts and autonomous agent-based analytic services to augment one another’s efforts.
  • Pattern detection algorithms learn from past events – and your team’s characterization of them – to improve performance the next time around.
  • Fully extensible open source solution allows you to incorporate your own data sources, services, and embedded modules.

Project Details

Since the purpose of the project is to create a functional application that predicts disease outbreaks from given articles, the application relies much on SVM and disease ontology.

It receives articles with known tags and content. Using these tags, SVM is used to generate the corresponding probabilities (or points) for the tags (if SVM is indeed capable of such a thing). At the same time, these tags will be used to search the disease ontology for the possible disease matches. Combining the results given by SVM and the list of diseases from the ontology database, the application makes a prediction of the most likely disease that the articles refer to.

The following image represents the basic workflow of the proposed application:
Image:ProjectFlowchart.jpg

Revised Project Details

Proposed Workflow of the application:

  • The application receives input documents.
  • The documents are parsed and frequencies are assigned to each word. For this step we could use the stemmer to reduce the word count and eliminating words that are derived from others.
  • The words from the previous step are matched against diseases, symptoms and syndromes from the ontology. The matches are then stored in a list. The matching algorithm has to be improved because at this moment only direct matches are detected. The use of the stemmer in step 2 could improve this.
  • The list of matches is used to create the final output of the application. Based on how many symptoms indicate a certain disease, points are assigned to that disease according to that number of points. At this moment the program does not take into account that some of the words in the input documents are names of diseases or syndromes and does not include these in this algorithm of assigning points. At the same time, this last step does not rely on the frequency of the matches to influence the number of points that are assigned. This should probably be implemented. An example of this step is illustrated below:

Image:project-example.png

Diagram for the workflow:
Image:project-outline.png

List of programs that compose the basic functionality of the application:

  • Parsing an input document, extracting the words and associating frequencies to those words. Words with less than 3 letters are omitted.
    • Input: .txt files
    • Output: a text file containing all the words in the document(s) and their frequencies
  • Going through the disease ontology and collecting the English names of all diseases, symptoms and syndromes. These are used for matching against the words that were extracted from the input documents.
    • Input: not required
    • Output: text file containing all the disease, symptom and syndrome names.
  • Matching the list of words from the input documents against the list of terms in the ontology from the previous step.
    • Input: the list of words from input documents and the list of terms from the ontology
    • Output: a list of matches between the two lists of terms
  • For the terms that matched we go from the symptom level to the disease level and assign points to diseases based on how many symptoms indicate a certain disease.
    • Input: the names of the symptoms from the documents that matched against the ontology
    • Output: the number of points assigned to each disease

Things that we are currently working on

  • Putting together all the programs described above.
  • Testing the workflow of the application from beginning to end after the previous step is done.
  • Adding more functionality to the application:
    • Improving the algorithm that matches the input document with the ontology (use porter stemmer for this step)
    • Probably add an SVM functionality later on to test the accuracy of the output of the application against SVM predicted disease.

Responsibilities

The image below displays the share of responsibilities between the members of the group Image:project-responsibilities.png

Supervised Machine Learning

Supervised learning is a machine learning technique for learning a certain function from training data. Typically, the input is represented by vectors. The output could take the form of classification, predicting a class label for the input object, or regression, meaning the output value of the function is continuous.
One of the most common approaches to supervised machine learning is the use of a Support Vector Machine (SVM) algorithm. A Support Vector Machine (SVM) is a machine learning tool that uses supervised learning to classify data into two or more classes.

Basic Support Vector Machine concepts

There are three basic concepts that stand at the core of SVM:

  • The kernel function:
The picture below illustrates how the original objects (left side of the picture) have been mapped or rearranged, using a set of mathematical functions, known as kernels. The process of rearranging the objects is known as mapping (transformation). By applying this procedure, the mapped objects (right side of the picture) are linearly separable and, thus, instead of constructing the complex curve (left side of the picture), all we have to do is to find an optimal line that can separate the GREEN and the RED objects.

Image:SVMexample.jpg

  • The separating hyperplane:
Given any set of data the SVM classifier has to find the hyperplane that maximizes the separation between the two classes of objects. By finding such a hyperplane the SVM is more capable of making correct predictions when given new data that we want to classify.
  • The soft margin: is a modified maximum margin idea that allows for mislabeled examples.

Disease Ontology

Disease Ontology is a controlled medical vocabulary developed at the Bioinformatics Core Facility in collaboration with the NuGene Project at the Center for Genetic Medicine. It was designed to facilitate the mapping of diseases and associated conditions to particular medical codes such as ICD9CM, SNOMED and others. Disease Ontology is implemented as a directed acyclic graph (DAG) and utilizes the Unified Medical Language System (UMLS) as its immediate source vocabulary to access medical Ontologies such as ICD9CM.

Deliverables

The main deliverable of the project is the java application itself which has to provide the following functionality:

  • loading training results from a database
  • loading the ontology from a file or database
  • receiving an article in the standard input and a document id as parameter
  • measuring the term frequencies and using an SVM classifier for each term in the ontology found in the document text (creating the tags)
  • traverse the ontology using the top n SVM results terms and accumulate points in hypernimic relationships (or use a Bayesian Network)
  • return the node with the most points and predict the disease most likely referred by the document

Documentation

Weekly Check Point Presentations

Timeline

Image:Milestones.png

Revised Timeline

400p

Personal tools