Diego Klabjan
  • Home
  • Vita
  • Publications
  • Contact

A Semi-supervised learning approach to enhance health care Community-based Question Answering: A case study in alcoholism

9/26/2017

0 Comments

 

Author

Papis Wongchaisuwat, Ph.D candidate in Industrial Engineering and Management Sciences, Northwestern University
​[email protected]

While large number of internet users seek health information online, it is not trivial for them to quickly find an accurate answer to specific questions. Community-based Question Answering (CQA) sites such as Yahoo! Answers play an important role in addressing health information needs. In CQA sites, users post a question and expect the online health community to promptly provide desirable answers. Despite a high volume of users’ participation, a considerable number of questions are left unanswered and at the same time other questions that address the same information need are answered elsewhere. Automatically answering the posted questions can provide a useful source of information for online health communities. 

We developed a 2-phase algorithm to automatically answer health-related questions based on past questions and answers (QA). Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA and further re-rank these candidates with a semi-supervised leaning algorithm that extracts the best answer to a prospective question. We first converts the raw data into a desirable structure which is collected as a corpus of existing QA pairs. The first phase implemented as a rule-based system that employs similarity measures, i.e. Dynamic Time Warping (DTW), and vector-space based approach (VS) to find candidate answers from the corpus of existing QA pairs for any prospective question. In the second phase, we implemented supervised and Expectation Maximization (EM) based semi-supervised learning models that refined the output of the first phase by screening out invalid answers and ranking the remaining valid answers.

We obtained a total of 4,216 alcoholism-related QA threads from Yahoo! Answers as a case study. Based on our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. UMLS-based (health-related) features used in the model enhance the algorithm’s performance by proximately 8 %. An example result returned from the algorithm to determine candidate answers is shown in the figure below.
Picture
0 Comments

Predicting ICU Readmission using Grouped Physiological and Medication Trends

9/14/2017

0 Comments

 

Author

Ye Xue, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
​[email protected]

​Background
Patients who are readmitted to an intensive care unit (ICU) usually have a high risk of mortality and an increased length of stay. ICU readmission risk prediction may help physicians to re-evaluate the patient’s physical conditions before patients are discharged and avoid preventable readmissions. ICU readmission prediction models are often built based on physiological variables. Intuitively, snapshot measurements, especially the last measurements, are effective predictors that are widely used by researchers. However, methods that only use snapshot measurements neglect predictive information contained in the trends of physiological and medication variables. Mean, maximum or minimum values take multiple time points into account and capture their summary statistics, however, these statistics are not able to catch the detailed picture of temporal trends.
 
In this work, we find strong predictors with ability of capturing detailed temporal trends of variables for 30- day readmission risk and build prediction models with high accuracy.
 
Workflow
We convert patients’ time series into graphs, where each node represents a discretized measurement at a single point in time. Among these graphs, we discover the most important subgraphs and identify them as common temporal trends. We study the correlation between the important subgraphs, group them and use the groupings as an augmentation to snapshot features in building predictive models. A workflow is shown below.
Picture
​Study of Imputation on Temporal Data
Along the way, we study the impact of different imputation techniques and develop a tailored methodology, called customized linear interpolation, that outperforms all other state-of-the-art approaches. Multivariate Imputation by Chained Equations (MICE) is a popular imputation method. However, its performance on temporal data is not as strong as that on snapshot data. A comparison between imputed values from MICE and customized linear interpolation is shown below.
Picture
​Conclusions
As a result, our model outperforms the baseline model that only uses the snapshot features, suggesting that the temporal trends carry predictive information for ICU readmission risk. Additionally, our experiments show that some imputation methods work well on replacing missing values in snapshot measurements but not on temporal data, suggesting that the temporal pattern need to be taken into consideration in imputation.
0 Comments

Semantic Document Distance Measures: Word-vector Based Dynamic Time Warping And Word-vector Based Tree Edit Distance

9/14/2017

0 Comments

 

Author

XIaofeng Zhu, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
[email protected]

I went to the bank” is semantically more similar to “I withdrew some money.” than “I went to the park.” However, most widely used distance measures think the later sentence is more similar because the later sentence has more common words. In this post, we will explain two new algorithms - word vector-based dynamic time warping (wDTW) and word vector-based tree edit distance (wTED) - for measuring semantic document distances based on distributed representations of words. Both algorithms hinge on the distance between two paragraphs and are implemented in Apache Spark. We train word2vec vectors beforehand and represent a document as a list of paragraphs where a paragraph is a list of word2vec vectors.
​DTW measures the distance between two sequences: wDTW algorithm calculates document distances based on DTW by sequentially comparing any two paragraphs of two documents. The figure below demonstrates how wDTW finds the optimal alignments between two documents.
Picture

Word Vector-based Tree Edit Distance (wTED): Word Vector-based Tree Edit Distance (wTED)TED calculates the minimal cost of node edit operations for transforming one labeled tree into another. A document can be viewed at multiple abstraction levels that include the document title, its sections, subsections, etc. The next figure illustrates wTED distance between two document trees.
Picture
Document Revision Detection: We used wDTW and wTED to solve a document revision detection problem by taking a document as an arc in a revision network and the distance score of two documents as the arc length. We conclude that the revision of a document is the document that has the smallest distance score based on the minimal branching algorithm. We next report the performance of our semantic measures on two types of data sets.
Distance/Similarity Measures: We compared our measures wDTW and wTED with four baselines: Vector Space Model (VSM), Word Mover Distance (WMD), PV-DTW(Paragraph vector + DTW),  and PV-TED (Paragraph vector + TED).
Data Sets
  • Wikipedia Revision Dumps (long, linear revisions)
  • Simulated Data sets (short, tree revisions): Insertion, deletion, and substation of words, paragraphs, and titles
​Performance on the Wikipedia Revision Dumps
Picture
Performance on the Simulated Data Sets
Picture
​Findings
wDTW consistently performs the best. WMD is better than wTED.
  • The performances of wDTW and wTED drop slower than others.
  • wDTW and wTED use dynamic programming to find the global optimal alignment.
  • WMD relies on a greedy algorithm that sums up the minimal cost for every word.
wDTW and wTED outperform PV-DTW and PV-TED.
  • Paragraph distance using word2vec is more accurate than using paragraph vectors.
VSM is the fastest, wTED is faster than wDTW, and WMD is the slowest (40 times slower than wDTW).
0 Comments

    Authors

    Fantastic PhD candidates at Northwestern University

    Archives

    November 2017
    September 2017
    August 2017
    July 2017

    Categories

    All
    Classification
    Data Mining
    NLP

    RSS Feed