Diego Klabjan
  • Home
  • Vita
  • Publications
  • Contact

A Semi-supervised learning approach to enhance health care Community-based Question Answering: A case study in alcoholism

9/26/2017

0 Comments

 

Author

Papis Wongchaisuwat, Ph.D candidate in Industrial Engineering and Management Sciences, Northwestern University
​[email protected]

While large number of internet users seek health information online, it is not trivial for them to quickly find an accurate answer to specific questions. Community-based Question Answering (CQA) sites such as Yahoo! Answers play an important role in addressing health information needs. In CQA sites, users post a question and expect the online health community to promptly provide desirable answers. Despite a high volume of users’ participation, a considerable number of questions are left unanswered and at the same time other questions that address the same information need are answered elsewhere. Automatically answering the posted questions can provide a useful source of information for online health communities. 

We developed a 2-phase algorithm to automatically answer health-related questions based on past questions and answers (QA). Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA and further re-rank these candidates with a semi-supervised leaning algorithm that extracts the best answer to a prospective question. We first converts the raw data into a desirable structure which is collected as a corpus of existing QA pairs. The first phase implemented as a rule-based system that employs similarity measures, i.e. Dynamic Time Warping (DTW), and vector-space based approach (VS) to find candidate answers from the corpus of existing QA pairs for any prospective question. In the second phase, we implemented supervised and Expectation Maximization (EM) based semi-supervised learning models that refined the output of the first phase by screening out invalid answers and ranking the remaining valid answers.

We obtained a total of 4,216 alcoholism-related QA threads from Yahoo! Answers as a case study. Based on our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. UMLS-based (health-related) features used in the model enhance the algorithm’s performance by proximately 8 %. An example result returned from the algorithm to determine candidate answers is shown in the figure below.
Picture
0 Comments

Semantic Document Distance Measures: Word-vector Based Dynamic Time Warping And Word-vector Based Tree Edit Distance

9/14/2017

0 Comments

 

Author

XIaofeng Zhu, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
[email protected]

I went to the bank” is semantically more similar to “I withdrew some money.” than “I went to the park.” However, most widely used distance measures think the later sentence is more similar because the later sentence has more common words. In this post, we will explain two new algorithms - word vector-based dynamic time warping (wDTW) and word vector-based tree edit distance (wTED) - for measuring semantic document distances based on distributed representations of words. Both algorithms hinge on the distance between two paragraphs and are implemented in Apache Spark. We train word2vec vectors beforehand and represent a document as a list of paragraphs where a paragraph is a list of word2vec vectors.
​DTW measures the distance between two sequences: wDTW algorithm calculates document distances based on DTW by sequentially comparing any two paragraphs of two documents. The figure below demonstrates how wDTW finds the optimal alignments between two documents.
Picture

Word Vector-based Tree Edit Distance (wTED): Word Vector-based Tree Edit Distance (wTED)TED calculates the minimal cost of node edit operations for transforming one labeled tree into another. A document can be viewed at multiple abstraction levels that include the document title, its sections, subsections, etc. The next figure illustrates wTED distance between two document trees.
Picture
Document Revision Detection: We used wDTW and wTED to solve a document revision detection problem by taking a document as an arc in a revision network and the distance score of two documents as the arc length. We conclude that the revision of a document is the document that has the smallest distance score based on the minimal branching algorithm. We next report the performance of our semantic measures on two types of data sets.
Distance/Similarity Measures: We compared our measures wDTW and wTED with four baselines: Vector Space Model (VSM), Word Mover Distance (WMD), PV-DTW(Paragraph vector + DTW),  and PV-TED (Paragraph vector + TED).
Data Sets
  • Wikipedia Revision Dumps (long, linear revisions)
  • Simulated Data sets (short, tree revisions): Insertion, deletion, and substation of words, paragraphs, and titles
​Performance on the Wikipedia Revision Dumps
Picture
Performance on the Simulated Data Sets
Picture
​Findings
wDTW consistently performs the best. WMD is better than wTED.
  • The performances of wDTW and wTED drop slower than others.
  • wDTW and wTED use dynamic programming to find the global optimal alignment.
  • WMD relies on a greedy algorithm that sums up the minimal cost for every word.
wDTW and wTED outperform PV-DTW and PV-TED.
  • Paragraph distance using word2vec is more accurate than using paragraph vectors.
VSM is the fastest, wTED is faster than wDTW, and WMD is the slowest (40 times slower than wDTW).
0 Comments

Hierarchical Relation-based Latent Dirichlet Allocation (hrLDA)

7/12/2017

0 Comments

 

Author

XIaofeng Zhu, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
[email protected]

​Most of the existing ontologies such as DBpedia rely on supervised ontology learning via manual parsing or transferring from existing knowledge bases. However, knowledge-bases in some specific domains such as semiconductor packaging do not exist, and supervised ontology learning is not appropriate for learning ontologies in new domains. In contrast, unsupervised ontology learning which is generally based on topic modeling can learn new entities and their relations from plain text and is likely to perform better when having more data. In this post, we will show you how to learn a terminological ontology in a specific domain via hrLDA.
 
Among all types of ontologies, our work focuses on terminological ontologies. In short, a terminological ontology is a hierarchical structure of subject-verb-object triplets (illustrated in below figure). Before we explain the concept terminological ontology, we need to talk about what a topic is. A topic is a list of words with different probabilities being in this topic, which is the fundamental assumption of Latent Dirichlet Allocation (LDA). A topic label is noun phrase and also a node in a topic tree, for example, city or London. A topic path is a list of topic labels from the root to one leaf. A terminological ontology has two components: topic hierarchies in a topic tree (shown on the left side) and topic relations between any two topic labels (shown on the right side). Capital is a sub-concept of city. Be on the north of is the relationship between Berlin and London. We use hierarchical topic modeling to extract topic hierarchies and we use relation extraction to extract topic relations.
Picture
hrLDA builds on hierarchical latent Dirichlet allocation (hLDA) and overcomes its limitations. The four components of hrLDA are relation-based latent Dirichlet allocation (rLDA), relation triplet extraction, acquaintance Chinese restaurant process (ACRP), and nested acquaintance Chinese restaurant process.
​
Relation-based Latent Dirichlet Allocation: Relation-based Latent Dirichlet AllocationIn contrast to LDA, rLDA takes a document as a bag of subject-verb-object triplets with the subject nouns as keys. In this way, we give salient nouns (nouns having more relation triplets) high weights. The other difference is that the number of topics in rLDA is computed by ACRP instead of a hyper-parameter. The figure below illustrates the plate notation of rLDA for extracting K topics from Corpus D having N documents. T is the list of subject-verb-object (S-V-O) relation triplets in a document. Z represents the topic assignments for the relation triplets. Multi(\beta) and Multi(\theta) are multinomial topic-word distribution and multinomial document-topic distributions. Dir(\alpha) and Dir(\eta) are hyper-parameters.
Picture
Relation Triple Extraction: The extracted relation triplets can be classified as three types. The first type: subject-predicate-object-based relations are relations that can be extracted using Stanford NLP parser and Ollie relation extraction library. The second type: noun-based/hidden relations are relations that reside in compound nouns and acronyms. The third type: relations from document structures. For instance, the indentation and bullet types of this slide indicate relations. hrLDA finds the topic numbers via a partition method ACRP. 

​Acquaintance Chinese Restaurant Process: A noun phrase has four properties: content, (paragraph, sentence) coordinates, one-to-many relation triplets and document ID. People tend to put phrases that describe the same topic together. Visually phrases that are close to each other regarding (paragraph, sentence) coordinates are acquaintances. Equation 1-5 shows the probability of noun phrase (n+1) choosing its topic id from [1 … k+1], which models how people order their wording when they write documents. \eta is a small hyper-parameter. C_i is the number of noun phrases joining topic i. Q_{1:i} represents paragraph location differences, and S_{1:i} represents sentence location differences. When people write one document, the probability of forming a new topic if there are non-empty topics is small, the same topic that has the same content words is close to 1, not likely to join the topic that does not have any acquaintances. Words appear in the same paragraph are close acquaintances if they are even closer if in the same sentence. It is true that this it not optimal yet, but a big improvement over the Chinese restaurant process (CRP) in hLDA. 
The probability of choosing the $(k + 1)^{th}$ topic reads \[P(z_{n+1} = (k+1) | Z_{1:n}) = \frac{\gamma}{n+ \gamma}.\]

The probability of selecting any of the $k$ topics is

  • if the content of $t^{n+1}$ is synonymous with or an acronym of a previously analyzed noun phrase $t^m$ $(m < n +1)$ in the $i^{th}$ topic, \[P(z_{n+1} = i | Z_{1:n}) = 1 - \gamma;\]
  • else if the document ID of $t^{n+1}$ is different from all document IDs belonging to the $i^{th}$ topic, \[P(z_{n+1} = i | Z_{1:n}) = \gamma;\] \[P(z_{n+1} = i | Z_{1:n}) = \] \[\frac{C_i - (1 - \frac{1}{min(Q_{1:i})})}{(1 + min(S_{1:i})) n+ \gamma},\]
    Nested Acquaintance Chinese Restaurant Process: ​A topic tree is generated via recursively applying rLDA and ACRP in a top-down fashion. For instance, we start with all the noun phrases at level 0 Node 1. We use ACRP to calculate the topic number K^0_1 at level 0 from Node 1 and the initial state of rLDA. We then apply rLDA to get the actual topic distribution and keep the top phrases as topic labels. Next, we remove the topic labels and feed the unpartitioned noun phrases to ACRP and rLDA until all the phrases are assigned as topic labels. Connecting all the topic labels colored in read, we get a topic tree. After we link the relation triplets back to the noun phrases we get a terminological ontology.
    Picture
    Empirical Results: A unique advantage of hrLDA is that it is not sensitive to messy/noisy text that is about multiple domains. For instance, the input text is about four domains: Semiconductor, Integrated circuit, Berlin and London. hrLDA can create four big branches for the four domains. However, hLDA mixes words from different domains into the same topic because LDA is applied vertically and each document is only allowed to have one topic path. More empirical results and analysis can be found in out paper. The code and data is available on GitHub.
    Picture
    0 Comments

      Authors

      Fantastic PhD candidates at Northwestern University

      Archives

      November 2017
      September 2017
      August 2017
      July 2017

      Categories

      All
      Classification
      Data Mining
      NLP

      RSS Feed