Diego Klabjan - Research Blog

L1-norm Kernel PCA

Thu, 16 Nov 2017 03:33:50 GMT

Author

Cheolmin Kim, Ph.D candidate in Industrial Engineering and Management Sciences, Northwestern University
CheolminKim2019@u.northwestern.edu

Introduction
Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques. Given a large set of possibly correlated features, it attempts to find a small set of features principal components that retain as much information as possible. To generate such new dimensions, it linearly transforms original features by multiplying loading vectors in a way that newly generated features are orthogonal and have the largest variances with respect to the L2-norm.
Although PCA has enjoyed great popularity, it still has some limitations. First, since it generates new dimensions through a linear combination of original features, it is not able to capture non-linear relationships between features. Second, as it uses the L2-norm for measuring variance, its solutions tend to be substantially affected by influential outliers. To overcome these limitations of PCA, we present L1-norm kernel PCA model that is robust to outliers as well as captures non-linear relationship between features.

Model and Algorithm
Let us denote the data vector by $\text{a}_i$, and the loading vector by $\text{x}$. Letting $\Phi$ be a possibly nonlinear mapping, L1-norm Kernel PCA is formulated as follows:
\begin{align}
& \text{maximize} && \sum_{i=1}^n |\Phi(\text{a}_i)^T\text{x}| \\
& \text{subject to} && \|\text{x}\|_2=1.
\end{align}
Finding the optimal solution for the above problem is not straight-forward since it is not only non-convex but also non-smooth. To develop an efficient algorithm, we first reformulate it to the following form:
\begin{align}
& \text{minimize} && \|\text{x}\|_2 \\
& \text{subject to} && \sum_{i=1}^n |\Phi(\text{a}_i)^T\text{x}|=1.
\end{align}
The reformulated problem is geometrically interpretable where the goal is to minimize the distance to the origin from the linear constraint involving the L1-norm terms. The main idea of the algorithm is to move along the boundary of the constraint so that the distance to the origin of the iterate decreases.

The above figure graphically shows a step of the algorithm. Starting with the iterate $\text{x}^k$, we first identify the hyperplane $h^k$, which the current iterate $\text{x}^k$ lies on. After identifying the equation of $h^k$, we find the closest point to the origin from $h^k$, which we denote by $\text{z}^k$. After that, we obtain the next iterate $\text{x}^{k+1}$ by projecting $\text{z}^k$ to the constraint set by multiplying an appropriate scalar. We repeat this process until iterate $\text{x}^k$ converges.
The above approach is also applicable to L2-norm PCA, and interestingly, the corresponding algorithm for L2-norm PCA is well-known Power iteration. Therefore, we can understand it as a counterpart of Power iteration to L1-norm PCA. Moreover, as in the case of L2-norm kernel PCA, the kernel trick is possible in the above algorithm making the computation of explicit mapping $\Phi$ unnecessary. While the above algorithm finds the leading loading vector, the remaining ones can be found by deflating the kernel matrix and recursively applying the same algorithm.
Convergence Analysis
We have the following results regarding the convergence of the algorithm.

The algorithm terminates in a finite number of steps.
$\|\text{x}^k\|-\|\text{x}^*\|$ where $\|\text{x}^*\|$ is the final iterate decreases at a geometric rate.
By multiplying an appropriate scalar to the final iterate $\text{x}^*$, we can get a local optimal solution to L1-norm PCA.

Experimental Results
From the modeling perspective, L1-norm kernel PCA has an advantage over L2-norm kernel PCA in the presence of influential outliers. To show the robustness of L1-norm kernel PCA, we consider two data sets: One is original data whose covariance matrix has a low rank, and the other one is noisy data obtained by corrupting $r\%$ of observations in the original data. To measure the robustness, we compute how much variation of original data is explained by the loading vectors obtained by applying each kernel PCA algorithm on noisy data. The following figure shows the experimental results varying the width $\sigma$ of Gaussian kernel and the noisy level $r$. As shown in the figure, the loading vectors obtained by L1-norm kernel PCA from the noisy data better explain the variation of the original data.

We utilize this property for outlier detection extending the success of L2-norm kernel PCA in anomaly detection. In the anomaly detection setting where sample labels are available, L2-norm kernel PCA is applied on normal samples, and the resulting loading vectors are used to characterize the boundary of normal samples, and to build a detection model. However, as opposed to the anomaly detection setting, samples labels are not available in the outlier detection setting. Given this context, we apply L1-norm kernel PCA to get the loading vectors in a more robust manner, and use them to build a detection model.

The above table displays the experimental results (AUC) of real-world datasets. As shown in the table, the L1-norm kernel PCA based models are not only better than the L2-norm kernel PCA based models but also produce competitive results compared to popular outlier detection models such as LOF and iForest. Moreover, as opposed to density-based models, they tend to work well even when the dimension of a dataset is high.

A Semi-supervised learning approach to enhance health care Community-based Question Answering: A case study in alcoholism

Tue, 26 Sep 2017 07:00:00 GMT

Author

Papis Wongchaisuwat, Ph.D candidate in Industrial Engineering and Management Sciences, Northwestern University
PapisWongchaisuwat2013@u.northwestern.edu

While large number of internet users seek health information online, it is not trivial for them to quickly find an accurate answer to specific questions. Community-based Question Answering (CQA) sites such as Yahoo! Answers play an important role in addressing health information needs. In CQA sites, users post a question and expect the online health community to promptly provide desirable answers. Despite a high volume of users’ participation, a considerable number of questions are left unanswered and at the same time other questions that address the same information need are answered elsewhere. Automatically answering the posted questions can provide a useful source of information for online health communities.

We developed a 2-phase algorithm to automatically answer health-related questions based on past questions and answers (QA). Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA and further re-rank these candidates with a semi-supervised leaning algorithm that extracts the best answer to a prospective question. We first converts the raw data into a desirable structure which is collected as a corpus of existing QA pairs. The first phase implemented as a rule-based system that employs similarity measures, i.e. Dynamic Time Warping (DTW), and vector-space based approach (VS) to find candidate answers from the corpus of existing QA pairs for any prospective question. In the second phase, we implemented supervised and Expectation Maximization (EM) based semi-supervised learning models that refined the output of the first phase by screening out invalid answers and ranking the remaining valid answers.

We obtained a total of 4,216 alcoholism-related QA threads from Yahoo! Answers as a case study. Based on our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. UMLS-based (health-related) features used in the model enhance the algorithm’s performance by proximately 8 %. An example result returned from the algorithm to determine candidate answers is shown in the figure below.

Predicting ICU Readmission using Grouped Physiological and Medication Trends

Thu, 14 Sep 2017 13:40:19 GMT

Author

Ye Xue, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
YeXue2015@u.northwestern.edu

Background
Patients who are readmitted to an intensive care unit (ICU) usually have a high risk of mortality and an increased length of stay. ICU readmission risk prediction may help physicians to re-evaluate the patient’s physical conditions before patients are discharged and avoid preventable readmissions. ICU readmission prediction models are often built based on physiological variables. Intuitively, snapshot measurements, especially the last measurements, are effective predictors that are widely used by researchers. However, methods that only use snapshot measurements neglect predictive information contained in the trends of physiological and medication variables. Mean, maximum or minimum values take multiple time points into account and capture their summary statistics, however, these statistics are not able to catch the detailed picture of temporal trends.

In this work, we find strong predictors with ability of capturing detailed temporal trends of variables for 30- day readmission risk and build prediction models with high accuracy.

Workflow
We convert patients’ time series into graphs, where each node represents a discretized measurement at a single point in time. Among these graphs, we discover the most important subgraphs and identify them as common temporal trends. We study the correlation between the important subgraphs, group them and use the groupings as an augmentation to snapshot features in building predictive models. A workflow is shown below.

Study of Imputation on Temporal Data
Along the way, we study the impact of different imputation techniques and develop a tailored methodology, called customized linear interpolation, that outperforms all other state-of-the-art approaches. Multivariate Imputation by Chained Equations (MICE) is a popular imputation method. However, its performance on temporal data is not as strong as that on snapshot data. A comparison between imputed values from MICE and customized linear interpolation is shown below.

Conclusions
As a result, our model outperforms the baseline model that only uses the snapshot features, suggesting that the temporal trends carry predictive information for ICU readmission risk. Additionally, our experiments show that some imputation methods work well on replacing missing values in snapshot measurements but not on temporal data, suggesting that the temporal pattern need to be taken into consideration in imputation.

Semantic Document Distance Measures: Word-vector Based Dynamic Time Warping And Word-vector Based Tree Edit Distance

Thu, 14 Sep 2017 12:45:41 GMT

Author

XIaofeng Zhu, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
xiaofengzhu2013@u.northwestern.edu

I went to the bank” is semantically more similar to “I withdrew some money.” than “I went to the park.” However, most widely used distance measures think the later sentence is more similar because the later sentence has more common words. In this post, we will explain two new algorithms - word vector-based dynamic time warping (wDTW) and word vector-based tree edit distance (wTED) - for measuring semantic document distances based on distributed representations of words. Both algorithms hinge on the distance between two paragraphs and are implemented in Apache Spark. We train word2vec vectors beforehand and represent a document as a list of paragraphs where a paragraph is a list of word2vec vectors.
DTW measures the distance between two sequences: wDTW algorithm calculates document distances based on DTW by sequentially comparing any two paragraphs of two documents. The figure below demonstrates how wDTW finds the optimal alignments between two documents.

Word Vector-based Tree Edit Distance (wTED): Word Vector-based Tree Edit Distance (wTED)TED calculates the minimal cost of node edit operations for transforming one labeled tree into another. A document can be viewed at multiple abstraction levels that include the document title, its sections, subsections, etc. The next figure illustrates wTED distance between two document trees.

Document Revision Detection: We used wDTW and wTED to solve a document revision detection problem by taking a document as an arc in a revision network and the distance score of two documents as the arc length. We conclude that the revision of a document is the document that has the smallest distance score based on the minimal branching algorithm. We next report the performance of our semantic measures on two types of data sets.
Distance/Similarity Measures: We compared our measures wDTW and wTED with four baselines: Vector Space Model (VSM), Word Mover Distance (WMD), PV-DTW(Paragraph vector + DTW), and PV-TED (Paragraph vector + TED).
Data Sets

Wikipedia Revision Dumps (long, linear revisions)
Simulated Data sets (short, tree revisions): Insertion, deletion, and substation of words, paragraphs, and titles

Performance on the Wikipedia Revision Dumps

Performance on the Simulated Data Sets

Findings
wDTW consistently performs the best. WMD is better than wTED.

The performances of wDTW and wTED drop slower than others.
wDTW and wTED use dynamic programming to find the global optimal alignment.
WMD relies on a greedy algorithm that sums up the minimal cost for every word.

wDTW and wTED outperform PV-DTW and PV-TED.

Paragraph distance using word2vec is more accurate than using paragraph vectors.

VSM is the fastest, wTED is faster than wDTW, and WMD is the slowest (40 times slower than wDTW).

Semi-supervised Learning for Discrete Choice

Thu, 17 Aug 2017 07:00:00 GMT

Author

Jie Yang, Ph.D candidate in Civil and Environmental Engineering, Northwestern University
jieyang2011@u.northwestern.edu

More and more airlines are putting emphasis on “merchandizing.” But aren’t they doing this right now? Unfortunately, most of them are not. Traditional carriers rely on their complex distribution channels and most of their focus is on managing those channels such as OTAs, off-line travel agencies, etc. Selling through direct channel is lucrative yet most airlines are in their early stage. As one part of merchandising capability, personalization heavily relies on airlines’ understanding of their travelers’ data and the ability to collect data. That is also why more and more airlines are trying to bring their travelers to their own website and complete the booking.

This trend however may disrupt current market ecosystem where global distribution system (GDS) sells the majority of fares. To prevent airlines’ corner overtaking strategies, GDS has to take some actions! Comparing to airlines’ database, GDS companies’ competitive advantage is that they have data from all different airlines who use GDS. Most importantly, some GDS companies have the ability to “shop-back” and match bookings with the itineraries displayed to travelers. But this process is computationally expensive. So, an interesting problem came to our mind that can we use unmatched itineraries (i.e. unlabeled data or data without observed label) to improve airlines’ understanding of a potential market?

Airlines can understand a market such as Chicago to Shanghai from different aspects. To predict their market share, they use discrete choice model and the basic one (i.e. multinomial logit model) assumes a traveler’s utility or impression on an itinerary is given by a linear function of weighted features plus a gumbel noise. The goal of this model is to predict probability of choosing one itinerary. And it can also be used to estimate an airline’s market share or to estimate a traveler’s preferred rank of returned itineraries. In a typical case, to estimate such a model we need to know choice sets and also booked itineraries within each set. But those unlabeled choice sets may also be used as a way to improve the typical model estimation.

How can we utilize those choice sets? There were lots of research focusing on improving classification model estimation with unlabeled data. So it is worthwhile to try similar algorithms onto our problem. Inspired by this, we adapted four different algorithms. Three of them were based on clustering methods while another one was based on expectation-maximization algorithm. To evaluate the methods, we designed cross-validation experiments based on a public hotel dataset and compared the prediction accuracy. Results are presented below. We gradually increase the percentage of booked data with respect to unlabeled data. The metric we used is based on Kendall’s tau and model used was based on a ranked-logit choice model. For this metric, the lower the better. And we can see the zero Y-axis is the baseline which is the prediction provided by a model we estimated with only booked data. It is clear that our algorithms have a better performance than the baseline model for up to 10%.

Specifically, we applied clustering-and-label (CL), expectation maximization (EM), x-clustering-and-label-1 (XCL1) and x-clustering-and-label-2 (XCL2). XCL1 and XCL2 are advanced clustering methods which explores the clustering structure automatically without setting a target number. It indicates that XCL1 and XCL2 are better than the other two algorithms.

In all, we believe this research may benefit GDS companies to provide better solutions to airlines, especially those non-legacy carriers who have not the capability to build up its own channel or IT infrastructure.

Hierarchical Relation-based Latent Dirichlet Allocation (hrLDA)

Wed, 12 Jul 2017 07:00:00 GMT

Author

XIaofeng Zhu, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
xiaofengzhu2013@u.northwestern.edu

Most of the existing ontologies such as DBpedia rely on supervised ontology learning via manual parsing or transferring from existing knowledge bases. However, knowledge-bases in some specific domains such as semiconductor packaging do not exist, and supervised ontology learning is not appropriate for learning ontologies in new domains. In contrast, unsupervised ontology learning which is generally based on topic modeling can learn new entities and their relations from plain text and is likely to perform better when having more data. In this post, we will show you how to learn a terminological ontology in a specific domain via hrLDA.

Among all types of ontologies, our work focuses on terminological ontologies. In short, a terminological ontology is a hierarchical structure of subject-verb-object triplets (illustrated in below figure). Before we explain the concept terminological ontology, we need to talk about what a topic is. A topic is a list of words with different probabilities being in this topic, which is the fundamental assumption of Latent Dirichlet Allocation (LDA). A topic label is noun phrase and also a node in a topic tree, for example, city or London. A topic path is a list of topic labels from the root to one leaf. A terminological ontology has two components: topic hierarchies in a topic tree (shown on the left side) and topic relations between any two topic labels (shown on the right side). Capital is a sub-concept of city. Be on the north of is the relationship between Berlin and London. We use hierarchical topic modeling to extract topic hierarchies and we use relation extraction to extract topic relations.

hrLDA builds on hierarchical latent Dirichlet allocation (hLDA) and overcomes its limitations. The four components of hrLDA are relation-based latent Dirichlet allocation (rLDA), relation triplet extraction, acquaintance Chinese restaurant process (ACRP), and nested acquaintance Chinese restaurant process.

Relation-based Latent Dirichlet Allocation: Relation-based Latent Dirichlet AllocationIn contrast to LDA, rLDA takes a document as a bag of subject-verb-object triplets with the subject nouns as keys. In this way, we give salient nouns (nouns having more relation triplets) high weights. The other difference is that the number of topics in rLDA is computed by ACRP instead of a hyper-parameter. The figure below illustrates the plate notation of rLDA for extracting K topics from Corpus D having N documents. T is the list of subject-verb-object (S-V-O) relation triplets in a document. Z represents the topic assignments for the relation triplets. Multi(\beta) and Multi(\theta) are multinomial topic-word distribution and multinomial document-topic distributions. Dir(\alpha) and Dir(\eta) are hyper-parameters.

Relation Triple Extraction: The extracted relation triplets can be classified as three types. The first type: subject-predicate-object-based relations are relations that can be extracted using Stanford NLP parser and Ollie relation extraction library. The second type: noun-based/hidden relations are relations that reside in compound nouns and acronyms. The third type: relations from document structures. For instance, the indentation and bullet types of this slide indicate relations. hrLDA finds the topic numbers via a partition method ACRP.

Acquaintance Chinese Restaurant Process: A noun phrase has four properties: content, (paragraph, sentence) coordinates, one-to-many relation triplets and document ID. People tend to put phrases that describe the same topic together. Visually phrases that are close to each other regarding (paragraph, sentence) coordinates are acquaintances. Equation 1-5 shows the probability of noun phrase (n+1) choosing its topic id from [1 … k+1], which models how people order their wording when they write documents. \eta is a small hyper-parameter. C_i is the number of noun phrases joining topic i. Q_{1:i} represents paragraph location differences, and S_{1:i} represents sentence location differences. When people write one document, the probability of forming a new topic if there are non-empty topics is small, the same topic that has the same content words is close to 1, not likely to join the topic that does not have any acquaintances. Words appear in the same paragraph are close acquaintances if they are even closer if in the same sentence. It is true that this it not optimal yet, but a big improvement over the Chinese restaurant process (CRP) in hLDA.

The probability of choosing the $(k + 1)^{th}$ topic reads \[P(z_{n+1} = (k+1) | Z_{1:n}) = \frac{\gamma}{n+ \gamma}.\]

The probability of selecting any of the $k$ topics is

if the content of $t^{n+1}$ is synonymous with or an acronym of a previously analyzed noun phrase $t^m$ $(m < n +1)$ in the $i^{th}$ topic, \[P(z_{n+1} = i | Z_{1:n}) = 1 - \gamma;\]
else if the document ID of $t^{n+1}$ is different from all document IDs belonging to the $i^{th}$ topic, \[P(z_{n+1} = i | Z_{1:n}) = \gamma;\] \[P(z_{n+1} = i | Z_{1:n}) = \] \[\frac{C_i - (1 - \frac{1}{min(Q_{1:i})})}{(1 + min(S_{1:i})) n+ \gamma},\]

Nested Acquaintance Chinese Restaurant Process: A topic tree is generated via recursively applying rLDA and ACRP in a top-down fashion. For instance, we start with all the noun phrases at level 0 Node 1. We use ACRP to calculate the topic number K^0_1 at level 0 from Node 1 and the initial state of rLDA. We then apply rLDA to get the actual topic distribution and keep the top phrases as topic labels. Next, we remove the topic labels and feed the unpartitioned noun phrases to ACRP and rLDA until all the phrases are assigned as topic labels. Connecting all the topic labels colored in read, we get a topic tree. After we link the relation triplets back to the noun phrases we get a terminological ontology.

Empirical Results: A unique advantage of hrLDA is that it is not sensitive to messy/noisy text that is about multiple domains. For instance, the input text is about four domains: Semiconductor, Integrated circuit, Berlin and London. hrLDA can create four big branches for the four domains. However, hLDA mixes words from different domains into the same topic because LDA is applied vertically and each document is only allowed to have one topic path. More empirical results and analysis can be found in out paper. The code and data is available on GitHub.