XIaofeng Zhu, PhD candidate in Electrical Engineering and Computer Science, Northwestern University
DTW measures the distance between two sequences: wDTW algorithm calculates document distances based on DTW by sequentially comparing any two paragraphs of two documents. The figure below demonstrates how wDTW finds the optimal alignments between two documents.
Distance/Similarity Measures: We compared our measures wDTW and wTED with four baselines: Vector Space Model (VSM), Word Mover Distance (WMD), PV-DTW(Paragraph vector + DTW), and PV-TED (Paragraph vector + TED).
- Wikipedia Revision Dumps (long, linear revisions)
- Simulated Data sets (short, tree revisions): Insertion, deletion, and substation of words, paragraphs, and titles
wDTW consistently performs the best. WMD is better than wTED.
- The performances of wDTW and wTED drop slower than others.
- wDTW and wTED use dynamic programming to find the global optimal alignment.
- WMD relies on a greedy algorithm that sums up the minimal cost for every word.
- Paragraph distance using word2vec is more accurate than using paragraph vectors.