<![CDATA[Diego Klabjan - Blog]]>Tue, 15 Jan 2019 18:00:43 -0800Weebly<![CDATA[Truthfulness of Information]]>Fri, 20 Jan 2017 04:14:06 GMThttp://dynresmanagement.com/blog/truthfulness-of-informationA lot has been written lately about fake news, especially related to recent elections. Blame has mostly been put on tech giants Google and Facebook as they became major sources of news. While traditional newspaper outlets such as the Wall Street Journal or Washington Post curate the information manually and perform due diligence, automatic news feeds either require substantial labor or algorithms for detecting fake news. The result is all of the well-known rage about Google and Facebook for enabling the circulation of such news.

The problem of truthfulness of information goes beyond everyday political, economics, and more general journalism related misinformation. In healthcare, there are many sites providing misinformation, especially those originating and hosted in third world countries. It can be devastating if a person acts on such sources. Another service area very prone for misinformation is finance. An erroneous signal or news can quickly be picked by traders or trading algorithms to drive a stock in a wrong direction. Even marketers are in need of knowing what web sites provide fake news since they do not want to advertise on such sites.

Google and Facebook are well known for their prowess in machine learning and artificial intelligence and they sit on top of troves of data. Yet they have not been able to design a robust truth-checking system. For quite some time, there are available sites – we will call them truthfulness brokers – that provide names of other sites known to dispel fake news. Recently Google announced that it will start marking sites with questionable credibility by their search engine. Based on the announcements their strategy is to cross-check the site in question with those black listed by truthfulness brokers. A few browser plug-ins have also emerged with the same purpose and also the same algorithm for tagging credibility of a site.

For the time being this is probably an admissible solution but it definitely begs for more. First, the solutions rely on third part sites. There is no guarantee that truthfulness brokers provide up-to-date listings. Indeed, they mostly rely on manual work and thus they cannot be up-to-date. In addition, a new site needs to continuously release fake news before it catches attention from truthfulness brokers. Another problem is that the current focus of the brokers is on journalism related fake news. The other aforementioned areas do not have such dedicated truthfulness brokers, thus leaving enormous untapped value.

It is clear that a robust scalable solution to the problem must not exclusively rely on truthfulness brokers but has to internally curate data and develop sophisticated machine learning and artificial intelligence algorithms. Solely focusing on web sites without taking into consideration the actual content exhibited has limitations. Instead algorithms and solutions based on knowledge graphs and fact-checking combined with scores for source reliability must be developed. 

<![CDATA[Who is Afraid of the Big (Bad) Wolf?]]>Mon, 15 Aug 2016 12:44:00 GMThttp://dynresmanagement.com/blog/-who-is-afraid-of-big-bad-wolfIn May 2016 I participated at the IRI Annual meeting in Orlando. I attended the round table “Implications of Big Data on R&D Management” organized by the Big Data Working Group where I serve as a subject matter expert. The discussion was moderated by Mike Blackburn, R&D Portfolio/Program Leader, Cargill.

The participants from companies such as IBM, John Deere, Schneider Electric, etc. were discussing data-driven use cases in R&D, success stories and challenges. We heard about data facilitating investment decisions, in agriculture yield data driving acquisition recommendations, and insights from patents taken into consideration in product design and development.

I was expecting to hear from participants how their main concern is direct competitors outsmarting them in data acquisition and use. To the contrary, the vast majority of the participants in their remarks mentioned a company from a different industry. Google has been the most frequently mentioned company despite the discussion focusing on industries that were even a few years ago considered completely disconnected from the technology giants in the Bay Area. 

It is well known how Google is disrupting the automotive industry with autonomous cars relying heavily on AI and mountains of data. Similarly, in logistics (Amazon with its Amazon Robotics acquired through Kiva Systems) and manufacturing (Google with the acquisition of Boston Dynamics who after a few years are parting ways) data together with AI and data science embraced by Google and other tech companies is making inroads in other traditional well entrenched industries. Google is also not shying away from life science data-driven opportunities which is being witnessed by its Life Sciences Group recently renamed to Verily. Not to mention tremendous opportunities that Google Glass or its future incarnations can potentially bring to R&D. 

Based on all these facts it is no surprise that Google has been mentioned much more often than direct competitors or the economy. The company is a wolf waiting to feed on pigs. But it is not a bad wolf; if your company gets eaten, you should blame yourself for not being a visionary about what is possible with data, AI, and data science. The easiest way is to be passive and pretend that Google disrupting your industry is utopia or too far ahead. 

And Google is not going to only affect other organizational units in your industry, it could also influence R&D. Consider, for example, that Google has already assembled product images and manuals. With additive manufacturing it could easily produce products by reverse engineering these materials and then sell the products online. From the vast amount of data it possesses, it can use AI to predict the need for products, willingness to pay and plug-ins that are going to sell. The only way to save your career is to overtake Google. ]]>
<![CDATA[Artificial Intelligence vs CRISPR]]>Thu, 24 Mar 2016 19:52:51 GMThttp://dynresmanagement.com/blog/-artificial-intelligence-vs-crisprCRISPR/Cas9 is a brand new algorithm in machine learning that has potentials to replace humans in practicing law and contract negotiations. It is based on a new deep learning model that is trained only on a few documents and then is capable to produce questions to be asked at a trial or negotiate a real estate deal with Mr. Trump. The model consists of 10 layers of …

Got you (at least some of you). CRISPR/Cas9 is actually a gene editing technology developed at the University of California, Berkeley that does not have anything to do with machine learning (and artificial intelligence). You can read more about the technology on Wikipedia. It is a relatively recent invention and has already been used in curing diseases in adult tissues and to change the color of skin in mice, Mizuno el. al. 2014. No problem here since the humanity definitely wants to eradicate cancer and other genetic diseases. Until a recent very controversial study came out by Junjiu Huang, a gene-function researcher at Sun Yat-sen University in Guangzhou. They applied gene editing by using CRISPR/Cas9 on embryos. Truth to be told, these embryos cannot result in a live birth. Their intent was to cure a blood disorder in the embryo. Huang concludes that further advances need to be made since several of the embryos were not successfully edited, but some were. It goes without saying that this in the future could lead to an ideal child with blue eyes, 6 feet in height, IQ of 130, etc (and graduating from Northwestern University or Georgia Tech - where I got my Ph.D). This nice article in Nature summarize and discusses this controversial research direction.

On the other hand, deep learning in the context of artificial intelligence is all the rage these days. Everybody is warning about the danger of artificial intelligence (AI) to humanity, including extremely influential and prominent people such as Bill Gates, Elon Musk, and Stephen Hawking. There are at least 100 returned pages (and probably many more) on google mentioning “artificial intelligence Bill Gates Elon Musk Stephen Hawking”. For someone that knows deep learning (DL) and is conducting research in this space, I believe DL and AI are very far from endangering humanity. Yes, there were significant advances in supervised learning in specific areas (autonomous cars, scene recognition from images, answering to simple factoids), however these models still need a lot of training data and can solve only very narrow specific problems. Train a scene recognition model on images of living animals and then show it a dinosaur. The answer: “Elephant.” A lot of written news and reports today are written by computers powered by AI, but this is much more structured and easier to learn than negotiating a contract with Mr. Trump or preparing a lawyer for a trial. We are very far from computers displacing humans for such tasks.

I am not worried about AI, definitely not in my life span and that of my children, but CRISPR/Cas9 makes me much more nervous. It really means interfering with the natural process and in the not-so-distant future creating exceptional humans a la carte. I am convinced that without any regulations, i.e., unleashing the scientists, successful gene editing would be around the corner. I believe it is very important that experts around the globe step in and prevent further studies of gene editing on embryos. In terms of AI, using CRISPR/Cas9 or another yet-to-be-invented technology to artificially create a functional brain with all the neurons in a jar seems to be more viable and closer in time than mimicking the human brains satisfactory to endanger humanity with bits.

<![CDATA[Discovery vs Production]]>Mon, 14 Mar 2016 22:21:21 GMThttp://dynresmanagement.com/blog/discovery-vs-productionIt has been noted that big data technologies are predominantly used for ETL and data discovery. The former has been around for decades and is well understood with a mature market. Data discovery is much newer and less understood. Wikipedia’s definition reads “Data discovery is a business intelligence architecture aimed at interactive reports and explorable data from multiple sources.”

Data lakes based on Hadoop are bursting out at many companies with the predominant purpose of data discovery from multiple sources (that are explorable). It is easy to simply dump files from all over the place into a data lake and thus the data source requirement in the definition is met. What about the part on “interactive reports?” Verb “to discover” based on dictionaries means “to learn of, to gain sight or knowledge of,” which is quite disconnected from interactive reports. It actually does not have much in common. Indeed, in business data discovery is much more aligned with the dictionary definition than Wikipedia. Data discovery as used with big data and data lakes really means “to gain knowledge of data – in order to ultimately derive business value – by using explorable data from multiple sources.”

The vast majority of the applications of big data are to conduct data discovery in the sense of learning from the data. The knowledge gained per se does not provide business value and thus such insights are operationalized separately in more established architectures (read EDW, RDBMS, BI).  A good example is customer behavior derived from many data sources, e.g., transactional data, social media data, credit performance. This clearly calls for data discovery in a data lake and insights written into a ‘relational database’ and productionalized by means of other systems used in marketing or pricing.

There are very few cases of big data solutions outside of ETL being actually used in production. Large companies directly connected with the web successfully deployed in production big data technologies (Google for page ranking, Facebook for friend recommendations) but outside of this industry big data solutions in production are rarely observed.

It is evident that today big data is used predominantly for data discovery and not in production. I suspect that as technologies mature even more and become more self-served the boundary will gradually shift more towards production assuming that business value would be derived from such opportunities. Today big data is mostly about data discovery. The Wikipedia definition about interactive reports is for now mostly an elusion and it is better to stick with the proper English definition of gaining knowledge of. 

<![CDATA[Beyond Basic Machine Learning: Modeling Loan Defaults with Advanced Techniques]]>Sat, 07 Nov 2015 04:32:24 GMThttp://dynresmanagement.com/blog/-beyond-basic-machine-learning-modeling-loan-defaults-with-advanced-techniquesDodd-Frank Act Stress Testing (DFAST) is now required also for smaller banks with assets less than $50 billion. Simply stated, the test requires a bank to assess possible losses in future years from the loans on the books. A report exploring many different future economic scenarios such as increases in the unemployment rates or short term interest rates must be submitted to the Federal Reserve.

A loan status evolves in time. A loan that is current can in the next period become delinquent and then in a future period it can default. A different scenario would be a loan transitioning from current to delinquent and then back to current. The modeling part consists of three major components:

1.     Derive probabilities that a loan will transition from one state to another state in the next period. Clearly a loan can stay in the same state. The probabilities depend on many attributes of a loan.

2.     With these transition probabilities derived, a Markov Chain model is set up. This model assesses the probability that a given loan would be default after a certain number of periods in the future (e.g., eight quarters).

3.     The expected loss of a loan is its probability of default multiplied by the approximate loss.

The most challenging part is the derivation of the probabilities. On the paper this is an easy machine learning classification problem. Consider the case of current loans transitioning to delinquent. First, for each historical loan a set of attributes (features in the machine learning parlance) needs to be assembled. Candidates for features are the loan duration, location of the loan borrower, loan type, etc. Next, a subset of historical loans that transition from current to delinquent and a subset of loans that remain current are selected. The selected loans form the training data set. The third step is to set up models and lastly to evaluate them by using techniques such as 10-fold validation on a test data set.

Typical challenges of data sourcing and cleansing require a big chunk of the time. Several banks do not have enough historical data from their own operations and thus have to procure external data sets (and make sure that the loans on such external data sets resemble their loans).

The classification task at hand is to classify if a given loan is to become delinquent or not in the next time period. As such it is a textbook binary classification problem. There are two hidden challenges. The first one is that the training set is heavily imbalanced since many more loans stay current than transition to delinquent. It is known that in such cases special techniques must be employed. Equally important is the fact that there are many possible features, in excess to one hundred.

We have been engaged in a few DFAST related projects and have faced the feature selection challenge. We started with standard techniques of the principle component analysis (PCA) or information gain (the so-called maximum relevancy minimum redundancy algorithm). Either technique reduced the feature space and then classification models (logistic regression, support vector machine, random forests) were evaluated.

We tackled a few problems requiring deep learning which is a technique well suited for complex problems. While the DFAST problem is not as complex as, for example, recognizing from an image if a passenger is to cross a road, we decided to try the Restricted Boltzmann Machine (RBM) as a technique to reduce the feature space. In RBM, an input (visible) vector is fed to the model, then it is mapped through the notion of the energy function into a lower dimensional hidden vector which is then lifted back to the original space of the feature vector. The goal is to tune parameters (b’,c’,W) in the energy function so that the recovered vector in probability comes close to the original vector. 

The entire classification is then performed by having an input vector, then based on RBM calculating the hidden vector which is then classified into delinquent or not. [This flow actually follows the paradigm of deep belief networks which typically include more than one hidden layer.]

To our surprise, the RBM based classification model outperforms the vast variety of other traditional feature selection and classification models. The improvement was drastic, the F1 score jumped from 0.13 to 0.5. This was a very nice ‘exercise’ that is hard to push to end users which have heard of logistic regression and PCA and might even know how they work, but would be very uncomfortable using something that is called RBM (but would be much more receptive if the acronym means role-based management). 

<![CDATA[Improvements to large-scale machine learning in Spark]]>Tue, 28 Apr 2015 13:10:36 GMThttp://dynresmanagement.com/blog/improvements-to-large-scale-machine-learning-in-spark One of the biggest hustles in mapreduce is model calibration for machine learning models such as the logistic regression and SVM. These algorithms are based on gradient optimization and require iterative computations of the gradient and in turn updating the weights. Mapreduce is ill suited for this since in each iteration the data has to be read from hdfs and there is significant cost of starting and winding down a mapreduce job.

On the other hand, Spark with its capability to persist rdd’s (resilient distributed dataset) in memory and natively offering dataflow capabilities, is a great candidate for efficient calibration on rdd’s.

Gradient based algorithms on distributed data sets rely on the paradigm of solving the optimization problem on each partition and then combining the solutions together. We implemented in scala three algorithms.

1.      Iterative parameter averaging (IPA): On each partition a single pass of the standard gradient algorithm is performed which produces weights. Weights from each partition are then averaged and form the initial weights for the next pass. The pseudo code is provided below.

Initialize w
            Broadcast w to each partition
            weightRDD = For each partition in rdd inputData
                                                wp = w
                                                Perform a single gradient descent pass over the records in            
                                                                        the partition by iteratively updating wp
                                                Return wp
            /* weightRDD is the rdd storing new weights */
            w = average of all weights in weightRDD
Return w

            The key is to keep the rdd inputData in memory (persist before calling IPA).

2.      Alternating direction method of multipliers (ADMM): http://stanford.edu/~boyd/admm.html

This method is based on the concept of the augmented Lagrangian. In each iteration for each partition the calibration model is solved on the records pertaining to the partition. The objective function is altered and it consists of the standard loss and a penalty term driving the weights to resemble the average weights. One needs to solve an extra regularization problem with penalties. For L2 and L1 norms this problem has a closed form solution.

After each partition computes its weights, they are averaged and the penalty term adjusted. Each partition has its own set of weights.

Since the algorithm is complex, we do not provide the pseudo code.  The bulk of the pseudo code is actually very similar to IPA, however there is additional work performed by the driver.

One challenge, i.e., inefficiency in spark or ‘we do not how to do it in spark,’ is the inability in spark to send particular data (in our case the penalties) to a particular actor working on a partition. Instead we had to do a forecast to all actors and then during processing of a partition only the relevant penalties that have been broadcast are used. The main issue here is that all penalties for each partition has to be held in memory at the driver. For very large-scale rdd’s with many features this will be a bottleneck.

3.      Progressive hedging (PH): This is very similar to ADMM. The regularization subproblem has a different from than in ADMM, but it still exhibits closed form solutions for L2 and L1 norms. 

The implementations together with test codes are available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-ml-optimization 

Below is a comparison with Spark on 4 CPUs each one with 8 cores for two large data sets. IPA is a clear winner with the default spark SGD being the worst algorithm. 

<![CDATA[CSV to Spark SQL tables]]>Fri, 20 Mar 2015 19:46:38 GMThttp://dynresmanagement.com/blog/csv-to-spark-sql-tablesRecently we were involved in a project that required reading and importing more than 30 csv files into Spark SQL. We started writing scala code to ‘manually’ import file by file, but we soon realized that there is substantial repetition.

As a result we created a nice helper object that takes as input information about the csv files and automatically creates a schema per file. Each csv file must have a header, which dictates the name of the columns in the corresponding table. The user has to customize an object where the details are listed by file. For example, for each file the user can specify:

  • The name of the schema
  • Delimiter
  • A possible formatter function that takes a string value from the input csv file and returns the corresponding scala object that is used in schemaRDD (by default, i.e., if the formatter is not specified, strings are assumed). 

The code is available at  https://github.com/wxhC3SC6OPm8M1HXboMy/spark-csv2sql.git and as a package in www.spark-packages.org

<![CDATA[Information Gain based feature selection in Spark’s MLlib]]>Mon, 16 Feb 2015 19:49:07 GMThttp://dynresmanagement.com/blog/information-gain-based-feature-selection-in-sparks-mllibWe recently worked on a project that included web data and other customer specific information. It was a propensity analysis type project where recommendations were required for each individual client based on his or her past actions on the web. Each item recommended has many features and clients belong to organizations, which creates interactions among them.

Recommendations should be personalized and take into account linkage through organizations. To handle the latter, we selected to use organization related features as features in the model (instead of a possible alternative approach of having organization-level and individual customer-level models which are then combined).

These characteristics led to a personalized model for each client with each model having more than 2,000 features. To avoid over-fitting, feature selection had to be performed at customer level and thus due to a large number of customers in an automated fashion.

A possible line of attack is to apply PCA to each model. The problem with this approach is that the resulting features are linear combinations of original features. The project managers were consistently asking which features are important and what is the importance of each feature (think weights in logistic regression). For this reason PCA was not a viable option.

Instead we opted to go with feature selection based on information gain. The concept is based on the information gain between a feature vector and the label vector. The goal is to select a subset of features that maximize the information gain between them and the label (reveal as much information about the labels as possible) and minimize the information gain among the features themselves (minimize the redundancy of the features among themselves). The goal is to find set S that , where ig is the information gain between two vectors. It is common to solve this problem, called maximum relevance minimum redundancy, greedily.

Due to a large amount of data across all customers, we implemented the entire system in Spark. Spark’s MLlib unfortunately offers only feature selection based on PCA but not based on information gain. For this reason we implemented our own package. The package is available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-ml.git for public peruse.

For our purpose we had to also customize MLlib with other extensions (that are rather short and thus not put in a public repository).

Our model was heavily unbalanced (one class represented less than 1% of the records). We undersampled the large class but we had also used different weights for different records in the model calibration phase. This also allowed us to put different weights for different events of customers (for example, a misclassification of a purchase event has more weight than a misclassification of an item that the user recommended to someone else in the same organization). We achieved this by subclassing LabelPoint with a weight of the record. We also had to customize the classes for gradient computation (in logistic and SVM since these were the most competitive models) in order to account for weights.

The second enhancement was the use of a quadratic kernel in SVM and logistic regression. We implemented the quadratic kernel since it yields a linear model in a finite higher dimensional space and thus can reuse a lot of the linear machinery in MLlib. To this end, we created a class that extends the input rdd with the kernel function and then exposes the standard API of linear models. We have also extended the model class to automatically take into account the same kernel function when a test vector is applied for predictions.

Overall, Spark definitely met the needs of this project. It is robust and definitely scalable. On the downside, our problem was not a textbook model and thus it needed several enhancements to MLlib.

<![CDATA[HOW CAN AIRLINES USE BIG DATA TO BETTER MATCH SUPPLY AND DEMAND]]>Wed, 28 Jan 2015 18:43:11 GMThttp://dynresmanagement.com/blog/how-can-airlines-use-big-data-to-better-match-supply-and-demandAirlines have been one of the first industries to use advanced analytics in areas such as revenue management. “Should a request for a seat be granted?” is a fundamental challenge in the industry. If it is granted, the money can be left on the table since the next day a highly valued business passenger might call willing to pay much more for the seat. If it is declined, the business passengers and any other potential passenger might not call for the seat in the future and thus it would remain unsold. The load factor (number of sold seats over all available seats) has always been a very important metric for profitability. The load factor can be increased by improved forecasting of future demand or more appealing offerings (itineraries more closely matching the needs of passengers).

Traditionally the airlines have recorded bookings, i.e., passengers actually buying an itinerary. The booking data is the foundation of revenue management systems by basing forecasting on it. In addition, the airlines use booking data to estimate market sizes (a market is a city-to-city pair), frequency (how often to fly in a market), and to tailor the actual itineraries in a given market (flying non-stop in a market or offering service through a connection).

With the advent of the internet as a distribution channel, today the airlines can store not only the actual bookings, but also the available itineraries offered to the customer. In addition to recording the booked itinerary, all itineraries on the ‘screen,’ i.e., presented to the customer, also called the choice set, are stored in a database. The request (filters specified on the page), the actual booking, and the choice set are all stored and linked.  The first challenge here is to store the data. The size of the data increases twenty fold if the choice set has twenty itineraries. After storing the data in json files, the immediate next challenge is how to analyze it.

Bookings with choice sets can be used in discrete choice models. These models predict the likelihood of an itinerary being selected given a choice set. They are based on the notion of the utility function, which is a linear combination of features. The model assumes that customers always select the itinerary that maximizes their utility. Typical features are the elapsed time of an itinerary, how far are the departure and arrival times from the requested values, price, class and cabin (economy vs business), and the aircraft type (regional jet vs narrow or wide body aircraft). The utility coefficients are calibrated by using the standard machine learning maximum likelihood objective.

Due to the sheer size of the data and multi-structural nature, e.g., customer preferences are specified as ‘lists,’ they are typically stored as json files and should be used in Hadoop for analyses.

The features mentioned above are extracted from fields by using Pig. This scripting language is well suited to sift through all records and fields in json in a very efficient manner by exploring concurrency of map reduce. For this data engineering step Hive can also be used.

After extracting all features in Pig or Hive, the discrete choice model has to be calibrated by means of solving an optimization problem. This can either be achieved with Spark as part of Hadoop or the extracted feature matrix can be loaded in R or Python. The former due to the distributed nature is capable to handle more records than the memory bound R or Python. The implication is that Spark will be able to handle more booking requests and the underlying choice sets. All these tools have build-in functions to solve the maximum likelihood optimization model.

Such large-scale discrete choice models should drastically improve airline planning, in particular their market-driven decisions including market presence. The models take into account not only the final decision of a customer – booking – but also the customer’s choice set.

Outside of the airline industry, discrete choice from itineraries has many use cases by online travel agencies (OTA) such as Orbitz and Travelocity, and providers of global distribution systems – Sabre Holdings, Amadeus.            

<![CDATA[CUSTOMER BEHAVIOR FROM WEB AND TEXT DATA]]>Wed, 21 Jan 2015 15:09:09 GMThttp://dynresmanagement.com/blog/customer-behavior-from-web-and-text-dataMany sites and portals offer text content on web pages. For example, news aggregators such as The Huffington Post or Google News allow users to browse news stories; membership-based portals focusing on a specific industry, e.g., constructionsupport.terex.com for construction, offer members a one-stop page for the latest and greatest updates in a particular domain; in the service domain DirectEmployers provides site www.my.jobs with job listings for members to explore. A challenge faced by these site providers is to distinguish users that simply browse the site versus those that are actively searching with an end goal, e.g., for DirectEmployers it means distinguishing between the user that actively seek a job vs those only exploring the portal. The former can then be targeted with possible marketing campaigns to provide higher business value.

While traditionally this can be accomplished through web analytics by following page views and not considering the actual textual content on pages, this is no longer satisfactory because modern sites use the html5 technology which enables data collection of users’ interactions with the textual content. By recording user clicks in javascript, new data streams collect and combine user ids with click streams and text content viewed. For example, DirectEmployers records the user id and the job description viewed. This should conceptually enable the company to identify which user is merely browsing the portal vs users that actively search a job.

In order to achieve this, relevant information needs to be extracted from each text description, next a measure of proximity of two extracted information is needed, and in the end a single ‘dispersion’ metrics is computed for each users. The higher the metrics, the more exploratory behavior of the user is. The workflow requires substantial data science and engineering using several tools.

Hadoop’s schema on read is a well suited framework for the bulk of the analysis. Its easy to load concept makes it adequate to simply dump textual descriptions, click data, and user information to the filesystem.

To form relevant information from each text description, Latent Dirichlet Allocation (LDA) can be performed. The process starts by remove stop words from text which is easily accomplished in the Hadoop’s map reduce paradigm. Instead of using raw java, scripting language Pig can be used in combination with user defined functions (UDFs) to accomplish this task in a few lines of code. Next the document-term matrix is constructed. This is again simple to perform in Pig by a single pass through text descriptions and fully exploring concurrency.

LDA which takes the document-term matrix as input is hard to execute in Hadoop’s map reduce framework and thus it is more common to export the matrix and perform LDA in either R or Python since both offer excellent support for LDA. The resulting topics mapped back to the original text content can then be exported back to Hadoop for subsequent steps.

The calculation of distances between text descriptions based on the topics provided by LDA can be efficiently executed in Hadoop by using a self-join in Pig with help of UDFs. Finally, the score for each user is computed by joining user data with clicks and pair-wise distances of text descriptions.

All of these steps can be accomplished by Pig (and select steps are more elegant in Hive) with only a limited number of java code hidden in UDFs and assistance of R or Python.

Without the use of Hadoop’s capability of handling size and variety of data, this analysis would be confined to only user clicks and thus the value provided would be very limited.