Diego Klabjan
  • Home
  • Vita
  • Publications
  • Contact

HOW CAN AIRLINES USE BIG DATA TO BETTER MATCH SUPPLY AND DEMAND

1/28/2015

39 Comments

 
Airlines have been one of the first industries to use advanced analytics in areas such as revenue management. “Should a request for a seat be granted?” is a fundamental challenge in the industry. If it is granted, the money can be left on the table since the next day a highly valued business passenger might call willing to pay much more for the seat. If it is declined, the business passengers and any other potential passenger might not call for the seat in the future and thus it would remain unsold. The load factor (number of sold seats over all available seats) has always been a very important metric for profitability. The load factor can be increased by improved forecasting of future demand or more appealing offerings (itineraries more closely matching the needs of passengers).

Traditionally the airlines have recorded bookings, i.e., passengers actually buying an itinerary. The booking data is the foundation of revenue management systems by basing forecasting on it. In addition, the airlines use booking data to estimate market sizes (a market is a city-to-city pair), frequency (how often to fly in a market), and to tailor the actual itineraries in a given market (flying non-stop in a market or offering service through a connection).

With the advent of the internet as a distribution channel, today the airlines can store not only the actual bookings, but also the available itineraries offered to the customer. In addition to recording the booked itinerary, all itineraries on the ‘screen,’ i.e., presented to the customer, also called the choice set, are stored in a database. The request (filters specified on the page), the actual booking, and the choice set are all stored and linked.  The first challenge here is to store the data. The size of the data increases twenty fold if the choice set has twenty itineraries. After storing the data in json files, the immediate next challenge is how to analyze it.

Bookings with choice sets can be used in discrete choice models. These models predict the likelihood of an itinerary being selected given a choice set. They are based on the notion of the utility function, which is a linear combination of features. The model assumes that customers always select the itinerary that maximizes their utility. Typical features are the elapsed time of an itinerary, how far are the departure and arrival times from the requested values, price, class and cabin (economy vs business), and the aircraft type (regional jet vs narrow or wide body aircraft). The utility coefficients are calibrated by using the standard machine learning maximum likelihood objective.

Due to the sheer size of the data and multi-structural nature, e.g., customer preferences are specified as ‘lists,’ they are typically stored as json files and should be used in Hadoop for analyses.

The features mentioned above are extracted from fields by using Pig. This scripting language is well suited to sift through all records and fields in json in a very efficient manner by exploring concurrency of map reduce. For this data engineering step Hive can also be used.

After extracting all features in Pig or Hive, the discrete choice model has to be calibrated by means of solving an optimization problem. This can either be achieved with Spark as part of Hadoop or the extracted feature matrix can be loaded in R or Python. The former due to the distributed nature is capable to handle more records than the memory bound R or Python. The implication is that Spark will be able to handle more booking requests and the underlying choice sets. All these tools have build-in functions to solve the maximum likelihood optimization model.

Such large-scale discrete choice models should drastically improve airline planning, in particular their market-driven decisions including market presence. The models take into account not only the final decision of a customer – booking – but also the customer’s choice set.

Outside of the airline industry, discrete choice from itineraries has many use cases by online travel agencies (OTA) such as Orbitz and Travelocity, and providers of global distribution systems – Sabre Holdings, Amadeus.            

39 Comments

CUSTOMER BEHAVIOR FROM WEB AND TEXT DATA

1/21/2015

11 Comments

 
Many sites and portals offer text content on web pages. For example, news aggregators such as The Huffington Post or Google News allow users to browse news stories; membership-based portals focusing on a specific industry, e.g., constructionsupport.terex.com for construction, offer members a one-stop page for the latest and greatest updates in a particular domain; in the service domain DirectEmployers provides site www.my.jobs with job listings for members to explore. A challenge faced by these site providers is to distinguish users that simply browse the site versus those that are actively searching with an end goal, e.g., for DirectEmployers it means distinguishing between the user that actively seek a job vs those only exploring the portal. The former can then be targeted with possible marketing campaigns to provide higher business value.

While traditionally this can be accomplished through web analytics by following page views and not considering the actual textual content on pages, this is no longer satisfactory because modern sites use the html5 technology which enables data collection of users’ interactions with the textual content. By recording user clicks in javascript, new data streams collect and combine user ids with click streams and text content viewed. For example, DirectEmployers records the user id and the job description viewed. This should conceptually enable the company to identify which user is merely browsing the portal vs users that actively search a job.

In order to achieve this, relevant information needs to be extracted from each text description, next a measure of proximity of two extracted information is needed, and in the end a single ‘dispersion’ metrics is computed for each users. The higher the metrics, the more exploratory behavior of the user is. The workflow requires substantial data science and engineering using several tools.

Hadoop’s schema on read is a well suited framework for the bulk of the analysis. Its easy to load concept makes it adequate to simply dump textual descriptions, click data, and user information to the filesystem.

To form relevant information from each text description, Latent Dirichlet Allocation (LDA) can be performed. The process starts by remove stop words from text which is easily accomplished in the Hadoop’s map reduce paradigm. Instead of using raw java, scripting language Pig can be used in combination with user defined functions (UDFs) to accomplish this task in a few lines of code. Next the document-term matrix is constructed. This is again simple to perform in Pig by a single pass through text descriptions and fully exploring concurrency.

LDA which takes the document-term matrix as input is hard to execute in Hadoop’s map reduce framework and thus it is more common to export the matrix and perform LDA in either R or Python since both offer excellent support for LDA. The resulting topics mapped back to the original text content can then be exported back to Hadoop for subsequent steps.

The calculation of distances between text descriptions based on the topics provided by LDA can be efficiently executed in Hadoop by using a self-join in Pig with help of UDFs. Finally, the score for each user is computed by joining user data with clicks and pair-wise distances of text descriptions.

All of these steps can be accomplished by Pig (and select steps are more elegant in Hive) with only a limited number of java code hidden in UDFs and assistance of R or Python.

Without the use of Hadoop’s capability of handling size and variety of data, this analysis would be confined to only user clicks and thus the value provided would be very limited.  

11 Comments

SPARK AND IN-MEMORY DATABASES: TACHYON LEADING THE PACK

1/7/2015

23 Comments

 
The biggest grunts about Hadoop is its batch processing focus and the fact that iterative algorithms cannot be written efficiently. For this reason it is mostly used in data lakes for storing huge datasets together with its ETL/ELT capabilities and for running ad-hoc queries with map reduce.

In-memory database on the other hand offer great response time but are limited in their capacity by physical memory. The market is embracing several solutions from Hana by SAP, to VoltDB, memSql, Redis, and other.

Then came Spark with its brilliant idea of resilient distributed datasets (RDDs) which allow to mimic map reduce but holding the data in (persistent) cache. While a single map reduce process is not much faster in Spark over Hadoop’s map reduce, algorithms iterating on the same dataset are greatly more efficient since data is stored in memory cache for continuous access through iterations.

Spark being a processing framework is not a database or filesystem, albeit offering drivers to many databases and filesystems. Its memory oriented cache offers great computational speed but no storage capabilities. So combining its speed with quick access of in-memory databases is the holy grail of computational efficiency and storage.

As an example, memSQL announced a driver for Spark. Functionality of Spark is not readily accessible on top of data residing in the memSQL in-memory database. Real-time use cases such as fraud detection are sure to benefit from the marriage of the two.

A step further is Tachyon developed at Berkeley. It offers in-memory storage with a seamless integration with Spark. If several Spark jobs are accessing the same dataset stored in Tachyon, the dataset is not replicated but loaded only once. This is definitely ultimate efficiency of storage and computation.

As Hadoop will never supplant RDMS (at least in the foreseeable future), Spark with Tachyon (or any other in-memory database) will not make the two extinct. Huge data sets are unlikely to economically fit in memory and thus the three roommates will continue to dance together and occasionally bounce into each other. 

23 Comments

    Diego Klabjan

    Professor at Northwestern University, Department of Industrial Engineering and Management Sciences. Founding Director, Master of Science in Analytics.

    Archives

    July 2019
    June 2019
    March 2019
    February 2019
    January 2017
    August 2016
    March 2016
    November 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014

    Categories

    All
    Analytics

    RSS Feed