Traditionally the airlines have recorded bookings, i.e., passengers actually buying an itinerary. The booking data is the foundation of revenue management systems by basing forecasting on it. In addition, the airlines use booking data to estimate market sizes (a market is a city-to-city pair), frequency (how often to fly in a market), and to tailor the actual itineraries in a given market (flying non-stop in a market or offering service through a connection).
With the advent of the internet as a distribution channel, today the airlines can store not only the actual bookings, but also the available itineraries offered to the customer. In addition to recording the booked itinerary, all itineraries on the ‘screen,’ i.e., presented to the customer, also called the choice set, are stored in a database. The request (filters specified on the page), the actual booking, and the choice set are all stored and linked. The first challenge here is to store the data. The size of the data increases twenty fold if the choice set has twenty itineraries. After storing the data in json files, the immediate next challenge is how to analyze it.
Bookings with choice sets can be used in discrete choice models. These models predict the likelihood of an itinerary being selected given a choice set. They are based on the notion of the utility function, which is a linear combination of features. The model assumes that customers always select the itinerary that maximizes their utility. Typical features are the elapsed time of an itinerary, how far are the departure and arrival times from the requested values, price, class and cabin (economy vs business), and the aircraft type (regional jet vs narrow or wide body aircraft). The utility coefficients are calibrated by using the standard machine learning maximum likelihood objective.
Due to the sheer size of the data and multi-structural nature, e.g., customer preferences are specified as ‘lists,’ they are typically stored as json files and should be used in Hadoop for analyses.
The features mentioned above are extracted from fields by using Pig. This scripting language is well suited to sift through all records and fields in json in a very efficient manner by exploring concurrency of map reduce. For this data engineering step Hive can also be used.
After extracting all features in Pig or Hive, the discrete choice model has to be calibrated by means of solving an optimization problem. This can either be achieved with Spark as part of Hadoop or the extracted feature matrix can be loaded in R or Python. The former due to the distributed nature is capable to handle more records than the memory bound R or Python. The implication is that Spark will be able to handle more booking requests and the underlying choice sets. All these tools have build-in functions to solve the maximum likelihood optimization model.
Such large-scale discrete choice models should drastically improve airline planning, in particular their market-driven decisions including market presence. The models take into account not only the final decision of a customer – booking – but also the customer’s choice set.
Outside of the airline industry, discrete choice from itineraries has many use cases by online travel agencies (OTA) such as Orbitz and Travelocity, and providers of global distribution systems – Sabre Holdings, Amadeus.