All Categories

CSV to Spark SQL tables

3/20/2015

Recently we were involved in a project that required reading and importing more than 30 csv files into Spark SQL. We started writing scala code to ‘manually’ import file by file, but we soon realized that there is substantial repetition.

As a result we created a nice helper object that takes as input information about the csv files and automatically creates a schema per file. Each csv file must have a header, which dictates the name of the columns in the corresponding table. The user has to customize an object where the details are listed by file. For example, for each file the user can specify:

The name of the schema
Delimiter
A possible formatter function that takes a string value from the input csv file and returns the corresponding scala object that is used in schemaRDD (by default, i.e., if the formatter is not specified, strings are assumed).

The code is available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-csv2sql.git and as a package in www.spark-packages.org

18 Comments

Information Gain based feature selection in Spark’s MLlib

2/16/2015

70 Comments

We recently worked on a project that included web data and other customer specific information. It was a propensity analysis type project where recommendations were required for each individual client based on his or her past actions on the web. Each item recommended has many features and clients belong to organizations, which creates interactions among them.

Recommendations should be personalized and take into account linkage through organizations. To handle the latter, we selected to use organization related features as features in the model (instead of a possible alternative approach of having organization-level and individual customer-level models which are then combined).

These characteristics led to a personalized model for each client with each model having more than 2,000 features. To avoid over-fitting, feature selection had to be performed at customer level and thus due to a large number of customers in an automated fashion.

A possible line of attack is to apply PCA to each model. The problem with this approach is that the resulting features are linear combinations of original features. The project managers were consistently asking which features are important and what is the importance of each feature (think weights in logistic regression). For this reason PCA was not a viable option.

Instead we opted to go with feature selection based on information gain. The concept is based on the information gain between a feature vector and the label vector. The goal is to select a subset of features that maximize the information gain between them and the label (reveal as much information about the labels as possible) and minimize the information gain among the features themselves (minimize the redundancy of the features among themselves). The goal is to find set S that , where ig is the information gain between two vectors. It is common to solve this problem, called maximum relevance minimum redundancy, greedily.

Due to a large amount of data across all customers, we implemented the entire system in Spark. Spark’s MLlib unfortunately offers only feature selection based on PCA but not based on information gain. For this reason we implemented our own package. The package is available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-ml.git for public peruse.

For our purpose we had to also customize MLlib with other extensions (that are rather short and thus not put in a public repository).

Our model was heavily unbalanced (one class represented less than 1% of the records). We undersampled the large class but we had also used different weights for different records in the model calibration phase. This also allowed us to put different weights for different events of customers (for example, a misclassification of a purchase event has more weight than a misclassification of an item that the user recommended to someone else in the same organization). We achieved this by subclassing LabelPoint with a weight of the record. We also had to customize the classes for gradient computation (in logistic and SVM since these were the most competitive models) in order to account for weights.

The second enhancement was the use of a quadratic kernel in SVM and logistic regression. We implemented the quadratic kernel since it yields a linear model in a finite higher dimensional space and thus can reuse a lot of the linear machinery in MLlib. To this end, we created a class that extends the input rdd with the kernel function and then exposes the standard API of linear models. We have also extended the model class to automatically take into account the same kernel function when a test vector is applied for predictions.

Overall, Spark definitely met the needs of this project. It is robust and definitely scalable. On the downside, our problem was not a textbook model and thus it needed several enhancements to MLlib.

70 Comments

HOW CAN AIRLINES USE BIG DATA TO BETTER MATCH SUPPLY AND DEMAND

1/28/2015

39 Comments

Airlines have been one of the first industries to use advanced analytics in areas such as revenue management. “Should a request for a seat be granted?” is a fundamental challenge in the industry. If it is granted, the money can be left on the table since the next day a highly valued business passenger might call willing to pay much more for the seat. If it is declined, the business passengers and any other potential passenger might not call for the seat in the future and thus it would remain unsold. The load factor (number of sold seats over all available seats) has always been a very important metric for profitability. The load factor can be increased by improved forecasting of future demand or more appealing offerings (itineraries more closely matching the needs of passengers).

Traditionally the airlines have recorded bookings, i.e., passengers actually buying an itinerary. The booking data is the foundation of revenue management systems by basing forecasting on it. In addition, the airlines use booking data to estimate market sizes (a market is a city-to-city pair), frequency (how often to fly in a market), and to tailor the actual itineraries in a given market (flying non-stop in a market or offering service through a connection).

With the advent of the internet as a distribution channel, today the airlines can store not only the actual bookings, but also the available itineraries offered to the customer. In addition to recording the booked itinerary, all itineraries on the ‘screen,’ i.e., presented to the customer, also called the choice set, are stored in a database. The request (filters specified on the page), the actual booking, and the choice set are all stored and linked. The first challenge here is to store the data. The size of the data increases twenty fold if the choice set has twenty itineraries. After storing the data in json files, the immediate next challenge is how to analyze it.

Bookings with choice sets can be used in discrete choice models. These models predict the likelihood of an itinerary being selected given a choice set. They are based on the notion of the utility function, which is a linear combination of features. The model assumes that customers always select the itinerary that maximizes their utility. Typical features are the elapsed time of an itinerary, how far are the departure and arrival times from the requested values, price, class and cabin (economy vs business), and the aircraft type (regional jet vs narrow or wide body aircraft). The utility coefficients are calibrated by using the standard machine learning maximum likelihood objective.

Due to the sheer size of the data and multi-structural nature, e.g., customer preferences are specified as ‘lists,’ they are typically stored as json files and should be used in Hadoop for analyses.

The features mentioned above are extracted from fields by using Pig. This scripting language is well suited to sift through all records and fields in json in a very efficient manner by exploring concurrency of map reduce. For this data engineering step Hive can also be used.

After extracting all features in Pig or Hive, the discrete choice model has to be calibrated by means of solving an optimization problem. This can either be achieved with Spark as part of Hadoop or the extracted feature matrix can be loaded in R or Python. The former due to the distributed nature is capable to handle more records than the memory bound R or Python. The implication is that Spark will be able to handle more booking requests and the underlying choice sets. All these tools have build-in functions to solve the maximum likelihood optimization model.

Such large-scale discrete choice models should drastically improve airline planning, in particular their market-driven decisions including market presence. The models take into account not only the final decision of a customer – booking – but also the customer’s choice set.

Outside of the airline industry, discrete choice from itineraries has many use cases by online travel agencies (OTA) such as Orbitz and Travelocity, and providers of global distribution systems – Sabre Holdings, Amadeus.

39 Comments

CUSTOMER BEHAVIOR FROM WEB AND TEXT DATA

1/21/2015

11 Comments

Many sites and portals offer text content on web pages. For example, news aggregators such as The Huffington Post or Google News allow users to browse news stories; membership-based portals focusing on a specific industry, e.g., constructionsupport.terex.com for construction, offer members a one-stop page for the latest and greatest updates in a particular domain; in the service domain DirectEmployers provides site www.my.jobs with job listings for members to explore. A challenge faced by these site providers is to distinguish users that simply browse the site versus those that are actively searching with an end goal, e.g., for DirectEmployers it means distinguishing between the user that actively seek a job vs those only exploring the portal. The former can then be targeted with possible marketing campaigns to provide higher business value.

While traditionally this can be accomplished through web analytics by following page views and not considering the actual textual content on pages, this is no longer satisfactory because modern sites use the html5 technology which enables data collection of users’ interactions with the textual content. By recording user clicks in javascript, new data streams collect and combine user ids with click streams and text content viewed. For example, DirectEmployers records the user id and the job description viewed. This should conceptually enable the company to identify which user is merely browsing the portal vs users that actively search a job.

In order to achieve this, relevant information needs to be extracted from each text description, next a measure of proximity of two extracted information is needed, and in the end a single ‘dispersion’ metrics is computed for each users. The higher the metrics, the more exploratory behavior of the user is. The workflow requires substantial data science and engineering using several tools.

Hadoop’s schema on read is a well suited framework for the bulk of the analysis. Its easy to load concept makes it adequate to simply dump textual descriptions, click data, and user information to the filesystem.

To form relevant information from each text description, Latent Dirichlet Allocation (LDA) can be performed. The process starts by remove stop words from text which is easily accomplished in the Hadoop’s map reduce paradigm. Instead of using raw java, scripting language Pig can be used in combination with user defined functions (UDFs) to accomplish this task in a few lines of code. Next the document-term matrix is constructed. This is again simple to perform in Pig by a single pass through text descriptions and fully exploring concurrency.

LDA which takes the document-term matrix as input is hard to execute in Hadoop’s map reduce framework and thus it is more common to export the matrix and perform LDA in either R or Python since both offer excellent support for LDA. The resulting topics mapped back to the original text content can then be exported back to Hadoop for subsequent steps.

The calculation of distances between text descriptions based on the topics provided by LDA can be efficiently executed in Hadoop by using a self-join in Pig with help of UDFs. Finally, the score for each user is computed by joining user data with clicks and pair-wise distances of text descriptions.

All of these steps can be accomplished by Pig (and select steps are more elegant in Hive) with only a limited number of java code hidden in UDFs and assistance of R or Python.

Without the use of Hadoop’s capability of handling size and variety of data, this analysis would be confined to only user clicks and thus the value provided would be very limited.

11 Comments

SPARK AND IN-MEMORY DATABASES: TACHYON LEADING THE PACK

1/7/2015

23 Comments

The biggest grunts about Hadoop is its batch processing focus and the fact that iterative algorithms cannot be written efficiently. For this reason it is mostly used in data lakes for storing huge datasets together with its ETL/ELT capabilities and for running ad-hoc queries with map reduce.

In-memory database on the other hand offer great response time but are limited in their capacity by physical memory. The market is embracing several solutions from Hana by SAP, to VoltDB, memSql, Redis, and other.

Then came Spark with its brilliant idea of resilient distributed datasets (RDDs) which allow to mimic map reduce but holding the data in (persistent) cache. While a single map reduce process is not much faster in Spark over Hadoop’s map reduce, algorithms iterating on the same dataset are greatly more efficient since data is stored in memory cache for continuous access through iterations.

Spark being a processing framework is not a database or filesystem, albeit offering drivers to many databases and filesystems. Its memory oriented cache offers great computational speed but no storage capabilities. So combining its speed with quick access of in-memory databases is the holy grail of computational efficiency and storage.

As an example, memSQL announced a driver for Spark. Functionality of Spark is not readily accessible on top of data residing in the memSQL in-memory database. Real-time use cases such as fraud detection are sure to benefit from the marriage of the two.

A step further is Tachyon developed at Berkeley. It offers in-memory storage with a seamless integration with Spark. If several Spark jobs are accessing the same dataset stored in Tachyon, the dataset is not replicated but loaded only once. This is definitely ultimate efficiency of storage and computation.

As Hadoop will never supplant RDMS (at least in the foreseeable future), Spark with Tachyon (or any other in-memory database) will not make the two extinct. Huge data sets are unlikely to economically fit in memory and thus the three roommates will continue to dance together and occasionally bounce into each other.

23 Comments

MIXING REAL-TIME AND BATCH DATA PROCESSING

12/29/2014

26 Comments

Want to process real-time data? Web anyone? Or IoT or the industrial internet! You can fire spouts and bolts in Storm and get “whoops, one of our assembly machines is about to experience problems!”

For decades companies were able to use RDMS and data cubes to find out what revenue would be lost if an assembly machine goes down for two hours under the normal throughput. Lately Hadoop has become a de-facto standard for this step.

But what if the management wants to know what is the impact of a sputtering machine likely to break down under its current throughput? Or a plant manager wants an up-to-date health status of the machine, i.e., after the last batch data run, and the assessment requires the information from the last batch data run and all sensor readings since then? A few years ago this was utopia, but computer and data scientists knew what is brewing in some pots.

Lambda architecture enables exactly this and makes it possible (for now, with an army of data scientists) by using several technologies. This architecture requires views for batch data and a streaming process for the data arrived after the last batch run. Typical implementations use the open source stack Kafka, Hadoop, Druid, Storm, but to a certain extend Spark can also serve the purpose. An interesting perspective and experience is provided by Jay Kreps from LinkedIn (who clearly advocated using only Kafka pioneered by LinkedIn).

While successful deployments are encountered at big web giants to process web data, there are definitely use cases outside of Silicon Valley. Possible manufacturing cases are outlined above.

26 Comments

A PREREQUISITE FOR ON-TIME ANALYTICS PROJECTS

12/17/2014

17 Comments

Needless to say, there can be no analytics project without data. A project starts by identifying data, cleansing it, performing analytics, and then conveying the results or solutions. The rule of thumb is that 70 to 80 percent of the total timeframe for an analytics project is spent on data preparation and a much smaller portion to actually conducting the analytics.

There are many reasons why analytics software projects routinely miss deadlines and overrun the budget, in particular regarding the data preparation phase where leaders kick these projects off without nailing the prerequisites:

Many analytics projects begin as software engagements, and software projects are notoriously known to be delayed. Commonly the development processes lack specification, there is no source control in place, and no incremental delivery requirements.
Developers are humans with all of the usual proclivities to over-promise, overestimate skills, and give in to the pressure of management.
The inherent nature of development does not help either: bugs can be tough to fix, software libraries do not link, the implementation must work on a great variety of different computing architectures, etc.

Analytics projects are subject to all these quirks. They also rely heavily on data which make the delays even bigger and more frequent. And it is not all about data cleansing despite requiring the majority of the time. If the data is readily available, then cleansing is time consuming but doable.

Way too often in an analytics project, significant delays accumulate even before the first line of code or query has been written. It all starts with data availability and collection. It is not uncommon for companies to start analytics projects based on the belief in the potential business value of analytics, dependent on utopic data sets. The quality of data eats into the realized value, but it is even worse if the data is not even available or accessible to the team before the start of the project.

I have been involved in several projects where there is an initial assumption that the business users will provide some data and the miracles of analytics will be performed. This can happen, but only after a multi-year delay in the execution due to data not being available from the beginning. If the business users own the data, then the task of making the data accessible is not a top priority. After all, the business has been running for years without this new analytics nuisance.

After the data has been collected and passed to the analytics team, the data quality issues arise and missing data sets are identified. At this point, a back-and-forth is started with the data owners and the best outcome is that the data is ready after a significant delay in the project execution. While there is a move towards EDWs that ease data access, in the foreseeable future there will always be spreadsheets around corporations with valuable information not accessible to everyone.

Corporations that do this well can execute a project more quickly because of well-established analytics project leadership. These leaders understand that a project cannot start without the data being ready and of acceptable quality. No matter how high the perceived business value of a project is, they will not pull the trigger to start before all of the prerequisites are met. It only starts after the data is ready.

This contrast in leadership is especially pronounced when outsourcing the project development. The billable hours keep accumulating while the consultants wait for the data or work with old - or even worse, fictitious - data sets. On the other hand, the consultants also feel the pressure as they see project deadlines slipping away. Experienced leaders do not cave in to the pressure of upper management and the allure of high ROI. They do not officially start a consulting engagement before all of the data is readily available and of acceptable quality. Potential delays in the execution are now narrowed down to the general peculiarities of software projects.

17 Comments

THE CHALLENGES OF EDUCATING THE NEXT GENERATION OF DATA SCIENTISTS

12/10/2014

55 Comments

Analytics is booming, and there is a gap between the supply of skilled analysts and the high demand in the market. In the U.S., the first comprehensive analytics program was established in 2007. Prior to that, existing programs focused on the components of analytics - statistics, information technology, data mining and business intelligence. It took another four years before other U.S. schools picked up on the trend and started slowly supplying students with advanced analytics degrees.

Today, in 2012, there are still only a handful of graduate and undergraduate programs focused on analytics. Some companies, like EMC, have found the talent shortage so severe that they have created their own programs to train and certify analysts. Given the high demand for these skills, why are universities so far behind?

A decade ago, Bioengineering was a degree and skill in heavy demand, and almost immediately numerous schools started research and educational programs in that field. What is different about Analytics and Bioengineering when it comes to establishing an educational program within a university or college? The primary difference is that the area of analytics is extremely broad. It spans computer science, statistics, operations research, and business. As a result, there is a lack of focused research programs and departments in analytics, and it is challenging for a traditional academic department to offer a program in analytics.

While Bioengineering spans the areas of biology/medicine and engineering, it was quickly embraced by engineering schools. Bioengineering departments started forming quickly and together with them the underlying undergraduate and graduate majors and research components reflected through advanced graduate degrees and comprehensive research groups. Despite the fact that bioengineering also covers multiple areas, the vast majority of the departments are housed within engineering schools or colleges. Analytics is much broader than bioengineering and it encompasses several traditional research and education areas within a university. For this reason there is ambiguity about who should host an analytics program, or even an ‘analytics department.’

North Carolina State elegantly addressed this issue by forming a new standalone unit, the Institute of Advanced Analytics. The institute operates independently as a university-wide collaboration that can draw faculty from any of the university’s ten colleges, allowing them to participate in the analytics program on an equal footing. Northwestern University tackled this problem differently. Northwestern’s program is a professional program, which implies that it is very autonomous. While it is housed within the department of Industrial Engineering and Management Sciences, it draws instructors from all around the university.

Changes to the program, in particular curriculum, do not have to be approved by the university graduate school and this offers great flexibility when it comes to adjusting the curriculum based on rapidly changing business trends and needs. Northwestern also offers an online program, Master of Science in Predictive Analytics. This program is administered by the School of Continuing Education which can also be considered a standalone unit. Despite this school not being established for the purpose of offering a degree in analytics, it is not associated with any traditional college or department, so the organizational structure is very much aligned with the North Carolina State’s model. The big differentiating factor between these two degrees is the full-time vs. online delivery of education. Outside of North Carolina State and Northwestern, most other analytics-based programs have been developed within schools of business. These programs are more focused on the business aspect of analytics and thus schools of business are their natural fit.

Bioengineering prospered due to the rise of the entire discipline which has not yet been observed in analytics. As of today, no department including “analytics” in its name exists. There are no undergraduate or graduate majors exclusively in analytics. On the research end, analytics is based on traditional disciplines such as statistics, operations research, and machine learning. In recent years the research activities come from the increased computing capabilities for handling larger and larger data sets. On the other hand, the business world is more data-driven and it is hard to be competitive without data-driven decision making. To meet these real-world needs, researchers are adapting and expending known techniques to bigger data sets. The Hadoop ecosystem for handling big data is definitely an interesting research area.

Hopefully these research directions will eventually lead to a ‘Department of Analytics’ and thus a ‘Ph.D. in Analytics.’ This will also spawn new educational programs in analytics. Many U.S. institutions of higher education realize the need to prepare the future workforce for analytics, yet there are still debates about the best way to form analytics-based programs. The highly interdisciplinary nature of analytics poses challenges that are yet to be resolved. The North Carolina State model of a new standalone unit, the Northwestern principle of a professional program, and confining analytics to a business focus are three current ways of coping with this fact. Due to the increasing demand for skilled professionals in analytics, universities should move quickly in adopting one of these three business models or find a new one that best fits their needs.

55 Comments

Why Small and Medium Businesses (SMBs) Are a Big Opportunity for Business Analytics

12/3/2014

19 Comments

Fortune 500 companies are big enough, and have enough resources, to assemble and run their own internal analytics teams. In today’s environment, it is impossible for a large corporation to succeed without employing analytics. The situation is completely different if we make a step down to small and medium business (SMBs), which are typically corporations with less than 500 employees and revenue in hundreds of millions of dollars. Most SMBs do not have enough resources to deploy an internal analytics team. Lack of resources is definitely not the prohibitive argument for why they don’t use analytics. The conventional wisdom ‘we have been successful for many years so why do we need analytics now?’ is just now being challenged. It is the growth driver that should spawn the adoption of analytics. With analytics, SMBs can expand the market share, intelligently manage operations, drive down costs, and gain a new competitive advantage. In layman’s words, analytics can increase the bottom line for a few million dollars.

As mentioned, an SMB typically lacks the size to have an internal analytics team and thus they are ripe for using external software solutions. There is a big opportunity for independent software vendors (ISVs) offering business analytics solutions to target SMBs. If an SMB was established more than five years ago, they most likely only use business intelligence for basic reporting, or Google Analytics if they have an e-commerce site.

The situation is different for most recent start-ups and SMBs since many of them built their business models around business analytics and from the very beginning it became a key component of their business strategy. They clearly include all the start-ups in the software space and those using other technologies. In addition, this is also evident in many companies, for example, using social networking data or data from sensors such as telemetry and smart meter data.

ISVs offering analytics-based solutions have a tremendous opportunity in targeting SMBs in pretty much every vertical: from transportation to healthcare, retail, CPG, manufacturing, etc. SMBs are overshadowed by their big brothers since typical analytics projects cannot drive hundreds of millions of benefits as is the case for big corporations. However, despite a lower per project ROI, the total market opportunity is enormous due to the large number of SMBs in the U.S. (there are more than 25 million SMBs in the U.S.) While percentage-wise, it is not as high as in many European countries, they still represent a major chunk of the U.S. economy. Since every corporation has sales and marketing, low-hanging fruit is in the areas of marketing and customer intelligence analytics.

It is well known that in a successful data-driven corporation, everything starts at the management level. The management has to embrace analytics and then trickle it down throughout the entire organization. SMBs are no exception in this regard. The big advantage of SMBs is the fact that their organizational structure is more shallow and narrower in size. For this reason they are usually quicker to buy into analytics. Let us make no mistake; the buy-in from management in SMBs should not be taken for granted.

To summarize, analytics success stories in SMBs are not sexy – they will not appear in the Bloomberg Businessweek and will not lead to feature films like Moneyball, but nevertheless, they can make a dent in the economy. The opportunity for ISVs to target SMBs is definitely big. One does not have to look further than Intuit to get inspired by focusing on SMBs as a major market segment. Despite traditionally being focused on ‘accounting,’ Intuit now embeds analytics in their solutions such as the online personal financing software Mint and conducts analyses across their customer segments.