Diego Klabjan
  • Home
  • Vita
  • Publications
  • Contact

Discovery vs Production

3/14/2016

0 Comments

 
It has been noted that big data technologies are predominantly used for ETL and data discovery. The former has been around for decades and is well understood with a mature market. Data discovery is much newer and less understood. Wikipedia’s definition reads “Data discovery is a business intelligence architecture aimed at interactive reports and explorable data from multiple sources.”

Data lakes based on Hadoop are bursting out at many companies with the predominant purpose of data discovery from multiple sources (that are explorable). It is easy to simply dump files from all over the place into a data lake and thus the data source requirement in the definition is met. What about the part on “interactive reports?” Verb “to discover” based on dictionaries means “to learn of, to gain sight or knowledge of,” which is quite disconnected from interactive reports. It actually does not have much in common. Indeed, in business data discovery is much more aligned with the dictionary definition than Wikipedia. Data discovery as used with big data and data lakes really means “to gain knowledge of data – in order to ultimately derive business value – by using explorable data from multiple sources.”

The vast majority of the applications of big data are to conduct data discovery in the sense of learning from the data. The knowledge gained per se does not provide business value and thus such insights are operationalized separately in more established architectures (read EDW, RDBMS, BI).  A good example is customer behavior derived from many data sources, e.g., transactional data, social media data, credit performance. This clearly calls for data discovery in a data lake and insights written into a ‘relational database’ and productionalized by means of other systems used in marketing or pricing.

There are very few cases of big data solutions outside of ETL being actually used in production. Large companies directly connected with the web successfully deployed in production big data technologies (Google for page ranking, Facebook for friend recommendations) but outside of this industry big data solutions in production are rarely observed.

It is evident that today big data is used predominantly for data discovery and not in production. I suspect that as technologies mature even more and become more self-served the boundary will gradually shift more towards production assuming that business value would be derived from such opportunities. Today big data is mostly about data discovery. The Wikipedia definition about interactive reports is for now mostly an elusion and it is better to stick with the proper English definition of gaining knowledge of. 

0 Comments

Beyond Basic Machine Learning: Modeling Loan Defaults with Advanced Techniques

11/6/2015

0 Comments

 
Dodd-Frank Act Stress Testing (DFAST) is now required also for smaller banks with assets less than $50 billion. Simply stated, the test requires a bank to assess possible losses in future years from the loans on the books. A report exploring many different future economic scenarios such as increases in the unemployment rates or short term interest rates must be submitted to the Federal Reserve.

A loan status evolves in time. A loan that is current can in the next period become delinquent and then in a future period it can default. A different scenario would be a loan transitioning from current to delinquent and then back to current. The modeling part consists of three major components:

1.     Derive probabilities that a loan will transition from one state to another state in the next period. Clearly a loan can stay in the same state. The probabilities depend on many attributes of a loan.

2.     With these transition probabilities derived, a Markov Chain model is set up. This model assesses the probability that a given loan would be default after a certain number of periods in the future (e.g., eight quarters).

3.     The expected loss of a loan is its probability of default multiplied by the approximate loss.

The most challenging part is the derivation of the probabilities. On the paper this is an easy machine learning classification problem. Consider the case of current loans transitioning to delinquent. First, for each historical loan a set of attributes (features in the machine learning parlance) needs to be assembled. Candidates for features are the loan duration, location of the loan borrower, loan type, etc. Next, a subset of historical loans that transition from current to delinquent and a subset of loans that remain current are selected. The selected loans form the training data set. The third step is to set up models and lastly to evaluate them by using techniques such as 10-fold validation on a test data set.

Typical challenges of data sourcing and cleansing require a big chunk of the time. Several banks do not have enough historical data from their own operations and thus have to procure external data sets (and make sure that the loans on such external data sets resemble their loans).

The classification task at hand is to classify if a given loan is to become delinquent or not in the next time period. As such it is a textbook binary classification problem. There are two hidden challenges. The first one is that the training set is heavily imbalanced since many more loans stay current than transition to delinquent. It is known that in such cases special techniques must be employed. Equally important is the fact that there are many possible features, in excess to one hundred.

We have been engaged in a few DFAST related projects and have faced the feature selection challenge. We started with standard techniques of the principle component analysis (PCA) or information gain (the so-called maximum relevancy minimum redundancy algorithm). Either technique reduced the feature space and then classification models (logistic regression, support vector machine, random forests) were evaluated.

We tackled a few problems requiring deep learning which is a technique well suited for complex problems. While the DFAST problem is not as complex as, for example, recognizing from an image if a passenger is to cross a road, we decided to try the Restricted Boltzmann Machine (RBM) as a technique to reduce the feature space. In RBM, an input (visible) vector is fed to the model, then it is mapped through the notion of the energy function into a lower dimensional hidden vector which is then lifted back to the original space of the feature vector. The goal is to tune parameters (b’,c’,W) in the energy function so that the recovered vector in probability comes close to the original vector. 

Picture
The entire classification is then performed by having an input vector, then based on RBM calculating the hidden vector which is then classified into delinquent or not. [This flow actually follows the paradigm of deep belief networks which typically include more than one hidden layer.]

To our surprise, the RBM based classification model outperforms the vast variety of other traditional feature selection and classification models. The improvement was drastic, the F1 score jumped from 0.13 to 0.5. This was a very nice ‘exercise’ that is hard to push to end users which have heard of logistic regression and PCA and might even know how they work, but would be very uncomfortable using something that is called RBM (but would be much more receptive if the acronym means role-based management). 

0 Comments

Why Small and Medium Businesses (SMBs) Are a Big Opportunity for Business Analytics

12/3/2014

19 Comments

 
Fortune 500 companies are big enough, and have enough resources, to assemble and run their own internal analytics teams. In today’s environment, it is impossible for a large corporation to succeed without employing analytics. The situation is completely different if we make a step down to small and medium business (SMBs), which are typically corporations with less than 500 employees and revenue in hundreds of millions of dollars. Most SMBs do not have enough resources to deploy an internal analytics team. Lack of resources is definitely not the prohibitive argument for why they don’t use analytics. The conventional wisdom ‘we have been successful for many years so why do we need analytics now?’ is just now being challenged. It is the growth driver that should spawn the adoption of analytics. With analytics, SMBs can expand the market share, intelligently manage operations, drive down costs, and gain a new competitive advantage. In layman’s words, analytics can increase the bottom line for a few million dollars.

As mentioned, an SMB typically lacks the size to have an internal analytics team and thus they are ripe for using external software solutions. There is a big opportunity for independent software vendors (ISVs) offering business analytics solutions to target SMBs. If an SMB was established more than five years ago, they most likely only use business intelligence for basic reporting, or Google Analytics if they have an e-commerce site.

The situation is different for most recent start-ups and SMBs since many of them built their business models around business analytics and from the very beginning it became a key component of their business strategy. They clearly include all the start-ups in the software space and those using other technologies. In addition, this is also evident in many companies, for example, using social networking data or data from sensors such as telemetry and smart meter data.

ISVs offering analytics-based solutions have a tremendous opportunity in targeting SMBs in pretty much every vertical: from transportation to healthcare, retail, CPG, manufacturing, etc. SMBs are overshadowed by their big brothers since typical analytics projects cannot drive hundreds of millions of benefits as is the case for big corporations. However, despite a lower per project ROI, the total market opportunity is enormous due to the large number of SMBs in the U.S. (there are more than 25 million SMBs in the U.S.) While percentage-wise, it is not as high as in many European countries, they still represent a major chunk of the U.S. economy. Since every corporation has sales and marketing, low-hanging fruit is in the areas of marketing and customer intelligence analytics.

It is well known that in a successful data-driven corporation, everything starts at the management level. The management has to embrace analytics and then trickle it down throughout the entire organization. SMBs are no exception in this regard. The big advantage of SMBs is the fact that their organizational structure is more shallow and narrower in size. For this reason they are usually quicker to buy into analytics. Let us make no mistake; the buy-in from management in SMBs should not be taken for granted.

To summarize, analytics success stories in SMBs are not sexy – they will not appear in the Bloomberg Businessweek and will not lead to feature films like Moneyball, but nevertheless, they can make a dent in the economy. The opportunity for ISVs to target SMBs is definitely big. One does not have to look further than Intuit to get inspired by focusing on SMBs as a major market segment. Despite traditionally being focused on ‘accounting,’ Intuit now embeds analytics in their solutions such as the online personal financing software Mint and conducts analyses across their customer segments.   

19 Comments

    Diego Klabjan

    Professor at Northwestern University, Department of Industrial Engineering and Management Sciences. Founding Director, Master of Science in Analytics.

    Archives

    July 2019
    June 2019
    March 2019
    February 2019
    January 2017
    August 2016
    March 2016
    November 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014

    Categories

    All
    Analytics

    RSS Feed