Diego Klabjan
  • Home
  • Vita
  • Publications
  • Contact

MIXING REAL-TIME AND BATCH DATA PROCESSING

12/29/2014

26 Comments

 
Want to process real-time data? Web anyone? Or IoT or the industrial internet! You can fire spouts and bolts in Storm and get “whoops, one of our assembly machines is about to experience problems!”

For decades companies were able to use RDMS and data cubes to find out what revenue would be lost if an assembly machine goes down for two hours under the normal throughput. Lately Hadoop has become a de-facto standard for this step.

But what if the management wants to know what is the impact of a sputtering machine likely to break down under its current throughput? Or a plant manager wants an up-to-date health status of the machine, i.e., after the last batch data run, and the assessment requires the information from the last batch data run and all sensor readings since then? A few years ago this was utopia, but computer and data scientists knew what is brewing in some pots.

Lambda architecture enables exactly this and makes it possible (for now, with an army of data scientists) by using several technologies. This architecture requires views for batch data and a streaming process for the data arrived after the last batch run. Typical implementations use the open source stack Kafka, Hadoop, Druid, Storm, but to a certain extend Spark can also serve the purpose. An interesting perspective and experience is provided by Jay Kreps from LinkedIn (who clearly advocated using only Kafka pioneered by LinkedIn).

While successful deployments are encountered at big web giants to process web data, there are definitely use cases outside of Silicon Valley. Possible manufacturing cases are outlined above. 

26 Comments

A PREREQUISITE FOR ON-TIME ANALYTICS PROJECTS

12/17/2014

17 Comments

 
Needless to say, there can be no analytics project without data. A project starts by identifying data, cleansing it, performing analytics, and then conveying the results or solutions. The rule of thumb is that 70 to 80 percent of the total timeframe for an analytics project is spent on data preparation and a much smaller portion to actually conducting the analytics.

There are many reasons why analytics software projects routinely miss deadlines and overrun the budget, in particular regarding the data preparation phase where leaders kick these projects off without nailing the prerequisites:

  • Many analytics projects begin as software engagements, and software projects are notoriously known to be delayed. Commonly the development processes lack specification, there is no source control in place, and no incremental delivery requirements.
  • Developers are humans with all of the usual proclivities to over-promise, overestimate skills, and give in to the pressure of management.
  • The inherent nature of development does not help either: bugs can be tough to fix, software libraries do not link, the implementation must work on a great variety of different computing architectures, etc.
Analytics projects are subject to all these quirks. They also rely heavily on data which make the delays even bigger and more frequent. And it is not all about data cleansing despite requiring the majority of the time. If the data is readily available, then cleansing is time consuming but doable.

Way too often in an analytics project, significant delays accumulate even before the first line of code or query has been written. It all starts with data availability and collection. It is not uncommon for companies to start analytics projects based on the belief in the potential business value of analytics, dependent on utopic data sets. The quality of data eats into the realized value, but it is even worse if the data is not even available or accessible to the team before the start of the project.

I have been involved in several projects where there is an initial assumption that the business users will provide some data and the miracles of analytics will be performed. This can happen, but only after a multi-year delay in the execution due to data not being available from the beginning. If the business users own the data, then the task of making the data accessible is not a top priority. After all, the business has been running for years without this new analytics nuisance.

After the data has been collected and passed to the analytics team, the data quality issues arise and missing data sets are identified. At this point, a back-and-forth is started with the data owners and the best outcome is that the data is ready after a significant delay in the project execution. While there is a move towards EDWs that ease data access, in the foreseeable future there will always be spreadsheets around corporations with valuable information not accessible to everyone.

Corporations that do this well can execute a project more quickly because of well-established analytics project leadership. These leaders understand that a project cannot start without the data being ready and of acceptable quality. No matter how high the perceived business value of a project is, they will not pull the trigger to start before all of the prerequisites are met. It only starts  after the data is ready.

This contrast in leadership is especially pronounced when outsourcing the project development. The billable hours keep accumulating while the consultants wait for the data or work with old - or even worse, fictitious - data sets. On the other hand, the consultants also feel the pressure as they see project deadlines slipping away. Experienced leaders do not cave in to the pressure of upper management and the allure of high ROI. They do not officially start a consulting engagement before all of the data is readily available and of acceptable quality. Potential delays in the execution are now narrowed down to the general peculiarities of software projects.

17 Comments

THE CHALLENGES OF EDUCATING THE NEXT GENERATION OF DATA SCIENTISTS

12/10/2014

55 Comments

 
Analytics is booming, and there is a gap between the supply of skilled analysts and the high demand in the market.  In the U.S., the first comprehensive analytics program was established in 2007.  Prior to that, existing programs focused on the components of analytics - statistics, information technology, data mining and business intelligence.  It took another four years before other U.S. schools picked up on the trend and started slowly supplying students with advanced analytics degrees.

Today, in 2012, there are still only a handful of graduate and undergraduate programs focused on analytics.  Some companies, like EMC, have found the talent shortage so severe that they have created their own programs to train and certify analysts.  Given the high demand for these skills, why are universities so far behind?

A decade ago, Bioengineering was a degree and skill in heavy demand, and almost immediately numerous schools started research and educational programs in that field. What is different about Analytics and Bioengineering when it comes to establishing an educational program within a university or college? The primary difference is that the area of analytics is extremely broad.  It spans computer science, statistics, operations research, and business. As a result, there is a lack of focused research programs and departments in analytics, and it is challenging for a traditional academic department to offer a program in analytics.

While Bioengineering spans the areas of biology/medicine and engineering, it was quickly embraced by engineering schools.  Bioengineering departments started forming quickly and together with them the underlying undergraduate and graduate majors and research components reflected through advanced graduate degrees and comprehensive research groups.  Despite the fact that bioengineering also covers multiple areas, the vast majority of the departments are housed within engineering schools or colleges. Analytics is much broader than bioengineering and it encompasses several traditional research and education areas within a university.  For this reason there is ambiguity about who should host an analytics program, or even an ‘analytics department.’

North Carolina State elegantly addressed this issue by forming a new standalone unit, the Institute of Advanced Analytics.  The institute operates independently as a university-wide collaboration that can draw faculty from any of the university’s ten colleges, allowing them to participate in the analytics program on an equal footing. Northwestern University tackled this problem differently.  Northwestern’s program is a professional program, which implies that it is very autonomous.  While it is housed within the department of Industrial Engineering and Management Sciences, it draws instructors from all around the university.

Changes to the program, in particular curriculum, do not have to be approved by the university graduate school and this offers great flexibility when it comes to adjusting the curriculum based on rapidly changing business trends and needs. Northwestern also offers an online program, Master of Science in Predictive Analytics.  This program is administered by the School of Continuing Education which can also be considered a standalone unit. Despite this school not being established for the purpose of offering a degree in analytics, it is not associated with any traditional college or department, so the organizational structure is very much aligned with the North Carolina State’s model.  The big differentiating factor between these two degrees is the full-time vs. online delivery of education.  Outside of North Carolina State and Northwestern, most other analytics-based programs have been developed within schools of business.  These programs are more focused on the business aspect of analytics and thus schools of business are their natural fit.

Bioengineering prospered due to the rise of the entire discipline which has not yet been observed in analytics. As of today, no department including “analytics” in its name exists. There are no undergraduate or graduate majors exclusively in analytics.  On the research end, analytics is based on traditional disciplines such as statistics, operations research, and machine learning. In recent years the research activities come from the increased computing capabilities for handling larger and larger data sets. On the other hand, the business world is more data-driven and it is hard to be competitive without data-driven decision making. To meet these real-world needs, researchers are adapting and expending known techniques to bigger data sets. The Hadoop ecosystem for handling big data is definitely an interesting research area.

Hopefully these research directions will eventually lead to a ‘Department of Analytics’ and thus a ‘Ph.D. in Analytics.’ This will also spawn new educational programs in analytics. Many U.S. institutions of higher education realize the need to prepare the future workforce for analytics, yet there are still debates about the best way to form analytics-based programs. The highly interdisciplinary nature of analytics poses challenges that are yet to be resolved. The North Carolina State model of a new standalone unit, the Northwestern principle of a professional program, and confining analytics to a business focus are three current ways of coping with this fact. Due to the increasing demand for skilled professionals in analytics, universities should move quickly in adopting one of these three business models or find a new one that best fits their needs.

55 Comments

Why Small and Medium Businesses (SMBs) Are a Big Opportunity for Business Analytics

12/3/2014

19 Comments

 
Fortune 500 companies are big enough, and have enough resources, to assemble and run their own internal analytics teams. In today’s environment, it is impossible for a large corporation to succeed without employing analytics. The situation is completely different if we make a step down to small and medium business (SMBs), which are typically corporations with less than 500 employees and revenue in hundreds of millions of dollars. Most SMBs do not have enough resources to deploy an internal analytics team. Lack of resources is definitely not the prohibitive argument for why they don’t use analytics. The conventional wisdom ‘we have been successful for many years so why do we need analytics now?’ is just now being challenged. It is the growth driver that should spawn the adoption of analytics. With analytics, SMBs can expand the market share, intelligently manage operations, drive down costs, and gain a new competitive advantage. In layman’s words, analytics can increase the bottom line for a few million dollars.

As mentioned, an SMB typically lacks the size to have an internal analytics team and thus they are ripe for using external software solutions. There is a big opportunity for independent software vendors (ISVs) offering business analytics solutions to target SMBs. If an SMB was established more than five years ago, they most likely only use business intelligence for basic reporting, or Google Analytics if they have an e-commerce site.

The situation is different for most recent start-ups and SMBs since many of them built their business models around business analytics and from the very beginning it became a key component of their business strategy. They clearly include all the start-ups in the software space and those using other technologies. In addition, this is also evident in many companies, for example, using social networking data or data from sensors such as telemetry and smart meter data.

ISVs offering analytics-based solutions have a tremendous opportunity in targeting SMBs in pretty much every vertical: from transportation to healthcare, retail, CPG, manufacturing, etc. SMBs are overshadowed by their big brothers since typical analytics projects cannot drive hundreds of millions of benefits as is the case for big corporations. However, despite a lower per project ROI, the total market opportunity is enormous due to the large number of SMBs in the U.S. (there are more than 25 million SMBs in the U.S.) While percentage-wise, it is not as high as in many European countries, they still represent a major chunk of the U.S. economy. Since every corporation has sales and marketing, low-hanging fruit is in the areas of marketing and customer intelligence analytics.

It is well known that in a successful data-driven corporation, everything starts at the management level. The management has to embrace analytics and then trickle it down throughout the entire organization. SMBs are no exception in this regard. The big advantage of SMBs is the fact that their organizational structure is more shallow and narrower in size. For this reason they are usually quicker to buy into analytics. Let us make no mistake; the buy-in from management in SMBs should not be taken for granted.

To summarize, analytics success stories in SMBs are not sexy – they will not appear in the Bloomberg Businessweek and will not lead to feature films like Moneyball, but nevertheless, they can make a dent in the economy. The opportunity for ISVs to target SMBs is definitely big. One does not have to look further than Intuit to get inspired by focusing on SMBs as a major market segment. Despite traditionally being focused on ‘accounting,’ Intuit now embeds analytics in their solutions such as the online personal financing software Mint and conducts analyses across their customer segments.   

19 Comments

    Diego Klabjan

    Professor at Northwestern University, Department of Industrial Engineering and Management Sciences. Founding Director, Master of Science in Analytics.

    Archives

    July 2019
    June 2019
    March 2019
    February 2019
    January 2017
    August 2016
    March 2016
    November 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014

    Categories

    All
    Analytics

    RSS Feed