<![CDATA[Diego Klabjan - Blog]]>Wed, 17 Apr 2024 06:08:19 -0700Weebly<![CDATA[Thoughts on autoML]]>Mon, 29 Jul 2019 14:43:14 GMThttp://dynresmanagement.com/blog/thoughts-on-automlThere are many start-ups screaming Data Science as a Service (DSaaS). The shortage of data scientists is well document and thus DSaaS makes sense. However, data science is complex and steps such as feature engineering and hyper parameter tuning are tough nuts to crack.
 
There are several feature selection algorithms with varying degree of success and adoption, but feature selection is different from feature creation which often requires domain expertise. Given a set of features, feature selection can be automated, but the latter is still unsolvable without substantial human involvement and interaction.
 
On the surface hyper parameter tuning and model construction are further down the automation and self-service path thanks to the abundant research work and commercial offerings of autoML. Scholars and industry researchers have developed many algorithms for autoML with two prevailing approaches of Bayesian optimization and reinforcement learning. Training of a new configuration can now be aborted early and perhaps restarted later if other attempts are not as promising as the first indicators were showing. Training of a new configuration has also been improved by exploiting previously obtained weights. In short, given a particular and unique image problem autoML can without human assistance configure a CNN-like network together with all other hyper-parameters (optimization algorithm, mini-batch size, etc). Indeed, autoML is able to construct novel architectures that compete and even outperform hand-crafted architectures and to create brand new activation functions.
 
Google’s autoML which is part of their Google Cloud Platform have competed in Kaggle’s competitions. On older competitions autoML would have always been in top ten. In a recent competition autoML competed ‘live’ and finished second. This attests that automatic hyper parameter tuning can compete with best humans.
 
There is one important aspect left out. autoML requires substantial computing resources. On deep learning problems it often requires weeks and even months of computing time on thousands of GPUs. There are not many companies that can afford such expenses on a single AI business problem. Even Fortune 500 companies that we collaborate with are reluctant to go down this path. If organizations with billions of quarterly revenue cannot make a business case for autoML, then it is obvious that scholars in academia cannot conduct research in this area. We can always work on toy problems, but this would take us only so far. The impact would be limited due to unknown scalability of proposed solutions and publishing work on limited computational experiments would be hindered. A recent PhD student of mine recently stated “I do not want to work on an autoML project since we cannot compete with Google due to computational resources.” This says a lot.
 
The implication is that autoML is going to continue to be further developed only by experts in tech giants who already have in place computational resources. Most if not all of the research will be left out of academia. This does not imply that autoML is doomed since in AI it is easy to argue that research in industry is ahead of academic contributions. However, it does imply that the progress is going to be slower since only one party is going to drive the agenda. On the positive side, Amazon, Google, and Microsoft have a keen interest in improving their autoML solutions as part of their cloud platforms. It can be an important differentiation factor driving customers.
 
Before autoML becomes more used in industry, the computational requirements must be lowered, and this is possible only with further research. I guess we are at the mercy of FAANG (like we are for many other aspects outside autoML) to make autoML more affordable.
]]>
<![CDATA[Autonomous Cars: Sporadic Need of Drivers (without a Driver’s License)]]>Tue, 25 Jun 2019 21:38:24 GMThttp://dynresmanagement.com/blog/autonomous-cars-sporadic-need-of-drivers-without-a-drivers-license​​Based on SAE International’s standard J3016 there are 6 levels of car automation with level 0 as the old fashion car with no automation and the highest, level 5, offering full automation in any geographical region and (weather) condition with a human only entering the destination. As a reference point, Tesla cars are at level 3 (short sellers claim it is level 2 while Elon believes it is level 5 or soon-to-be-level-5). 

Waymo is probably the farthest ahead with level 4 cars – full automation on select types of roads and geographical regions (“trained” in Bay Area or Phoenix, but not in both), and weather conditions (let the car not drive in Chicago in snow or experience pot holes when snow is not on the ground). Atmospheric rain storms in San Francisco have probably also wrack havoc to Waymo’s cars.

In “Self-driving cars have a problem: safer human-drive ones,” The Wall Street Journal, June 15, 2019, Waymo’s CEO John Krafcik stated that level 5 was decades away. A human will be needed to occasionally intervene for foreseeable future (a flooded street in San Francisco or a large pot hole in Chicago).  

Let us now switch to the other side of the equation: humans, aka drivers. Apparently, they will be needed for decades. The existence of humans in the future is less problematic except for the believers of superintelligence and singularity, but the survival of drivers is less clear. 

I have three young-adult children, each with a driver’s license, however they are likely the last generation with driving knowledge. Ten years from now I actually doubt they will still know how to drive. While they currently have their own car at home owned by the parenty, they frequently use car-sharing. I doubt they will ever own their car since they will not need one. Their number of miles driven per year is steadily going down and I am confident that five years from now it is going to deplete to zero. As with any other skill or knowledge, if you do not practice it, you forget how to do it. 

Thirty years from now I predict that the only people with driving knowledge will be those who are currently between 30 and 45 years old (I am assuming that all of us older than 45 will not be delighted to drive at the age of 75 or will RIP). Those who are now 30 years old or younger will forget how to drive in a decade or so. Those currently below the driving age and yet-to-be-born will probably never even learn how to drive. 

People outside of this age range of between 30 and 45 will not be able to own a car unless we get level 5 cars, which seems to be unlikely. No matter how much they long to have a car, occasionally they will have to take control of the car, but they will not know how to operate it. As a result, there will be no car ownership for those outside of age 30 to 45. 

The logic does not quite add up since car-sharing drivers will always be needed to transport my children, and occasionally, me. In short, thirty years from now, the only drivers will be those who are currently between 30 and 45 years old or are, and will be, car-sharing drivers. The latter will be in a similar situation as the airplane pilots are now. Car-sharing drivers will mostly be trained in simulators just to have enough knowledge to sporadically take control of a car. Since we consider today’s pilots to know how to fly, we should as well call future car- sharing operators, drivers. 

There is another problem with not having level 5 cars soon. I personally find Tesla’s autopilot useless (I do not own a Tesla, and thus, this claim is based on reported facts), as well as, level 4 cars. The main purpose of an autonomous car is to increase my productivity. The car should drive while I am working on more important things. If I have to pay attention to the road and traffic even without actively driving, it defeats the purpose; it is still a waste of time. The only useful autonomous cars are level 5. There is clearly a benefit of levels 1-4 by being safer, but at least, in my case, this argument goes only so far. 

In summary, the future of car automation levels 0-4 is bleak. They do increase safety, but they do not increase productivity if a human driver needs to be in the car paying attention to weather and pot holes. Furthermore, the lack of future drivers will make them even more problematic. In short, a human in the loop is not resounding. 

One solution is to have level 4 cars being remotely assisted or teleoperated. This practice is already being advocated by some start-ups (Designed Driver) and, in essence, is a human on the loop. In such a scenario, I will be able to do productive work while being driven by an autonomous car with teleoperated assistance. This business model also aligns nicely with the aforementioned lack of drivers since there will only be a need for ‘drivers’ capable of driving in a simulator. Is this a next generation job that will not be wiped out by AI? You bet it is, and we will probably need many. 

There is a challenge on a technical side. A car would have to identify an unusual situation or low-confidence action in order to invoke a teleoperator. If this can be done with sufficient reliability is yet to be seen. Current deep learning models are capable of concluding “this is an animal that I have not yet seen” without specifying what kind of animal it is. There is hope to invent deep learning solutions RELIABLY identifying an unseen situation and passing the control to a teleoperator.  
]]>
<![CDATA[Federated Learning and Privacy]]>Wed, 06 Mar 2019 23:27:37 GMThttp://dynresmanagement.com/blog/federated-learning-and-privacyFederated learning is a process of training where data owners (clients) collaboratively train a shared model without sharing their own data. A general process works as follows. 
  1. A curator (server) sends the current model to clients.
  2. Clients improve the model by training it on their own data.
  3. Clients send back the updates.
  4. The curator aggregates updates from clients and updates the current model.
  5. Repeat 1 - 4.
In general, clients improve the model by minimizing a shared loss function. Only the values of parameters (or gradients of parameters) of the model are shared with the server. The private data stay on the clients' end. Many works have been done to ensure low information leakage of clients' data through shared parameters (or gradients). However, it is not clear how much privacy a federated learning algorithm should guarantee and what properties of the clients' data the server can learn. 

Suppose a client provides a service trained through federated learning. After the client agrees to participate in the training process, the server starts the process. It may choose to optimize any function, including, but not limited to, the loss function of the service model, by collecting parameter estimates from the clients. 

Under the following scenario, the server has a chance to learn the distribution of client data by collecting the maximum likelihood estimates. The server would assume a particular distribution that client data may be drawn from. Then, instead of letting clients minimize a loss function, the server now let clients maximize the likelihood function of the parameters of the chosen distribution on their own data. For example, the server may assume that client data are drawn from a normal distribution. The server sends the current mean estimate to a client. The client maximizes the likelihood function on its own data and sends the update of the mean back to the server. 

In short, the server can collect parameters that reflect the distribution of client data through maximum likelihood estimation. This process does not break the rule of federated learning that clients do not expose their private data to others. However, the server can learn the distribution of the client data in the training process. To be clear, each client needs to agree to make updates with respect to the original loss function and the predefined likelihood function.

We wonder if there is a privacy standard in federated learning that specifies what kind of summaries (or parameters) a server can collect from clients? In federated learning, is it allowed for the server to infer the distribution of the client data by distribution fitting, explicitly through maximum likelihood estimation or maybe implicitly through shared parameters (or gradients) of a regular loss function?
]]>
<![CDATA[Deep Learning for Trading]]>Tue, 12 Feb 2019 23:34:19 GMThttp://dynresmanagement.com/blog/deep-learning-for-trading​Recurrent neural networks are well suited for temporal data. There is abundant work when training data are well defined sequences on the encoding and decoding side in the context of sequence-to-sequence modeling. Consider language based tasks such as sentiment analysis. A sentence is well defined and maps into the encoder. On the decoding side a single prediction is made (positive, neutral, negative). In financial data, a sequence can go back 10 order book updates, 131 of them, or even 1,000. The same issues is present on the decoding side; do we want to make the prediction for the next one second, one minute, or one hour in increments of 5 minutes? 

The length of the input sequence remains a challenging problem and is subject to trial-and-error. 
We were able to make advances on how to output only confident predictions in a dynamic fashion. In a very volatile market, a model should be able to reliably make only short term recommendations, while in a stable one, the confidence should increase and more predictions should be made. This is the trait of our new model. 

Standard models have a fixed number of layers (think about the number of neurons in each time step). In a challenging market, one should spend more time exploring the patterns and learning while we can only skim and move on in easy times. There is no reason why a model should not follow the same strategy. Another family of models discussed are adaptive computational time that lead naturally to some of the challenges related to time series data. These models dynamically allocate the number of layers in each time and thus the hardness of computation in each time is controlled. First, data scientists do not need to fine tune the number of layers, and, second, the model allocates a lot of time to hard portions of a sequence and just one layer/neuron to easy parts. 

All these novel aspects have been tested on a few financial data sets predicting prices of ETLs and commodities. The prediction power is drastically improved by using these enhancements. One vexing challenge remains in evaluation. How does better prediction of prices translate into actual trading and P&L? Stay tuned; to-be-seen. 
]]>
<![CDATA[Truthfulness of Information]]>Fri, 20 Jan 2017 04:14:06 GMThttp://dynresmanagement.com/blog/truthfulness-of-informationA lot has been written lately about fake news, especially related to recent elections. Blame has mostly been put on tech giants Google and Facebook as they became major sources of news. While traditional newspaper outlets such as the Wall Street Journal or Washington Post curate the information manually and perform due diligence, automatic news feeds either require substantial labor or algorithms for detecting fake news. The result is all of the well-known rage about Google and Facebook for enabling the circulation of such news.

The problem of truthfulness of information goes beyond everyday political, economics, and more general journalism related misinformation. In healthcare, there are many sites providing misinformation, especially those originating and hosted in third world countries. It can be devastating if a person acts on such sources. Another service area very prone for misinformation is finance. An erroneous signal or news can quickly be picked by traders or trading algorithms to drive a stock in a wrong direction. Even marketers are in need of knowing what web sites provide fake news since they do not want to advertise on such sites.

Google and Facebook are well known for their prowess in machine learning and artificial intelligence and they sit on top of troves of data. Yet they have not been able to design a robust truth-checking system. For quite some time, there are available sites – we will call them truthfulness brokers – that provide names of other sites known to dispel fake news. Recently Google announced that it will start marking sites with questionable credibility by their search engine. Based on the announcements their strategy is to cross-check the site in question with those black listed by truthfulness brokers. A few browser plug-ins have also emerged with the same purpose and also the same algorithm for tagging credibility of a site.

For the time being this is probably an admissible solution but it definitely begs for more. First, the solutions rely on third part sites. There is no guarantee that truthfulness brokers provide up-to-date listings. Indeed, they mostly rely on manual work and thus they cannot be up-to-date. In addition, a new site needs to continuously release fake news before it catches attention from truthfulness brokers. Another problem is that the current focus of the brokers is on journalism related fake news. The other aforementioned areas do not have such dedicated truthfulness brokers, thus leaving enormous untapped value.

It is clear that a robust scalable solution to the problem must not exclusively rely on truthfulness brokers but has to internally curate data and develop sophisticated machine learning and artificial intelligence algorithms. Solely focusing on web sites without taking into consideration the actual content exhibited has limitations. Instead algorithms and solutions based on knowledge graphs and fact-checking combined with scores for source reliability must be developed. 

]]>
<![CDATA[Who is Afraid of the Big (Bad) Wolf?]]>Mon, 15 Aug 2016 12:44:00 GMThttp://dynresmanagement.com/blog/-who-is-afraid-of-big-bad-wolfIn May 2016 I participated at the IRI Annual meeting in Orlando. I attended the round table “Implications of Big Data on R&D Management” organized by the Big Data Working Group where I serve as a subject matter expert. The discussion was moderated by Mike Blackburn, R&D Portfolio/Program Leader, Cargill.

The participants from companies such as IBM, John Deere, Schneider Electric, etc. were discussing data-driven use cases in R&D, success stories and challenges. We heard about data facilitating investment decisions, in agriculture yield data driving acquisition recommendations, and insights from patents taken into consideration in product design and development.

I was expecting to hear from participants how their main concern is direct competitors outsmarting them in data acquisition and use. To the contrary, the vast majority of the participants in their remarks mentioned a company from a different industry. Google has been the most frequently mentioned company despite the discussion focusing on industries that were even a few years ago considered completely disconnected from the technology giants in the Bay Area. 

It is well known how Google is disrupting the automotive industry with autonomous cars relying heavily on AI and mountains of data. Similarly, in logistics (Amazon with its Amazon Robotics acquired through Kiva Systems) and manufacturing (Google with the acquisition of Boston Dynamics who after a few years are parting ways) data together with AI and data science embraced by Google and other tech companies is making inroads in other traditional well entrenched industries. Google is also not shying away from life science data-driven opportunities which is being witnessed by its Life Sciences Group recently renamed to Verily. Not to mention tremendous opportunities that Google Glass or its future incarnations can potentially bring to R&D. 

Based on all these facts it is no surprise that Google has been mentioned much more often than direct competitors or the economy. The company is a wolf waiting to feed on pigs. But it is not a bad wolf; if your company gets eaten, you should blame yourself for not being a visionary about what is possible with data, AI, and data science. The easiest way is to be passive and pretend that Google disrupting your industry is utopia or too far ahead. 

And Google is not going to only affect other organizational units in your industry, it could also influence R&D. Consider, for example, that Google has already assembled product images and manuals. With additive manufacturing it could easily produce products by reverse engineering these materials and then sell the products online. From the vast amount of data it possesses, it can use AI to predict the need for products, willingness to pay and plug-ins that are going to sell. The only way to save your career is to overtake Google. ]]>
<![CDATA[Artificial Intelligence vs CRISPR]]>Thu, 24 Mar 2016 19:52:51 GMThttp://dynresmanagement.com/blog/-artificial-intelligence-vs-crisprCRISPR/Cas9 is a brand new algorithm in machine learning that has potentials to replace humans in practicing law and contract negotiations. It is based on a new deep learning model that is trained only on a few documents and then is capable to produce questions to be asked at a trial or negotiate a real estate deal with Mr. Trump. The model consists of 10 layers of …

Got you (at least some of you). CRISPR/Cas9 is actually a gene editing technology developed at the University of California, Berkeley that does not have anything to do with machine learning (and artificial intelligence). You can read more about the technology on Wikipedia. It is a relatively recent invention and has already been used in curing diseases in adult tissues and to change the color of skin in mice, Mizuno el. al. 2014. No problem here since the humanity definitely wants to eradicate cancer and other genetic diseases. Until a recent very controversial study came out by Junjiu Huang, a gene-function researcher at Sun Yat-sen University in Guangzhou. They applied gene editing by using CRISPR/Cas9 on embryos. Truth to be told, these embryos cannot result in a live birth. Their intent was to cure a blood disorder in the embryo. Huang concludes that further advances need to be made since several of the embryos were not successfully edited, but some were. It goes without saying that this in the future could lead to an ideal child with blue eyes, 6 feet in height, IQ of 130, etc (and graduating from Northwestern University or Georgia Tech - where I got my Ph.D). This nice article in Nature summarize and discusses this controversial research direction.

On the other hand, deep learning in the context of artificial intelligence is all the rage these days. Everybody is warning about the danger of artificial intelligence (AI) to humanity, including extremely influential and prominent people such as Bill Gates, Elon Musk, and Stephen Hawking. There are at least 100 returned pages (and probably many more) on google mentioning “artificial intelligence Bill Gates Elon Musk Stephen Hawking”. For someone that knows deep learning (DL) and is conducting research in this space, I believe DL and AI are very far from endangering humanity. Yes, there were significant advances in supervised learning in specific areas (autonomous cars, scene recognition from images, answering to simple factoids), however these models still need a lot of training data and can solve only very narrow specific problems. Train a scene recognition model on images of living animals and then show it a dinosaur. The answer: “Elephant.” A lot of written news and reports today are written by computers powered by AI, but this is much more structured and easier to learn than negotiating a contract with Mr. Trump or preparing a lawyer for a trial. We are very far from computers displacing humans for such tasks.

I am not worried about AI, definitely not in my life span and that of my children, but CRISPR/Cas9 makes me much more nervous. It really means interfering with the natural process and in the not-so-distant future creating exceptional humans a la carte. I am convinced that without any regulations, i.e., unleashing the scientists, successful gene editing would be around the corner. I believe it is very important that experts around the globe step in and prevent further studies of gene editing on embryos. In terms of AI, using CRISPR/Cas9 or another yet-to-be-invented technology to artificially create a functional brain with all the neurons in a jar seems to be more viable and closer in time than mimicking the human brains satisfactory to endanger humanity with bits.

]]>
<![CDATA[Discovery vs Production]]>Mon, 14 Mar 2016 22:21:21 GMThttp://dynresmanagement.com/blog/discovery-vs-productionIt has been noted that big data technologies are predominantly used for ETL and data discovery. The former has been around for decades and is well understood with a mature market. Data discovery is much newer and less understood. Wikipedia’s definition reads “Data discovery is a business intelligence architecture aimed at interactive reports and explorable data from multiple sources.”

Data lakes based on Hadoop are bursting out at many companies with the predominant purpose of data discovery from multiple sources (that are explorable). It is easy to simply dump files from all over the place into a data lake and thus the data source requirement in the definition is met. What about the part on “interactive reports?” Verb “to discover” based on dictionaries means “to learn of, to gain sight or knowledge of,” which is quite disconnected from interactive reports. It actually does not have much in common. Indeed, in business data discovery is much more aligned with the dictionary definition than Wikipedia. Data discovery as used with big data and data lakes really means “to gain knowledge of data – in order to ultimately derive business value – by using explorable data from multiple sources.”

The vast majority of the applications of big data are to conduct data discovery in the sense of learning from the data. The knowledge gained per se does not provide business value and thus such insights are operationalized separately in more established architectures (read EDW, RDBMS, BI).  A good example is customer behavior derived from many data sources, e.g., transactional data, social media data, credit performance. This clearly calls for data discovery in a data lake and insights written into a ‘relational database’ and productionalized by means of other systems used in marketing or pricing.

There are very few cases of big data solutions outside of ETL being actually used in production. Large companies directly connected with the web successfully deployed in production big data technologies (Google for page ranking, Facebook for friend recommendations) but outside of this industry big data solutions in production are rarely observed.

It is evident that today big data is used predominantly for data discovery and not in production. I suspect that as technologies mature even more and become more self-served the boundary will gradually shift more towards production assuming that business value would be derived from such opportunities. Today big data is mostly about data discovery. The Wikipedia definition about interactive reports is for now mostly an elusion and it is better to stick with the proper English definition of gaining knowledge of. 

]]>
<![CDATA[Beyond Basic Machine Learning: Modeling Loan Defaults with Advanced Techniques]]>Sat, 07 Nov 2015 04:32:24 GMThttp://dynresmanagement.com/blog/-beyond-basic-machine-learning-modeling-loan-defaults-with-advanced-techniquesDodd-Frank Act Stress Testing (DFAST) is now required also for smaller banks with assets less than $50 billion. Simply stated, the test requires a bank to assess possible losses in future years from the loans on the books. A report exploring many different future economic scenarios such as increases in the unemployment rates or short term interest rates must be submitted to the Federal Reserve.

A loan status evolves in time. A loan that is current can in the next period become delinquent and then in a future period it can default. A different scenario would be a loan transitioning from current to delinquent and then back to current. The modeling part consists of three major components:

1.     Derive probabilities that a loan will transition from one state to another state in the next period. Clearly a loan can stay in the same state. The probabilities depend on many attributes of a loan.

2.     With these transition probabilities derived, a Markov Chain model is set up. This model assesses the probability that a given loan would be default after a certain number of periods in the future (e.g., eight quarters).

3.     The expected loss of a loan is its probability of default multiplied by the approximate loss.

The most challenging part is the derivation of the probabilities. On the paper this is an easy machine learning classification problem. Consider the case of current loans transitioning to delinquent. First, for each historical loan a set of attributes (features in the machine learning parlance) needs to be assembled. Candidates for features are the loan duration, location of the loan borrower, loan type, etc. Next, a subset of historical loans that transition from current to delinquent and a subset of loans that remain current are selected. The selected loans form the training data set. The third step is to set up models and lastly to evaluate them by using techniques such as 10-fold validation on a test data set.

Typical challenges of data sourcing and cleansing require a big chunk of the time. Several banks do not have enough historical data from their own operations and thus have to procure external data sets (and make sure that the loans on such external data sets resemble their loans).

The classification task at hand is to classify if a given loan is to become delinquent or not in the next time period. As such it is a textbook binary classification problem. There are two hidden challenges. The first one is that the training set is heavily imbalanced since many more loans stay current than transition to delinquent. It is known that in such cases special techniques must be employed. Equally important is the fact that there are many possible features, in excess to one hundred.

We have been engaged in a few DFAST related projects and have faced the feature selection challenge. We started with standard techniques of the principle component analysis (PCA) or information gain (the so-called maximum relevancy minimum redundancy algorithm). Either technique reduced the feature space and then classification models (logistic regression, support vector machine, random forests) were evaluated.

We tackled a few problems requiring deep learning which is a technique well suited for complex problems. While the DFAST problem is not as complex as, for example, recognizing from an image if a passenger is to cross a road, we decided to try the Restricted Boltzmann Machine (RBM) as a technique to reduce the feature space. In RBM, an input (visible) vector is fed to the model, then it is mapped through the notion of the energy function into a lower dimensional hidden vector which is then lifted back to the original space of the feature vector. The goal is to tune parameters (b’,c’,W) in the energy function so that the recovered vector in probability comes close to the original vector. 

The entire classification is then performed by having an input vector, then based on RBM calculating the hidden vector which is then classified into delinquent or not. [This flow actually follows the paradigm of deep belief networks which typically include more than one hidden layer.]

To our surprise, the RBM based classification model outperforms the vast variety of other traditional feature selection and classification models. The improvement was drastic, the F1 score jumped from 0.13 to 0.5. This was a very nice ‘exercise’ that is hard to push to end users which have heard of logistic regression and PCA and might even know how they work, but would be very uncomfortable using something that is called RBM (but would be much more receptive if the acronym means role-based management). 

]]>
<![CDATA[Improvements to large-scale machine learning in Spark]]>Tue, 28 Apr 2015 13:10:36 GMThttp://dynresmanagement.com/blog/improvements-to-large-scale-machine-learning-in-spark One of the biggest hustles in mapreduce is model calibration for machine learning models such as the logistic regression and SVM. These algorithms are based on gradient optimization and require iterative computations of the gradient and in turn updating the weights. Mapreduce is ill suited for this since in each iteration the data has to be read from hdfs and there is significant cost of starting and winding down a mapreduce job.

On the other hand, Spark with its capability to persist rdd’s (resilient distributed dataset) in memory and natively offering dataflow capabilities, is a great candidate for efficient calibration on rdd’s.

Gradient based algorithms on distributed data sets rely on the paradigm of solving the optimization problem on each partition and then combining the solutions together. We implemented in scala three algorithms.

1.      Iterative parameter averaging (IPA): On each partition a single pass of the standard gradient algorithm is performed which produces weights. Weights from each partition are then averaged and form the initial weights for the next pass. The pseudo code is provided below.

Initialize w
Loop
            Broadcast w to each partition
            weightRDD = For each partition in rdd inputData
                                                wp = w
                                                Perform a single gradient descent pass over the records in            
                                                                        the partition by iteratively updating wp
                                                Return wp
            /* weightRDD is the rdd storing new weights */
            w = average of all weights in weightRDD
Return w

            The key is to keep the rdd inputData in memory (persist before calling IPA).

2.      Alternating direction method of multipliers (ADMM): http://stanford.edu/~boyd/admm.html

This method is based on the concept of the augmented Lagrangian. In each iteration for each partition the calibration model is solved on the records pertaining to the partition. The objective function is altered and it consists of the standard loss and a penalty term driving the weights to resemble the average weights. One needs to solve an extra regularization problem with penalties. For L2 and L1 norms this problem has a closed form solution.

After each partition computes its weights, they are averaged and the penalty term adjusted. Each partition has its own set of weights.

Since the algorithm is complex, we do not provide the pseudo code.  The bulk of the pseudo code is actually very similar to IPA, however there is additional work performed by the driver.

One challenge, i.e., inefficiency in spark or ‘we do not how to do it in spark,’ is the inability in spark to send particular data (in our case the penalties) to a particular actor working on a partition. Instead we had to do a forecast to all actors and then during processing of a partition only the relevant penalties that have been broadcast are used. The main issue here is that all penalties for each partition has to be held in memory at the driver. For very large-scale rdd’s with many features this will be a bottleneck.

3.      Progressive hedging (PH): This is very similar to ADMM. The regularization subproblem has a different from than in ADMM, but it still exhibits closed form solutions for L2 and L1 norms. 

The implementations together with test codes are available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-ml-optimization 

Below is a comparison with Spark on 4 CPUs each one with 8 cores for two large data sets. IPA is a clear winner with the default spark SGD being the worst algorithm. 

]]>