<![CDATA[Diego Klabjan - Blog]]>Wed, 20 Mar 2019 12:59:09 -0700Weebly<![CDATA[Federated Learning and Privacy]]>Wed, 06 Mar 2019 23:27:37 GMThttp://dynresmanagement.com/blog/federated-learning-and-privacyFederated learning is a process of training where data owners (clients) collaboratively train a shared model without sharing their own data. A general process works as follows. 
  1. A curator (server) sends the current model to clients.
  2. Clients improve the model by training it on their own data.
  3. Clients send back the updates.
  4. The curator aggregates updates from clients and updates the current model.
  5. Repeat 1 - 4.
In general, clients improve the model by minimizing a shared loss function. Only the values of parameters (or gradients of parameters) of the model are shared with the server. The private data stay on the clients' end. Many works have been done to ensure low information leakage of clients' data through shared parameters (or gradients). However, it is not clear how much privacy a federated learning algorithm should guarantee and what properties of the clients' data the server can learn. 

Suppose a client provides a service trained through federated learning. After the client agrees to participate in the training process, the server starts the process. It may choose to optimize any function, including, but not limited to, the loss function of the service model, by collecting parameter estimates from the clients. 

Under the following scenario, the server has a chance to learn the distribution of client data by collecting the maximum likelihood estimates. The server would assume a particular distribution that client data may be drawn from. Then, instead of letting clients minimize a loss function, the server now let clients maximize the likelihood function of the parameters of the chosen distribution on their own data. For example, the server may assume that client data are drawn from a normal distribution. The server sends the current mean estimate to a client. The client maximizes the likelihood function on its own data and sends the update of the mean back to the server. 

In short, the server can collect parameters that reflect the distribution of client data through maximum likelihood estimation. This process does not break the rule of federated learning that clients do not expose their private data to others. However, the server can learn the distribution of the client data in the training process. To be clear, each client needs to agree to make updates with respect to the original loss function and the predefined likelihood function.

We wonder if there is a privacy standard in federated learning that specifies what kind of summaries (or parameters) a server can collect from clients? In federated learning, is it allowed for the server to infer the distribution of the client data by distribution fitting, explicitly through maximum likelihood estimation or maybe implicitly through shared parameters (or gradients) of a regular loss function?
<![CDATA[Deep Learning for Trading]]>Tue, 12 Feb 2019 23:34:19 GMThttp://dynresmanagement.com/blog/deep-learning-for-trading​Recurrent neural networks are well suited for temporal data. There is abundant work when training data are well defined sequences on the encoding and decoding side in the context of sequence-to-sequence modeling. Consider language based tasks such as sentiment analysis. A sentence is well defined and maps into the encoder. On the decoding side a single prediction is made (positive, neutral, negative). In financial data, a sequence can go back 10 order book updates, 131 of them, or even 1,000. The same issues is present on the decoding side; do we want to make the prediction for the next one second, one minute, or one hour in increments of 5 minutes? 

The length of the input sequence remains a challenging problem and is subject to trial-and-error. 
We were able to make advances on how to output only confident predictions in a dynamic fashion. In a very volatile market, a model should be able to reliably make only short term recommendations, while in a stable one, the confidence should increase and more predictions should be made. This is the trait of our new model. 

Standard models have a fixed number of layers (think about the number of neurons in each time step). In a challenging market, one should spend more time exploring the patterns and learning while we can only skim and move on in easy times. There is no reason why a model should not follow the same strategy. Another family of models discussed are adaptive computational time that lead naturally to some of the challenges related to time series data. These models dynamically allocate the number of layers in each time and thus the hardness of computation in each time is controlled. First, data scientists do not need to fine tune the number of layers, and, second, the model allocates a lot of time to hard portions of a sequence and just one layer/neuron to easy parts. 

All these novel aspects have been tested on a few financial data sets predicting prices of ETLs and commodities. The prediction power is drastically improved by using these enhancements. One vexing challenge remains in evaluation. How does better prediction of prices translate into actual trading and P&L? Stay tuned; to-be-seen. 
<![CDATA[Truthfulness of Information]]>Fri, 20 Jan 2017 04:14:06 GMThttp://dynresmanagement.com/blog/truthfulness-of-informationA lot has been written lately about fake news, especially related to recent elections. Blame has mostly been put on tech giants Google and Facebook as they became major sources of news. While traditional newspaper outlets such as the Wall Street Journal or Washington Post curate the information manually and perform due diligence, automatic news feeds either require substantial labor or algorithms for detecting fake news. The result is all of the well-known rage about Google and Facebook for enabling the circulation of such news.

The problem of truthfulness of information goes beyond everyday political, economics, and more general journalism related misinformation. In healthcare, there are many sites providing misinformation, especially those originating and hosted in third world countries. It can be devastating if a person acts on such sources. Another service area very prone for misinformation is finance. An erroneous signal or news can quickly be picked by traders or trading algorithms to drive a stock in a wrong direction. Even marketers are in need of knowing what web sites provide fake news since they do not want to advertise on such sites.

Google and Facebook are well known for their prowess in machine learning and artificial intelligence and they sit on top of troves of data. Yet they have not been able to design a robust truth-checking system. For quite some time, there are available sites – we will call them truthfulness brokers – that provide names of other sites known to dispel fake news. Recently Google announced that it will start marking sites with questionable credibility by their search engine. Based on the announcements their strategy is to cross-check the site in question with those black listed by truthfulness brokers. A few browser plug-ins have also emerged with the same purpose and also the same algorithm for tagging credibility of a site.

For the time being this is probably an admissible solution but it definitely begs for more. First, the solutions rely on third part sites. There is no guarantee that truthfulness brokers provide up-to-date listings. Indeed, they mostly rely on manual work and thus they cannot be up-to-date. In addition, a new site needs to continuously release fake news before it catches attention from truthfulness brokers. Another problem is that the current focus of the brokers is on journalism related fake news. The other aforementioned areas do not have such dedicated truthfulness brokers, thus leaving enormous untapped value.

It is clear that a robust scalable solution to the problem must not exclusively rely on truthfulness brokers but has to internally curate data and develop sophisticated machine learning and artificial intelligence algorithms. Solely focusing on web sites without taking into consideration the actual content exhibited has limitations. Instead algorithms and solutions based on knowledge graphs and fact-checking combined with scores for source reliability must be developed. 

<![CDATA[Who is Afraid of the Big (Bad) Wolf?]]>Mon, 15 Aug 2016 12:44:00 GMThttp://dynresmanagement.com/blog/-who-is-afraid-of-big-bad-wolfIn May 2016 I participated at the IRI Annual meeting in Orlando. I attended the round table “Implications of Big Data on R&D Management” organized by the Big Data Working Group where I serve as a subject matter expert. The discussion was moderated by Mike Blackburn, R&D Portfolio/Program Leader, Cargill.

The participants from companies such as IBM, John Deere, Schneider Electric, etc. were discussing data-driven use cases in R&D, success stories and challenges. We heard about data facilitating investment decisions, in agriculture yield data driving acquisition recommendations, and insights from patents taken into consideration in product design and development.

I was expecting to hear from participants how their main concern is direct competitors outsmarting them in data acquisition and use. To the contrary, the vast majority of the participants in their remarks mentioned a company from a different industry. Google has been the most frequently mentioned company despite the discussion focusing on industries that were even a few years ago considered completely disconnected from the technology giants in the Bay Area. 

It is well known how Google is disrupting the automotive industry with autonomous cars relying heavily on AI and mountains of data. Similarly, in logistics (Amazon with its Amazon Robotics acquired through Kiva Systems) and manufacturing (Google with the acquisition of Boston Dynamics who after a few years are parting ways) data together with AI and data science embraced by Google and other tech companies is making inroads in other traditional well entrenched industries. Google is also not shying away from life science data-driven opportunities which is being witnessed by its Life Sciences Group recently renamed to Verily. Not to mention tremendous opportunities that Google Glass or its future incarnations can potentially bring to R&D. 

Based on all these facts it is no surprise that Google has been mentioned much more often than direct competitors or the economy. The company is a wolf waiting to feed on pigs. But it is not a bad wolf; if your company gets eaten, you should blame yourself for not being a visionary about what is possible with data, AI, and data science. The easiest way is to be passive and pretend that Google disrupting your industry is utopia or too far ahead. 

And Google is not going to only affect other organizational units in your industry, it could also influence R&D. Consider, for example, that Google has already assembled product images and manuals. With additive manufacturing it could easily produce products by reverse engineering these materials and then sell the products online. From the vast amount of data it possesses, it can use AI to predict the need for products, willingness to pay and plug-ins that are going to sell. The only way to save your career is to overtake Google. ]]>
<![CDATA[Artificial Intelligence vs CRISPR]]>Thu, 24 Mar 2016 19:52:51 GMThttp://dynresmanagement.com/blog/-artificial-intelligence-vs-crisprCRISPR/Cas9 is a brand new algorithm in machine learning that has potentials to replace humans in practicing law and contract negotiations. It is based on a new deep learning model that is trained only on a few documents and then is capable to produce questions to be asked at a trial or negotiate a real estate deal with Mr. Trump. The model consists of 10 layers of …

Got you (at least some of you). CRISPR/Cas9 is actually a gene editing technology developed at the University of California, Berkeley that does not have anything to do with machine learning (and artificial intelligence). You can read more about the technology on Wikipedia. It is a relatively recent invention and has already been used in curing diseases in adult tissues and to change the color of skin in mice, Mizuno el. al. 2014. No problem here since the humanity definitely wants to eradicate cancer and other genetic diseases. Until a recent very controversial study came out by Junjiu Huang, a gene-function researcher at Sun Yat-sen University in Guangzhou. They applied gene editing by using CRISPR/Cas9 on embryos. Truth to be told, these embryos cannot result in a live birth. Their intent was to cure a blood disorder in the embryo. Huang concludes that further advances need to be made since several of the embryos were not successfully edited, but some were. It goes without saying that this in the future could lead to an ideal child with blue eyes, 6 feet in height, IQ of 130, etc (and graduating from Northwestern University or Georgia Tech - where I got my Ph.D). This nice article in Nature summarize and discusses this controversial research direction.

On the other hand, deep learning in the context of artificial intelligence is all the rage these days. Everybody is warning about the danger of artificial intelligence (AI) to humanity, including extremely influential and prominent people such as Bill Gates, Elon Musk, and Stephen Hawking. There are at least 100 returned pages (and probably many more) on google mentioning “artificial intelligence Bill Gates Elon Musk Stephen Hawking”. For someone that knows deep learning (DL) and is conducting research in this space, I believe DL and AI are very far from endangering humanity. Yes, there were significant advances in supervised learning in specific areas (autonomous cars, scene recognition from images, answering to simple factoids), however these models still need a lot of training data and can solve only very narrow specific problems. Train a scene recognition model on images of living animals and then show it a dinosaur. The answer: “Elephant.” A lot of written news and reports today are written by computers powered by AI, but this is much more structured and easier to learn than negotiating a contract with Mr. Trump or preparing a lawyer for a trial. We are very far from computers displacing humans for such tasks.

I am not worried about AI, definitely not in my life span and that of my children, but CRISPR/Cas9 makes me much more nervous. It really means interfering with the natural process and in the not-so-distant future creating exceptional humans a la carte. I am convinced that without any regulations, i.e., unleashing the scientists, successful gene editing would be around the corner. I believe it is very important that experts around the globe step in and prevent further studies of gene editing on embryos. In terms of AI, using CRISPR/Cas9 or another yet-to-be-invented technology to artificially create a functional brain with all the neurons in a jar seems to be more viable and closer in time than mimicking the human brains satisfactory to endanger humanity with bits.

<![CDATA[Discovery vs Production]]>Mon, 14 Mar 2016 22:21:21 GMThttp://dynresmanagement.com/blog/discovery-vs-productionIt has been noted that big data technologies are predominantly used for ETL and data discovery. The former has been around for decades and is well understood with a mature market. Data discovery is much newer and less understood. Wikipedia’s definition reads “Data discovery is a business intelligence architecture aimed at interactive reports and explorable data from multiple sources.”

Data lakes based on Hadoop are bursting out at many companies with the predominant purpose of data discovery from multiple sources (that are explorable). It is easy to simply dump files from all over the place into a data lake and thus the data source requirement in the definition is met. What about the part on “interactive reports?” Verb “to discover” based on dictionaries means “to learn of, to gain sight or knowledge of,” which is quite disconnected from interactive reports. It actually does not have much in common. Indeed, in business data discovery is much more aligned with the dictionary definition than Wikipedia. Data discovery as used with big data and data lakes really means “to gain knowledge of data – in order to ultimately derive business value – by using explorable data from multiple sources.”

The vast majority of the applications of big data are to conduct data discovery in the sense of learning from the data. The knowledge gained per se does not provide business value and thus such insights are operationalized separately in more established architectures (read EDW, RDBMS, BI).  A good example is customer behavior derived from many data sources, e.g., transactional data, social media data, credit performance. This clearly calls for data discovery in a data lake and insights written into a ‘relational database’ and productionalized by means of other systems used in marketing or pricing.

There are very few cases of big data solutions outside of ETL being actually used in production. Large companies directly connected with the web successfully deployed in production big data technologies (Google for page ranking, Facebook for friend recommendations) but outside of this industry big data solutions in production are rarely observed.

It is evident that today big data is used predominantly for data discovery and not in production. I suspect that as technologies mature even more and become more self-served the boundary will gradually shift more towards production assuming that business value would be derived from such opportunities. Today big data is mostly about data discovery. The Wikipedia definition about interactive reports is for now mostly an elusion and it is better to stick with the proper English definition of gaining knowledge of. 

<![CDATA[Beyond Basic Machine Learning: Modeling Loan Defaults with Advanced Techniques]]>Sat, 07 Nov 2015 04:32:24 GMThttp://dynresmanagement.com/blog/-beyond-basic-machine-learning-modeling-loan-defaults-with-advanced-techniquesDodd-Frank Act Stress Testing (DFAST) is now required also for smaller banks with assets less than $50 billion. Simply stated, the test requires a bank to assess possible losses in future years from the loans on the books. A report exploring many different future economic scenarios such as increases in the unemployment rates or short term interest rates must be submitted to the Federal Reserve.

A loan status evolves in time. A loan that is current can in the next period become delinquent and then in a future period it can default. A different scenario would be a loan transitioning from current to delinquent and then back to current. The modeling part consists of three major components:

1.     Derive probabilities that a loan will transition from one state to another state in the next period. Clearly a loan can stay in the same state. The probabilities depend on many attributes of a loan.

2.     With these transition probabilities derived, a Markov Chain model is set up. This model assesses the probability that a given loan would be default after a certain number of periods in the future (e.g., eight quarters).

3.     The expected loss of a loan is its probability of default multiplied by the approximate loss.

The most challenging part is the derivation of the probabilities. On the paper this is an easy machine learning classification problem. Consider the case of current loans transitioning to delinquent. First, for each historical loan a set of attributes (features in the machine learning parlance) needs to be assembled. Candidates for features are the loan duration, location of the loan borrower, loan type, etc. Next, a subset of historical loans that transition from current to delinquent and a subset of loans that remain current are selected. The selected loans form the training data set. The third step is to set up models and lastly to evaluate them by using techniques such as 10-fold validation on a test data set.

Typical challenges of data sourcing and cleansing require a big chunk of the time. Several banks do not have enough historical data from their own operations and thus have to procure external data sets (and make sure that the loans on such external data sets resemble their loans).

The classification task at hand is to classify if a given loan is to become delinquent or not in the next time period. As such it is a textbook binary classification problem. There are two hidden challenges. The first one is that the training set is heavily imbalanced since many more loans stay current than transition to delinquent. It is known that in such cases special techniques must be employed. Equally important is the fact that there are many possible features, in excess to one hundred.

We have been engaged in a few DFAST related projects and have faced the feature selection challenge. We started with standard techniques of the principle component analysis (PCA) or information gain (the so-called maximum relevancy minimum redundancy algorithm). Either technique reduced the feature space and then classification models (logistic regression, support vector machine, random forests) were evaluated.

We tackled a few problems requiring deep learning which is a technique well suited for complex problems. While the DFAST problem is not as complex as, for example, recognizing from an image if a passenger is to cross a road, we decided to try the Restricted Boltzmann Machine (RBM) as a technique to reduce the feature space. In RBM, an input (visible) vector is fed to the model, then it is mapped through the notion of the energy function into a lower dimensional hidden vector which is then lifted back to the original space of the feature vector. The goal is to tune parameters (b’,c’,W) in the energy function so that the recovered vector in probability comes close to the original vector. 

The entire classification is then performed by having an input vector, then based on RBM calculating the hidden vector which is then classified into delinquent or not. [This flow actually follows the paradigm of deep belief networks which typically include more than one hidden layer.]

To our surprise, the RBM based classification model outperforms the vast variety of other traditional feature selection and classification models. The improvement was drastic, the F1 score jumped from 0.13 to 0.5. This was a very nice ‘exercise’ that is hard to push to end users which have heard of logistic regression and PCA and might even know how they work, but would be very uncomfortable using something that is called RBM (but would be much more receptive if the acronym means role-based management). 

<![CDATA[Improvements to large-scale machine learning in Spark]]>Tue, 28 Apr 2015 13:10:36 GMThttp://dynresmanagement.com/blog/improvements-to-large-scale-machine-learning-in-spark One of the biggest hustles in mapreduce is model calibration for machine learning models such as the logistic regression and SVM. These algorithms are based on gradient optimization and require iterative computations of the gradient and in turn updating the weights. Mapreduce is ill suited for this since in each iteration the data has to be read from hdfs and there is significant cost of starting and winding down a mapreduce job.

On the other hand, Spark with its capability to persist rdd’s (resilient distributed dataset) in memory and natively offering dataflow capabilities, is a great candidate for efficient calibration on rdd’s.

Gradient based algorithms on distributed data sets rely on the paradigm of solving the optimization problem on each partition and then combining the solutions together. We implemented in scala three algorithms.

1.      Iterative parameter averaging (IPA): On each partition a single pass of the standard gradient algorithm is performed which produces weights. Weights from each partition are then averaged and form the initial weights for the next pass. The pseudo code is provided below.

Initialize w
            Broadcast w to each partition
            weightRDD = For each partition in rdd inputData
                                                wp = w
                                                Perform a single gradient descent pass over the records in            
                                                                        the partition by iteratively updating wp
                                                Return wp
            /* weightRDD is the rdd storing new weights */
            w = average of all weights in weightRDD
Return w

            The key is to keep the rdd inputData in memory (persist before calling IPA).

2.      Alternating direction method of multipliers (ADMM): http://stanford.edu/~boyd/admm.html

This method is based on the concept of the augmented Lagrangian. In each iteration for each partition the calibration model is solved on the records pertaining to the partition. The objective function is altered and it consists of the standard loss and a penalty term driving the weights to resemble the average weights. One needs to solve an extra regularization problem with penalties. For L2 and L1 norms this problem has a closed form solution.

After each partition computes its weights, they are averaged and the penalty term adjusted. Each partition has its own set of weights.

Since the algorithm is complex, we do not provide the pseudo code.  The bulk of the pseudo code is actually very similar to IPA, however there is additional work performed by the driver.

One challenge, i.e., inefficiency in spark or ‘we do not how to do it in spark,’ is the inability in spark to send particular data (in our case the penalties) to a particular actor working on a partition. Instead we had to do a forecast to all actors and then during processing of a partition only the relevant penalties that have been broadcast are used. The main issue here is that all penalties for each partition has to be held in memory at the driver. For very large-scale rdd’s with many features this will be a bottleneck.

3.      Progressive hedging (PH): This is very similar to ADMM. The regularization subproblem has a different from than in ADMM, but it still exhibits closed form solutions for L2 and L1 norms. 

The implementations together with test codes are available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-ml-optimization 

Below is a comparison with Spark on 4 CPUs each one with 8 cores for two large data sets. IPA is a clear winner with the default spark SGD being the worst algorithm. 

<![CDATA[CSV to Spark SQL tables]]>Fri, 20 Mar 2015 19:46:38 GMThttp://dynresmanagement.com/blog/csv-to-spark-sql-tablesRecently we were involved in a project that required reading and importing more than 30 csv files into Spark SQL. We started writing scala code to ‘manually’ import file by file, but we soon realized that there is substantial repetition.

As a result we created a nice helper object that takes as input information about the csv files and automatically creates a schema per file. Each csv file must have a header, which dictates the name of the columns in the corresponding table. The user has to customize an object where the details are listed by file. For example, for each file the user can specify:

  • The name of the schema
  • Delimiter
  • A possible formatter function that takes a string value from the input csv file and returns the corresponding scala object that is used in schemaRDD (by default, i.e., if the formatter is not specified, strings are assumed). 

The code is available at  https://github.com/wxhC3SC6OPm8M1HXboMy/spark-csv2sql.git and as a package in www.spark-packages.org

<![CDATA[Information Gain based feature selection in Spark’s MLlib]]>Mon, 16 Feb 2015 19:49:07 GMThttp://dynresmanagement.com/blog/information-gain-based-feature-selection-in-sparks-mllibWe recently worked on a project that included web data and other customer specific information. It was a propensity analysis type project where recommendations were required for each individual client based on his or her past actions on the web. Each item recommended has many features and clients belong to organizations, which creates interactions among them.

Recommendations should be personalized and take into account linkage through organizations. To handle the latter, we selected to use organization related features as features in the model (instead of a possible alternative approach of having organization-level and individual customer-level models which are then combined).

These characteristics led to a personalized model for each client with each model having more than 2,000 features. To avoid over-fitting, feature selection had to be performed at customer level and thus due to a large number of customers in an automated fashion.

A possible line of attack is to apply PCA to each model. The problem with this approach is that the resulting features are linear combinations of original features. The project managers were consistently asking which features are important and what is the importance of each feature (think weights in logistic regression). For this reason PCA was not a viable option.

Instead we opted to go with feature selection based on information gain. The concept is based on the information gain between a feature vector and the label vector. The goal is to select a subset of features that maximize the information gain between them and the label (reveal as much information about the labels as possible) and minimize the information gain among the features themselves (minimize the redundancy of the features among themselves). The goal is to find set S that , where ig is the information gain between two vectors. It is common to solve this problem, called maximum relevance minimum redundancy, greedily.

Due to a large amount of data across all customers, we implemented the entire system in Spark. Spark’s MLlib unfortunately offers only feature selection based on PCA but not based on information gain. For this reason we implemented our own package. The package is available at https://github.com/wxhC3SC6OPm8M1HXboMy/spark-ml.git for public peruse.

For our purpose we had to also customize MLlib with other extensions (that are rather short and thus not put in a public repository).

Our model was heavily unbalanced (one class represented less than 1% of the records). We undersampled the large class but we had also used different weights for different records in the model calibration phase. This also allowed us to put different weights for different events of customers (for example, a misclassification of a purchase event has more weight than a misclassification of an item that the user recommended to someone else in the same organization). We achieved this by subclassing LabelPoint with a weight of the record. We also had to customize the classes for gradient computation (in logistic and SVM since these were the most competitive models) in order to account for weights.

The second enhancement was the use of a quadratic kernel in SVM and logistic regression. We implemented the quadratic kernel since it yields a linear model in a finite higher dimensional space and thus can reuse a lot of the linear machinery in MLlib. To this end, we created a class that extends the input rdd with the kernel function and then exposes the standard API of linear models. We have also extended the model class to automatically take into account the same kernel function when a test vector is applied for predictions.

Overall, Spark definitely met the needs of this project. It is robust and definitely scalable. On the downside, our problem was not a textbook model and thus it needed several enhancements to MLlib.