torstai 9. huhtikuuta 2020

Learnings of practicing data scientist on building AI products

What is AI product

AI product is just like all other software products with notable exception that AI provides critical part of value creation logic. Therefore building AI product requires all the same skills as other software products combined with relevant parts of the AI know-how.

The value creation logic must the clear enough to be written in simple sentences in few lines. No mumbo jumbo about AI, no technology name soup and no elaborated explanation on how AI enables this and that and undoubtedly cures the cancer too.


Types of AI projects - should you tackle market or technology risk first?

In practice there are two types of AI products. Amazing value proposition with high technological risk (a.k.a. moonshot project, e.g. self driving car) and moderate value proposition risk with moderate technical risk (a.k.a. "do old stuff better"). Amazing value proposition with low technical risk is rarely if ever seen practice and ones with high technical risk with moderate value proposition are not worth doing.

It is common that the same project is combination two of aspects. Often it is possible to have severely reduced feature set with moderate technical risk to test the market risk and solve the hard technical problem later. Other times one has to solve the hard part first, e.g. flying car actually has to fly -- sort of flying while being stuck on the ground doesn't cut it.

Combining AI competence with use case knowledge - avoiding the collection of smiling clueless smart people

Building AI products is really about solving the customer need and not about AI at all. Therefore it is absolutely mandatory to have use case knowledge in the team and have flawless communication between the technical and use case people. In practice this can achieved only if use case people are part of the implementation team. The practice has shown that it is easier to teach use case to technical people than technology to use case people but everybody should strive for somewhat overlapping knowledge to allow for shared vision of the mission.

Root cause of all failures is illusion of communication. I am sure everybody have been in meetings where nobody has real understanding what is going on but due to various reasons are unwilling to ask for clarifications. This is collection of smiling happy clueless people and sure sign that project is heading towards the failure. Only remedy is to bite the bullet and ask for clarifications (a.k.a truly stupid questions) until it is crystal clear that everybody is on the same page.  Might be painful and might not make friends but is absolutely mandatory to deliver AI product that truly creates value.

On skill and competence


Nature of skill and competence

General skill is overall competence on the topic. In software engineering this means design patterns, algorithms, good developing practices, testing and things of this sort.

Specific skill is know-how on specific technology or topic. In software engineering this means using frameworks/libraries and in some cases internals of frameworks or technology. Initial specific skill is highly valuable in short project and loses importance on long projects as people are able to pick up the used technologies in relatively fast pace anyhow.


Individual skillset

In AI project one needs to have some level of these skills
  • Software engineering
  • Data engineering
  • Data science/AI
  • Use case know-how
Each team member should be approximately evaluated on these skills and team skills should be compared to product being built. In some cases AI driven product can be built with relative weak data science know-how using ready made libraries. Yet, in also in these cases it is absolutely mandatory to investigate model performance thoroughly.

The self evident conslusion is that more competitive advantage the AI features provide bigger the AI know-how in the team needs to be.

Organizing the team

The two stereotypical ways to organize team are 
  • Everybody does everything
  • Each member has specialty


Everybody does everything

In ideal agile team any team member can take any task and push it to production. In practical teams this is rarely accomplished and often times not even desired since there is bound to be big productivity differences within the team on various task.

Yet, if team consists of senior people with overlapping competences it is possible and desired that for each task there are multiple people capable of completing the task. This also requires that team members are not keen to cherry pick the tasks that are most interesting and fill the technology hype bingo in the CV.

Upsides of this is that team doesn't get stuck if any team member loses productivity temporarily, is on vacation or just leaves the team. Based on my experience this is also very fun and enjoyable way to work since it is impossible not to learn new things constantly.


Each team member has specialty

This makes body shopping easy. Need a deep learning specialist on flower species classification? No worries, just call suitable consulting company and rent an expert for suitable euros per an hour and you are all set.

However, downsides are significant. The obvious problem is that lack of significant overlap between the team members causes big issues if a team member leaves the team or is not performing well enough. Another big problem is constant hand overs and changing team composition. Usually a given specialty is needed only in one part of the project and therefore people are joining and leaving the team constantly. Third problem is that there are only limited number of sweet roles in any project  -- in AI projects too there is more perspiration than inspiration and for team spirit it is best to share fun and boring parts mostly evenly.


Practical project management  -- AI R&D and software engineering


Dailys and other agile rituals

The project progresses like a clock, faster the clock speed sooner the project completes. In my experience having short (max 15min) daily each day keeps people focused in the task at hand and sort of forces the project to go forward each day. Not all days are easy but moving forward each day leads to success sooner or later. The trick to make the dailys work is to limit people who attend the daily --- only people working together should be the in daily.
Retrospectives are extremely valuable tool to improve team productivity. Demos not so much --- often times demos eat up quite a lot working time to prepare even if you the team explicitly agrees that demos should be very light weight. Demos are good but if you have to choose prioritize dailys and retrospectives over demos every time.


On communication

Nobody in history of software has failed due to over communication but every minute another project grinds to halt due to over meeting. Know the difference between meetings and communication and spend your time wisely.

Team must have shared understanding of the product vision and next few steps of the roadmap to the vision. The team cannot be successful if team members are not working towards the same goal with common understanding of hard requirements (eg. data cannot leave on-premise servers) and priorities.

Good communication is clear, short and to the point. AI hype and jargon is best left to sales people -- it does no good when you actually have to deliver.


Does the agile method work in data science?

By and large it doesn't. The agile method works when the primary concern is executing tasks and iterating the product. Problem with the data science is that you are not really working on tasks but solving the puzzle or crosswords. When solving the puzzle one needs to iterate between trying to figure out individual words and looking at the big picture.

The trick is to have two working modes. At the start when it is not clear how to solve the AI puzzle one should strive for maximal flexibility and  freedom --- do what you did during your PhD years but add the team work to the mix. Drink too much coffee at cafeteria, try things out on the white boards, walk around the building or go skydiving if it helps you to clear your head. When the path to solution is more or less clear and the biggest technical risk are manageable move towards agile project work with well defined tasks and clear goals.

Toxic behaviour that must not be tolerated


Prima donnas

Sometimes individual in the team develops prima donna behaviour, that is, individual picks only nice task and therefore forces other team members to shovel the manure. Upside is that prima donnas are often relatively competent in their area of interest so at least prima donna is productive on the task he is willing to work on. However, overall this is really bad for the team spirit.

There are two ways handle prima donnas:

  • The preferred solution is not to hire prima donnas and give each team member boring and unpleasant task already in the trial period at the start of the employment. Most people will pick up the hint and choose the team player mode.
  • Organize the whole team around the prima donna.  In extreme case the prima donna is absolutely critical to project and you just have to tolerate the behavior. However, this is major motivation killer for other team members. In practice the easiest way out is to hire experienced consultants with mercenary attitude. They know how to manage the prima donna, are in the project for short term anyhow,  primarily in it for the money and therefore don't really mind shoveling the manure. This is usually expensive and not optimal in any way so avoid prima donna syndrome at all costs.


Dumping manure downstream - Jira ticket syndrome

This is really bad one. Even in the most agile teams there are bound to be hand overs where one team member does the first part of the task and other team member does the latter part. If the first team member is only interested in his individual productivity by ignoring project perspective then often times the person doing the latter part is forced to redo major parts of the earlier part too. 

This happens when closing the Jira tickets is seemingly more important than actually creating real value. Only remedy is to constantly remind that no real value is created until end-to-end solution is provided -- when running marathon one needs to run all the kilometers, not just the first one in record time.


Eternal lack of faith

Sometimes everybody looses the faith that project works out. If you never loose faith you are not solving hard problem so occasional whining and complaining is absolutely acceptable and much preferred to cult like positivity where everything is 'amazing challenge' and problems are essentially denied. However, you need to make sure that nobody is loosing the faith for long and that team members are in vast majority of time upbeat.

Progress is the best motivator so each team member should have constant mixture of relative surely successful task and the hard ones. This way the victories carry the torch when one is getting stuck in the tar pit.


Not done my way

This is similar to prima donna syndrome in a way that this usually happens on senior team members or start up founders but there are none the positive aspects of the prima donnas. 

Sometimes some team members require that everything is done the way they would do it. Team absolutely should have standards on code quality, some ways of working and things like that. Keeping up these standards is positive behavior but micro managing implementation details is not. Usually there are several more or less equally good ways to solve task and it is absolutely unreasonable to require that everything is done just the way you would do it.

Only remedy is to wake up and the smell the coffee. If you need to have a team you also need to embrace the different ways to solve the tasks. If you cannot get your head around this then you should just quit because the project is doomed to failure anyways. Being a jerk is a very expensive hobby.

On technology

AI components of AI driven software product. Note that in most cases machine learning model update cycle is independent of other software update cycle but in some use cases machine learning model cannot be decoupled from rest of the software.


Keep collecting new data

The first version of the AI is practically always trained on static data set. Yet as machine model runs in the wild it encounters new situations that should be added to training set. World also has tendency to change so existing training data may not anymore be representative and model is no longer well fitted to new circumstances. Periodic model retraining becomes necessary and data collection becomes mandatory.
Build the data collection in to the product and do not try to add it as feature later on.

Isolate machine learning components

In all AI product vast majority of the code base has nothing to do with machine learning as such. For sure there are a lot of data related code all over the code base but usually actual machine learning is relatively well contained functionality.

In practice machine learning model changes independently of other software changes. In model runner concept machine learning model is packaged inside interface and model runner serves the machine learning model to other parts of the software. This allows for easy A/B testing of models and separating machine learning from other code base.

Usually software updating cycle and model updating cycle diverge. Software is updated to add new features and to fix bugs whereas model retrained to better fit on the always changing world. In model runner pattern model runner component is updated as part of the software update cycle whereas models are updated as new data becomes available.

AI that actually works is very use case specific machine learning so it is easy put behind API or inside software component. 

Technology selection

Each used technology should be evaluated on following criteria
  • Fit for purpose. If technology can't do the job then it is useless.
  • Will the technology be supported and developed in future. If it is commercial product is likely that company stays afloat and if it is open source product will developer community exists also in future?
  • Is technology ready for production. Being a pioneer is tough so check if the technology used in production elsewhere?
  • Ease of finding developers. If you can't hire developers today and tomorrow then technology should not be chosen.
  • Could technology choice be a pull factor when looking for team member? This might work sometimes but is also very risky tactic since often today's hot technology is tomorrow's niche legacy with small developer community.
Today most products are built on cloud services. Often there is temptation to aim for cloud independence so that cloud providers could be competed against each other . In practice cloud independence means dockerizing most functionality and therefore few of cloud computing goodies -- serverless and managed services -- can be used. In most cases cloud independence is very expensive proposition and should not be considered unless project has very deep pockets.

The dreaded deadlines

Developers are notoriously reluctant to give firm deadlines whereas business people seem to get fixated on the deadlines. Both sides of the argument are valid because you can't have business case without cost estimates (=deadlines) but deadlines in software projects are somewhat of a joke. It is just really hard to estimate amount of work needed in software engineering.

In the worst case work estimates are just fairy tales people pretend to believe in but in the best case work estimates become self fulfilling prophecy. Yet once more the way out is efficient and clear communication on between the implementation team and the ones paying for the product development.



AI driven product project is very similar to any other software product project. The biggest differences are
  • Stronger emphasis on data
  • Machine learning engine of some sort
  • Higher technical risk
In most software products (or products with large software component) the biggest risk is always the market risk. Therefore most AI products are not inherently riskier than other products if AI specific matters are managed properly. The notable exception is moonshot projects but in these the risk is balanced by vastly larger market potential.

maanantai 16. huhtikuuta 2018

Figuring out an obscure database


In many data driven products one must look into arcane operational database to access the necessary training data or operational database must be turned to analytical database. Implementation details vary from project to project but practically always one is confronted with an arcane operational database with obsolete documentation at best and usually none at all.

At first this might seem like a daunting task. How is it possible to decrypt meaning of tens and tens of tables? But fear you not, here is how the trick is done.

Learning cycle

Start with asking why database exists because use case of application using the database is bound to show up in the database structure somehow. The best source for this information is whoever is working daily with the database.

Draw physical Entity-Relationship (ER) diagram of the database using database DDL scripts. If you have access to database you are almost guaranteed to be able to do this. Sometimes you can only access database dump without valid constraints but even then you can make educated guesses using table names, field names, data types and table structure.

Most operational databases use common database design patterns (star schemabridge table etc.) so learn how these work and mapping business concepts to database tables becomes much easier. Physical ER diagram will help tremendously on this part. Remember though that even if database was born with solid structure by the time you come around there can be any number of creative and not so solid features so flexible mind is mandatory.

The preceding steps offer high level understanding of the database but very little to understanding on how to compute anything from database.

In almost all cases application has some kind of UI, e.g. distribution of ice cream in ice cream manufacturing company has some business user UI. Based on the database design patterns, ER diagrams and a little bit of intuition gained from discussions with business users you can guess how to map primary key of some center table of the database to UI.  When you compare what you see in database to  what the application wants so show to the user you can create first valuable hypothesis on how database tables should be interpret.

Now you have gained a little bit of understanding on the data so use your new found knowledge to ask more relevant questions from business users, understand ER diagram better an so forth.  Essentially you start with very vague understanding of the database and iteratively refine your understanding until you can compute the necessary values using the database.

Tricks of the trade

  • Use mainly production data since TEST (and QA too..) can be just about anything. Without any access to PROD data probability of success is greatly diminished.
  • Change TEST data to confirm hypotheses, e.g. to change database state by making a new ice cream delivery order to check if database looks what it should after the new order.
  • Sometimes you want access end results of the application and compare them to database content. Quite often companies store the most relevant parts of the database information also in a different format, e.g. ice cream manufacturing company might store delivery records in pdf files or in paper records for accounting purposes since they must be able to present that data to regulatory agencies on request. 
  • Beware that TEST data and PROD data can have funny differences, e.g. IDs used to map ice creams to ice cream descriptions can be different in TEST and in PROD.
  • If you can access the application source code you should take a look. Be warned though, it can take a long time to understand relevant parts of complex legacy application but sometimes you have no choice.
  • Quite often some logic is coded in first character of a string or there is some other black magic involved. This creates sort of hidden logic in database that is not available in ER diagrams but can be spotted by inspecting the data.
  • Don't assume that database content should be interpret in same way on different time spans. If company's business has changed then the application has changed too and seemingly similar products may not behave the same. Similar wholesale ice cream package may not look the same each year in the database.
  • You are solving a puzzle so combine intuition with logic.

keskiviikko 5. huhtikuuta 2017

Anatomy of Data Science Project

In data science all the buzz and fuzz is about cool algorithms, exciting results and profits. However, to really gain value from data science it is necessary to understand mundane day to day work from project work angle too.  As adage goes all models are wrong but some are useful and same applies to mental frameworks to handle data science projects too.

This is my framework, what is yours?

Value Cycle

You cannot hope to get it right in one go. You can try but you will run out of time, out of budget, out of sponsors or out of ideas. The key is to deliver value continuously and only way to do that is to make modelling process incremental. Start with really simple model and use also exploratory data analysis to produce value to stakeholders.

Data will surprise you and iterative approach allows you to be surprised just a little bit each time instead being knocked out in one blow.

Key takeaways: Iterate fast and make sure every iteration ends in tangible result. Use iterations to learn ins ands outs of data.

Problem definition

The project starts, the team and all the stakeholders are excited. We are going to figure it out, we are going to make things real. Let's rock and roll!

Hold your horses for a while though. No matter how clear the objective seems and no matter how straightforward the project sounds this is the time to stop and ask the critical questions:
  • Who is end customer?
  • What is value creation logic for end customer?
  • What value does data science add here, can you solve problem on whiteboard instead?
  • Why now and what are hard constraints (ie. GDPR, deadlines)?
  • Is there a way to deploy the model?
  • What data sources are available?

Key takeaways: Understand what product owner needs (not only what product owner means), make sure you understand who is end-customer and what is value creation logic.

Data collection and wrangling  

Ditch digging phase, a lot of perspiration and little inspiration. Changes are that data from various sources is imported several times during the project and you should also make the whole project reproducible. The best way to achieve this is to automate data loading and wrangling as much as possible and write short README files when applicable

No need to do extensive documentation but write something as there are always plentitude of hidden assumptions in data. When debugging the models you are forced to check details of data loading and wrangling over and over again so don't fall for temptation to do data loading and cleaning manually.

Key takeaways: Automate, documentate (don't overdo)

Exploratory data analysis

Plot this and check that. Model performance depends on data and therefore you must know the data to choose a feasible model. Also, this is great phase to detect inconsistencies and missing values in data. If data is not representative you have good change to detect it here too.

Often you can find useful insights in this phase. Not guaranteed though and depends how well data is analyzed already. Value of an insight doesn't depend on how much work it required but creating an insight is hard if data set is already analyzed through and through.

Key takeaways: Intuition of data, insights

Optimization proxy selection

The great irony of data science (especially in machine learning) is that it is all about optimization but you almost never optimize the real thing.

Let's think about recommendation engine for a moment. What you really care about is providing interesting and surprising recommendations to customer. This is, sadly, an impossible objective for any machine learning algorithm and data scientist must find a credible and technologically feasible proxy (e.g. ratings of movies per customer) for end customer's true need.

Key takeaways: Make optimization proxy explicit.

Model development

This is part of data science everybody is talking about and all the books are about this so let's not go to technical details here.

Sometimes you use different tool chains in different iterations of value cycle (ie. R -> Python -> Scala -> ..) and therefore you should uncouple data preparation from modelling if possible. Of course there will be quite lot of coupling but try to keep it to minimum at least in early iterations and stabilize architecture/technology stack later as project matures. Remember that using tools that fit the problem is much easier than shoehorning problem to your favourite tool chain.

In first iteration you should build baseline model that can be deployed to production. Make it simple and make it clear. In later iterations you redefine the model or choose a more complex approach altogether. Only exception to this rule is when you are doing moonshot project (not clear if even doable) where there is no point even talking about production before the biggest technical risks are solved.

Faster you can train the model the better. You will run into many gotchas during the model development and faster you train the model faster you can sort them out. On other hand, many techniques to parallelize model training also add complexity and make training less transparent (e.g. running cluster vs. running on one huge AWS instance). It depends on case by case which way to go but make it a conscious decision.

Key takeaways:  Baseline model first, be tools agnostic, think your productivity

Out-of-sample test

Time see how predictive you predictive model really is and decide if you can use it in real life. Out-of-sample test means that you run your model on data that has not been used to train the model. In data science you generalize using training data and out-of-sample test tells how well you succeeded.

Key takeaways: Use out-of-sample test to do go-alive decision.

Deploying model to production

There are two ways to turn insight into tangible benefit. Indirect route is essentially consulting where you advise someone who has means to make a difference. Classical example of indirect route is optimizing marketing budget. Direct route means that you create (or work with someone who creates) automated system that provides insights (say, recommendations) to end customer.

Key take-aways: Figure out a way to turn model into real life benefits.

Real life test

The moment of truth, everything before this is just dress rehearsal. There is no guarantee whatsoever that customer cares for your optimization proxy. You worked hard, your hair turned gray and you started to look like an old and feeble man while developing the model but does the customer really care?

This is the first time your end-customer voices his opinion so to try get here as soon as possible. In fact, does customer even care for the product in which the model is embedded in?

Measure improvement against baseline instead a arbitrary target. The point is to improve performance over baseline or create new business opportunities instead of trying to hit an arbitrary target (which may be way too low or way too high).

Key takeaways: Get here fast, measure improvements relative to baseline model.

Data scientist survival guide

It is foolhardy to forget software development best practices. Not having extensive automated tests is going to backfire in data science just as surely as in any other software project. Static code analysis and linters are still great (and almost free) way to improve code quality and communication with stakeholders is just as important as in any other project.

There are plentitude of free packages available online but here you should use same caution as in software projects. Check the licence, check if package still updated (check commit history) and use your judgement to determine if package will be updated in future.

However, there is a more to the story. The data science projects are inherently risky. No matter how many algorithms you try (or invent) there is always a change that data just isn't representative (relevant to problem) or maybe there is no signal in data. Sooner or later you will run into this but almost always there are more degrees of freedom than you initially see. Maybe you can buy some data, collect more data or try some totally orthogonal approach? These allow you to turn looming failure to great, if a bit postponed, success. There are always more solutions than you think, trust me on this one.

Copy all the good parts from software engineering and add a strong dose of Sherlock Holmes' wits. A friend of mine once described data science as no-holds-barred approach to solving a crosswords puzzle. Everything is allowed except using test set in training phase.

Key takeaways: Be resourceful and organized.

Product owner survival guide

Data science projects have tendency to become quite complicated quite fast. The data sources come in all forms and sizes, algorithms are exotic and full of misleading acronyms. It is quite easy to get lost in the endless jargon and incomprehensible gibberish.

However, most of this is just implementation details and not that important to you actually. You as a product owner should really be concerned about only a few things:
  • Is the team solving the right question (problem definition phase)?
  • Is out-of-sample test done and are results encouraging (go-alive decision)?
  • Are results in production system measured and how big is improvement compared to baseline (real-life test)?
  • Are there deliverables from each iteration of value creating cycle (is there real progress)?

Key takeaways: Solve the right question, out-of-sample test, real-life test


Solve the right question, work iteratively, check your results against baseline and deploy to production as fast as possible.

Not all data science projects fit in this framework but all projects have least some elements from it. Actually, most projects have all the elements in some sense.

Data science is fun!

keskiviikko 15. kesäkuuta 2016

Value creation in data science

Creating value in data science is hard work and requires down to earth attitude. One must understand business model thoroughly and provide actionable insights or better yet provide end-to-end working system that captures value and serves it to the end customer. Note that end customer may be e.g. consumer, an employee or a company.

What is real value?

A genuinely good thing. Not great because you say it is great, not credible because of your reputation nor supported because it advances somebody's personal agenda but because it helps corporate customer to solve a problem than prevents business growth (in existing or new business) or allows consumer to enjoy life more. A real value makes measurable difference and improves lives. 

Kinds of insights

A good insight is profitable and actionable. Let's break this down to components and examine them separately.

Profitable insight is a plan or a non-trivial fact that provides value to customer if acted upon. You must also estimate size of the opportunity to decide if the opportunity of worth pursuing.

Actionable insight is doable in practice. That is, you can reach the end customer some way and provide to the end customer the value created by the insight.

Insight "People who buy carrots also buy jeans" is profitable because in principal it allows you to bundle these products non-trivially together but actionable only if there is a way to serve this bundle to end customer, e.g. through shelf placement in retail store or by mobile application.

Ways to create value out of an insight

An insight can be valuable either directly or indirectly. Advising on trends or segmenting the market tend to fall into indirect category whereas recommendation system or predictive maintenance are direct ways to create value out of an insight.

When creating value directly one implements a system that serves value directly to the end customer, e.g. by providing personalized, meaningful and non-obvious leisure time activity recommendations through Facebook.

When creating value indirectly one must provide advice to somebody else who has means to make end customer's life better, e.g.. providing advice to a car dealer on car accessories that make end customer's driving safer and more enjoyable. 

Direct way to provide value has benefit that it can usually be measured and iterated in rapid manner whereas indirect way to provide value tends to have much longer feedback loop.

How to discover profitable and actionable insights?

Start from the basics. Plot this, double check that, check for data consistency, talk to all stakeholders and make sure you got the basics right. You cannot hope to gain valuable insights if you don't understand the business both quantitatively and qualitatively. 

Max out business intelligence tools and existing infrastructure.  Many insights can be found by using business intelligence tools and business intelligence tools can also be used to double check and clarify insights found using more advanced tools like Spark.

Be tools agnostic. If the job requires R then use R and if job requires Spark then use Spark. As a rule it is easier to learn to use multiple tools compared to trying to squeeze the job to the form that your favourite tool chain supports. This is especially important if you change topics in a rapid manner and build on top of existing libraries and packages. Remember that the best tool varies case by case basis. 

Be practical and never forget that your job is to provide value. If problem can be solved using whiteboard and spreadsheet all the better because you just saved everybody's time and money. Then again, if advanced tools are needed then they must be used and it is up to you to overcome any technical hinderance that you might run into.

maanantai 28. joulukuuta 2015

Outlier detection using Spark


Detecting outliers is very important ie. in fraud detection, in medicine and in intruder detection systems. One important application is to determine whether a particular test result should lead to medical care or extensive clinical testing. All of these applications require a scalable outlier detection algorithms and computational platform as datasets can be very big and include both numerical and categorical values.

It is hard to define what outlier is exactly but it seems to be any observation or item that is somehow distinctly different from other observations or items. There are several algorithms to detect outliers but let's here take a look in one of more straightforward algorithms. Attribute value frequency (AVF) algorithm scores observations based on frequencies of the their attribute values. Observations that receive the lowest scores are assumed to be outliers based on hyperparameter $k$ determining the number of outliers.

Spark is an easy to use distributed computing platform using Resilient Distributed Datasets (RDD) allowing efficient iterative algorithms. Spark supports Java, Python and Scala programming languages and AWS supports creating Spark clusters with a only a couple of mouse clicks. For more information on how spark achieves its efficiency please see ie. Spark Cluster Computing with Working Sets or Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.

To develop the algorithm and test its scalability I wrote tools to generate arbitrarily sized test datasets. Tools and spark program for AVF can be found here. Using a generated arbitary dataset is a good way to make sure that your algorithm works are intended.

Attribute Value Frequency algorithm

Attribute value frequency algorithm computes frequencies of the attribute values using

\begin{equation} AVF Score(x_{i}) = \frac{1}{m}\sum_{i=1}^{m}f(x_{il}), \end{equation} where $m$ is number of attribute values, $x_{il}$ is l-th attribute value of $x_{i}$ and $f(x_{il})$ is number of times l-th attribute value of $x_{i}$ appears in the dataset.

AVF algorithm pseudocode:
Input: dataset D
Output: k detected outliers

Label all data points as non-outliers
calculate frequency of each attribute value
foreach point x
  AVFscore = compute AVF score for item x(i)
end foreach
return to k outliers with minimum AFVscore

Using spark this can be expressed as

Running spark on AWS

An excellent tutorial to submitting spark jobs to AWS EMR using CLI can found here.

First one needs to create a Spark cluster and then add step to run the code. The code to be run needs to be put to the s3 and it is advisable to also write the results to s3.

In this case there is no need to install custom software to cluster. However, using public DNS name and ssh key (selected when cluster was created), you can you can login directly to master or slave nodes to install needed software like scikit-learn or opencv. Note that you must first allow ssh connections (port 22) from your IP addess to nodes. If you wish to allow connections from everywhere set IP address to ''.


Spark allows straightforward implementations of certain classes of outlier detection algorithms. AWS provides easy-to-use environment to scale you computations to arbitrary scale dynamically. There are some hickups when setting up the cluster in AWS but mostly it is an enjoyable experience.

lauantai 5. syyskuuta 2015

Inverted pendulum - simulation and control

Check out the SIMULATION! Swipe the screen horizontally to move the pendulum.


Inverted pendulum is the classic problem in control theory and therefore well suited for an exercise combining mathematical modelling, simulation, control theory and web technologies for presentation purposes.

Just about all suitable controllers have been applied to inverted pendulum problem since it has become the challenge all new controllers must be able to conquer. Model Predictive Control (MPC) uses a mathematical model of the phenomenon to predict how the system is going to evolve and what control input should be applied to achieve a desired end state. MPC Controller computes an optimal control sequence and applies the first control input, then computes the optimal control sequence again and applies the first control input again and so forth. You are encouraged to take a look at the freely available excellent lecture notes.

In all projects key to success is rigorous scoping and prioritization. This project consist of making simulation model and linear model predictive controller in c++ and use Node.js addon to glue those to web frontend. Simulation is visualized by drawing inverted pendulum in 3d using three.js WebGL library.

Code is available at the bitbucket and a running example is available here. In the example you can whack the pendulum using a mouse (press left button down, move mouse horizontally and release the left button) or a sweep in a mobile device and see how the controller stabilizes the system.

Simulation model

The first step is to create a reasonably accurate simulation model of the inverted pendulum. In principal this could be done using standard Newtonian physics you learned at school, F=ma and so forth, but handling constraints is a bit tricky in Newtonian formalism. For an example inverted pendulum should not fall through the ground nor rise above the ground. Ensuring this in Newtonian formalism requires a careful consideration of all relevant forces that cancel out in the actual simulation keeping the inverted pendulum on the ground.

However, it is possible to handle constraints using Lagrangian mechanics that use generalized coordinates and forces. A little bit confusing sentence if your background is not in the physics or control theory but it will become very clear by the end of this article.

In Lagrangian mechanics you need to express kinetic and potential energy using generalized coordinates. In this case kinetic energy is
\begin{equation} K = \frac{1}{2}M\dot{x}^2 + \frac{1}{2}m\big[ (\dot{x} - L\dot{\varphi}\sin{\varphi})^2 + (\dot{y} + L\dot{\varphi}\cos{\varphi})^2 \big] \end{equation}
and potential energy is \begin{equation} V = mgL\sin{\varphi} \end{equation} Lagrangian is \begin{equation} \begin{split} L = &K - V \\ = & \frac{1}{2}(M+m)\dot{x}^2 + \frac{1}{2}mL^{2}\dot{\varphi}^{2} - L\sin{\varphi}\dot{x}\dot{\varphi} - mgL\sin{\varphi} \\ \end{split} \end{equation} since $\dot{y}=0$. Lagrangian can be used to derive differential equation of the system using the generalized equations of motion
\begin{equation} \label{equations_of_motion} \frac{d}{dt} \frac{\partial L}{\partial \dot{q_{i}}} - \frac{\partial L}{\partial q_{i}}= F_{i}, \end{equation} where $F_{i}$ is generalized force. Let's apply generalized equations of motion (\ref{equations_of_motion}) for $x$ coordinate
\begin{equation} \begin{split} \frac{\partial L}{\partial \dot{x}} = & \: (M+m)\dot{x} - mL\dot{\varphi}\sin{\varphi} \\ \frac{dL}{dx} = & \: 0\\ \frac{d}{dt}\bigg(\frac{\partial L}{\partial \dot{x}}\bigg) = & \: (M+m)\ddot{x} - mL\ddot{\varphi}\sin{\varphi} - mL\dot{\varphi}^{2}\cos{\varphi}\\ \end{split} \end{equation} and therefore \begin{equation} (m+M)\ddot{x} - mL\ddot{\varphi}\sin{\varphi} - mL\dot{\varphi}^{2}\cos{\varphi} = F_{lin}, \end{equation} where $F_{lin}$ is in this case just a combination control force ($F_{c}$) pushing cart left and right and linear friction defined as \begin{equation} F_{lin} = F_{c} - \mu_{x}'x, \end{equation} where $\mu_{x}'$ is linear friction coefficient including physical properties of the ground and the cart. Let's next consider coordinate $\varphi$ \begin{equation} \begin{split} \frac{\partial L}{\partial \dot{\varphi}} = & \: mL^{2}\dot{\varphi} - mL\dot{x}\sin{\varphi} \\ \frac{dL}{d\varphi} = & \: -mL\dot{x}\dot{\varphi}\cos{\varphi} - mgL\cos{\varphi}\\ \frac{d}{dt}\bigg(\frac{\partial L}{\partial \dot{\varphi}}\bigg) = & \: mL^{2}\ddot{\varphi} - mL\ddot{x}\sin{\varphi} -mL\dot{x}\dot{\varphi}\cos{\varphi}\\ \end{split} \end{equation} and therefore \begin{equation} L\ddot{\varphi}-\ddot{x}\sin{\varphi} + g\cos{\varphi} = \frac{F_{\varphi}}{mL} = F_{rot}, \end{equation} where $F_{rot}$ is scaled generalized force defined as friction \begin{equation} F_{rot} = F_{rotnoise} - \mu_{\varphi}'\dot{\varphi}, \end{equation} where $\mu_{\varphi}'$ is rotational friction coefficient including hinge's physical properties and $F_{rotnoise}$ is rotational disturbance which can be intentional, say user input, or noise i.e. due to wind.

Next we need to uncouple $\ddot{x}$ and $\ddot{\varphi}$ which is easily done by solving them from the systems of equations we just derived. After a little bit of arithmetic one gets \begin{equation} \frac{d}{dt} \left[ {\begin{array}{c} \dot{\varphi}\\ \varphi\\ \dot{x}\\ x \end{array} } \right] = \left[ {\begin{array}{c} \frac{F_{lin}\sin{\varphi}+mL\dot{\varphi}^{2}\cos{\varphi}\sin{\varphi} - (m+M)g\cos{\varphi} + (m+M)F_{rot}}{L(m+M-m\sin^{2}{\varphi})}\\ \dot{\varphi}\\ \frac{F_{lin} + F_{rot}m\sin{\varphi} -mg\sin{\varphi}\cos{\varphi}+mL\dot{\varphi}^{2}\cos{\varphi}}{m + M - m\sin^{2}{\varphi}}\\ \dot{x} \end{array} } \right] \label{statePropagationEq} \end{equation}
These equations can be solved trivially using Runge-Kutta method. The friction terms serve dual purpose. Firstly, they are physically justified. Secondly, they make the system to bleed out energy. Due to numerical inaccuracies and approximations Runge-Kutta solvers tend to gather some extra energy in system like this so bleeding out some energy helps to keep the system stable.

Model predictive control

Model predictive control comes in two variants. Linear model predictive control considers only linear systems whereas nonlinear model predictive control considers nonlinear systems. Linear model predictive control leads to convex optimization problem that can be solved easily whereas optimization problem in nonlinear model predictive control may be riddled with troubles such as local optima and is generally hard to solve. One of the great benefits of model predictive control is that you can explicitly limit the maximum and minimum values of the control input. That is, optimal sequence on control inputs is different based on physical limits of the actuators. In addition you can also set limits to allowed states (say, speed of the cart). 

Linear model predictive control problem tries to find sequence on control inputs (items of vector u) such that \begin{equation} \begin{split} J(k) = & \sum_{i=1}^{N}\big[ x^{T}(k+i|k)Qx(k+i|k) + u^{T}(k+i|k)Ru(k+i|k) \big]\\ &+ x^{T}(k+N|k)\bar{Q}x(k+N|k) \end{split} \label{Jk} \end{equation} is minimized. $N$ is the length of the prediction horizon and matrices $Q$,$R$ and $\bar{Q}$ can be used to give different terms and variables a proper weight in the optimization problem. These matrices offer some degrees of freedom to the control problem and can be used to tune the controller. In numerical computations matrices are preferred and an equivalent form of ($\ref{Jk}$) can be used \begin{equation} J(k) = \mathbf{u}^{T}(k)H\mathbf{u}(k) + 2x^{T}F^{T}\mathbf{u}(k) + x^{T}(k)Gx(k) \label{Jkmatrix} \end{equation} The first term is control input cost punishing excessive control inputs. The second term is cost due to system state changes by the control input and the last term is cost due to free propagation of the system as it doesn't depend on the control input. Matrices are \begin{equation} \begin{split} H = & C^{T}\tilde{Q}C + \tilde{R} \\ G = & M^{T}\tilde{Q}M + Q \\ F = & C^{T}\tilde{Q}M, \\ \end{split} \end{equation} where \begin{equation} M = \left[ {\begin{array}{c} A\\ A^{2} \\ \vdots \\ A^{N} \end{array} } \right] \end{equation} \begin{equation} C = \left[ {\begin{array}{cccc} B & 0 & \cdots & 0\\ AB & B & \cdots & 0\\ \vdots& \vdots & \ddots & \\ A^{N-1}B & A^{N-2}B & \cdots & B \end{array} } \right] \end{equation} \begin{equation} \tilde{Q} = \left[ {\begin{array}{cccc} Q & 0 & \cdots & 0\\ 0 & \ddots & & \vdots\\ \vdots& & Q & 0\\ 0 & \cdots & 0 & \bar{Q} \end{array} } \right] \end{equation} and \begin{equation} \tilde{R} = \left[ {\begin{array}{cccc} R & 0 & \cdots & 0\\ 0 & \ddots & & \vdots\\ \vdots& & R & 0\\ 0 & \cdots & 0 & R \end{array} } \right] \end{equation} Many solvers require derivative of cost function with respect to optimized parameter. Taking derivative of ($\ref{Jkmatrix}$) with respect to $x$ leads to \begin{equation} \frac{dJ(k)}{du} = 2\mathbf{u}^{T}(k)H + 2x^{T}F^{T} \end{equation} For details of derivation see.

There are two glaring issues here. Firstly, differential equations of inverted pendulum are nonlinear while ($\ref{Jkmatrix}$) applies only to linear systems. Secondly, MPC control problem is presented in terms of difference equations instead of differential equations. Let’s tackle these issues one by one. Inverted pendulum can be considered linear as long as it is almost in the upright position. Applying linearization formula

\begin{equation} \bar{f}(\bar{x}) = \bar{f}(\bar{x_{0}}) + \nabla \bar{f}|_{x_{0}}(\bar{x}-\bar{x}_{0}) \end{equation}

to ($\ref{statePropagationEq}$) at $ x_{0}=[0, \pi/2, 0, 0]^{T}$ leads to a system of linear differential equations
\begin{equation} \dot{x} = Ax(t) + Bu(t), \end{equation} where \begin{equation} A = \left[ {\begin{array}{cccc} -\frac{\mu_{\varphi}(m+M)}{LM} & \frac{(m+M)g}{LM} & -\frac{\mu_{x}}{LM} & 0\\ 1 & 0 & 0 & 0\\ -\frac{\mu_{\varphi}m}{M} & \frac{mg}{M} & -\frac{\mu_{x}}{M} & 0\\ 0 & 0 & 1 & 0 \end{array} } \right] \end{equation} and \begin{equation} B = \left[ {\begin{array}{c} \frac{1}{LM}\\ 0 \\ \frac{1}{M}\\ 0 \end{array} } \right] \end{equation} $x(t)$ is state vector (note that this is not x coordinate) and $u(t)$ is control input. In this case only available control input is to push cart left or right so $u(t)$ is just $F_{c}$. Also, it's assumed that $F_{rotnoise}$ is zero centered and it's ignored in the prediction model. Let’s now transform these to difference equations using formulas

\begin{equation} A_{discrete} = e^{At} \end{equation} and \begin{equation} B_{discrete} = \bigg( \int_{0}^{T}e^{At}dt \bigg)B \end{equation}

The first matrix can be evaluated as power series since
\begin{equation} e^{X} = \sum_{0}^{\infty}\frac{1}{k!}X^{k} \end{equation}
However the second expression is trickier since matrix A is singular in this case and you cannot use trivial integration formula similar to scalar case. Lets derive a power series form also for it 

\begin{equation} \begin{split} \int_{0}^{T}e^{At}dt = & \int_{0}^{T} \sum_{0}^{\infty}\frac{1}{k!}(At)^{k}dt \\ = & \sum_{0}^{\infty}\frac{1}{k!}A^{k}\int_{0}^{T}t^{k}dt \\ = & \sum_{0}^{\infty}\frac{1}{k!}A^{k}\bigg/_{0}^{T}\frac{1}{k+1}t^{k+1} \\ = & \sum_{0}^{\infty}\frac{1}{k!}A^{k}\frac{1}{k+1}T^{k+1} \\ = & T\sum_{0}^{\infty}\frac{1}{(k+1)!}\big(AT\big)^k \end{split} \end{equation}
Solving optimization problems is fascinating but quite involved topic so let’s scope implementation of constrained optimization problem solver out by using an existing library. dlib library offers a wide range of optimizers as well as classes for matrix arithmetics. 

Software stack 

In one sense software engineering is all about stacking libraries on top of each other to create a fully functional system. More one can use ready-made libraries and components more quickly one can finish the project. Hence the name software stack and importance of getting it right.

Browser frontend uses three.js for 3d visualization of inverted pendulum. Three.js is an easy-to-use abstraction of WebGL interface. Visualization of the simulation needs to be near real time so websocket is used for communication between the browser frontend and server backend that is implemented using NodeJS with Express.js . Actual simulation is implemented using standard c++ threads which don't share a lot of state. One thread runs actual simulation and keeps up the simulation state while the other thread runs the MPC controller.

As shown above MPC controller is essentially just a real time optimization problem solver whose input is the system state and the output is the control value needed to solve control problem. In actual controller one usually also has the estimation block but here it is assumed that controller is able to measure system state directly. Noise is added to controller's state measurement to make simulation more interesting.

Actual optimization problem is solved using a dlib library which is a templated multipurpose optimization and machine learning library implemented in c++.  Simulation is glued to frontend using Node.js addon that exposes V8 objects to the javascript.

Gluing dlib to to Node.js was fairly easy. By default Real Time Type Information RTTI and exceptions are turned off in Node.js. Both of these are needed by dlib but you can easily turn these on in the binding.gyp file.


Model predictive control allows a fine tuned and efficient control of complex system while taking into account also physical limitations of actuators. Web technologies allow you to present your work to in an interactive manner without forcing the reader to install a multitude of arcane libraries and navigating through the maze of issues due to slightly varying system environments and setups.

Control theory is cool by itself and in addition mathematical foundations of many branches of data science have a lot of in common with the tools used in model predictive control.