keskiviikko 5. huhtikuuta 2017

Anatomy of Data Science Project

In data science all the buzz and fuzz is about cool algorithms, exciting results and profits. However, to really gain value from data science it is necessary to understand mundane day to day work from project work angle too.  As adage goes all models are wrong but some are useful and same applies to mental frameworks to handle data science projects too.

This is my framework, what is yours?

Value Cycle


You cannot hope to get it right in one go. You can try but you will run out of time, out of budget, out of sponsors or out of ideas. The key is to deliver value continuously and only way to do that is to make modelling process incremental. Start with really simple model and use also exploratory data analysis to produce value to stakeholders.

Data will surprise you and iterative approach allows you to be surprised just a little bit each time instead being knocked out in one blow.

Key takeaways: Iterate fast and make sure every iteration ends in tangible result. Use iterations to learn ins ands outs of data.

Problem definition


The project starts, the team and all the stakeholders are excited. We are going to figure it out, we are going to make things real. Let's rock and roll!

Hold your horses for a while though. No matter how clear the objective seems and no matter how straightforward the project sounds this is the time to stop and ask the critical questions:
  • Who is end customer?
  • What is value creation logic for end customer?
  • What value does data science add here, can you solve problem on whiteboard instead?
  • Why now and what are hard constraints (ie. GDPR, deadlines)?
  • Is there a way to deploy the model?
  • What data sources are available?

Key takeaways: Understand what product owner needs (not only what product owner means), make sure you understand who is end-customer and what is value creation logic.

Data collection and wrangling  


Ditch digging phase, a lot of perspiration and little inspiration. Changes are that data from various sources is imported several times during the project and you should also make the whole project reproducible. The best way to achieve this is to automate data loading and wrangling as much as possible and write short README files when applicable

No need to do extensive documentation but write something as there are always plentitude of hidden assumptions in data. When debugging the models you are forced to check details of data loading and wrangling over and over again so don't fall for temptation to do data loading and cleaning manually.

Key takeaways: Automate, documentate (don't overdo)

Exploratory data analysis


Plot this and check that. Model performance depends on data and therefore you must know the data to choose a feasible model. Also, this is great phase to detect inconsistencies and missing values in data. If data is not representative you have good change to detect it here too.

Often you can find useful insights in this phase. Not guaranteed though and depends how well data is analyzed already. Value of an insight doesn't depend on how much work it required but creating an insight is hard if data set is already analyzed through and through.

Key takeaways: Intuition of data, insights

Optimization proxy selection


The great irony of data science (especially in machine learning) is that it is all about optimization but you almost never optimize the real thing.

Let's think about recommendation engine for a moment. What you really care about is providing interesting and surprising recommendations to customer. This is, sadly, an impossible objective for any machine learning algorithm and data scientist must find a credible and technologically feasible proxy (e.g. ratings of movies per customer) for end customer's true need.

Key takeaways: Make optimization proxy explicit.

Model development


This is part of data science everybody is talking about and all the books are about this so let's not go to technical details here.

Sometimes you use different tool chains in different iterations of value cycle (ie. R -> Python -> Scala -> ..) and therefore you should uncouple data preparation from modelling if possible. Of course there will be quite lot of coupling but try to keep it to minimum at least in early iterations and stabilize architecture/technology stack later as project matures. Remember that using tools that fit the problem is much easier than shoehorning problem to your favourite tool chain.

In first iteration you should build baseline model that can be deployed to production. Make it simple and make it clear. In later iterations you redefine the model or choose a more complex approach altogether. Only exception to this rule is when you are doing moonshot project (not clear if even doable) where there is no point even talking about production before the biggest technical risks are solved.

Faster you can train the model the better. You will run into many gotchas during the model development and faster you train the model faster you can sort them out. On other hand, many techniques to parallelize model training also add complexity and make training less transparent (e.g. running cluster vs. running on one huge AWS instance). It depends on case by case which way to go but make it a conscious decision.

Key takeaways:  Baseline model first, be tools agnostic, think your productivity

Out-of-sample test


Time see how predictive you predictive model really is and decide if you can use it in real life. Out-of-sample test means that you run your model on data that has not been used to train the model. In data science you generalize using training data and out-of-sample test tells how well you succeeded.

Key takeaways: Use out-of-sample test to do go-alive decision.

Deploying model to production


There are two ways to turn insight into tangible benefit. Indirect route is essentially consulting where you advise someone who has means to make a difference. Classical example of indirect route is optimizing marketing budget. Direct route means that you create (or work with someone who creates) automated system that provides insights (say, recommendations) to end customer.

Key take-aways: Figure out a way to turn model into real life benefits.

Real life test


The moment of truth, everything before this is just dress rehearsal. There is no guarantee whatsoever that customer cares for your optimization proxy. You worked hard, your hair turned gray and you started to look like an old and feeble man while developing the model but does the customer really care?

This is the first time your end-customer voices his opinion so to try get here as soon as possible. In fact, does customer even care for the product in which the model is embedded in?

Measure improvement against baseline instead a arbitrary target. The point is to improve performance over baseline or create new business opportunities instead of trying to hit an arbitrary target (which may be way too low or way too high).

Key takeaways: Get here fast, measure improvements relative to baseline model.

Data scientist survival guide


It is foolhardy to forget software development best practices. Not having extensive automated tests is going to backfire in data science just as surely as in any other software project. Static code analysis and linters are still great (and almost free) way to improve code quality and communication with stakeholders is just as important as in any other project.

There are plentitude of free packages available online but here you should use same caution as in software projects. Check the licence, check if package still updated (check commit history) and use your judgement to determine if package will be updated in future.

However, there is a more to the story. The data science projects are inherently risky. No matter how many algorithms you try (or invent) there is always a change that data just isn't representative (relevant to problem) or maybe there is no signal in data. Sooner or later you will run into this but almost always there are more degrees of freedom than you initially see. Maybe you can buy some data, collect more data or try some totally orthogonal approach? These allow you to turn looming failure to great, if a bit postponed, success. There are always more solutions than you think, trust me on this one.

Copy all the good parts from software engineering and add a strong dose of Sherlock Holmes' wits. A friend of mine once described data science as no-holds-barred approach to solving a crosswords puzzle. Everything is allowed except using test set in training phase.

Key takeaways: Be resourceful and organized.

Product owner survival guide


Data science projects have tendency to become quite complicated quite fast. The data sources come in all forms and sizes, algorithms are exotic and full of misleading acronyms. It is quite easy to get lost in the endless jargon and incomprehensible gibberish.

However, most of this is just implementation details and not that important to you actually. You as a product owner should really be concerned about only a few things:
  • Is the team solving the right question (problem definition phase)?
  • Is out-of-sample test done and are results encouraging (go-alive decision)?
  • Are results in production system measured and how big is improvement compared to baseline (real-life test)?
  • Are there deliverables from each iteration of value creating cycle (is there real progress)?

Key takeaways: Solve the right question, out-of-sample test, real-life test

Summary


Solve the right question, work iteratively, check your results against baseline and deploy to production as fast as possible.

Not all data science projects fit in this framework but all projects have least some elements from it. Actually, most projects have all the elements in some sense.

Data science is fun!