maanantai 16. huhtikuuta 2018

Figuring out an obscure database

Challenge

In many data driven products one must look into arcane operational database to access the necessary training data or operational database must be turned to analytical database. Implementation details vary from project to project but practically always one is confronted with an arcane operational database with obsolete documentation at best and usually none at all.

At first this might seem like a daunting task. How is it possible to decrypt meaning of tens and tens of tables? But fear you not, here is how the trick is done.

Learning cycle

Start with asking why database exists because use case of application using the database is bound to show up in the database structure somehow. The best source for this information is whoever is working daily with the database.

Draw physical Entity-Relationship (ER) diagram of the database using database DDL scripts. If you have access to database you are almost guaranteed to be able to do this. Sometimes you can only access database dump without valid constraints but even then you can make educated guesses using table names, field names, data types and table structure.

Most operational databases use common database design patterns (star schemabridge table etc.) so learn how these work and mapping business concepts to database tables becomes much easier. Physical ER diagram will help tremendously on this part. Remember though that even if database was born with solid structure by the time you come around there can be any number of creative and not so solid features so flexible mind is mandatory.

The preceding steps offer high level understanding of the database but very little to understanding on how to compute anything from database.

In almost all cases application has some kind of UI, e.g. distribution of ice cream in ice cream manufacturing company has some business user UI. Based on the database design patterns, ER diagrams and a little bit of intuition gained from discussions with business users you can guess how to map primary key of some center table of the database to UI.  When you compare what you see in database to  what the application wants so show to the user you can create first valuable hypothesis on how database tables should be interpret.

Now you have gained a little bit of understanding on the data so use your new found knowledge to ask more relevant questions from business users, understand ER diagram better an so forth.  Essentially you start with very vague understanding of the database and iteratively refine your understanding until you can compute the necessary values using the database.

Tricks of the trade

  • Use mainly production data since TEST (and QA too..) can be just about anything. Without any access to PROD data probability of success is greatly diminished.
  • Change TEST data to confirm hypotheses, e.g. to change database state by making a new ice cream delivery order to check if database looks what it should after the new order.
  • Sometimes you want access end results of the application and compare them to database content. Quite often companies store the most relevant parts of the database information also in a different format, e.g. ice cream manufacturing company might store delivery records in pdf files or in paper records for accounting purposes since they must be able to present that data to regulatory agencies on request. 
  • Beware that TEST data and PROD data can have funny differences, e.g. IDs used to map ice creams to ice cream descriptions can be different in TEST and in PROD.
  • If you can access the application source code you should take a look. Be warned though, it can take a long time to understand relevant parts of complex legacy application but sometimes you have no choice.
  • Quite often some logic is coded in first character of a string or there is some other black magic involved. This creates sort of hidden logic in database that is not available in ER diagrams but can be spotted by inspecting the data.
  • Don't assume that database content should be interpret in same way on different time spans. If company's business has changed then the application has changed too and seemingly similar products may not behave the same. Similar wholesale ice cream package may not look the same each year in the database.
  • You are solving a puzzle so combine intuition with logic.