Bora writes on...: April 2019

Introduction

I am sure you have heard the slogans circling around data and the related technologies: “Data is new oil”, “Data is the most important corporate asset”, “Data is new gold” so on and so forth. Following the data trends, companies are trying to modernize their data infrastructures, founding new data organizations, creating board awareness and making it a part of their digital journey. Moreover, regulators are pushing data management and data governance obligations on enterprises harder every day.

All those are true, and right.

However, I have doubts when I think whether we invest for the future in a proper way or not. Do we know what we are doing? Are we capable enough to build the future?

I will explain my question marks in a structured way considering a number of dimensions and while discussing the problems, I will propose possible solutions in abstract forms as well.

Mother of all evils

Datawarehouses. Have you heard the stories about them? Many projects have failed while people were trying to build them. Many managers were fired. Many DBAs were cursed and many programmers forgot their professional notion while feeding data into them. A dark shadow, reaching today and struggling to expand tomorrow, coming from 1980s and 1990s.

A typical datawarehouse conventionally offers a number of fundamental values to the technological infrastructure of the corporations such as accumulating data historically, integrating high quality data in a single platform, bridging the application silos, providing cohesive data organised by business subjects, enabling data analytics and business intelligence, robust reporting etc.

Very attractive value propositions indeed. Therefore, all the universe invested in datawarehouses. Data architects, DBAs, data engineers and others started to work hard for finding the best way of copying data from the source systems, transforming them in an efficient manner, storing them in the correct target data models: Some built data marts, some went for great foundational layers in third normal form. Some picked the special purpose database appliance systems for overcoming underperforming queries and data integration software, some procured systems with fat cache feeding infiniband wires, some picked database systems specifically designed for data warehouses. Some failed, some ended up with the systems that are able to run.

Today, vendors are still playing the same game and companies are still building datawarehouses. Let’s analyse the picture of a typical datawarehouse habitat.

Figure 1: A typical datawarehouse scheme. Source: TDWI,

Very familiar: a set of data sources comprises operational systems such as core banking, CRM, ERP, accounting etc.; ETL modules for extracting data from the data structures of the source systems, transforming them and loading into the datawarehouse. Another set of ETL modules or some sort of views for feeding BI, reporting, analytics based or business reporting oriented data marts.

Problem #1: You always end up with data redundancy. It is in the nature of datawarehouses, they store copies of the real data. And no copy is as good as the genuine. For instance, if a court order requests the accounting data of your company, you deliver it from your accounting system or you validate the data you get from your datawahouse by using the data coming from your accounting system. Fresh, hot, operational and reliable as your blood. No cold fish needed!

Problem #2: ETL is guaranteed bad software engineering. Years ago, I highlightedthe characteristics of high quality software system in my blog. I will not jump into the details again but just want to remind two very core characteristics (a) high cohesion (b) low coupling.

Your modules must be encapsulating functions and data that are strongly interrelated for being a highly cohesive software system. If you designed the module for sales management, all the routines, data structures, functions, assertions etc. in your module must be there for handling sales, nothing more, nothing less.

In contrary, datawarehouse systems cannot stand for high cohesion because they are designed for covering all enterprise data coming from highly cohesive source systems. That’s one of the reasons a very strong metadata management layer is essential for datawarehouses. Otherwise, you naturally get lost.

Low coupling refers to minimising the inter-module dependency of the software systems. For providing such an attribute, you need to hide the implementation details of a module from the other modules and you have to set the interfaces for must-have communications with other modules.

By doing so, it is assured not to affect outer modules in a negative way like invalidating them, breaking their functionality etc. while you are changing the internal structure of your module. Every module is responsible for the interfaces they opened to other modules, keep the interface compatible to the original communication act then, enhance your internals independently. Be message oriented, no unknown or unwanted dependencies, no non-needed details of outer systems to manage.

Low coupling is good, tight coupling is bad.

Again, ETL is innately brings the attribute of tight coupling. All ETL tools and techniques force the data integration engineers to know the very internals of source systems: all tables, columns, meanings of the flags sored in the columns, all keys, lookup structures, everything. ETL is like a dagger you inserted into heart of the source systems. It hurts. It is not for good unless you are in the business of assassination. After your ETL systems are plugged into the depths of the ERP systems, CRM systems, Sales systems, Core Banking systems, your data flow towards the datawarehouse is always vulnerable to be invalidated every time a data structure change occurs in the source systems. You must be continuously trying to synchronize yourself with the new data structures and the new syntax and semantics attached to them.

That is mission impossible. That is bad software engineering. That is prying into others’ affairs. It is impossible to always be aware of every implementation details of outer systems for the sake of copying data. A total bad practice.

Perhaps, ETL developers don’t call themselves as software engineers because of this unhealthy heritage of the ETL approach. Most of the ETL professionals I’ve been interviewing are confident with the ETL tools they are familiar with, they know using the tool and they are not interested in the underlying infrastructures or the general idea. Some call themselves data integration specialists in a funny way because ETL is able to integrate “nothing”, just an ugly between-systems dependency technique.

Problem #3: The world is spending millions for enabling cloud foundry, open API, microservices, continuous delivery, continuous integration, hyper connectivity etc. meanwhile what’s datawarehuse world doing? Copying terabytes of data between systems in batch fashion for providing already old data, being affected by every change done in any source system, getting bigger and bigger and so on.

What is it good for?

How will IT managers assure coordinating super fast micro systems which are responsible for core operations running in private, public or hybrid clouds, communicating event based, deployed a number of times intraday by devised automations with “reactive, slow, cold and fat datawarehouses”?

Imagine you devised the best impact analysis system for alerting datawarehouse people on any sort of source system changes, what will happen in a continuous delivery world? You will get hundreds of alerts and will be thrashing while trying to cope with them in your slow and fragile datawarehouse hinterland.

Architecture for the future

All negative aspects were mainly declared so far. A dark picture, seems like there is no way to escape. Any solutions? Yes.

Firstly, stop investing for traditional datawarehouses. Vendors are motivating you in a false way: the more database instances you have, the more license fees they earn. This is just the legacy sales business. Paying for datawarehouses is like sending your money to 1980. Full stop.

Second, uninstall ETL tools. Re-educate your ETL programmers for extracting the high profile software engineer in them if any left.

Third, focus on the software module design. Classify your modules as data providers and data consumers. Make your modules provide and/or consume data in a responsible way through exposed, well defined services. It is possible to follow industry specific guidelines for identifying key data services (e.g. BIAN for banking).

Fourth, invest for enterprise data bus which is to provide robust access to any sort of identified data service of the modules you developed. There can be many options of data access: ad-hoc, continuous, record based, bulk fashion etc. Harness the necessary technologies for proper data use cases; in-memory layers, self-service data analysis platforms, embedded data validation layers, volatile and non-volatile data presentations etc. Rule of thumb is “produce data knowingly, consume data knowingly”. Awareness is the key. Sample illustration is below.

Figure 2: Enterprise data bus in the middle of highly cohesive software modules

Enterprise data bus and the data services published responsibly is just an idea, not the perfect solution but, unlike datawarehouse and ETL, bearing no intrinsic anomalies. I think, we should find better ways of data and information representation, non-hierarchical data structures, non-linear data processing and handling anyhow. This sort of thinking will yield better talents, better academy and industry. Not the dead sales business in 21^st century.

Know the basics, derive for better.

Datawarehouse, no more!

Introduction

Mother of all evils

Architecture for the future

di