Introduction
I am sure you have heard the slogans
circling around data and the related technologies: “Data is new oil”, “Data is
the most important corporate asset”, “Data is new gold” so on and so forth.
Following the data trends, companies are trying to modernize their data
infrastructures, founding new data organizations, creating board awareness and
making it a part of their digital journey. Moreover, regulators are pushing
data management and data governance obligations on enterprises harder every
day.
All those are true, and right.
However, I have doubts when I think
whether we invest for the future in a proper way or not. Do we know what we are doing? Are we capable
enough to build the future?
I will explain my question marks in a
structured way considering a number of dimensions and while discussing the problems,
I will propose possible solutions in abstract forms as well.
Mother of all evils
Datawarehouses. Have you heard the
stories about them? Many projects have failed while people were trying to build
them. Many managers were fired. Many DBAs were cursed and many programmers
forgot their professional notion while feeding data into them. A dark shadow,
reaching today and struggling to expand tomorrow, coming from 1980s and 1990s.
A typical datawarehouse
conventionally offers a number of fundamental values to the technological
infrastructure of the corporations such as accumulating data historically,
integrating high quality data in a single platform, bridging the application
silos, providing cohesive data organised by business subjects, enabling data
analytics and business intelligence, robust reporting etc.
Very attractive value propositions
indeed. Therefore, all the universe invested in datawarehouses. Data
architects, DBAs, data engineers and others started to work hard for finding
the best way of copying data from the source systems, transforming them in an
efficient manner, storing them in the correct target data models: Some built
data marts, some went for great foundational layers in third normal form. Some
picked the special purpose database appliance systems for overcoming
underperforming queries and data integration software, some procured systems
with fat cache feeding infiniband wires, some picked database systems
specifically designed for data warehouses. Some failed, some ended up with the
systems that are able to run.
Today, vendors are still playing the
same game and companies are still building datawarehouses. Let’s analyse the
picture of a typical datawarehouse habitat.
Figure 1: A typical datawarehouse scheme. Source:
TDWI,
Very familiar: a set of data sources
comprises operational systems such as
core banking, CRM, ERP, accounting etc.; ETL modules for extracting data from
the data structures of the source systems, transforming them and loading into
the datawarehouse. Another set of ETL modules or some sort of views for feeding
BI, reporting, analytics based or business reporting oriented data marts.
Problem #1: You always end up with data redundancy. It is in the nature of
datawarehouses, they store copies of the real data. And no copy is as good as the
genuine. For instance, if a court order requests the accounting data of your
company, you deliver it from your accounting system or you validate the data
you get from your datawahouse by using the data coming from your accounting
system. Fresh, hot, operational and reliable as your blood. No cold fish
needed!
Problem #2: ETL is guaranteed bad software engineering. Years ago, I highlightedthe characteristics of high quality software system in my blog. I will not jump
into the details again but just want to remind two very core characteristics
(a) high cohesion (b) low coupling.
Your modules must be encapsulating functions and data that are strongly interrelated for being a highly cohesive software system. If you designed the module for sales management, all the routines, data structures, functions, assertions etc. in your module must be there for handling sales, nothing more, nothing less.
In contrary, datawarehouse systems cannot stand for high cohesion because they are designed for covering all enterprise data coming from highly cohesive source systems. That’s one of the reasons a very strong metadata management layer is essential for datawarehouses. Otherwise, you naturally get lost.
Low coupling refers to minimising the inter-module dependency of the software systems. For providing such an attribute, you need to hide the implementation details of a module from the other modules and you have to set the interfaces for must-have communications with other modules.
By doing so, it is assured not to affect outer modules in a negative way like invalidating them, breaking their functionality etc. while you are changing the internal structure of your module. Every module is responsible for the interfaces they opened to other modules, keep the interface compatible to the original communication act then, enhance your internals independently. Be message oriented, no unknown or unwanted dependencies, no non-needed details of outer systems to manage.
Low coupling is good, tight coupling is bad.
Again, ETL is innately brings the attribute of tight coupling. All ETL tools and techniques force the data integration engineers to know the very internals of source systems: all tables, columns, meanings of the flags sored in the columns, all keys, lookup structures, everything. ETL is like a dagger you inserted into heart of the source systems. It hurts. It is not for good unless you are in the business of assassination. After your ETL systems are plugged into the depths of the ERP systems, CRM systems, Sales systems, Core Banking systems, your data flow towards the datawarehouse is always vulnerable to be invalidated every time a data structure change occurs in the source systems. You must be continuously trying to synchronize yourself with the new data structures and the new syntax and semantics attached to them.
That is mission impossible. That is bad software engineering. That is prying into others’ affairs. It is impossible to always be aware of every implementation details of outer systems for the sake of copying data. A total bad practice.
Perhaps, ETL developers don’t call themselves as software engineers because of this unhealthy heritage of the ETL approach. Most of the ETL professionals I’ve been interviewing are confident with the ETL tools they are familiar with, they know using the tool and they are not interested in the underlying infrastructures or the general idea. Some call themselves data integration specialists in a funny way because ETL is able to integrate “nothing”, just an ugly between-systems dependency technique.
Your modules must be encapsulating functions and data that are strongly interrelated for being a highly cohesive software system. If you designed the module for sales management, all the routines, data structures, functions, assertions etc. in your module must be there for handling sales, nothing more, nothing less.
In contrary, datawarehouse systems cannot stand for high cohesion because they are designed for covering all enterprise data coming from highly cohesive source systems. That’s one of the reasons a very strong metadata management layer is essential for datawarehouses. Otherwise, you naturally get lost.
Low coupling refers to minimising the inter-module dependency of the software systems. For providing such an attribute, you need to hide the implementation details of a module from the other modules and you have to set the interfaces for must-have communications with other modules.
By doing so, it is assured not to affect outer modules in a negative way like invalidating them, breaking their functionality etc. while you are changing the internal structure of your module. Every module is responsible for the interfaces they opened to other modules, keep the interface compatible to the original communication act then, enhance your internals independently. Be message oriented, no unknown or unwanted dependencies, no non-needed details of outer systems to manage.
Low coupling is good, tight coupling is bad.
Again, ETL is innately brings the attribute of tight coupling. All ETL tools and techniques force the data integration engineers to know the very internals of source systems: all tables, columns, meanings of the flags sored in the columns, all keys, lookup structures, everything. ETL is like a dagger you inserted into heart of the source systems. It hurts. It is not for good unless you are in the business of assassination. After your ETL systems are plugged into the depths of the ERP systems, CRM systems, Sales systems, Core Banking systems, your data flow towards the datawarehouse is always vulnerable to be invalidated every time a data structure change occurs in the source systems. You must be continuously trying to synchronize yourself with the new data structures and the new syntax and semantics attached to them.
That is mission impossible. That is bad software engineering. That is prying into others’ affairs. It is impossible to always be aware of every implementation details of outer systems for the sake of copying data. A total bad practice.
Perhaps, ETL developers don’t call themselves as software engineers because of this unhealthy heritage of the ETL approach. Most of the ETL professionals I’ve been interviewing are confident with the ETL tools they are familiar with, they know using the tool and they are not interested in the underlying infrastructures or the general idea. Some call themselves data integration specialists in a funny way because ETL is able to integrate “nothing”, just an ugly between-systems dependency technique.
Problem #3: The world is spending millions for enabling cloud foundry, open API,
microservices, continuous delivery, continuous integration, hyper connectivity
etc. meanwhile what’s datawarehuse world doing? Copying terabytes of data
between systems in batch fashion for providing already old data, being affected
by every change done in any source system, getting bigger and bigger and so on.
What is it good for?
How will IT managers assure coordinating super fast micro systems which are responsible for core operations running in private, public or hybrid clouds, communicating event based, deployed a number of times intraday by devised automations with “reactive, slow, cold and fat datawarehouses”?
Imagine you devised the best impact analysis system for alerting datawarehouse people on any sort of source system changes, what will happen in a continuous delivery world? You will get hundreds of alerts and will be thrashing while trying to cope with them in your slow and fragile datawarehouse hinterland.
What is it good for?
How will IT managers assure coordinating super fast micro systems which are responsible for core operations running in private, public or hybrid clouds, communicating event based, deployed a number of times intraday by devised automations with “reactive, slow, cold and fat datawarehouses”?
Imagine you devised the best impact analysis system for alerting datawarehouse people on any sort of source system changes, what will happen in a continuous delivery world? You will get hundreds of alerts and will be thrashing while trying to cope with them in your slow and fragile datawarehouse hinterland.
Architecture for the future
All negative aspects were mainly
declared so far. A dark picture, seems like there is no way to escape. Any
solutions? Yes.
Firstly, stop investing for
traditional datawarehouses. Vendors are motivating you in a false way: the more
database instances you have, the more license fees they earn. This is just the
legacy sales business. Paying for datawarehouses is like sending your money to
1980. Full stop.
Second, uninstall ETL tools.
Re-educate your ETL programmers for extracting the high profile software
engineer in them if any left.
Third, focus on the software module
design. Classify your modules as data providers and data consumers. Make your
modules provide and/or consume data in a responsible way through exposed, well
defined services. It is possible to follow industry specific guidelines for
identifying key data services (e.g. BIAN for banking).
Fourth, invest for enterprise data
bus which is to provide robust access to any sort of identified data service of
the modules you developed. There can be many options of data access: ad-hoc,
continuous, record based, bulk fashion etc. Harness the necessary technologies
for proper data use cases; in-memory layers, self-service data analysis
platforms, embedded data validation layers, volatile and non-volatile data
presentations etc. Rule of thumb is “produce
data knowingly, consume data knowingly”. Awareness is the key. Sample
illustration is below.
Figure 2: Enterprise data bus in the middle of
highly cohesive software modules
Enterprise data bus and the data
services published responsibly is just an idea, not the perfect solution but,
unlike datawarehouse and ETL, bearing no intrinsic anomalies. I think, we
should find better ways of data and information representation,
non-hierarchical data structures, non-linear data processing and handling
anyhow. This sort of thinking will yield better talents, better academy and
industry. Not the dead sales business in 21st century.
Know the basics, derive for better.