"Clean Data" - Scrub and Shine

by   |  2 min read
Published :

Continuing from my smart cities post, thought would move into Data Migration, the other topic I intend posting on. The intention is to alternate between these two topics but please do bear with me if I do not keep up my word on that though.

I happened to read the other day ““Migration is not just about moving the data…

It’s about making the data work.” Well it got me thinking, about my experiences with Data Migration, true enough, seemed like it was me stating that.

What are the challenges around making the data work…? My experiences around data migration in moving large quantities of data from source to a new target system is what I intend writing about. I won’t go into the challenges faced, they are there, and my take on this thread is, what the important issues in a migration are and, how we managed that. This post is about my experiences in data profiling and the gains from it.

Data Quality: The most abused word in a migration process, but having stated that, it indeed is the most important issue. The question that we were faced with was that “the data you say is of poor quality is the one that we have been running our system for the past 15 years”, whenever an issue about data quality was raised. The answer to this question is simply data profiling. Data profiling is an objective measure of the “quality” of the source data and the suitability of that to the target system.

Data Migration is always dealt as a three stage process – Extract, Transform, and Load.  The extracts and load have been the easiest task in the migration process and are a function of the data size. The transformation is the one that consumes the most time, both in terms of tool development as well as the actual data migration. Quality hits the transform, and thus hits the data migration where it hurts the most. Trust me, it is painful when hit.

Profiling brings out a comprehensive understanding of the “data quality”, my experiences point to the fact that profiling output defines the development transformation tools and approach. Bad records will be dropped rather than being cleansed, if your transformational output on load points to this, the diagnosis is that a profiling exercise has not been carried out. In relational systems a small set of bad records can actually make the data set migrated disappear when the relationships are factored in. I call this the “nasty multiplication factor”. The solution to mitigating the “nasty multiplication factor” is to profile and transform as many bad records or make strategies around cleansing them. The inverse of the “nasty multiplication factor” being that a small set of records cleansed or transformed properly could mean a complete to a near complete migration. Fortunately, it works both ways.

This being my experience, look forward to comments regarding yours.

- Contributed by Vancheeswar B