Cleaning Dirty Data Daily – The Importance of Data Cleaning

Operating since 1997, Cedar Rose is known to have the largest single cleaned database of analytically linked companies’, shareholders’ and directors’ information across the Middle East, Africa and Asia, available to our thousands of clients around the world. Data cleaning has always and will always be a priority to us.

No matter how data is gathered and collected, there will always be some level of error. Data in the real world, certainly in the regions we gather it from is mainly dirty: incomplete, disorganized, unstructured and inconsistent. Incomplete data stems from non-available data values when collected and different criteria between the time collected and the time analyzed. Examples of a lack of attribute values could be an incomplete address or incomplete translated company name. Original data contains errors such as typing, spelling, word transposition (e.g. number of premises or number of employees equal to -3, or even a Shareholder/ Manager who is 230 years old).

Data can also be inconsistent and duplicated; containing incompatibility in codes or names (e.g. Company Name: “XXX Company LTD” or “XXX Company Limited” could be considered one registered entity although in the latter case the legal form is Joint Stock Company which is not reflected correctly in the name). The lack of compatibility is mainly between the different data fields. Inconsistent and duplicate data, as in the example above, comes from different data sources merged together or non-uniform naming conventions.

These types of mistakes can result from human error, poor recording software, or incomplete control over the type of data imported. Before processing the data for analysis or use, error-prevention strategies should be implemented to reduce common errors as much as possible, and to ensure that data is accurate, valid and consistent.

Maintaining an excellent quality database is essential for our company to ensure accuracy in our credit and due diligence reports. In our data warehouse, currently containing more than 12 million companies and more than 23 million individuals, data cleaning is a major part of the extract, transform and load (ETL) process. Data cleansing (also known as data cleaning or scrubbing) is the process of spotting and rectifying inaccurate or corrupt data.  Incomplete, inaccurate or irrelevant data is identified and then either replaced, modified or deleted.

Also worth noting that all our data is date stamped and graded. Our clients can now check the date of each data field in addition to its source grading. In recent years, Cedar Rose has implemented a system for the grading and evaluation of the source reliability, as well as of the information and intelligence credibility of the majority of our data. This grading is invaluable to our subscribers and due diligence clients who can then calculate which data they can rely on 100% and which has less reliability (eg; data from third parties, assumed to be correct but not verified).

No matter what sector you are working in, from public health to extractive industries to education, you can have access to our cleaned and linked database of companies, directors and shareholders via our website at www.cedar-rose.com, via API or by a CRiS subscription.

For further information, please contact Hannah King or Nicole Konstantinou to arrange a demonstration of CRiS, or go to our website to search and download or order a fresh investigation on your client today.

Visit our newsroom for more relevant news!

Written By Elissa Ghosn, Data Analyst

 

*** The sole purpose of the article above is to generate public discussion, it has no intention to constitute legal advice. ***