Skip to main content

Guide to Data Deduplication


At the most basic level, data deduplication refers to the deletion and removal of redundant or duplicate data. It is an ongoing process to ensure no excess data is in your database, and that you’re using only a single copy of truth, or the golden record, for analytics or operations.

Redundant or duplicate data can harm your business and your strategy in many ways, both in operational use cases and analytical use cases. From an operational perspective, you can’t answer questions like which account is the right one to contact?


From an analytics perspective, it’s hard to answer questions like who are my top paying customers by revenue?


Data deduplication has a lot of overlap with data unification, where the task is to ingest data from multiple systems and clean it. It also overlaps with entity resolution, where the task is to identify the same entity across different data sources and data formats.

What are the benefits of deduplication?

Data deduplication can benefit your business in a myriad of ways. For example, improved data quality can lead to more cost savings, more effective marketing campaigns, improved return on investment, improved customer experience, and more.

  • Improve cost savings. This is the most obvious and direct benefit. First of all, it reduces data storage costs. Then, it helps save on data preparation and correction costs. Data analysts no longer need to spend 80% of their time on tasks such as  data wrangling and transformation and can instead focus on more valuable data analysis. It also helps with employee churn.
  • More accurate analytics. In the example mentioned above, we were unsure which customer is our highest paying customer. In general, duplicate data really distorts a company’s visibility into its customer base and can derail analytics efforts. Data deduplication helps provide the team with the most accurate data, and eventually improves analytics performance.
  • Better customer experience. Duplicate data can cause companies to focus on the wrong targets, and even worse, to contact the same person multiple times. Data deduplication helps provide the customer success team with a holistic view of their customers and provides the best customer experience possible.

When is data deduplication used?

If you deal with real business data, then you have certainly faced the headaches of duplicate data. From customers filling out forms, your team manually entering data, or data imports from third-party platforms, there are certain patterns that create duplicate data and it can be quite difficult to get rid of as a result. Data deduplication can help overcome duplicate data caused by the situations below.

  • Different expressions. One of the most common ways duplicate data is created in databases is through common terms expressed in different ways. For example, Tamr Inc. and Tamr Incorporated. A human can look at the two records and know instantly that it’s the same company. But databases will treat them as if they are two distinct records. The same problem happens to job titles as well, VP, V.P., and Vice President are good examples.
  • Nicknames (short names). People are often known by multiple names, such as a more casual version of their first name, a nickname, or simply initials. In the previous example, someone named Andrew John Wyatt might be known as Andy Wyatt, A.J. Wyatt, or Andy J. Wyatt, etc. In all cases, these name variations can easily create duplicate records in a database such as your CRM systems.
  • Typos (fat fingers). Whenever humans are responsible for inputting data, there are going to be data quality issues. The average human data entry error rate can be as high as 4%, which means one in 25 keystrokes could be wrong. You might run into issues like “Gooogle” or “Amason” in company names and misspelled names such as “Thomas” typed as “Tomas”. In either case, they will create duplicate records.
  • Titles & Suffixes. Contact data may include a title or a suffix, and those can cause duplicate data as well. A person called Dr. Andrew Wyatt and a person called Andrew Wyatt could be created as separate records and live in different data systems, although they could be the same person.
  • Website / URLs. Between records on organization website URLs, the field may or may not contain “HTTPS://” or “www.” in them. Furthermore, different records might have different top-level domains, such as amazon.com vs. amazon.co.uk. All of these differences will cause data duplications.
  • Number formats. The most common ones are phone numbers and dates. There are many ways to format a phone number. For example, 1234567890, 123-456-7890, (123)-456-7890, and 1-123-456-7890. In the case of dates, there are also many ways to represent them. For example, 20220607, 06/07/2022, and 2022-6-7. Number fields are also prone to typos and other issues, causing different representations of the same value.
  • Partial matches. This is one of the more complex issues and something not easily resolved by traditional rules or simple match algorithms. In the case of partial matches, the records share similarities with each other but are not exactly the same entity. For example, Harvard University, Harvard Business School, and Harvard Business Review Publishing. From an affiliated organization perspective, they are all affiliated with Harvard University. But from a mail delivery perspective, they would be distinct entities.

The result is that there are many ways duplicate records are created in data systems and the actual database will have a combination of these factors that contribute to data duplication. In the process of deduplication, you need to consider many – if not all – of these factors.

How does data deduplication work?

As discussed before, data deduplication has a lot of overlap with data unification and entity resolution. And there is a system of tools dedicated to solving this problem: Master Data Management (MDM). But in its simplest form, data deduplication is just the process to ensure only a single copy of truth, or the golden record is used for analytics or operations. 

There are traditional approaches to deal with data deduplication such as data standardization, relying on external IDs, and fuzzy matching with rules. But these approaches only worked partially because of all the variations of data problems mentioned in the previous section and the growing volume of data varieties and data volume.

Data standardization. For small data volumes, standardizing many fields such as dates, phone numbers, and even addresses is possible to solve the problem. But traditional methods such as ETL pipelines can deal with new data sources and new varieties. 

Fuzzy match with rules. This method uses a combination of fuzzy matching (approximate string matching) and complicated rules to match potential duplicates. But the number of rules quickly skyrockets when there are multiple data systems in play with each other. Quickly, it becomes very difficult to maintain those rules.

Relying on external IDs. Sometimes, the data itself already has a primary key that you can rely on to deduplicate data records. In the case of people, it could be a social security number. Or in the case of companies, it could be DUNS numbers. But DUNS numbers may not always exist and they are expensive to acquire.

Machine learning. Look for solutions that use a machine learning-first approach to entity resolution. Machine learning improves with more data. Rules do not. Machine learning increases automation and frees up technical resources by up to 90%. Rules-only approaches have the opposite effect.

Persistent IDs. Data and attributes tend to change over time even as you deduplicate them. As you reconcile different records into the same entity, maintaining a persistent ID can help provide you with a longitudinal view of the entity.

Enrichment data. Data enrichment integrates your internal data assets with external data to increase the value of these assets. It also automatically standardizes many of the fields discussed above, like addresses, phone numbers, and other fields that can be used to identify the records, thus making it easier to identify duplicate records as well.

How to prevent duplicates from being created?

Healthy, next-gen data ecosystems have the ability to simultaneously process data in both batch and streaming modes. And this needs to occur not only from source to consumption but also back to the source, which is often an operational system (or systems) itself.

Having the ability to read and write in real-time – or near real-time – also prevents you from creating bad data in the first place. For example, real-time reading can enable autocomplete functions to block errors at the point of entry. And real-time writing through the MDM services and match index can help share good data back to the source systems. With real-time or near real-time reading and writing, it’s easy to identify dupe either blocked it from entering the database, automatically merge it with the existing record, or sent it to the data steward for review.

There are next-generation data mastering tools that have the following components to help deal with duplicate issues and solve the problem that traditional tools can’t solve.

Duplicate data, along with many other pitfalls, is unavoidable in today’s environment where data grows exponentially each day. But with the right next-gen MDM tools, deduping your data becomes much easier.

Comments

Popular posts from this blog

Why Do You Need Self-Serve Data Preparation?

Self-Serve Data Preparation Takes the Headache Out of Data Analytics! Self-Serve Data Preparation (aka augmented data preparation) is all about efficiency and the presentation of sophisticated data preparation tools in an easy-to-use environment. The idea behind self-service data preparation is to give the average business user the ability to prepare, use, report on and share data without the assistance of IT staff or analysts, thereby making their jobs easier and making every team member more of an asset to the organization. Business users love  Self-Serve Data Preparation  because they can control data elements, and the volume and timing, perform data preparation and test theories and hypotheses by prototyping on their own. No one likes to be restricted to complex tools or forced to wait for programmers or data scientists. Give your business users access to crucial data and connect them to data sources so they can mash up and integrate data in a single, one-st...

Evaluating Enterprise Data Literacy

 Any organization that aims toward complete digital transformation must move toward Enterprise Data Literacy. So, what exactly is Data Literacy? Gartner defines Data Literacy as: “The ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied – and the ability to describe the use case, application and resulting value.” According to the Gartner Annual Chief Data Officer (CDO) Survey, an absence of Data Literacy is the primary reason behind CDOs’ inadequate performance. To combat this, more and more enterprises are engaging in “competency development in the field of Data Literacy.” In a digital culture, the goal is to make data accessible and available to all employees – not just to data scientists, analysts, or CDOs. Right now, most business executives realize that all employees need to “communicate in a common data language,” but data regulations, and privacy and security policies are ...

BI for Customer Relationship Management

Can Business Intelligence for CRM Help Attract and Retain Customers? Customer service and customer satisfaction are the backbone of customer relationships. In an effort to ensure customer satisfaction and retention, businesses spend a lot of time trying to understand buying behavior, customer expectations for product support, website support and product and service variety, as well as gaps in product and service offerings. If an organization can accurately monitor and measure customer service factors and customer satisfaction, it is easier to resolve issues and capitalize on opportunities and to anticipate customer needs and fill market gaps. The goal is always to attract new customers, retain existing customers and obtain those all important client references. Business Intelligence for CRM  is crucial to business success. Your competitors have already embraced metrics and KPI for customer relationship management to provide objective metrics and understand what tasks an...