We primarely work on data warehouses and it is very common to have data quality issues. In this step by step sql tutorial blog post I will show how you can deal with certain types of data quality issues and resolve them using SQL Soundex function. But before we start I will explain when you can use it and when you should NOT use it.
Microsoft Soundex page explains soundex function; as always in a very technical way without providing much guidance on when you should use it and when you shoudn't use it, so let's start from the very beginnng.
What is soundex function?
SQL Server Soundex is a function that converts string (words) into 4 characters code. The code represent "sound" of the word so you can use function on another word and see if they "sounds" the same. If 4 characters code is the same than it sounds the same.
Let's show you a simple SQL query example:
SELECT SOUNDEX('Katie') --returns K300
SELECT SOUNDEX('Emil') --returns E540
As you can see Katie & Emil don't sound the same. Let's give you another example.
SELECT SOUNDEX('Katy') --returns K300
SELECT SOUNDEX('Katie') --returns K300
Katie & Katy sound similar so they have the same code K300. Actually this is nice example because when Katie introduces herself and than people type the name on a computer they sometimes type Katy instead of Katie because both names sound similar.
What is the purpose of Soundex function?
You can use SQL Soundex function to compare two words and find out if they sound the same. Remember we only check sounds, that means we check sound similarity but that doesn't mean the words are similar or the same. Katie & Katy sound similar BUT there are two different names so they are distinct which is important when you want to clean data. You don't want to change Katie to Katy because Katie is Katie ;)
When to use SQL Server Soundex function?
I must admit I don't often use Soundex function for data cleansing purposes myself as sound similarity is usually insufficient for most tasks I do; but there are exceptions when I could use this function. This function can be used when you have a list of values which give you unique sound code, so when you compare another word against he list you can find one match only (or no match).
SQL Soundex Example
Recently I have come across this question on Microsoft forum SQL Question and soundex as proposed solution the problem was that month is mispelled and contains odd numbers. The task was to get proper month out of the fields. Here are some example of incorrect input.
Row 1: January300
Row 2: January398
Row 3: January
Row 4: February40
Row 5: February20694
Row 6: Januuary 5454
One of the solution is to build a list of incorrect values and map it to correct month name which can be time consuming, another one is too take out number and spaces but it doesn't deal with mispelled months (row 6). So can we use soundex? Let's try it.
First let's see if we have unique soundex code for all 12 months:
SELECT SOUNDEX('January') as [Soundex month code] union all
SELECT SOUNDEX('February') union all
SELECT SOUNDEX('March') union all
SELECT SOUNDEX('April') union all
SELECT SOUNDEX('May') union all
SELECT SOUNDEX('June') union all
SELECT SOUNDEX('July') union all
SELECT SOUNDEX('August') union all
SELECT SOUNDEX('September') union all
SELECT SOUNDEX('October') union all
SELECT SOUNDEX('November') union all
and the results are below. Left side is January to December, Right side is sorted by soundex code to see if we have the same soundex month codes.
This looks good. So let's try to tidy up our list of incorrect values.
SQL Soundex data cleansing
Here is the code that we use to perform data cleansing using t-sql soundex function.
select s.id, s.DirtyMonth,
when soundex('january') = soundex(s.DirtyMonth)
when soundex('feburary') = soundex(s.DirtyMonth)
-- and so on
end as CleanMonthName
Select 1 as ID, 'January300' as DirtyMonth Union All
Select 2, 'January398' Union All
Select 3, 'January' Union All
Select 4, 'February40' Union All
Select 5, 'February20694'Union All
Select 6, 'Januuary 5454') as s
Below are results of the above SQL query:
Success!! That worked fine.
IMPORTANT: Soundex worked fine in our case BUT be careful when you use it because it might not be so accurate with different examples of inaccurate data so ensure you test it on full set of invalid values. In our case anything that doesn't match month goes to 'unknown' so some kind of logic or process should be put in place to tidy up the exception.
SQL Soundex final words:
Soundex is not commonly used but sometimes it can help especially when we to do something quickly. Data cleansing in data warehouses project is generally done using fuzzy lookup and in SQL Server 2012 we have Master Data Services (master data management) and Data Quality Services which can help us with this kind of tasks.
If you want to use T-SQL for data cleansing check Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server
I hope you found SQL Soundex article helpful