Support our 100% FREE Projects: Donate Now OR Sponsor Now


  

SQL Soundex

We primarely work on data warehouses and it is very common to have data quality issues. In this step by step sql tutorial blog post I will show how you can deal with certain types of data quality issues and resolve them using SQL Soundex function. But before we start I will explain when you can use it and when you should NOT use it. 

Microsoft Soundex page explains soundex function; as always in a very technical way without providing much guidance on when you should use it and when you shoudn't use it, so let's start from the very beginnng.

What is soundex function?

SQL Server Soundex is a function that converts string (words) into 4 characters code. The code represent "sound" of the word so you can use function on another word and see if they "sounds" the same. If 4 characters code is the same than it sounds the same.

Let's show you a simple SQL query example:

SELECT SOUNDEX('Katie') --returns K300

SELECT SOUNDEX('Emil') --returns E540

As you can see Katie & Emil don't sound the same. Let's give you another example.

SELECT SOUNDEX('Katy') --returns K300

SELECT SOUNDEX('Katie') --returns K300

Katie & Katy sound similar so they have the same code K300. Actually this is nice example because when Katie introduces herself and than people type the name on a computer they sometimes type Katy instead of Katie because both names sound similar.

What is the purpose of Soundex function?

You can use SQL Soundex function to compare two words and find out if they sound the same. Remember we only check sounds, that means we check sound similarity but that doesn't mean the words are similar or the same. Katie & Katy sound similar BUT there are two different names so they are distinct which is important when you want to clean data. You don't want to change Katie to Katy because Katie is Katie ;)

When to use SQL Server Soundex function?

I must admit I don't often use Soundex function for data cleansing purposes myself as sound similarity is usually insufficient for most tasks I do; but there are exceptions when I could use this function. This function can be used when you have a list of values which give you unique sound code, so when you compare another word against he list you can find one match only (or no match).

SQL Soundex Example

Recently I have come across this question on Microsoft forum SQL Question and soundex as proposed solution the problem was that month is mispelled and contains odd numbers. The task was to get proper month out of the fields. Here are some example of incorrect input.

Row 1: January300
Row 2: January398
Row 3: January
Row 4: February40
Row 5: February20694

Row 6: Januuary 5454

One of the solution is to build a list of incorrect values and map it to correct month name which can be time consuming, another one is too take out number and spaces but it doesn't deal with mispelled months (row 6). So can we use soundex? Let's try it.

First let's see if we have unique soundex code for all 12 months:

SELECT SOUNDEX('January') as [Soundex month code] union all

SELECT SOUNDEX('February') union all

SELECT SOUNDEX('March') union all

SELECT SOUNDEX('April') union all

SELECT SOUNDEX('May') union all

SELECT SOUNDEX('June') union all

SELECT SOUNDEX('July') union all

SELECT SOUNDEX('August') union all

SELECT SOUNDEX('September') union all

SELECT SOUNDEX('October') union all

SELECT SOUNDEX('November') union all

SELECT SOUNDEX('December')

and the results are below. Left side is January to December, Right side is sorted by soundex code to see if we have the same soundex month codes.

SQL soundex

This looks good. So let's try to tidy up our list of incorrect values.

SQL Soundex data cleansing

Here is the code that we use to perform data cleansing using t-sql soundex function.

select s.id, s.DirtyMonth,

  case

    when soundex('january') = soundex(s.DirtyMonth)

      then 'January'

    when soundex('feburary') = soundex(s.DirtyMonth)

      then 'February'

     -- and so on

  else 'unknown'

  end as CleanMonthName

from (

  Select 1 as ID, 'January300' as DirtyMonth Union All

  Select 2, 'January398' Union All

  Select 3, 'January' Union All

  Select 4, 'February40' Union All

  Select 5, 'February20694'Union All

  Select 6, 'Januuary 5454') as s

 

Below are results of the above SQL query:

sql soundex data cleansing

Success!! That worked fine. 

IMPORTANT: Soundex worked fine in our case BUT be careful when you use it because it might not be so accurate with different examples of inaccurate data so ensure you test it on full set of invalid values. In our case anything that doesn't match month goes to 'unknown' so some kind of logic or process should be put in place to tidy up the exception.

 

SQL Soundex final words:

Soundex is not commonly used but sometimes it can help especially when we to do something quickly. Data cleansing in data warehouses project is generally done using fuzzy lookup and in SQL Server 2012 we have Master Data Services (master data management) and Data Quality Services which can help us with this kind of tasks. 

If you want to use T-SQL for data cleansing check Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server

I hope you found SQL Soundex article helpful Take care Emil
Share: Share on FacebookShare on Google PlusTweet it
Comments Add Comment
No data was returned. Share your thoughts, questions and suggest improvements:
Add Comment

You found us! Below are 50 most popular searched keywords

created at TagCrowd.com

Disclaimer: While every caution has been taken to provide our readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author will not compensate you in any way whatsoever if you ever happen to suffer a loss/inconvenience/damage because of/while making use of information in this blog.