Fuzzy Matching in Microsoft Fabric: The elegant solution when your data "almost" matches

Gabriel Peralta
Apr 1
6 min read

Updated: Apr 2

The disparate data dilemma: When "equal" doesn't mean "identical"

Have you ever found yourself faced with two datasets that should fit perfectly, but simply don't? Imagine this scenario: you have customer data from two different systems. One says "Ana Torres García, El Ejido" and the other "Ana Torres, Ejido." To you, it's obvious they're the same person, but to your system, they're completely different entities. This is the kind of problem that can turn a seemingly simple data integration project into a monumental headache... unless you understand the power of fuzzy matching.

What exactly is Fuzzy Matching?

Unlike exact comparisons (where "Almería" and "Almeria" are treated as different), fuzzy matching evaluates the similarity between texts, assigning a percentage that indicates how similar they are. It's as if you were giving your system the ability to say, "These texts are 95% the same; they probably refer to the same thing." In this article, I'll show you how to implement this powerful technique in Microsoft Fabric to unify data that represents the same information but has been recorded with slight variations.

Our case study: Two contact lists that do not match

For this tutorial, we'll be working with two CSV files containing information about the same contacts, but structured differently:

Name	First Surname	Second Surname	Place of birth
Maria	Gonzalez	Perez	Almería
Juan	Martinez	Lopez	Roquetas de Mar
Ann	Towers	Garcia	The Ejido
Peter	Hernandez	Sanchez	Adra
Lucia	Fernández	Diaz	Vicar
Carlos	Lopez	Ruiz	Níjar
Sofia	Perez	Towers	Huércal-Overa
Miguel	Alvarez	Gomez	Vera
Elena	Sanchez	Dark	Berja
José	Ruiz	Castro	Carboneras

Tabla 1. Datos del primer archivo CSV

While in the second source the data in the csv are as follows:

Contact name	City
María González	Almería
Juan Martínez	Roquetas
Ana Torres	Ejido
Pedro Hernández	Adra
Lucia Fernandez	Vicar
Carlos López	Níjar
Sofia Perez	Huércal
Miguel Álvarez	Vera
Elena Sánchez	Berja
José Ruiz	Carboneras

Tabla 2. Datos del segundo archivo CSV

As you can see, not only is the structure different, but the information itself varies: - "Roquetas de Mar" vs "Roquetas" - "El Ejido" vs "Ejido" - "Almería" vs "Almeria" (without accent) Furthermore, there is no unique identifier (such as a client ID) that allows both sources to be reliably linked. This is where fuzzy matching becomes our best ally.

Preparing the environment in Microsoft Fabric

Before we dive into the code, we need to set up our working environment in Microsoft Fabric:

Step 1: Create a Lakehouse

Lakehouse will allow us to store and process our CSV files efficiently: 1. Access your Fabric workspace

2. Select "Create" → "Lakehouse" in the "Store Data" section

3. Assign a descriptive name and click "Create"

Step 2: Upload the CSV files

Once the Lakehouse is created: 1. Click "Upload Files" 2. Select your CSV files 3. Verify that the files appear in the "Files" folder of your Lakehouse

Step 3: Create a Notebook

To run our analysis:

1. Go back to “Create” and this time select “Notebook” in the “Data Analysis and Training” section

2. Make sure the selected language is “PySpark (Python)”

3. Connect the notebook to your Lakehouse from the “Add Data Sources” option And we are ready to start writing code!

The Code: Making Magic with Fuzzy Matching

First, you must install any required libraries that are not already included in the kernel running the notebook. In this case, you will add the fuzzywuzzy library, which provides the fuzzy comparison algorithms:

!pip install fuzzywuzzy

Requirement already satisfied: fuzzywuzzy in /home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages (0.18.0)

Next, load the libraries to be used: - PySpark.sql.functions to import functions that facilitate data frame processing. - Fuzzywuzzy to perform data comparisons and obtain equity percentages.

from pyspark.sql.functions import concat_ws, lower, trim, udf
from fuzzywuzzy import fuzz

It's time to read the data, which requires knowing the path to the data. Since Lakehouse has been connected to the notebook and PySpark will be used, you can take advantage of the relative path provided by the file. To do this, navigate to the Files section, click on the three-dot icon (···) associated with the file, and obtain the relative path for Spark.

ruta_jugadores1 = 'Files/datos_1.csv'
ruta_jugadores2 = 'Files/datos_2.csv'


df1 = spark.read.option("header", "true").csv(ruta_jugadores1)
df2 = spark.read.option("header", "true").csv(ruta_jugadores2)

The data will then be preprocessed into a uniform format so that the comparison is as efficient as possible:

# Combinar nombre y lugar en una sola cadena
combined_expr1 = concat_ws(" ", df1["Nombre"], df1["Primer Apellido"], df1["Segundo Apellido"], df1["Lugar de Nacimiento"])
combined_expr2 = concat_ws(" ", df2["Nombre contacto"], df2["Ciudad"])

# Generar dataframe combinado de cada origen de datos habiendo convertido todo a minúsculas y limpiando espacios en blanco
df1_combined = df1.withColumn("Combined", lower(trim(combined_expr1)))
df2_combined = df2.withColumn("Combined", lower(trim(combined_expr2)))

At this point, it's necessary to clarify what UDFs are within the PySpark ecosystem. UDFs (User-Defined Functions) are a feature that allows you to define custom functions in Python and apply them to columns in a Spark DataFrame. In this case, they are used to calculate the similarity between strings using the fuzzywuzzy library, since Spark doesn't have a native function for fuzzy matching. This is where our solution really shines. We'll use a UDF to calculate the similarity between the combined strings:

# Definir UDF para similitud combinada
fuzzy_combined_udf = udf(lambda x, y: fuzz.token_sort_ratio(x, y), IntegerType())

# Cruzar y calcular similitud
matched_df = df1_combined.crossJoin(df2_combined).withColumn(
    "Similarity", fuzzy_combined_udf(df1_combined["Combined"], df2_combined["Combined"])
)

# Filtrar por umbral del 70%
resultado = matched_df.filter(matched_df["Similarity"] > 70).orderBy("Similarity", ascending=False)

# Mostrar resultados
display(resultado)

What exactly is going on here?

1. We create a function (fuzzy_combined_udf) that uses the token_sort_ratio algorithm to compare two strings.

2. We perform a cross join to compare each row of the first set with each row of the second.

3. We calculate the similarity percentage for each pair.

4. We filter out those pairs that exceed 70% similarity.

5. We sort the results by descending similarity. The token_sort_ratio algorithm is especially useful because it considers words as "tokens" regardless of their order, which allows it to handle cases like "Juan Martínez López" vs "Martínez Juan". Running the code gives the results in the following table:

Name	First Surname	Second Surname	place of birth	Similarity	Combined	Contact name	City	Combined
Lucia	Fernandez	Diaz	Vicar	90	Lucia Fernandez Diaz Vicar	Lucia Fernandez	Vicar	Lucia Fernández Vicar
Miguel	Alvarez	Gomez	Vera	88	Miguel Ávarez Gómez Vera	Miguel Álvarez	Vera	Miguel Álvarez Vera
Maria	Gonzalez	Perez	Almería	86	Maria Gonzalez Perez Vera	María González	Almeria	María González Almeria
Carlos	Lopez	Ruiz	Níjar	86	Carlos López Ruiz Níjar	Carlos López	Níjar	Carlos Lopez Nijar
Peter	Hernandez	Sanchez	Adra	84	Pedro Hernández Sánchez Adra	Pedro Hernández	Adra	Pedro Hernández Adra
Elena	Sanchez	Dark	Berja	84	Elena Sánchez Moreno Berja	Elena Sánchez	Berja	Elena Sánchez Berja
José	Ruiz	Castro	Carboneras	84	José Ruiz Castro Carboneras	José Ruiz	Carboneras	José Ruiz Carboneras
Juan	Martinez	Lopez	Roquetas de Mar	78	Juan Martinez Lopez, Roquetas de Mar	Juan Martínez	Roquetas	Juan Martínez Roquetas
Ann	Towers	Garcia	The Ejido	78	Ana Torres García, El Ejido	Ana Torres	Ejido	Ana Torres Ejido
Sofia	Perez	Towers	Huércal-Overa	71	Sofia Perez Torres, Huércal-Overa	Sofia Perez	Huércal	Sofia Perez Huércal

Tabla 3. Resultados de la comparación.

Finally, to complete the process of performing the process described so far, we need to unify the information in a single, actionable location. To do this, we create a related table in Lakehouse:

# Seleccionar y renombrar columnas para la tabla unificada
tabla_unificada = resultado.select(
    resultado["Nombre"].alias("nombre"),
    resultado["Primer Apellido"].alias("apellido_1"),
    resultado["Segundo Apellido"].alias("apellido_2"),
    resultado["Lugar de Nacimiento"].alias("lugar_nacimiento"),
    resultado["Similarity"].alias("similaridad")
)

# Guardar como tabla en el Lakehouse
tabla_unificada.write.mode("overwrite").format("delta").saveAsTable("Clientes_Unificados")

Beyond the basic example: Practical applications

This approach can be extended to much more complex scenarios:

Customer database deduplication

Do you have a database with thousands of customers that you suspect contains duplicates? Fuzzy matching can help you identify entries like "Javier Rodríguez" and "J. Rodriguez" as potentially the same customer.

Integration of product catalogs

When you need to merge product catalogs from different vendors where the product names vary slightly ("iPhone 13 Pro Max 256GB" vs "Apple iPhone 13 Pro Max (256 GB)").

Geographic address normalization

To standardize geographic references such as "Calle Mayor, 5" and "C/ Mayor nº5" or "Barcelona, Spain" and "BCN, ES".

Improvements and advanced considerations

For more sophisticated implementations, consider:

1. Adjusting the similarity threshold: 70% works for our example, but depending on your data, you might need a different value.

2. Using different algorithms: In addition to token_sort_ratio, the fuzzywuzzy library offers other functions such as partial_ratio or token_set_ratio that may work better depending on the type of variations in your data.

3. Parallelizing the process: For large datasets, take advantage of Spark's ability to distribute computation across multiple nodes.

4. Pre-filtering: To optimize performance, you can implement pre-filtering to reduce the number of comparisons required.

Conclusion: The power of flexibility

In an ideal world, all systems would use universal identifiers and standardized data formats. But in the business world, we constantly face heterogeneity and inconsistencies.

Fuzzy matching in Microsoft Fabric provides us with an elegant tool for navigating this data chaos, allowing us to unify valuable insights that would otherwise remain fragmented. This technique not only saves time on manual data cleansing but also unlocks new possibilities for integrated analysis and decision-making.

The next time you're faced with data sets that should match but don't, remember: sometimes the solution isn't to force accuracy, but to embrace similarity.