top of page

Fuzzy Matching in Microsoft Fabric: The elegant solution when your data "almost" matches

Updated: Apr 2

The disparate data dilemma: When "equal" doesn't mean "identical"


Have you ever found yourself faced with two datasets that should fit perfectly, but simply don't? Imagine this scenario: you have customer data from two different systems. One says "Ana Torres García, El Ejido" and the other "Ana Torres, Ejido." To you, it's obvious they're the same person, but to your system, they're completely different entities. This is the kind of problem that can turn a seemingly simple data integration project into a monumental headache... unless you understand the power of fuzzy matching.


What exactly is Fuzzy Matching?


Unlike exact comparisons (where "Almería" and "Almeria" are treated as different), fuzzy matching evaluates the similarity between texts, assigning a percentage that indicates how similar they are. It's as if you were giving your system the ability to say, "These texts are 95% the same; they probably refer to the same thing." In this article, I'll show you how to implement this powerful technique in Microsoft Fabric to unify data that represents the same information but has been recorded with slight variations.


Our case study: Two contact lists that do not match


For this tutorial, we'll be working with two CSV files containing information about the same contacts, but structured differently:



Name

First Surname

Second Surname

Place of birth

Maria

Gonzalez

Perez

Almería

Juan

Martinez

Lopez

Roquetas de Mar

Ann

Towers

Garcia

The Ejido

Peter

Hernandez

Sanchez

Adra

Lucia

Fernández

Diaz

Vicar

Carlos

Lopez

Ruiz

Níjar

Sofia

Perez

Towers

Huércal-Overa

Miguel

Alvarez

Gomez

Vera

Elena

Sanchez

Dark

Berja

José

Ruiz

Castro

Carboneras

Tabla 1. Datos del primer archivo CSV

While in the second source the data in the csv are as follows:



Contact name

City

María González

Almería

Juan Martínez

Roquetas

Ana Torres

Ejido

Pedro Hernández

Adra

Lucia Fernandez

Vicar

Carlos López

Níjar

Sofia Perez

Huércal

Miguel Álvarez

Vera

Elena Sánchez

Berja

José Ruiz

Carboneras

Tabla 2. Datos del segundo archivo CSV

As you can see, not only is the structure different, but the information itself varies: - "Roquetas de Mar" vs "Roquetas" - "El Ejido" vs "Ejido" - "Almería" vs "Almeria" (without accent) Furthermore, there is no unique identifier (such as a client ID) that allows both sources to be reliably linked. This is where fuzzy matching becomes our best ally.

Preparing the environment in Microsoft Fabric


Before we dive into the code, we need to set up our working environment in Microsoft Fabric:


Step 1: Create a Lakehouse


Lakehouse will allow us to store and process our CSV files efficiently: 1. Access your Fabric workspace

2. Select "Create" → "Lakehouse" in the "Store Data" section

3. Assign a descriptive name and click "Create"


Step 2: Upload the CSV files


Once the Lakehouse is created: 1. Click "Upload Files" 2. Select your CSV files 3. Verify that the files appear in the "Files" folder of your Lakehouse


Step 3: Create a Notebook


To run our analysis:

1. Go back to “Create” and this time select “Notebook” in the “Data Analysis and Training” section

2. Make sure the selected language is “PySpark (Python)”

3. Connect the notebook to your Lakehouse from the “Add Data Sources” option And we are ready to start writing code!


The Code: Making Magic with Fuzzy Matching


First, you must install any required libraries that are not already included in the kernel running the notebook. In this case, you will add the fuzzywuzzy library, which provides the fuzzy comparison algorithms:


!pip install fuzzywuzzy

Requirement already satisfied: fuzzywuzzy in /home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages (0.18.0)

Next, load the libraries to be used: - PySpark.sql.functions to import functions that facilitate data frame processing. - Fuzzywuzzy to perform data comparisons and obtain equity percentages.


from pyspark.sql.functions import concat_ws, lower, trim, udf
from fuzzywuzzy import fuzz

It's time to read the data, which requires knowing the path to the data. Since Lakehouse has been connected to the notebook and PySpark will be used, you can take advantage of the relative path provided by the file. To do this, navigate to the Files section, click on the three-dot icon (···) associated with the file, and obtain the relative path for Spark.

ruta_jugadores1 = 'Files/datos_1.csv'
ruta_jugadores2 = 'Files/datos_2.csv'


df1 = spark.read.option("header", "true").csv(ruta_jugadores1)
df2 = spark.read.option("header", "true").csv(ruta_jugadores2)

The data will then be preprocessed into a uniform format so that the comparison is as efficient as possible:


# Combinar nombre y lugar en una sola cadena
combined_expr1 = concat_ws(" ", df1["Nombre"], df1["Primer Apellido"], df1["Segundo Apellido"], df1["Lugar de Nacimiento"])
combined_expr2 = concat_ws(" ", df2["Nombre contacto"], df2["Ciudad"])

# Generar dataframe combinado de cada origen de datos habiendo convertido todo a minúsculas y limpiando espacios en blanco
df1_combined = df1.withColumn("Combined", lower(trim(combined_expr1)))
df2_combined = df2.withColumn("Combined", lower(trim(combined_expr2)))

At this point, it's necessary to clarify what UDFs are within the PySpark ecosystem. UDFs (User-Defined Functions) are a feature that allows you to define custom functions in Python and apply them to columns in a Spark DataFrame. In this case, they are used to calculate the similarity between strings using the fuzzywuzzy library, since Spark doesn't have a native function for fuzzy matching. This is where our solution really shines. We'll use a UDF to calculate the similarity between the combined strings:

# Definir UDF para similitud combinada
fuzzy_combined_udf = udf(lambda x, y: fuzz.token_sort_ratio(x, y), IntegerType())

# Cruzar y calcular similitud
matched_df = df1_combined.crossJoin(df2_combined).withColumn(
    "Similarity", fuzzy_combined_udf(df1_combined["Combined"], df2_combined["Combined"])
)

# Filtrar por umbral del 70%
resultado = matched_df.filter(matched_df["Similarity"] > 70).orderBy("Similarity", ascending=False)

# Mostrar resultados
display(resultado)

What exactly is going on here?

1. We create a function (fuzzy_combined_udf) that uses the token_sort_ratio algorithm to compare two strings.

2. We perform a cross join to compare each row of the first set with each row of the second.

3. We calculate the similarity percentage for each pair.

4. We filter out those pairs that exceed 70% similarity.

5. We sort the results by descending similarity. The token_sort_ratio algorithm is especially useful because it considers words as "tokens" regardless of their order, which allows it to handle cases like "Juan Martínez López" vs "Martínez Juan". Running the code gives the results in the following table:



Name

First Surname

Second Surname

place of birth

Similarity

Combined

Contact name

City

Combined

Lucia

Fernandez

Diaz

Vicar

90

Lucia Fernandez Diaz Vicar

Lucia Fernandez

Vicar

Lucia Fernández Vicar

Miguel

Alvarez

Gomez

Vera

88

Miguel Ávarez Gómez Vera

Miguel Álvarez

Vera

Miguel Álvarez Vera

Maria

Gonzalez

Perez

Almería

86

Maria Gonzalez Perez Vera

María González

Almeria

María González Almeria

Carlos

Lopez

Ruiz

Níjar

86

Carlos López Ruiz Níjar

Carlos López

Níjar

Carlos Lopez Nijar

Peter

Hernandez

Sanchez

Adra

84

Pedro Hernández Sánchez Adra

Pedro Hernández

Adra

Pedro Hernández Adra

Elena

Sanchez

Dark

Berja

84

Elena Sánchez Moreno Berja

Elena Sánchez

Berja

Elena Sánchez Berja

José

Ruiz

Castro

Carboneras

84

José Ruiz Castro Carboneras

José Ruiz

Carboneras

José Ruiz Carboneras

Juan

Martinez

Lopez

Roquetas de Mar

78

Juan Martinez Lopez, Roquetas de Mar

Juan Martínez

Roquetas

Juan Martínez Roquetas

Ann

Towers

Garcia

The Ejido

78

Ana Torres García, El Ejido

Ana Torres

Ejido

Ana Torres Ejido

Sofia

Perez

Towers

Huércal-Overa

71

Sofia Perez Torres, Huércal-Overa

Sofia Perez

Huércal

Sofia Perez Huércal

Tabla 3. Resultados de la comparación.

Finally, to complete the process of performing the process described so far, we need to unify the information in a single, actionable location. To do this, we create a related table in Lakehouse:


# Seleccionar y renombrar columnas para la tabla unificada
tabla_unificada = resultado.select(
    resultado["Nombre"].alias("nombre"),
    resultado["Primer Apellido"].alias("apellido_1"),
    resultado["Segundo Apellido"].alias("apellido_2"),
    resultado["Lugar de Nacimiento"].alias("lugar_nacimiento"),
    resultado["Similarity"].alias("similaridad")
)

# Guardar como tabla en el Lakehouse
tabla_unificada.write.mode("overwrite").format("delta").saveAsTable("Clientes_Unificados")

Beyond the basic example: Practical applications


This approach can be extended to much more complex scenarios:


Customer database deduplication


Do you have a database with thousands of customers that you suspect contains duplicates? Fuzzy matching can help you identify entries like "Javier Rodríguez" and "J. Rodriguez" as potentially the same customer.


Integration of product catalogs


When you need to merge product catalogs from different vendors where the product names vary slightly ("iPhone 13 Pro Max 256GB" vs "Apple iPhone 13 Pro Max (256 GB)").


Geographic address normalization


To standardize geographic references such as "Calle Mayor, 5" and "C/ Mayor nº5" or "Barcelona, Spain" and "BCN, ES".


Improvements and advanced considerations


For more sophisticated implementations, consider:


1. Adjusting the similarity threshold: 70% works for our example, but depending on your data, you might need a different value.

2. Using different algorithms: In addition to token_sort_ratio, the fuzzywuzzy library offers other functions such as partial_ratio or token_set_ratio that may work better depending on the type of variations in your data.

3. Parallelizing the process: For large datasets, take advantage of Spark's ability to distribute computation across multiple nodes.

4. Pre-filtering: To optimize performance, you can implement pre-filtering to reduce the number of comparisons required.


Conclusion: The power of flexibility


In an ideal world, all systems would use universal identifiers and standardized data formats. But in the business world, we constantly face heterogeneity and inconsistencies.


Fuzzy matching in Microsoft Fabric provides us with an elegant tool for navigating this data chaos, allowing us to unify valuable insights that would otherwise remain fragmented. This technique not only saves time on manual data cleansing but also unlocks new possibilities for integrated analysis and decision-making.


The next time you're faced with data sets that should match but don't, remember: sometimes the solution isn't to force accuracy, but to embrace similarity.


Comentários


leave me a comment

¡Thank you for reaching out!

© 2023 by Train of Thoughts. Proudly created withwix.com

bottom of page