Fuzzy Matching in Microsoft Fabric: The elegant solution when your data "almost" matches
- Gabriel Peralta
- Apr 1
- 6 min read
Updated: Apr 2
The disparate data dilemma: When "equal" doesn't mean "identical"
Have you ever found yourself faced with two datasets that should fit perfectly, but simply don't? Imagine this scenario: you have customer data from two different systems. One says "Ana Torres García, El Ejido" and the other "Ana Torres, Ejido." To you, it's obvious they're the same person, but to your system, they're completely different entities. This is the kind of problem that can turn a seemingly simple data integration project into a monumental headache... unless you understand the power of fuzzy matching.
What exactly is Fuzzy Matching?
Unlike exact comparisons (where "Almería" and "Almeria" are treated as different), fuzzy matching evaluates the similarity between texts, assigning a percentage that indicates how similar they are. It's as if you were giving your system the ability to say, "These texts are 95% the same; they probably refer to the same thing." In this article, I'll show you how to implement this powerful technique in Microsoft Fabric to unify data that represents the same information but has been recorded with slight variations.
Our case study: Two contact lists that do not match
For this tutorial, we'll be working with two CSV files containing information about the same contacts, but structured differently:
Name | First Surname | Second Surname | Place of birth |
Maria | Gonzalez | Perez | Almería |
Juan | Martinez | Lopez | Roquetas de Mar |
Ann | Towers | Garcia | The Ejido |
Peter | Hernandez | Sanchez | Adra |
Lucia | Fernández | Diaz | Vicar |
Carlos | Lopez | Ruiz | Níjar |
Sofia | Perez | Towers | Huércal-Overa |
Miguel | Alvarez | Gomez | Vera |
Elena | Sanchez | Dark | Berja |
José | Ruiz | Castro | Carboneras |
Tabla 1. Datos del primer archivo CSV
While in the second source the data in the csv are as follows:
Contact name | City |
María González | Almería |
Juan Martínez | Roquetas |
Ana Torres | Ejido |
Pedro Hernández | Adra |
Lucia Fernandez | Vicar |
Carlos López | Níjar |
Sofia Perez | Huércal |
Miguel Álvarez | Vera |
Elena Sánchez | Berja |
José Ruiz | Carboneras |
Tabla 2. Datos del segundo archivo CSV
As you can see, not only is the structure different, but the information itself varies: - "Roquetas de Mar" vs "Roquetas" - "El Ejido" vs "Ejido" - "Almería" vs "Almeria" (without accent) Furthermore, there is no unique identifier (such as a client ID) that allows both sources to be reliably linked. This is where fuzzy matching becomes our best ally.
Preparing the environment in Microsoft Fabric
Before we dive into the code, we need to set up our working environment in Microsoft Fabric:
Step 1: Create a Lakehouse
Lakehouse will allow us to store and process our CSV files efficiently: 1. Access your Fabric workspace
2. Select "Create" → "Lakehouse" in the "Store Data" section
3. Assign a descriptive name and click "Create"
Step 2: Upload the CSV files
Once the Lakehouse is created: 1. Click "Upload Files" 2. Select your CSV files 3. Verify that the files appear in the "Files" folder of your Lakehouse
Step 3: Create a Notebook
To run our analysis:
1. Go back to “Create” and this time select “Notebook” in the “Data Analysis and Training” section
2. Make sure the selected language is “PySpark (Python)”
3. Connect the notebook to your Lakehouse from the “Add Data Sources” option And we are ready to start writing code!
The Code: Making Magic with Fuzzy Matching
First, you must install any required libraries that are not already included in the kernel running the notebook. In this case, you will add the fuzzywuzzy library, which provides the fuzzy comparison algorithms:
!pip install fuzzywuzzy
Requirement already satisfied: fuzzywuzzy in /home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages (0.18.0)
Next, load the libraries to be used: - PySpark.sql.functions to import functions that facilitate data frame processing. - Fuzzywuzzy to perform data comparisons and obtain equity percentages.
from pyspark.sql.functions import concat_ws, lower, trim, udf
from fuzzywuzzy import fuzz
It's time to read the data, which requires knowing the path to the data. Since Lakehouse has been connected to the notebook and PySpark will be used, you can take advantage of the relative path provided by the file. To do this, navigate to the Files section, click on the three-dot icon (···) associated with the file, and obtain the relative path for Spark.
ruta_jugadores1 = 'Files/datos_1.csv'
ruta_jugadores2 = 'Files/datos_2.csv'
df1 = spark.read.option("header", "true").csv(ruta_jugadores1)
df2 = spark.read.option("header", "true").csv(ruta_jugadores2)
The data will then be preprocessed into a uniform format so that the comparison is as efficient as possible:
# Combinar nombre y lugar en una sola cadena
combined_expr1 = concat_ws(" ", df1["Nombre"], df1["Primer Apellido"], df1["Segundo Apellido"], df1["Lugar de Nacimiento"])
combined_expr2 = concat_ws(" ", df2["Nombre contacto"], df2["Ciudad"])
# Generar dataframe combinado de cada origen de datos habiendo convertido todo a minúsculas y limpiando espacios en blanco
df1_combined = df1.withColumn("Combined", lower(trim(combined_expr1)))
df2_combined = df2.withColumn("Combined", lower(trim(combined_expr2)))
At this point, it's necessary to clarify what UDFs are within the PySpark ecosystem. UDFs (User-Defined Functions) are a feature that allows you to define custom functions in Python and apply them to columns in a Spark DataFrame. In this case, they are used to calculate the similarity between strings using the fuzzywuzzy library, since Spark doesn't have a native function for fuzzy matching. This is where our solution really shines. We'll use a UDF to calculate the similarity between the combined strings:
# Definir UDF para similitud combinada
fuzzy_combined_udf = udf(lambda x, y: fuzz.token_sort_ratio(x, y), IntegerType())
# Cruzar y calcular similitud
matched_df = df1_combined.crossJoin(df2_combined).withColumn(
"Similarity", fuzzy_combined_udf(df1_combined["Combined"], df2_combined["Combined"])
)
# Filtrar por umbral del 70%
resultado = matched_df.filter(matched_df["Similarity"] > 70).orderBy("Similarity", ascending=False)
# Mostrar resultados
display(resultado)
What exactly is going on here?
1. We create a function (fuzzy_combined_udf) that uses the token_sort_ratio algorithm to compare two strings.
2. We perform a cross join to compare each row of the first set with each row of the second.
3. We calculate the similarity percentage for each pair.
4. We filter out those pairs that exceed 70% similarity.
5. We sort the results by descending similarity. The token_sort_ratio algorithm is especially useful because it considers words as "tokens" regardless of their order, which allows it to handle cases like "Juan Martínez López" vs "Martínez Juan". Running the code gives the results in the following table:
Name | First Surname | Second Surname | place of birth | Similarity | Combined | Contact name | City | Combined |
Lucia | Fernandez | Diaz | Vicar | 90 | Lucia Fernandez Diaz Vicar | Lucia Fernandez | Vicar | Lucia Fernández Vicar |
Miguel | Alvarez | Gomez | Vera | 88 | Miguel Ávarez Gómez Vera | Miguel Álvarez | Vera | Miguel Álvarez Vera |
Maria | Gonzalez | Perez | Almería | 86 | Maria Gonzalez Perez Vera | María González | Almeria | María González Almeria |
Carlos | Lopez | Ruiz | Níjar | 86 | Carlos López Ruiz Níjar | Carlos López | Níjar | Carlos Lopez Nijar |
Peter | Hernandez | Sanchez | Adra | 84 | Pedro Hernández Sánchez Adra | Pedro Hernández | Adra | Pedro Hernández Adra |
Elena | Sanchez | Dark | Berja | 84 | Elena Sánchez Moreno Berja | Elena Sánchez | Berja | Elena Sánchez Berja |
José | Ruiz | Castro | Carboneras | 84 | José Ruiz Castro Carboneras | José Ruiz | Carboneras | José Ruiz Carboneras |
Juan | Martinez | Lopez | Roquetas de Mar | 78 | Juan Martinez Lopez, Roquetas de Mar | Juan Martínez | Roquetas | Juan Martínez Roquetas |
Ann | Towers | Garcia | The Ejido | 78 | Ana Torres García, El Ejido | Ana Torres | Ejido | Ana Torres Ejido |
Sofia | Perez | Towers | Huércal-Overa | 71 | Sofia Perez Torres, Huércal-Overa | Sofia Perez | Huércal | Sofia Perez Huércal |
Tabla 3. Resultados de la comparación.
Finally, to complete the process of performing the process described so far, we need to unify the information in a single, actionable location. To do this, we create a related table in Lakehouse:
# Seleccionar y renombrar columnas para la tabla unificada
tabla_unificada = resultado.select(
resultado["Nombre"].alias("nombre"),
resultado["Primer Apellido"].alias("apellido_1"),
resultado["Segundo Apellido"].alias("apellido_2"),
resultado["Lugar de Nacimiento"].alias("lugar_nacimiento"),
resultado["Similarity"].alias("similaridad")
)
# Guardar como tabla en el Lakehouse
tabla_unificada.write.mode("overwrite").format("delta").saveAsTable("Clientes_Unificados")
Beyond the basic example: Practical applications
This approach can be extended to much more complex scenarios:
Customer database deduplication
Do you have a database with thousands of customers that you suspect contains duplicates? Fuzzy matching can help you identify entries like "Javier Rodríguez" and "J. Rodriguez" as potentially the same customer.
Integration of product catalogs
When you need to merge product catalogs from different vendors where the product names vary slightly ("iPhone 13 Pro Max 256GB" vs "Apple iPhone 13 Pro Max (256 GB)").
Geographic address normalization
To standardize geographic references such as "Calle Mayor, 5" and "C/ Mayor nº5" or "Barcelona, Spain" and "BCN, ES".
Improvements and advanced considerations
For more sophisticated implementations, consider:
1. Adjusting the similarity threshold: 70% works for our example, but depending on your data, you might need a different value.
2. Using different algorithms: In addition to token_sort_ratio, the fuzzywuzzy library offers other functions such as partial_ratio or token_set_ratio that may work better depending on the type of variations in your data.
3. Parallelizing the process: For large datasets, take advantage of Spark's ability to distribute computation across multiple nodes.
4. Pre-filtering: To optimize performance, you can implement pre-filtering to reduce the number of comparisons required.
Conclusion: The power of flexibility
In an ideal world, all systems would use universal identifiers and standardized data formats. But in the business world, we constantly face heterogeneity and inconsistencies.
Fuzzy matching in Microsoft Fabric provides us with an elegant tool for navigating this data chaos, allowing us to unify valuable insights that would otherwise remain fragmented. This technique not only saves time on manual data cleansing but also unlocks new possibilities for integrated analysis and decision-making.
The next time you're faced with data sets that should match but don't, remember: sometimes the solution isn't to force accuracy, but to embrace similarity.
Comentários