Record Linking across millions of rows — Large Language Models to the rescue!

Sarang Shrivastava
5 min readMay 12, 2023

--

Image created by the author

In today’s data-driven world, businesses often have to work with multiple data vendors. These vendors provide information about things in the real world, like products or people. While each vendor’s data is useful, it can be even more helpful if we could see all this data in one place. This would make things like making deals, trading, or bringing on new clients smoother. However, merging data from different vendors can be tricky.

Challenges of Merging Vendor Data

Imagine you ordered several puzzles, but they all arrived in one box without any pictures to guide you. This is what data from different vendors can look like. Some pieces might be missing (missing values), some might be mislabeled (different naming conventions), and some might even be from a different puzzle altogether (misspelt entities). To add to the confusion, there’s no universal ID number (unique identifier) that can tell us which pieces belong to which puzzle. Ideally, these identifiers would serve as foreign keys, enabling seamless integration of data from different sources. So, how can we solve this puzzle?

Turning Data into a Language that Machines Understand

The first step is to turn the data, which is like a table, into a form that our machine-learning algorithm can understand. This is a bit like translating from one language to another. For example, let’s take a row of data and serialize it:

S1(Row11) = | Col | Title | Val | State of Fire | Col | Category | Val | Science | Col | Price | Val | 500Rs |

In another form, it might look like this:

S2(Row11) = The title of the book is State of Fire. It belongs to the Science category. Its price is 500Rs.

Matching Pieces from Different Puzzles

Image created by the author

Once we’ve translated our data, we need to find potential matches. These are pairs of data that might be talking about the same thing, like the same book or person. We then ask our machine learning algorithm to decide if they’re a match or not. But with millions of pieces of data, we can’t compare each one with every other. That’s where ‘Blocking’ comes in.

‘Blocking’ is a way of picking a few likely matches (subset of records) for each piece of data (record) from another vendor. To do this, we turn our translated data into embeddings using sentence embedder model. Then, we use approximate nearest neighbour search algorithms like ‘scann’ or ‘faiss’ to find the closest matches. You can check out a comparison of many ann libraries here.

Using AI to Link Records

Image created by the author

Once we have our potential matches, we need to decide if they’re a true match or not. For this, we can use a special kind of decision-making algorithm (binary classifier). But can we do this without training it specifically for this task? Yes, we can, thanks to Large Language Models (LLMs).

Prompt: “Given below are description of two books. Do they describe the same book? Description1: S2(Row11) Description2: S2(Row11) Ans: “

LLMs, like GPT3, T0pp, Flan, etc., are like super-smart assistants that have learned from billions of pieces of text. By asking them the right questions (prompts), we can get pretty good decisions from them.

When and How Much Data to Use for Training?

Image created by the author

When implementing machine learning solutions, the questions of how much data is necessary for training and when to stop accumulating additional data are crucial.

LLMs like GPT3 and GPT4 have been trained on vast amounts of text, which makes them incredibly powerful at understanding and generating language. Unlike many traditional machine learning models, these LLMs don’t need a significant volume of additional task-specific data for fine-tuning. Instead, they operate in a “few-shot” or “zero-shot” learning manner. This means that they can understand the task at hand and generate appropriate responses based on the context provided by carefully crafted prompts, even without seeing many (or any) similar examples before. The models learn to perform the task from these prompts, making them versatile tools that can adapt to a variety of tasks with minimal data requirements.

Image created by the author

Another model we can use without thinking about prompt construction is Setfit, a model introduced in late 2022 that operates differently. Unlike GPT3 or GPT4, Setfit is a framework for few-shot learning without prompts. Importantly, Setfit allows for targeted fine-tuning of different components of the model. For instance, we can freeze the weights (parameters) of the sentence embedder and train only the classifier, or vice versa. This control level allows us to optimize Setfit’s performance even with a small amount of data.

In summary, the approach is to start with models that deliver good results even with limited data. As we collect more data and receive feedback from the initial rollout, we can adjust our approach to enhance the performance of these models. This strategy allows us to start small and scale up as more data and resources become available, thereby adapting to the unique requirements of our task.

Looking Ahead

To improve results further, we can train multiple models using different techniques and combine their results. This creates a more robust and resilient system, capable of handling the complex task of linking records across different databases.

In conclusion, linking records using Large Language Models offers a promising way to handle the challenges of vendor data integration. With the right approach and tools, it’s possible to link records across databases, providing a unified view that can significantly enhance business operations.”

--

--