Using AI to Parse Large Scale Real Estate Data

USING AI TO PARSE LARGE SCALE REAL ESTATE DATA

A bleeding-edge LLM document parsing tool that extracts data in seconds.

When it comes to spreadsheets, this baby more than Excels

The Problem

Municipalities across the country maintain large-scale residential real estate record databases. These records are aggregated into a single file, often on a daily basis, each containing dozens of columns and hundreds of thousands of rows of variable data. Many companies need to extract property-specific tax and title information from these ever-evolving government data sets to facilitate real estate transactions. It can cost them thousands of dollars per file to manually read, format, and extract pertinent information from these documents — a process that is not only incredibly time consuming and prone to errors, but also vulnerable to any changes in a municipality’s file format.

The Task

Crux Digital was engaged by one such company to develop an automated and scalable solution for ingesting, standardizing, and dynamically parsing specific fields of data from incoming documents, regardless of their file type, data structure and format, or municipal source of origin.

The Solution

Leveraging a cutting-edge large language model (LLM), we built a novel parsing tool.

Technically Speaking

New LLM parsing tool is capable of Ingesting documents from a myriad of sources.

The solution dynamically identifies and extracts a predefined list of fields from a structured, but variable and very messy data set.

Uses AI to construct a sustainable Python-based parser tailored for each unique document.

To ensure accuracy and handle anomalies when parsing variable data, we devised a hybrid AI/human approach. While the AI solution can autonomously process, with 100% accuracy, the majority of the documents received, it is also smart enough to prompt targeted human intervention when necessary. Because the LLM can make sense of everyday English, “human intervention” means that simple text responses sent via Slack are more than capable of providing feedback to the LLM, rather than hands-on data manipulation or coding being required. This hybrid solution drastically decreases the need for human involvement — requiring only a fraction of the original time and resources — and dramatically reduces administrative costs by an order of magnitude.

Show more +

The Outcome

Using modern LLM architectures, we’ve created a novel way to quickly comprehend and dissect complex documents with very large data sets.

Through an inventive application of the LLM, our tool uniquely harnesses its power to construct a Python-based parser tailored for each document as a whole — vs. only deploying the LLM for each individual data row within a document. This method circumvents the need for costly and lengthy inferences for every data row, all while achieving a profound grasp of intricate documents.

The tool eliminates costly, error-prone, and time-consuming manual parsing methodologies, while ensuring detailed comprehension of complex documents with efficiency.

Our application of LLM is part of a greater effort to advance AI’s depth of understanding of contextual text.