Leading large language models (LLMs) are trained on public data. However, the majority of the world’s data is dark data not pub- licly accessible, mainly in the form of private organizational data or enterprise data. We show that the performance of LLM-based methods seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a benchmark of enterprise data, the Goby benchmark, to the scientific community to advance discovery in the area of enterprise data management. Based on our experience with this enterprise benchmark, we propose tech- niques to uplift the performance of LLMs on this more challenging data distribution: these are (1) hierarchical annotation (2) runtime class-learning and (3) ontology synthesis. We show that one these techniques are deployed, performance on enterprise data is on par with public data.
The GOBY Benchmark Dataset is designed to aid in evaluating data integration methods on structured enterprise data. This dataset includes categories, entities, and results derived from various data sources, represented in a unified schema. Key components include:
The primary data archive, goby.tar.gz
, contains the following key directories:
dump/
: PostgreSQL dump files that include:
doit_categories
: Data categories with record counts.doit_data
: Triple-based data representing (category_id, source_id, entity_id, name, value).To access the GOBY dataset:
goby.zip
file from the repository (link forthcoming).unzip -P your_password goby.zip -d /path/to/extract/
.Once downloaded, use the following steps to explore and integrate the dataset:
dump/
directory.The benchmark mentioned in this abstract can be downloaded with the download button below using the password: