Populating CSV Files from Unstructured Text with LLMs for KG Generation with RML

We report on an exploratory study using Large Language Models (LLMs) to generate Comma-Separated Values (CSV) files, which are subsequently transformed into Resource Description Framework (RDF) using the RDF Mapping Language (RML). Prior studies have shown that LLMs sometimes have problems generating valid and well-formed RDF from unstructured texts, i.e., issues with RDF, not the contents. We wanted to test whether the generation of CSV led to fewer issues and whether this would be a viable option for allowing domain experts to be actively part of the Knowledge Graph (KG) population process by allowing them to use familiar tools. We have built a prototype illustrating this idea, and the results seem promising for further study. The initial prototype uses zero-shot training and is built on GPT-4. The prototype takes the unstructured text and the CSV file’s structure as input and uses the latter to generate prompts to fill in the cells’ values. Future work includes analyzing the effect of different prompting strategies. The limitation, however, is that such an approach only works for projects where domain experts work with spreadsheets for pre-existing mappings.
Publication Reference
Maushagen, J., Sepehri, S., Sanctorum, A., Vanhaecke, T., De Troyer, O., & Debruyne, C. (2024, September). Populating CSV Files from Unstructured Text with LLMs for KG Generation with RML. In 20th International Conference on Semantic Systems (SEMANTiCS 2024). CEUR-WS.
Publication Awards
Best Poster Award

Available: