Large Language Models (LLMs) demonstrate remarkable capabilities in NLP tasks but face significant challenges when reasoning over structured factual knowledge. Structured data introduces unique characteristics that impact LLM performance:
- Heterogeneity - Mixed data types (text, numbers, dates)
- Topological Interdependencies - Complex structural relationships
- Order Invariance - Permutation-invariant semantics
- Sparsity - Handling missing values
- Lack of Prior Knowledge - Domain-specific context sensitivity
To address these challenges, we present StructFact - a comprehensive benchmark with:
- π 13,407 factual queries across diverse structures (tables/lists/graphs)
- π Multi-domain coverage with temporal/regional variations
- π§© 5 reasoning tasks: Arithmetic Calculation, Geography-Time Reasoning, Multi-hop Reasoning, Composition Understanding, and Combining Structural and Unstructural Reasoning
- π StructFact-Unseen subset for testing generalization on fresh knowledge
βββ data/
β βββ dataset_demo.json # Sample dataset entries
βββ src/
β βββ cal_option.py # Metric calculation script
β βββ run_llm.py # Model inference script
-
Run Inference
Configure your LLM inrun_llm.sh:Then execute:
chmod +x run_llm.sh ./run_llm.sh
-
Calculate Metrics ` Generate accuracy and task-specific metrics:
python src/cal_option.py /path/to/your_llm_output