Skip to content

VectorSpaceLab/InfoSeek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Open Data Synthesis For Deep Research

Build Build Build Build

🔎 Roadmap

InfoSeek is currently under active development, with resources and prototypes continuously being published at this repository.

  • Dataset
  • Data Construction Codes
  • SFT Training Code
  • Technical Report
  • RL Training Code
  • InfoSeeker Model

🔆 Overview

We propose InfoSeek, a scalable data synthesis framework for constructing structurally complex Deep Research tasks. InfoSeek designs a dual-agent system to recursively build a Research Tree by mining entities and relations from large-scale text, and blurring itermediate vertices to ensure they form valid sub-problems. The agent then transform these trees into natural language questions whose solutions require traversing the entire hierarchy. Using InfoSeek pipeline, we construct a high-quality, complexity-controllable, and intrinsically verifiable dataset.

📋 InfoSeek Data

We released InfoSeek dataset on 🤗

Example 1:

Question: What is a species of bird that was named by a person employed under his father between 1818 and 1824, whose wife was a British artist, and which has three subspecies and body length is generally no more than 6 inches?

Answer: Russet sparrow

Tree Structure
{
  "root": {
    "id": "A",
    "entity": "Russet sparrow",
    "question": "What is a species of bird that was named by a person employed under his father between 1818 and 1824, whose wife was a British artist, and which has three subspecies and body length is generally no more than 6 inches?",
    "claims": [
      { "target_id": "B", "claim": "A was named by B" },
      { "target_id": "C", "claim": "A has three subspecies" },
      { "target_id": "D", "claim": "A's body length is generally no more than 6 inches" }
    ],
    "children": [
      {
        "id": "B",
        "entity": "John Gould",
        "claims": [
          { "target_id": "E", "claim": "B was employed by his father between 1818 and 1824" },
          { "target_id": "F", "claim": "B's wife was F" }
        ],
        "children": [
          { "id": "E", "entity": "None", "claims": [], "children": [] },
          { "id": "F", "entity": "Elizabeth Gould", "claims": [], "children": [] }
        ]
      },
      { "id": "C", "entity": "None", "claims": [], "children": [] },
      { "id": "D", "entity": "None", "claims": [], "children": [] }
    ]
  }
}
(A: Russet sparrow)
 │
 │
 │── [claim] "was named by" ──> (B: John Gould)
 │    │
 │    │
 │    │── [claim] "was employed by his father (1818-1824)"
 │    │
 │    │
 │    │── [claim] "wife was" ──> (F: Elizabeth Gould)
 │
 │
 │── [claim] "has three subspecies"
 │
 │
 │── [claim] "body length is generally no more than 6 inches"

Example 2:

Question: What is a women's football team whose first goals in the 2. Bundesliga were scored by a player born in Korogocho, who was discovered and developed by the Mathare Youth Sports Association?

Answer: SV Werder Bremen (women)

Tree Structure
{
  "root": {
    "id": "A",
    "entity": "SV Werder Bremen (women)",
    "question": "What is a women's football team whose first goals in the 2. Bundesliga were scored by a player born in Korogocho, who was discovered and developed by the Mathare Youth Sports Association?",
    "claims": [
      { "target_id": "B", "claim": "A's first goals in the 2. Bundesliga were scored by B" }
    ],
    "children": [
      {
        "id": "B",
        "entity": "Doreen Nabwire",
        "claims": [
          { "target_id": "C", "claim": "B was discovered and developed by C" },
          { "target_id": "D", "claim": "B was born in D" }
        ],
        "children": [
          { "id": "C", "entity": "Mathare Youth Sports Association", "claims": [], "children": [] },
          { "id": "D", "entity": "Korogocho", "claims": [], "children": [] }
        ]
      }
    ]
  }
}
(A: SV Werder Bremen (women))
 │
 │
 │── [claim] "first goals scored by" ──> (B: Doreen Nabwire)
      │
      │
      │── [claim] "discovered and developed by" ──> (C:Mathare Youth Sports Association)
      │
      │
      │── [claim] "was born in" ──> (D: Korogocho)

📊 Performance

Model trained on InfoSeek and our framework shows strong performances on traditional multi-hop benchmarks:

Our 3B model shows competitive results on BrowseComp-Plus:

📄 License

The code and data accompanying this work are released under the Apache License, Version 2.0. This permits use, modification, and distribution for research and commercial purposes, provided that proper attribution is given and the terms of the license are followed.

❤️ Citing Us

If you find this repository or our work useful, please consider giving a star ⭐ and or citing our work, which would be greatly appreciated:

@misc{xia2025opendatasynthesisdeep,
      title={Open Data Synthesis For Deep Research}, 
      author={Ziyi Xia and Kun Luo and Hongjin Qian and Zheng Liu},
      year={2025},
      eprint={2509.00375},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.00375}, 
}

About

Data Synthesis for Deep Research Based on Semi-Structured Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published