The current inference engine relies on direct string comparisons or simple token matching. This approach is brittle and fails when developers use unconventional naming patterns.
Examples of failure cases:
- usr_mail instead of email
- ph_no instead of phone_number
- dt_created instead of created_at
This leads to fallback generation, reducing realism and usefulness of generated datasets.
Deeper Architectural Requirements
1. Regex Driven Semantic Matching
Introduce a regex based matching system that captures semantic intent rather than exact naming.
Examples:
/(phone|mobile|cell|contact_no|tel)/i
/(email|e_mail|mail_id)/i
/(created|updated|timestamp|date|dt)/i
This allows broader coverage across diverse schemas.
2. Weighted Scoring Model
Each regex match should contribute a confidence score.
- Exact matches yield high confidence
- Partial or ambiguous matches yield lower confidence
Example scoring:
- exact email match = 1.0
- partial mail match = 0.6
- generic text match = 0.2
The final semantic classification is determined by aggregating scores across all matching patterns.
3. Multi Label Classification
Some fields may belong to multiple semantic categories. For example:
- billing_email may match both financial and contact categories
The engine should support:
- Primary classification based on highest score
- Secondary tags for contextual enrichment
4. Externalized Heuristic Dictionary
All regex patterns and scoring weights should be stored in a configuration layer.
Benefits:
- Enables community contributions without touching core logic
- Allows domain specific extensions
- Simplifies testing and iteration
The dictionary structure may include:
{
"email": {
"patterns": [...],
"weight": ...
}
}
5. Continuous Learning Loop
Optionally, the system can log fallback cases and allow developers to:
- Add new regex patterns
- Adjust weights based on observed failures
Implementation Considerations
- Regex performance impact on large schemas
- Avoiding overfitting or overly broad matches
- Balancing precision versus recall
- Ensuring deterministic inference for repeatable runs
The current inference engine relies on direct string comparisons or simple token matching. This approach is brittle and fails when developers use unconventional naming patterns.
Examples of failure cases:
This leads to fallback generation, reducing realism and usefulness of generated datasets.
Deeper Architectural Requirements
1. Regex Driven Semantic Matching
Introduce a regex based matching system that captures semantic intent rather than exact naming.
Examples:
/(phone|mobile|cell|contact_no|tel)/i/(email|e_mail|mail_id)/i/(created|updated|timestamp|date|dt)/iThis allows broader coverage across diverse schemas.
2. Weighted Scoring Model
Each regex match should contribute a confidence score.
Example scoring:
The final semantic classification is determined by aggregating scores across all matching patterns.
3. Multi Label Classification
Some fields may belong to multiple semantic categories. For example:
The engine should support:
4. Externalized Heuristic Dictionary
All regex patterns and scoring weights should be stored in a configuration layer.
Benefits:
The dictionary structure may include:
5. Continuous Learning Loop
Optionally, the system can log fallback cases and allow developers to:
Implementation Considerations