Upgrading the FieldInferenceEngine with semantic regex heuristics

The current inference engine relies on direct string comparisons or simple token matching. This approach is brittle and fails when developers use unconventional naming patterns.

Examples of failure cases:

- usr_mail instead of email
- ph_no instead of phone_number
- dt_created instead of created_at

This leads to fallback generation, reducing realism and usefulness of generated datasets.

### Deeper Architectural Requirements

#### 1. Regex Driven Semantic Matching

Introduce a regex based matching system that captures semantic intent rather than exact naming.

Examples:

- Phone detection:

`/(phone|mobile|cell|contact_no|tel)/i`

- Email detection:

`/(email|e_mail|mail_id)/i`

- Temporal fields:

`/(created|updated|timestamp|date|dt)/i`

This allows broader coverage across diverse schemas.

#### 2. Weighted Scoring Model

Each regex match should contribute a confidence score.

- Exact matches yield high confidence
- Partial or ambiguous matches yield lower confidence

Example scoring:

- exact email match = 1.0
- partial mail match = 0.6
- generic text match = 0.2

The final semantic classification is determined by aggregating scores across all matching patterns.

#### 3. Multi Label Classification

Some fields may belong to multiple semantic categories. For example:

- billing_email may match both financial and contact categories

The engine should support:

- Primary classification based on highest score
- Secondary tags for contextual enrichment

#### 4. Externalized Heuristic Dictionary

All regex patterns and scoring weights should be stored in a configuration layer.

Benefits:

- Enables community contributions without touching core logic
- Allows domain specific extensions
- Simplifies testing and iteration

The dictionary structure may include:

```
{
"email": {
"patterns": [...],
"weight": ...
}
}
```

#### 5. Continuous Learning Loop

Optionally, the system can log fallback cases and allow developers to:

- Add new regex patterns
- Adjust weights based on observed failures

### Implementation Considerations

- Regex performance impact on large schemas
- Avoiding overfitting or overly broad matches
- Balancing precision versus recall
- Ensuring deterministic inference for repeatable runs

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading the FieldInferenceEngine with semantic regex heuristics #10

Deeper Architectural Requirements

1. Regex Driven Semantic Matching

2. Weighted Scoring Model

3. Multi Label Classification

4. Externalized Heuristic Dictionary

5. Continuous Learning Loop

Implementation Considerations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Upgrading the FieldInferenceEngine with semantic regex heuristics #10

Description

Deeper Architectural Requirements

1. Regex Driven Semantic Matching

2. Weighted Scoring Model

3. Multi Label Classification

4. Externalized Heuristic Dictionary

5. Continuous Learning Loop

Implementation Considerations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions