This project explores early machine learning approaches to Named Entity Recognition (NER), token co-occurrence modeling, and semantic association using SQL-based graph construction. Built in 2016, the system processes domain-specific text (e.g., Wikipedia articles on F. Scott Fitzgerald and The Great Gatsby) to identify, normalize, and associate named entities based on statistical tokenization and paragraph-level context.
- Goal: Extract and associate named entities through co-occurrence and weighting
- Data Source: Wikipedia articles and original texts by F. Scott Fitzgerald
- Tech Stack: SQL Server (T-SQL), custom stored procedures, early-stage neural embeddings (described, not included in this repo)
-
Corpus Input
- Load paragraph-tagged Wikipedia articles into an input table.
-
Term Normalization (
ProcessTermsList.sql
)- Standardizes inflected forms ("writer", "writers", "writing")
- Looks up or inserts
TermId
s - Assigns statistical weights based on historical word use
-
Proper Matching & Feature Engineering (
ProcessPropers.sql
)- Matches normalized terms to known "Propers" (entity names)
- Computes metrics: word count, total/weighted overlap
- Filters and flags strong associations for graph building
-
Named Entity Graph Construction (
ProcessPropersAssociate.sql
,AssociateTermsList.sql
)- Each entity becomes a graph node (
AMM_Node
) - Nodes are linked via associations (
AMM_Association
) based on:- Term overlap
- Co-occurrence in the same paragraph
- Each entity becomes a graph node (
-
Scoring (
CalculateAssociationWeight.sql
)- Computes association scores using weighted overlap:
Score = ((TotalWeight + 1) * CountedWeightSum / CountedWeightDistinct) / 10
- Computes association scores using weighted overlap:
This project is structured as a classic data pipeline:
- Modular Steps: Each SQL script performs a focused transformation: from raw text to tokens, from tokens to entities, from entities to graphs.
- Sequential Flow: Output from each step becomes input for the next (e.g., token normalization feeds entity matching).
- Intermediate Tables: Temporary and permanent tables store results between stages (
AMM_ProcessInputTerm
,AMM_ProcessProper
, etc). - Automated Processing: Each stored procedure acts as a reusable module that can be applied to new input.
- End-to-End Transformation: Raw Wikipedia text is converted into structured, ranked graphs for downstream use.
SQL/
βββ AMM/
β βββ AMM_Database_All_Objects_2016_04.sql # Full database schema and procedures
β βββ LookupAssociationViewPart1.sql # View for association lookups (part 1)
β βββ LookupAssociationViewPart2.sql # View for association lookups (part 2)
β
βββ WikipediaSql/
βββ ProcessTermsList.sql # Normalize tokens and assign IDs/weights
βββ ProcessPropers.sql # Match terms to entities and compute features
βββ ProcessPropersAssociate.sql # Build graph nodes and weighted associations
βββ AssociateTermsList.sql # Build term-based graph using paragraph grouping
βββ CalculateAssociationWeight.sql # Score entity associations
βββ Delete Titles With Prefixes.sql # Clean up noise from wiki namespaces
Using input from The Great Gatsby Wikipedia article, the model identifies and connects entities such as:
- F. Scott Fitzgerald
- Zelda Fitzgerald
- Jazz Age, Lost Generation, American Dream
- Nick Carraway, Jay Gatsby, Daisy Buchanan
These are clustered based on term overlap and paragraph proximity, yielding semantically relevant association graphs.
- This project predates modern transformers and neural NLP frameworks.
- Most logic is embedded in T-SQL stored procedures.
- Neural embeddings were experimented with using exported node graphs (not included here).
Developed by Korova Mode at AmoryTech in 2016.
This repository is for archival and educational purposes.