Skip to content

Associative Named Entity Graph Model (2016) - Semantic graph model for named entity recognition using SQL-based tokenization, co-occurrence analysis, and entity weighting. Built in 2016 using Wikipedia and literary texts as input.

Notifications You must be signed in to change notification settings

markiskorova/associative-entity-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Associative Named Entity Graph Model (2016)

This project explores early machine learning approaches to Named Entity Recognition (NER), token co-occurrence modeling, and semantic association using SQL-based graph construction. Built in 2016, the system processes domain-specific text (e.g., Wikipedia articles on F. Scott Fitzgerald and The Great Gatsby) to identify, normalize, and associate named entities based on statistical tokenization and paragraph-level context.

πŸ” Overview

  • Goal: Extract and associate named entities through co-occurrence and weighting
  • Data Source: Wikipedia articles and original texts by F. Scott Fitzgerald
  • Tech Stack: SQL Server (T-SQL), custom stored procedures, early-stage neural embeddings (described, not included in this repo)

🧠 Architecture

  1. Corpus Input

    • Load paragraph-tagged Wikipedia articles into an input table.
  2. Term Normalization (ProcessTermsList.sql)

    • Standardizes inflected forms ("writer", "writers", "writing")
    • Looks up or inserts TermIds
    • Assigns statistical weights based on historical word use
  3. Proper Matching & Feature Engineering (ProcessPropers.sql)

    • Matches normalized terms to known "Propers" (entity names)
    • Computes metrics: word count, total/weighted overlap
    • Filters and flags strong associations for graph building
  4. Named Entity Graph Construction (ProcessPropersAssociate.sql, AssociateTermsList.sql)

    • Each entity becomes a graph node (AMM_Node)
    • Nodes are linked via associations (AMM_Association) based on:
      • Term overlap
      • Co-occurrence in the same paragraph
  5. Scoring (CalculateAssociationWeight.sql)

    • Computes association scores using weighted overlap:
      Score = ((TotalWeight + 1) * CountedWeightSum / CountedWeightDistinct) / 10
      

🧬 Pipeline Design

This project is structured as a classic data pipeline:

  • Modular Steps: Each SQL script performs a focused transformation: from raw text to tokens, from tokens to entities, from entities to graphs.
  • Sequential Flow: Output from each step becomes input for the next (e.g., token normalization feeds entity matching).
  • Intermediate Tables: Temporary and permanent tables store results between stages (AMM_ProcessInputTerm, AMM_ProcessProper, etc).
  • Automated Processing: Each stored procedure acts as a reusable module that can be applied to new input.
  • End-to-End Transformation: Raw Wikipedia text is converted into structured, ranked graphs for downstream use.

πŸ“¦ Contents

SQL/
β”œβ”€β”€ AMM/
β”‚   └── AMM_Database_All_Objects_2016_04.sql   # Full database schema and procedures
β”‚   └── LookupAssociationViewPart1.sql         # View for association lookups (part 1)
β”‚   └── LookupAssociationViewPart2.sql         # View for association lookups (part 2)
β”‚
β”œβ”€β”€ WikipediaSql/
    β”œβ”€β”€ ProcessTermsList.sql                   # Normalize tokens and assign IDs/weights
    β”œβ”€β”€ ProcessPropers.sql                     # Match terms to entities and compute features
    β”œβ”€β”€ ProcessPropersAssociate.sql            # Build graph nodes and weighted associations
    β”œβ”€β”€ AssociateTermsList.sql                 # Build term-based graph using paragraph grouping
    β”œβ”€β”€ CalculateAssociationWeight.sql         # Score entity associations
    β”œβ”€β”€ Delete Titles With Prefixes.sql        # Clean up noise from wiki namespaces

πŸ§ͺ Example Use Case

Using input from The Great Gatsby Wikipedia article, the model identifies and connects entities such as:

  • F. Scott Fitzgerald
  • Zelda Fitzgerald
  • Jazz Age, Lost Generation, American Dream
  • Nick Carraway, Jay Gatsby, Daisy Buchanan

These are clustered based on term overlap and paragraph proximity, yielding semantically relevant association graphs.

πŸ“Œ Notes

  • This project predates modern transformers and neural NLP frameworks.
  • Most logic is embedded in T-SQL stored procedures.
  • Neural embeddings were experimented with using exported node graphs (not included here).

πŸ“š Credit

Developed by Korova Mode at AmoryTech in 2016.


This repository is for archival and educational purposes.

About

Associative Named Entity Graph Model (2016) - Semantic graph model for named entity recognition using SQL-based tokenization, co-occurrence analysis, and entity weighting. Built in 2016 using Wikipedia and literary texts as input.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages