Skip to content

NbAiLab/AltMorph

Repository files navigation

AltMorph: Context-Aware Norwegian Morphological Alternative Generator

AltMorph is a tool for expanding Norwegian text by finding morphological alternatives for each word. It combines the Ordbank API with NLP techniques to provide alternatives that fit the surrounding context.

✨ Features

  • 🎯 Context-sensitive filtering: Uses BERT-based acceptability scoring for ambiguous cases
  • πŸ“š Lemma coverage: Finds morphological forms across multiple lemmas
  • πŸ” Position-specific analysis: Looks at each word in its syntactic context
  • ⚑ Caching: Persistent file-based caching to improve performance
  • πŸ—£οΈ Multiple verbosity levels: From silent operation to detailed pipeline insights
  • 🌐 Language support: Norwegian BokmΓ₯l (nob) and Nynorsk (nno)
  • 🧠 POS-aware: Uses NbAiLab BERT models for part-of-speech tagging
  • πŸš€ Parallel processing: Runs concurrent API calls

πŸ› οΈ Installation

Prerequisites

  • Python 3.8+
  • Ordbank API key (free registration at Ordbank)

Install Dependencies

pip install -r requirements.txt

Get API Key

  1. Register at https://www.ordbank.no/
  2. Obtain your API key from your account dashboard
  3. Set the environment variable:
    export ORDBANK_API_KEY="your_api_key_here"
    Or pass it directly with --api_key flag

πŸš€ Quick Start

Basic Usage

python altmorph.py --sentence "Katta ligger pΓ₯ matta." --lang nob

Output:

"{Katta, Katten} ligger pΓ₯ {matta, matten}."

With API Key

python altmorph.py \
  --sentence "Katta ligger pΓ₯ matta." \
  --lang nob \
  --api_key "your_api_key_here"

πŸ“– Usage Examples

Context-Sensitive Behaviour

The tool takes sentence context into account:

Simple example:

python altmorph.py --sentence "Katta ligger pΓ₯ matta." --lang nob
# Output: "{Katta, Katten} ligger pΓ₯ {matta, matten}."
# Shows different morphological forms for the same words

Complex context:

python altmorph.py --sentence "Katta ligger pΓ₯ matta i stua." --lang nob  
# Output: "{Katta, Katten} ligger pΓ₯ {matta, matten} i stua."
# BERT-based filtering keeps alternatives that work in the sentence

Position-Specific Analysis

python altmorph.py --sentence "Katta ligger pΓ₯ matta." --lang nob
# Each word occurrence is analyzed in its specific syntactic context

πŸŽ›οΈ Command Line Options

Option Default Description
--sentence required Input sentence to process
--lang nob Language code (nob or nno)
--api_key $ORDBANK_API_KEY Ordbank API key
--verbosity 0 Verbosity level (0-3)
--logit-threshold 3.0 BERT acceptability threshold
--timeout 6.0 HTTP timeout per request
--max_workers 4 Parallel API requests
--no-cache False Disable caching
--delete-cache False Clear cache and exit

πŸ”Š Verbosity Levels

Level 0: Quiet (Default)

python altmorph.py --sentence "Katta ligger pΓ₯ matta." --verbosity 0

Output: Just the final result

"{Katta, Katten} ligger pΓ₯ {matta, matten}."

Level 1: Normal

python altmorph.py --sentence "Katta ligger pΓ₯ matta." --verbosity 1

Output: Basic progress information

2025-XX-XX 12:00:00 INFO Loading POS tagger...
2025-XX-XX 12:00:02 INFO POS tagger loaded
"{Katta, Katten} ligger pΓ₯ {matta, matten}."

Level 2: Verbose

python altmorph.py --sentence "Katta ligger pΓ₯ matta." --verbosity 2

Output: Processing details (POS tags, API lookups, alternatives found)

🎯 PROCESSING: Katta ligger pΓ₯ matta.
πŸ“ WORDS: ['katta', 'ligger', 'pΓ₯', 'matta']
🏷️ POS TAGS:
   katta: NOUN
   ligger: VERB
   pΓ₯: ADP
   matta: NOUN
πŸ“‘ API LOOKUP: katta (POS: NOUN)
   βœ… katta: 2 alternatives: ['katta', 'katten']
...
✨ RESULT: "{Katta, Katten} ligger pΓ₯ {matta, matten}."

Level 3: Very Verbose

python altmorph.py --sentence "Katta ligger pΓ₯ matta." --verbosity 3

Output: Everything including cache operations, lemma analysis, BERT filtering

🎯 PROCESSING: Katta ligger pΓ₯ matta.
πŸ“ FOUND 2 LEMMAS for katta
πŸ’Ύ CACHE HIT: lemmas for 'katta' (POS: NOUN)
🧠 ACCEPTABILITY FILTERING (threshold: 3.00)
πŸ” ANALYZING: katta (position 0)
   Context: [Katta] ligger pΓ₯ matta.
   Alternatives: ['katta', 'katten']
πŸ“Š CACHE STATS: 8 hits, 0 misses (100.0% hit rate)
...

πŸ—‚οΈ Caching System

AltMorph includes caching to improve performance:

  • Cache location: ~/.ordbank_cache/
  • Cache types: Lemma searches and inflection data
  • Performance: ~95%+ hit rate for repeated usage
  • Management:
    • --no-cache: Disable caching
    • --delete-cache: Clear all cache files

Performance impact:

  • First run: ~3-4 seconds (API calls)
  • Cached runs: ~0.5 seconds

🧠 Technical Details

Code Architecture Deep-Dive

πŸ“– Complete Code Walkthrough - Detailed technical explanation of how AltMorph works for developers who need implementation details.

Architecture

  1. Input Processing: Tokenization preserving whitespace and punctuation
  2. POS Tagging: NbAiLab/nb-bert-base-pos for accurate grammatical analysis
  3. Lemma Discovery: Comprehensive search across all relevant Ordbank lemmas
  4. Inflection Analysis: Full morphological paradigm extraction
  5. Acceptability Scoring: NbAiLab/nb-bert-base for context-sensitive filtering
  6. Output Generation: Case-preserving alternative presentation

Models Used

  • POS Tagging: NbAiLab/nb-bert-base-pos
  • Acceptability: NbAiLab/nb-bert-base
  • API: Ordbank - Norwegian morphological database

Key Algorithms

  • Comprehensive lemma matching: Finds all lemmas containing target word
  • Position-specific analysis: Each word occurrence analyzed in context
  • Logit-based filtering: Acceptability thresholding (default: 3.0)
  • Prioritization: Balances morphological coverage with contextual fit

πŸ“Š Performance

Typical Performance

  • Single sentence: 0.5-4 seconds (depending on cache state)
  • Cache hit rate: Typically 95%+ for repeated usage
  • API efficiency: Parallel requests with batching
  • Memory usage: ~500MB (loaded BERT models)

Scaling Considerations

  • Concurrent requests: Configurable via --max_workers
  • Timeout handling: Robust error recovery with retries
  • Rate limiting: Respectful API usage patterns

πŸ› οΈ Tools

AltMorph includes additional tools for batch processing and testing:

See tools/README.md for detailed documentation and usage examples.

πŸ”§ Development

Project Structure

altmorph/
β”œβ”€β”€ altmorph.py              # Main application
β”œβ”€β”€ tools/
β”‚   β”œβ”€β”€ README.md            # Tools documentation  
β”‚   β”œβ”€β”€ process_jsonl.py     # JSONL batch processor
β”‚   └── pos_tester.py        # POS tagging comparison tool
β”œβ”€β”€ data/
β”‚   └── sample_input.jsonl   # Sample data for testing
β”œβ”€β”€ README.md                # Main documentation
β”œβ”€β”€ setup.py                 # Legacy packaging
β”œβ”€β”€ pyproject.toml          # Modern packaging
β”œβ”€β”€ requirements.txt         # Dependencies
└── ~/.ordbank_cache/        # Cache directory (auto-created)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure code follows existing style
  5. Submit a pull request

Testing

# Test basic functionality
python altmorph.py --sentence "Katta ligger pΓ₯ matta." --lang nob

# Test cache functionality  
python altmorph.py --delete-cache
python altmorph.py --sentence "Katta ligger pΓ₯ matta." --lang nob --verbosity 3

# Test without cache
python altmorph.py --sentence "Katta ligger pΓ₯ matta." --lang nob --no-cache

# Test POS comparison tool
python tools/pos_tester.py --text "Katta ligger pΓ₯ matta."

# Test batch processing with sample data
python tools/process_jsonl.py --input_file data/sample_input.jsonl --output_file test_output.jsonl --verbosity 2

🀝 Related Projects

  • AltWER: Depends on AltMorph's output format for Norwegian text evaluation

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

  • Ordbank Team: For providing the comprehensive Norwegian morphological API
  • Clarino/UiB: For hosting the API infrastructure
  • NbAiLab: For the Norwegian BERT models
  • AltMorph: Idea and coding by Magnus Breder Birkenes and Per Egil Kummervold

About

Context-Aware Norwegian Morphological Alternative Generator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published