About

Hi there! 👋

My name is Martin and I am the author of NGRAMS.

I am a backend developer and like to build software that deals with large amounts of data. I do things in C++ because it's fast — at least at runtime, not necessarily at development time. I also like web programming to some extent thanks to TypeScript. When I should name my most important software development principle it would be Keep It Simple. I received a master's degree in computer science in 2012 from the Bauhaus-Universität Weimar (Germany).

Email me at [email protected]

NGRAMS is my third implementation of a search engine of this kind.

2019 - today — ngrams.dev
Dataset: Google Books Ngram Dataset v3
Size: 23 TB compressed, ~230 TB uncompressed
Backend: C++20 for core app, uWebSockets for REST API server
This thing has been released mid April 2023.

2015 - today — phrasefinder.io
Dataset: Google Books Ngram Dataset v2
Size: 7 TB compressed, ~70 TB uncompressed
Backend: C++14 for core app, Boost Beast for REST API server
This thing will be discontinued by the end of 2023.

2007 - 2013 — netspeak.org
Dataset: Web 1T 5-gram Version 1
Size: 25 GB compressed, ~75 GB uncompressed
Backend: C++03 (later C++11) for core app, JNI, Java Servlet for REST API server
This thing started off as my bachelor thesis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About

Hi there! 👋

Home

Ngram Dataset

Query Language

Search Settings

REST API

Terms

About

FAQ

Clone this wiki locally