TextIndex: create back-of-book indexes from Markdown and other plain-text formats #11007

mattgemmell · 2025-07-31T09:36:51Z

mattgemmell
Jul 31, 2025

Hello. I've made something that might be useful for those who want to add indexes to their publications, when using plain-text source formats like Markdown.

My pandoc usage is entirely based on Markdown to HTML-derived formats, including HTML itself, epub3, and PDF for print via weasyprint. I wanted a way to generate indexes (in the back-of-book sense) purely from Markdown without going via latex, and after a bit of research I decided to create a micro-syntax (and parser script, in Python). It's called TextIndex, and it allows adding "index marks" to plain-text documents, which are then compiled into an HTML index and inserted into the document wherever you like. This is conceptually similar to Microsoft Word's approach to indexes.

I use it as a pre-processor before pandoc and also as a standalone tool, and I'm finding the results to be more than satisfactory. Genuine page-numbers are available for paginated formats (via the CSS Generated Content spec), and there's a lot of index-related functionality available, including cross-references, custom sorting of entries, hierarchical headings, locator emphasis and suffixing, running-in of deeply nested entries, and a number of conveniences to make indexing easier and quicker. I largely used the Chicago Manual of Style to guide the formatting, within the constraints of simplicity.

It's of course open source (GPL3), and the repository is available on github. You can read the full documentation here, with a sample index (for that page itself, generated with TextIndex) at the end.

A couple of screenshots of sample output are attached. I hope the project will be useful to others too.

jgm · 2025-08-01T05:35:15Z

jgm
Aug 1, 2025
Maintainer

This looks great! I recently had to generate an index for a book in LaTeX/PDF and EPUB versions, and I might have used this if it had been available. (In the end I used inline LaTeX indexing commands + a Lua filter that converted these to something appropriate for the EPUB.)

9 replies

mattgemmell Aug 1, 2025
Author

That's very useful indeed; thank you. I appreciate you taking the time to share those samples.

iandol Aug 1, 2025

If I was writing markup in my text I would certainly prefer your {^} syntax over the LaTeX one and your tool has some great features, if this was a Lua filter and supported more output formats it would be amazing! Thanks!

jgm Aug 1, 2025
Maintainer

Here's my filter:

-- This filter replaces latex \index commands with anchors, and it
-- creates a data structure linking index terms to anchors (and e.g.
-- see or seealso entries). When the traversal is complete, it uses
-- this table to add an index section to the document, with a list of
-- links back into the body.

-- Limitations:
-- * Sorting will be broken for non-ASCII letters.
-- * Section labels are only computed to three levels.

local number = {}
local index = {}
local lastAnchor = 0

local function getSectionNumber()
  local result = "§"
  for i, n in ipairs(number) do
    result = result .. (i == 1 and "" or ".") .. tostring(n)
  end
  return result
end

local function getNewAnchor()
  lastAnchor = lastAnchor + 1
  return lastAnchor
end

local function getIndexEntries(el)
  local anchor
  for s in string.gmatch(el.text, "\\index(%b{})") do
      if not anchor then
          anchor = getNewAnchor()
      end
      local wentry, instruct = string.match(s, "^%{([^|]*)(.*)%}$")
      local subentrypos = string.find(wentry, "!")
      local subentry
      local entry
      if subentrypos then
        entry = string.sub(wentry, 1, subentrypos - 1)
        subentry = string.sub(wentry, subentrypos + 1, -1)
      else
        entry = wentry
      end
      if not index[entry] then
        index[entry] = {}
      end
      if subentry and not index[entry][subentry] then
        index[entry][subentry] = {}
      end
      if instruct == "" or instruct == "|(" then
          local secnum = getSectionNumber()
          if subentry then
            table.insert(index[entry][subentry], {secnum, anchor})
          else
            table.insert(index[entry], {secnum, anchor})
          end
      elseif instruct == "|)" then
          return nil
      else
        local cmd, ref = string.match(instruct, "%|(see%w*)(%b{})")
        if cmd then
          index[entry][cmd] = string.sub(ref, 2, -2)
        end
      end
  end
  return anchor
end

local function handleRaw(el)
  if el.format == "latex" or el.format == "tex" then
    local anchor = getIndexEntries(el)
    if anchor then
      if el.type == 'RawBlock' then
        return {el, pandoc.Div({},{"index:" .. anchor, {"indexref"}, {}})}
      else
        return {el, pandoc.Span({},{"index:" .. anchor, {"indexref"}, {}})}
      end
    end
  end
end

local function printIndex()
  for k, v in pairs(index) do
    print(k)
    for k,x in pairs(v) do
      print(k,x)
    end
  end
end

function extend(t1,t2)
    for i=1,#t2 do
        t1[#t1+1] = t2[i]
    end
    return t1
end

local function readLaTeXInlines(s)
  local doc = pandoc.read(s, "latex")
  return pandoc.utils.blocks_to_inlines(doc.blocks)
end

local function formatEntry(key, entry)
  local contents = readLaTeXInlines(key)
  for k, v in pairs(entry) do
    if type(k) == "number" and type(v) == "table" then
      table.insert(contents, pandoc.Str(","))
      table.insert(contents, pandoc.Space())
      table.insert(contents, pandoc.Link(v[1], "#index:" .. tostring(v[2])))
    elseif k == "see" or k == "seealso" then
      local lab = {pandoc.Str(k == "seealso" and "see also" or "see")}
      table.insert(contents, pandoc.Str(","))
      table.insert(contents, pandoc.Space())
      table.insert(contents, pandoc.Emph(lab))
      table.insert(contents, pandoc.Space())
      contents:extend(readLaTeXInlines(v))
    elseif type(k) == "string" and type(v) == "table" then
      -- subentry
      table.insert(contents, pandoc.Str(";"))
      table.insert(contents, pandoc.Space())
      extend(contents, formatEntry(k, v))
    end
  end
  return contents
end

local function latexToPlainHtml (s)
  local doc = pandoc.read(s, "latex")
  local html = pandoc.write(pandoc.Pandoc(pandoc.Plain(pandoc.utils.blocks_to_inlines(doc.blocks))), "html")
  return html:gsub("%s*$", ""):gsub("%b<>", "") -- remove HTML tags
end

local function makeIndex()
  local result = { pandoc.Header(1, "Index", { "index", {"unnumbered"}, {} }) }
  -- get keys so we can sort them
  local keys = {}
  for k, _ in pairs(index) do
      table.insert(keys, k)
  end
  table.sort(keys, function(a,b) return (string.lower(a) < string.lower(b)) end)
  local firstletter
  for _, rawkey in ipairs(keys) do
    local entry = index[rawkey]
    local indexkey = string.gsub(rawkey, "@.*", "")
    local key = string.gsub(rawkey, "^.*@", "")
    local ekey = latexToPlainHtml(indexkey)
    local newfirstletter = string.lower(string.sub(ekey, 1, 1))
    if newfirstletter ~= firstletter and newfirstletter < '\127' and newfirstletter:match("%a") then
      table.insert(result,
          pandoc.Header(2, string.upper(newfirstletter), {"index-heading-" .. string.upper(newfirstletter), {"unnumbered", "unlisted", "index-heading"}, {}}))
      firstletter = newfirstletter
    end
    table.insert(result, pandoc.Para(formatEntry(key, entry)))
  end
  return result
end

local function addIndex(el)
  el.blocks = el.blocks .. makeIndex()
  return el
end

local function incrementNumber(lev)
  if lev == 1 then
    if number[1] then
      number = {number[1] + 1}
    else
      number = {1}
    end
  elseif lev == 2 then
    if number[2] then
      number = {number[1], number[2] + 1}
    else
      number = {number[1], 1}
    end
  elseif lev == 3 then
    if number[3] then
      number[3] = number[3] + 1
    else
      number[3] = 1
    end
  end
end

local function updateNum(el)
  if not el.classes:includes("unnumbered") then
    incrementNumber(el.level)
  end
end

return { { traverse = 'topdown',
           RawBlock = handleRaw,
           RawInline = handleRaw,
           Header = updateNum },
         { Pandoc = addIndex } }

mattgemmell Aug 1, 2025
Author

Thanks!

mattgemmell Aug 3, 2025
Author

Just a quick addendum: based on the above, TextIndex can now convert basic/common latex \index{...} commands into its own syntax before processing the document. In the event that someone wants to create an additional to-HTML/ePub workflow from existing print-oriented source files, it should at least help. Docs are here, and screen of example conversion is below. Thanks for the pointers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TextIndex: create back-of-book indexes from Markdown and other plain-text formats #11007

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

TextIndex: create back-of-book indexes from Markdown and other plain-text formats #11007

Uh oh!

mattgemmell Jul 31, 2025

Replies: 1 comment · 9 replies

Uh oh!

jgm Aug 1, 2025 Maintainer

Uh oh!

mattgemmell Aug 1, 2025 Author

Uh oh!

iandol Aug 1, 2025

Uh oh!

jgm Aug 1, 2025 Maintainer

Uh oh!

mattgemmell Aug 1, 2025 Author

Uh oh!

mattgemmell Aug 3, 2025 Author

mattgemmell
Jul 31, 2025

Replies: 1 comment 9 replies

jgm
Aug 1, 2025
Maintainer

mattgemmell Aug 1, 2025
Author

jgm Aug 1, 2025
Maintainer

mattgemmell Aug 1, 2025
Author

mattgemmell Aug 3, 2025
Author