Automating markup changes with node, gulp and config files.
This tool is part of the Blitz Framework.
All the Blitz repositories reached End Of Life on July 1, 2020. The entire project is no longer maintained and its repositories are read-only. You can still fork them if they can be useful to you.
Blitz Tasks is a set of gulp scripts to automate (X)HTML markup changes and apply some useful optimizations (image optim + CSS/JS minification). Files and folders you put in input
will go through pipelines then show up in output
.
It is using JSON config files with a “search & replace” affordance. You search in documents using CSS selectors, then replace in a variety of ways e.g. changing the tag, removing the element, adding classes and ids, etc.
We don’t know of any alternative right now but will warmly welcome any project duplicating features in the same or another language, and promote it (we mean it).
The idea behind Blitz Tasks is to modify the markup of a large amount of files with a “common pattern” – hence the config files. Here are a few use cases which drove its design:
- you must modify dozens of older HTML files (e.g. backlist EPUB);
- you have to systematically clean and improve the markup that is output by an authoring tool;
- you want to upgrade lots of files to HTML5 (or EPUB3);
- you want to add
id
to all your headings, figures, notes, etc.; - you want to remove or rename a lot of classes;
- you want to add elements e.g. stylesheet, metas, etc. to a large amount of files;
- you want to use the heading of the document for its title;
- you want to add a
lang
to hundreds of files; - etc.
On top of that, Blitz Tasks also provides image optimization and CSS/JS minification as process options.
First make sure you have nodeJS and npm installed. If you don’t install it.
Then either clone this repository:
git clone https://github.com/FriendsOfEpub/blitz-tasks.git
Or “use this template” on github, which will create your own repository with the same files and folders.
Then install gulp-cli
globally:
npm install -g gulp-cli
Finally cd
into the project and:
npm install
This will install all required dependencies (gulp, gulp-cheerio, etc.).
Put your files in input
(you can safely add entire folders e.g. unzipped epub).
Make your config file then:
gulp
Your modified folders/files are now available in output
.
This will run tasks with the existing config file:
gulp
This will run tasks with a custom config file:
gulp -c "./another-config.json"
If you don’t need to use a script, don’t use it (remove it entirely) in your config.json. Blitz Tasks will know it doesn’t have to run it at all. See another-config.json as an example.
Blitz Tasks uses a JSON config file to modify your documents.
It is recommended to define a useful scope
and version
for each config file. Those 2 properties are informational today, but may be used for scoping changes to precise folders, and help you manage breaking changes in a new major version, in the future.
There are optional properties bound to the scripts Blitz Tasks currently offers:
- retag
- sanitize
- classify
- identify
- attributify
- append
Most of these scripts are conceptually “search & replace” for the Document Object Model (DOM), with CSS selectors as the syntax. However some may slightly differ so let’s see these scripts in detail.
Technical note: the value of those properties is an array of objects whose keys differ depending on the script.
You can take a look at the example config for more examples.
Retag lets you change the markup of all elements found in the DOM. It’s “search & replace” for HTML markup.
{
"search": ".image",
"replace": "figure"
}
Here we are searching for all elements whose class is image
, and replace the current tag with figure
.
Sanitize lets you remove or unwrap elements, and remove attributes.
It’s “search & replace” again, but what’s important here is the value of replace
.
If you’re using value "this"
then Blitz Tasks will know it has to remove the element entirely – removing its contents as well.
{
"search": ".remove",
"replace": "this"
}
If you’re using value "unwrap"
then Blitz Tasks will know it has to unwrap the element – keeping its contents.
{
"search": "h1 > b, h2 > b, span:not([class]):not([id])",
"replace": "unwrap"
}
Any other value will be interpreted as an attribute e.g. class
, id
, data-something
, etc.
{
"search": ".title-1, .title-2, p.text, em.italic",
"replace": "class"
}
Note that thanks to the CSS selector syntax, you can search for more complex attributes, for instance an id
whose value starts with x
.
{
"search": "[id^='x']",
"replace": "id"
}
Classify lets you add a class or overwrite an existing one. This means it will first remove the class entirely then add classes defined in replace.
{
"search": "figure + h1",
"replace": "no-margin-top"
}
In this example, we search for h1
after figure
and add a no-margin-top
class. If there were existing classes, they would be overwritten.
Identify lets you iterate over elements and add an identifier. Here replace
is the prefix that will be used for ids.
{
"search": "p",
"replace": "para"
}
In this example, each paragraph will get an id
with prefix para
e.g. first paragraph in the doc will be para-1
, second will be para-2
and so on. If there were existing ids, they would be overwritten.
Attributify lets you add an attribute (property + value) or overwrite an existing one.
{
"search": "h1",
"replace": "data-heading='1'"
},
In this example, we search for h1
and add a data-heading
attribute whose value is 1
. If there was an existing data-heading
, it would be overwritten.
Append lets you add elements in each document. It is not using the “search & replace” concept but “where & what.”
Indeed, where
is the element at the end of which you want to add something (what
).
{
"where": "head",
"what": "<link type='text/css' rel='stylesheet' href='../css/my-styles.css'/>"
}
In this example, we are adding a stylesheet at the end of <head>
.
Finally, options allow you to define global changes and optimizations when processing.
With options you can:
- automate replacing the document title with headings found in a document;
- set a default language for each document;
- prettify the markup of each document;
- optimize images;
- minify or prettify stylesheets;
- minify or prettify scripts;
- delete files and their entries in the EPUB’s OPF, NCX, and Nav Doc.
Property docTitle
expects a CSS selector and allows you to define what the <title>
of the document should be.
"options": {
"docTitle": "h1, h2"
}
Here, we instruct Blitz Tasks that it should check for an h1
, and if it doesn’t find any, check h2
. It will use the text content of the first result found.
If no result is found in a document, it will ignore the option.
Property docLang
expects a BCP-47 language tag and allows you to define a lang
attribute for the root element (html
).
"options": {
"docLang": "fr"
}
If the document is XHTML, is will also add an xml:lang
attribute.
Property imageOptim
expects a boolean.
"options": {
"imageOptim": true
}
When set to true
, all GIF, JPEG, PNG, and SVG images will be optimized.
Property minifyCSS
expects a boolean.
"options": {
"minifyCSS": true
}
When set to true
, all stylesheets (.css
) will be minified (a.k.a. removing comments, new lines, spaces, etc.).
Property minifyJS
expects a boolean.
"options": {
"minifyJS": true
}
When set to true
, all scripts (.js
) will be uglified (a.k.a. removing comments, new lines, spaces, etc.).
Property prettyHTML
expects a boolean.
"options": {
"prettyHTML": true
}
When set to true
, all documents (.html
+ .xhtml
) will be prettified (a.k.a. consistent indents, removing useless lines, etc.).
You can change prettyOpts
in gulpfile.js if you want to customize how files are prettified.
Property prettyCSS
expects a boolean.
"options": {
"prettyCSS": true
}
When set to true
, all stylesheets (.css
) will be prettified (a.k.a. consistent indents, removing useless lines, etc.).
Note: This option is incompatible with minifyCSS
and will be overridden if minifyCSS
is set to true
.
You can change prettyOpts
in gulpfile.js if you want to customize how files are prettified.
Property prettyJS
expects a boolean.
"options": {
"prettyJS": true
}
When set to true
, all scripts (.js
) will be prettified (a.k.a. consistent indents, removing useless lines, etc.).
Note: This option is incompatible with minifyJS
and will be overridden if minifyJS
is set to true
.
You can change prettyOpts
in gulpfile.js if you want to customize how files are prettified.
Property deleteFiles
expects an array of strings (filenames).
"options": {
"deleteFiles": ["blitz-kindle.css", "cover.xhtml", "cover.png"]
}
It makes sense to provide this option since you may remove links
, scripts
, etc. during the sanitize
task. Note default
will run this script immediately after init
in order to save some useless processing – especially image optim and minification.
Property epub
expects a boolean.
"options": {
"epub": true
}
When set to true
, Blitz Tasks will run some extra processes specific to EPUB files e.g. deleting file entries from the OPF, NCX and Nav Doc.
Note: This option is currently limited to deleting files but may be used for more in the future – rezipping in the correct order, adding metadata, etc.
We have a couple of guides demonstrating how to use these scripts in config files addressing more specific workflow issues. Do not hesitate to add yours!
Blitz Tasks makes each one of its scripts available if you don’t want to run the default. Note you must gulp init
before running those scripts.
Obviously, you must still have a config file for these scripts to run.
If you intend to run multiple scripts, don’t forget to use the --series
flag e.g. gulp retag sanitize --series
.
First, you must init a session:
gulp init
init
will copy everything from input
into output
. Indeed, Blitz Tasks doesn’t modify your input, just in case, and will only alter files it finds in the output folder. This means you should think in terms of “sessions.” Each time you add to input, you should consider it a new session – default
script automatically creates a new session every time it is run, and effectively resets output
.
This will only run the retag script with the existing config file:
gulp retag
This will run retag then sanitize. The --series
flag must be used in order to run those scripts one after the other – otherwise scripts would be run asynchronously.
gulp retag sanitize --series
Finally, this will run identify and append with a custom config:
gulp identify append --series --config "./another-config.json"
Blitz Tasks offers a bunch of command-line options you can pass as arguments.
You can use --config
or -c
to use a custom config file.
gulp -c "./another-config.json"
In this example, Blitz Tasks will use the config file another-config.json
to process documents, stylesheets, and scripts.
You can define what the --input
(or -i
) and --output
(or -o
) should be for a session. It is recommended to use both so that nothing in the default output
folder ends up being lost/overwritten.
gulp -i "input/folder" -o "test"
In this example, Blitz Tasks will copy files from input/folder
into test
and process documents. This can be useful for scoping processes to publishers/collections/etc, especially when using a custom config file.
For the optimization and beautification scripts, it may be useful to --force
(or -F
) as this will ignore the config file options.
gulp imageOptim minifyCSS minifyJS --force
In this example, we are running the imageOptim
, minifyCSS
, and minifyJS
scripts while ignoring options
in config.json
. Files will be optimized even if those options are not set to true
.
- default (this will run all scripts below)
- init
- deleteFiles
- retag
- sanitize
- classify
- identify
- append
- handleEPUB
- handleOptions
- identifyNCX (
--force
flag will bypass config.json) - imageOptim (
--force
flag will bypass config.json) - minifyCSS (
--force
flag will bypass config.json) - minifyJS (
--force
flag will bypass config.json) - prettyHTML (
--force
flag will bypass config.json) - prettyCSS (
--force
flag will bypass config.json) - prettyJS (
--force
flag will bypass config.json)
You can see a list of available scripts by running gulp --tasks
. Note scripts that can be forced can run asynchronously i.e. you don’t need the --series
flag if running several of these.
If you want examples of how you can use a subset of those scripts with npm, take a look at scripts
in package.json (e.g. npm run optim
, npm run clean
, etc.).
Here’s a couple of questions that might pop up at some point in time, and attempts at an honest answer.
Because that is the environment maintainers are comfortable with, hence the easiest way to create such a project. That’s it.
If we had to care about all the twitter fights on XML vs. JSON vs. YAML, or technology X vs. technology Y in general, a lot of tools wouldn’t even exist.
You are completely free to replicate this project and its goals into any other language/environment you prefer.
Do not hesitate to let us know so that we can advertise it in this ReadMe, as it would definitely benefit a larger amount of users, especially the ones who are not comfortable with node and JSON.
Because JSON is like the simplest thing to use in node. It is literally require("config.json")
, you don’t even need to parse it.
That said, Pull Requests adding support for XML (using xml2json for instance), YAML, or anything else, will be greatly appreciated.
If you have this need/requirement and can manage its addition to Blitz Tasks, do not hesitate if you have questions or need clarifications.
Blitz Tasks is relying on CheerioJS, which is heavily inspired by jQuery. This means CheerioJS may support more than the browser you are using right now.
Therefore, Blitz Tasks is able to support pseudo-class :has()
for instance, and search for elements containing other elements. As a practical example, you can filter figure
with a figcaption
like this: figure:has(figcaption)
. And you could modify those figures
differently e.g. a specific class because it has a figcaption
.
We welcome any idea, improvement, or fix that will benefit all users.
A good rule of thumb is to request global utilities that can be used for other various use cases (see recipes). On the opposite, requesting something to fix one of your own workflow issues will be problematic, and likely not considered if not a very common issue for users.
Please note this repository is also a GitHub template so we’ve even made it easier for people to adjust it to their workflow issues.
The most obvious one would be adding a zip
option, that could indeed help:
- Unzip in input
- Rezip folders or EPUB files (if
epub
set to true) in output
Blitz Tasks shouldn’t be considered the be-all and end-all of all ebook production issues. In particular, everything text is out of scope – there are better and more reliable tools for that.
In other words, we won’t implement any textual search and replace (e.g. regex), or automatic typography improvements (e.g. smart quotes, symbols, etc.).
It’s also first and foremost a template you can use to kickstart your own projects. You can think of it as a toolbox instead of a finished product.
So we won’t provide an option to upgrade ePub2 to EPUB3 for instance, because this is a set of smaller tasks (e.g. changing the doctype, updating metadata in the .opf
, creating a nav out of .ncx
, etc.). At most, Blitz Tasks should offer some scripts to handle these smaller issues, and not a complete solution to the upgrade.
Obviously, you can implement these features yourself, and we will gladly list your repo in this ReadMe if you do.
Of course a project is only as reliable as its dependencies… and sh*t obviously happens.
That being said, those scripts helped maintainers go through very tight deadlines, ePub2 to EPUB3 upgrades, Word to EPUB3 conversions, etc. In the end, they saved hundreds, if not thousands, of work hours over the span of 2–3 years.