This document describes yet another instantiation of my attempts to create a sane reference manager that could actually improve rather than degrade my bibliography/PDF management workflows. This is only a project for now, yet a Gitter chat has been created for people interested in providing suggestions or feedback.
- Introduction: doing two things and doing them well
- User interface
- Standard ontology
- Implementing the ontology
- Client/server protocol
- Unresolved issues
- Footnotes
Why yet another reference manager? The first goal of Tkacz is to be usable for very large bibliographies in the field of historical sciences, with a strong focus on contemporary history of science, although it shouldn’t be restricted to a specific field or research domain. Large bibliographies need good organization capabilities, history requires a powerful enough data model to cleanly represent complex or unusual documents and relationships between them, and all research require open, documented and easily usable formats that doesn’t tie their users to a particular piece of software.
Tkacz tries to satisfies these needs by addressing the shortcomings of existing document managers and reference managers:
Document managers usually have very powerful classification, ordering and annotation features, but a very poor data model, which makes it hard to actually meaningfully and formally describe documents by providing formal bibliographic references.
Most document management software also assume documents are somehow unique and sui generis, where historical sources tend to have significant histories in themselves: they may be reprinted, republished, collected and updated, and all of this matters, both for the history of the document and its availability: knowing where to find an obscure article is important. In the “hard” sciences, the most complicated (significant) thing that can happen to an article is to be withdrawn, and all you’re expected to do in this case is not to cite it anymore; but for historical research, withdrawals, reprints, modifications, reuse, etc., are often significant and need to be represented.
It is also assumed that documents are files, usually PDFs. But it’s often the case that documentation is scattered between different libraries, printed and electronic copies, that not everything has a PDF and not every PDF is complete.
Reference managers make it easy to reference documents you don’t actually own (and often make it the default), and usually have a powerful (or extensible) enough data model to accommodate a lot of information, but they usually fall very short when it comes to classification. Beyond that, the (potential) power of the data model is not necessarily reflected in all UIs, and some changes that are theoretically possible in the data model will make the data actually unusable. For instance, handing authors not as strings, but as objects is theoretically feasible in the BibTeX format; but such files would be unusable by the Bib(La)TeX family of programs.
The goal of Tkacz is to “do two things, and do them well”: to serve as a document manager with the full power of a bibliographical data manager. That means:
- Nicely deal with complex or unusual entities: manuscripts, unpublished works, posthumous publications, etc. This means a more complex data model that most reference managers provide, and if possible an extensible data model.
- Prefer objects over strings for qualified properties. Authors, for
example: attribution is important, and attribution can be tricky;
it’s thus better that the
author
field takes a series of person objects, optionally parameterized, instead of a list of strings. This helps solve three problems: a) it allows to keep notes on persons as well, which can be quickly retrieved and are connected to this person’s works; b) it nicely handles cases where a given person published under multiple names, or published anonymous works, making it easy to present the complete list of someone’s works, regardless of the name they were published under; c) it also, obviously, makes typos much easier to detect and correct. - Provide advanced classification mechanisms: hierarchical or flat collections, tags, etc. Because we use objects rather than strings, a good deal of classification can be automated: works by this author, works in this journal, etc. Classification also allows to consider objects as topics or more generally classes: it is not uncommon, eg. in philosophy or literature, that the topic of a book is the author of other books. Friedrich Nietzsche is the author of the Geneology of Morals, and the topic of Heidegger’s Nietzsche.
- Provide just as powerful annotation features.
- Express relationships between entities: an article and a reprint, or a withdrawal, a book and its editions, a multipart work and its parts, etc.
- Help represent the history of references: an article may have been published in a journal, then reprinted multiple times in different books. Where do I look for Canguilhem’s 1966 article “Mort de l’homme ou épuisement du cogito?”, if I don’t have access to the original journal? Has it been translated into English, if I need to cite it in an English article, and where to find the translation?
- Make it hard to introduce inconsistencies or typos in the database. Journals, like persons, are objects: if the object describes the publishing rate, then no reference could contain a volume/issue number inconsistent with publication date without at least generating a warning.
- Create usable bibliographic databases for use with (La)TeX (and
other reference managers). The data model outlined above is way
too expressive for what BiBTeX, BibLaTeX or virtually any
bibliography manager can handle, so Tkacz needs to be able to build
.bib
files on-demand with a given subset of its database. - Rather than supporting “file attachments”, provide a general mechanism to describe copies of a reference, some of which can be files, but would also include things like: it’s in this library at this call number; it’s in my personal library, I have a printout in this box, etc.
- Make the database history easy to browse, review, and understand. Nothing frightens me more than undocumented and non-versioned binary databases with quirky UIs that make it hard to get a grasp of what’s going on (yes, Papers, I’m looking at you). The database will be in a Lisp-like syntax (think XML with parentheses), with Git integration out of the box. A change = a commit, with a meaningful commit message. This leaves the user free to rebase, reorder or squash commits before pushing, and should make it trivial to keep a perfectly clean history.
This is a design document for Tkacz, which should work as a specification for both the user interface and the implementation.
When started with M-x tkacz RET
, Tkacz shows a list of all references
it has in store. It can also show a list of any other type of
entities: to do so, press e, then select the entity type you want.
There are three by default: references (r
), persons (p
), and journals
(j
).
By default, entities are displayed in so-called natural format, they can also be shown in tabulated format by pressing ===.
There are multiple ways to create new references:
- Press
n n
in the references view to display an input form where you can manually fill fields. This is the most tedious way, and should generally be avoided. n u
will prompt for a URL, then do its best to build a reference out of it. If possible, it will assimilate the associated PDF as a copy of the reference.n U
does the same in a loop, which is useful if you’re browsing the web in search for documentation (terminate with empty input). To create references from a web browser, simply configure it to calltkacz/create-reference-from-url
ortkacz/create-reference-from-html
on the Emacs daemon.- Similarly,
n f
will prompt for a file,n F
will do so in a loop. n d
will show a drop area on which you can drag and drop virtually everything, with a strong preference for URLs and PDFs.
From the list view, press <RET>
to open or focus editor view.
Tkacz classification system is made of two distinct mechanisms: taxonomies and contexts.
Taxonomies are hierarchical trees whose branches and leaves may contain entities of various types.
Contexts are branches and leaves of a taxonomy. Contexts are how Tkacz help manage huge collections of possibly unrelated entities. If you’re working on, say, your PhD in history of psychiatry, you don’t want all your computer science articles collection popping up in the list. Contexts are taxonomies, but the contract with the UI is different:
- Contexts are used as first-order filters. In the default UI,
C
is used to toggle between contexts. - When toggling back to a previous context, secondary filters are to be restored as they were.
What’s good is a personal library if you can’t find anything inside? Tkacz comes with two powerful query systems. The coolest one is a formal search syntax, the fastest one is full-text search.
Formal queries are especially useful for building collections and taxonomies. They take the following form:
((type book)
(by MichelFoucault)
(date (between 1960 1980)))
Multiple values can be searched on a single selector. Into French Theory?
((type book article)
(by GillesDeleuze JacquesDerrida JacquesLacan MichelFoucault)))
Need the complete works of someone, including books they edited?
(((author editor) PierreBourdieu))
Notice the car
of each s-expression is the field, the whole cdr
is
values.
Standard boolean operators are available, of course.
(not (and (author RobertStoller) (author RobertGreen)))
(or (date (between 1910 1930)) (date (between 1950 1965)))
Some basic capture and logic is available. You can search for a book by at least two of a group of authors by searching like this:
;; Set the original author list
(let ([authors '(AlonzoChurch KurtGödel AlanTuring)])
;; Do twice
(repeat 2
;; Capture the matched author as capt
(capture capt (by authors))
;; Remove the matched author from list before searching again
(set authors (remove capt authors))))
Just type ?
in the UI, and type some search terms. This is actually
just another formal search: Eg, searching for “popper logic” actually
generates:
((fulltext "popper" "logic"))
The ontology is the actual data model. The next section describes the type system used to implement this model.
I believe Tkacz requires a complex enough class system not to be able to rely on common OOP mechanisms — but who knows.
The Tkacz ontology is made of three big kinds of “things”:
- Entity type
- which in OOP are generally called “classes”. This
includes a root class,
Entity
, and an undefined number of children. Children may be instantiable or “abstract” (abstract is never used to describe entity types, Tkacz simply describes such types are “non instantiables”. That’s because some instantiable entity types are actually abstract (Tkacz has a type for “a book which has had multiple editions”, which is a pure abstraction which corresponds to no actual text — but is still useful) or “generic” (some higher-level entities may be instantiated instead of one if its children) - Relationships
- which are a kind of special fields which connect two
entities together. For example, being the author of a book is a
relationship, because Person and Book and are entity types.
Relationships may be unidirectional, reciprocal or imply other relationships.
From a OOP perspective, a relationship
- Fields
Tkacz is strongly typed.
- Tkacz types are constructed by composing a small set of primitive types.
- Composition is done in the form of classes. Classes have named properties and methods.
- There is an either type.
- There are references. A reference stores the identifier of another entity. References are typed.
- Methods can be overridden at instance level. This may be complicated, so could be implemented as fields with function types and a default value.
- Classes may be abstract. Abstract classes may require interfaces to become concrete.
- Properties have a visibility setting which determines if they’re exposed to the user or not. This is different from OOP’s concepts of private/public: non-exposed members are by default public.
- Classes and properties declaration include a user-readable,
localizable name and optional documentation. Eg:
(defclass BirthCertificate PublicRecord "Birth certificate A legal birth certificate, held by a Public Records office.")
This should automatically generate translation templates.
- For each Tkacz run, the type definitions are built once and for all during a bootstrapping phase. After this phase, they become read-only.
- Types are inspectable at runtime: the GUI system needs typing data to build UIs. Inspection doesn’t have to be dynamic, since at this point types are read-only. Names, documentation and types of properties, as well as hierarchy of types, have to be inspectable.
- Types are extensible after declaration, but before runtime. That is, fields may be added, or their types changed. Entity types may be created.
- String
- Integer
- Float
- Boolean
- File
- List
- Picture
- Date
Either is a rough equivalent of Haskell’s |
. It defines a sum type
which can be of any of a finite set of type. A simple example of
either
is:
(either string number)
A field of this type can be, guess what, either a string or a number.
Unlike structs, either isn’t enough to define a type, and can only be
assigned as the type of a struct’s field. Either are resolved at
struct constructor level, and don’t appear in the object itself but
are replaced by a value of the chosen type. For example, if the above
definition was the type of a field called a
, the struct object would
only contain:
(tzo/struct struct-name #:a (#:type integer #:value 1))
Could either
be used as an enum type?
(either "This" "That" 3)
With eventually a fallback/custom case?
(either "This" "That" string?)
and taking pairs to differentiate between values an UI representation?
(either ((tr "Yes") . #t) ((tr "No") . #f) ("Something else" . string?))
Entities are the essential Tkacz type. They’re defined from structs,
but unlike structs, entities are named root objects, not values.
Structs have discrete values, entities have identity. Entity names
start by an uppercase letter, and they’re defined with the
(tkacz/entity ENTITY-NAME STRUCT-NAME)
macro:
(tkacz/entity Person person)
Everything Tkacz is meant to keep information about is an entity. Informally, an entity is something with an actual existence (in a very loose sense of the word). A person is an entity, a publication’s title or date aren’t. Yet, this criterion should be understood in a quite relaxed fashion, and not as a strict requirement: it’s nice to be able to group an article, its extended reprint as a book chapter, and its translation to another language as instances of a single “thing” (the “abstract” article) to help keep track of various transformations of this document. Such a thing is an entity nonetheless, because it’s useful to consider it as one.
Taxonomies are trees. Taxonomy objects are structs with the following attributes:
Name | Default | Meaning |
---|---|---|
name | required | The name of this branch |
parent | nil | parent branch |
gender | true | whether this branch is a gender |
showempty | false | Whether to show this branch if empty |
parent
is null at the root branch of a tree.- A
gender
is a branch which contains the leaves of its children (the way, in biology, a gender is “made of” its species) showempty
hides a branch and all its subtrees if they contain no entities, and only in this case.
There are two kind of branches: standard and queries. Query branches can do two things: they can treat their result as a list of entities, or as a list of branches which each receive a result and use it on a second, standard query.
The behavior of query branches is defined by their gender
field. If
gender
is true, these branches contain their results as leaves, and
subbranches may contain other queries which refine the original query
(ie, they apply on the first result set, so subbranches are
necessarily strict subsets of their parents)
Query branches have an extra query
attribute, which holds the query.
Also, query branches:
- cannot have entities be manually added/removed.
- non-gender query branches cannot have subtrees added/removed.
- gender query branches may have a single “template” child branch,
expressing a query with a placeholder for result. Eg, a query
branch with
((type person))
could have a subbranch((type reference) (author person))
.
Could we have a query branch listing persons, then subbranches listing
their work ((author (parent-result)))
, then subsubbranches
distributing works by their types? We could call them “breakdown
branches”.
This could be done by allowing queries to act on the ontology and not only entities.
Sub-query branches shouldn’t need to access more than a single result of their parent branch.
Relationships connect entities together
Formal queries are actually small programs. They operate within a context and progressively reduce that context. Eg, this query:
(intersection
(type book)
(author RobertMusil))
Transforms to a program that restricts a global context (ie, a list of entities) to the subset of entities of type book, then reduces this subset to the entries with Robert Musil as an author.
The exact meaning of “transforms to a program” remains to be specified. It may be possible to use Racket to design a small query DSL, or we could just traverse the s-expression and convert it manually. Both approaches should be easy enough.
This is absolutely critical.
Should be easy: (in branch ...+)
We need “like”, “contains”, “starts with” and “regexp match”, etc.
The file format should be readable by a human, and git history should be easy to understand.
Git is an integral part of Tkacz’ storage subsystem, and is managed automatically. Tkacz stages changes, commits them and can optionally push them to a remote repository.
The Tkacz program is a simple server able to talk to various clients. The CLI program itself isn’t made to be used by humans, but only for programs to interact with it. The initial implementation uses s-expressions for requests and responses, becaus)e a they’re really easy to parse; and b) both the server and original client are written in Lisp.
To begin a session, the client sends:
(tkacz elisp)
Where tkacz
is the magic handshake command, and elisp
the dialect the
client wishes to be talked in. To which the server replies with the
very welcoming message:
(tkacz :protocol (0 1 0))
The client must send a db-load
or db-create
as its first request, or
the server will abort communication[fn:1].
Client writes to server’s stdin to send requests of the form:
(request ID BODY)
The client is responsible for giving each request a session-unique ID as a positive integer, as the server provides no guarantee on the order of responses[fn:2]. Replies are written to stdout, wrapped in:
(response ID BODY)
Between requests, server ignores all whitespace and triggers an error for any other input. Requests are read until a single s-expression is complete, then processed.
For debugging purposes, client/server sessions can include comments.
Lines matching ^;
are to be ignored by both parties. If non Lisp-like
syntaxes are implemented, a different comment syntax may be used. Eg,
in JSON:
{ "comment": "This is a comment and should be ignored by any party." }
In production, comments should be absolutely avoided.
(db-load PATH [READ-ONLY #f])
- load the Tkacz database at
PATH
. (db-create PATH)
- Create and load a new Tkacz database at
PATH
. (db-version)
- returns a monotonously increasing integer
describing the version of the database. It is
guaranteed that if
(db-version)
returns the same value, the database is unmodified. These values are session-specific, and aren’t usable to compare database states across sessions.
(context [context #<void>])
- activate
context
, if provided, or deactivate current context, if any.
(query query format [format-options #<void>])
- runs
query
and returns a list of results.format
is one of:'identifiers
: return entity identifiers.'description
: return formatted description of entities.'properties
: return alists of entity properties (the “raw” entity).
(list (id ...+) format format-options)
- similar to
(query)
, but takes a list of entity names instead of a query.
This should return two sets of results:
- The set of entities matching
query
. - When
(= format 'properties)
, the set of entities referenced from the first set. In other cases, the empty set.
In we apply this, we can drop the list (list)
option.
(get id)
- get contents of entity with identifier
id
. This returns an entity object. (type id)
- return the type of entity
id
. (is-a class id [strict #f])
- true if
id
is an instance ofclass
. If strict is false, true also ifid
is an instance of a subclass ofclass
.
(new type)
- prepare a new entity of type
type
, and return an editing identifier. (edit id)
- begins editing an entity for editing, and return an editing identifier.
(editor editing-id)
- return an editor description object for entity
id
. (validate editing-id field value [store #t])
- run validators to
determine if
value
is acceptable forfield
in the context ofediting-id
. Ifstore
, store the value on the editing context if it passes validation.
We probably need to clearly differentiate effectful queries.
(validate)
should be always pure, we should have a different (update)
function of same signature.
(classes)
- return the tree of entity classes, starting at Entity.
(class-inheritance class [depth 0])
- return the inheritance list
of
class
. The returned value is a list whosecar
is the immediate superclass ofclass
. Ifdepth>0
, list halts afterdepth
elements. (class-is-a a b)
- return true if b is a the same class as, or a superclass of, b.
(types)
- return the list of non-entity, non-primitive types.
(collections)
- list all available collections.
(collections-populate)
- populate collections with entities.
(collection-create name [parent #<void>] [gender #f] [query #<void>])
- create
a new collection called
name
, belowparent
,
How can a field provide qualified references to another entity? Eg, an author under a given name?
(natural-person Romain Gary
#:name "Romain Gary"
#:as ("Émile Ajar" EmileAjar))
(book
#:by ((RomainGary #:as EmileAjar)))
Eg. virtually every ancient work: (Plato, 2004) sounds weird, but we really don’t know the exact date The Sophist was written, and publication date is meaningless in the context.
- Some works don’t actually exist: Hume’s Treatise of Human Nature is
made of three different books, but some editions merges some, or
all, of these books: Create an
Entity/Document/Multipart
type.
Some works have not been originally published on papers:
- Conferences and lectures (Austin’s How to do things with words, Goodman’s Facts, Fictions, Predictions, Bourdieu’s lectures at the Collège de France…)
- Ancient works (Plato, Aristotle…)
This requires a bit of subtlety in date assignment, but could be reasonably easily solved:
- By using the
Entity/Document/Unpublished/Lecture
type to create an original instance. - By allowing fuzzy dates, or date intervals. This is better left for version 2 :)
Eg Critique of Pure Reason
To handle these cases, we may create a “virtual” entry, something like:
(virtual KRV
:title '(("Kritik der reinen Vernunft" :lang de :orig t)
("Critique de la raison pure" :lang fr))
("Critique of Pure Reason" :lang en))
Louis Marin’s Portrait of the King contains a long commentary of Jean
de la Fontaine’s The Crow and the Fox. If one wanted to take a quick
note on that (ie, express that Marin1981
, I, 2 is about the fable,
should one look up the original publication year of the whole
collection of fables, create records, etc, or could one just create a
quick draft entry on “Fables” with a single, unnumbered chapter “The
Fox and the Crown” to which Marin1981, I, 2 could point to as a topic?
The definition of an entity should include rules for sorting its instances, regardless of the way they’re rendered: Jacques Lacan should appear after Sigmund Freud.
Since entity types describe how they should be displayed, we need a rudimentary text formatting syntax, something that should be trivial to convert to any other syntax, and which could look like:
(tkacz/format-text
(sc "Bourdieu") comma "Pierre"
space "et" space
"Jean-Claude" space (sc "Passeron")
comma
(italic "La reproduction"
dot
"Éléments pour une théorie du système d'enseignement"
dot)
"Paris" colon "Les éditions de Minuit" colon "1970" fullstop)
There are two ways an entity may contain another: being contained may
be essential for the entity, or anecdotal. It is essential for a
chapter to be part of a book, it is anecdotal for a person to be part
of a collective. Maybe entities could have a standalone
property, or
something similar, which could determine whether or not they should be
displayed in the main listing.
The server could provide completion and syntax checking services for queries and similar
Queries could be able to return information about the ontology itself, and
We could consider a system of traits to further specify entities.
These traits would apply to objects, not classes. For instance,
academic position (Eg ”someone was the Roger Rabbit Professor of
Metaethics at Harvard from 1963 to 1985”) may be useful as an
attribute of a physical person; yet it isn’t meaningful for every
single person. But Academic
cannot be a subtype of NaturalPerson
,
since most such subtypes wouldn’t be mutually exclusive. We could
define traits as optional attributes of a given entity type. In this
case, the academic
trait would allow to assign a series of academic
positions to a given person.
[fn:1] Tkacz works on one, and only one, database at a time. This is not a limitation, but a design choice: instead of multiple databases, we have contexts.
[fn:2] Which doesn’t imply in any way that the server promises to work asynchronously. It would be unreasonable to write the first version as asynchronous, and it is likely that parallel computation, if ever implemented, will work on a stop-the-world-on-write mode.
[fn:3] It is hard to avoid that the format be in a specific Lisp dialect. But this dialect should not command the implementation language of the backend. This has some obvious consequences on extensibility.
[fn:4] This is an oversimplification, but it’s the BibTeX model.