Skip to content

Latest commit

 

History

History
852 lines (628 loc) · 30.9 KB

design.org

File metadata and controls

852 lines (628 loc) · 30.9 KB

Tkacz: Design

This document describes yet another instantiation of my attempts to create a sane reference manager that could actually improve rather than degrade my bibliography/PDF management workflows. This is only a project for now, yet a Gitter chat has been created for people interested in providing suggestions or feedback.

https://badges.gitter.im/thblt/tkacz.svg

Contents

Introduction: doing two things and doing them well

Why yet another reference manager? The first goal of Tkacz is to be usable for very large bibliographies in the field of historical sciences, with a strong focus on contemporary history of science, although it shouldn’t be restricted to a specific field or research domain. Large bibliographies need good organization capabilities, history requires a powerful enough data model to cleanly represent complex or unusual documents and relationships between them, and all research require open, documented and easily usable formats that doesn’t tie their users to a particular piece of software.

Tkacz tries to satisfies these needs by addressing the shortcomings of existing document managers and reference managers:

Document managers usually have very powerful classification, ordering and annotation features, but a very poor data model, which makes it hard to actually meaningfully and formally describe documents by providing formal bibliographic references.

Most document management software also assume documents are somehow unique and sui generis, where historical sources tend to have significant histories in themselves: they may be reprinted, republished, collected and updated, and all of this matters, both for the history of the document and its availability: knowing where to find an obscure article is important. In the “hard” sciences, the most complicated (significant) thing that can happen to an article is to be withdrawn, and all you’re expected to do in this case is not to cite it anymore; but for historical research, withdrawals, reprints, modifications, reuse, etc., are often significant and need to be represented.

It is also assumed that documents are files, usually PDFs. But it’s often the case that documentation is scattered between different libraries, printed and electronic copies, that not everything has a PDF and not every PDF is complete.

Reference managers make it easy to reference documents you don’t actually own (and often make it the default), and usually have a powerful (or extensible) enough data model to accommodate a lot of information, but they usually fall very short when it comes to classification. Beyond that, the (potential) power of the data model is not necessarily reflected in all UIs, and some changes that are theoretically possible in the data model will make the data actually unusable. For instance, handing authors not as strings, but as objects is theoretically feasible in the BibTeX format; but such files would be unusable by the Bib(La)TeX family of programs.

The goal of Tkacz is to “do two things, and do them well”: to serve as a document manager with the full power of a bibliographical data manager. That means:

  • Nicely deal with complex or unusual entities: manuscripts, unpublished works, posthumous publications, etc. This means a more complex data model that most reference managers provide, and if possible an extensible data model.
  • Prefer objects over strings for qualified properties. Authors, for example: attribution is important, and attribution can be tricky; it’s thus better that the author field takes a series of person objects, optionally parameterized, instead of a list of strings. This helps solve three problems: a) it allows to keep notes on persons as well, which can be quickly retrieved and are connected to this person’s works; b) it nicely handles cases where a given person published under multiple names, or published anonymous works, making it easy to present the complete list of someone’s works, regardless of the name they were published under; c) it also, obviously, makes typos much easier to detect and correct.
  • Provide advanced classification mechanisms: hierarchical or flat collections, tags, etc. Because we use objects rather than strings, a good deal of classification can be automated: works by this author, works in this journal, etc. Classification also allows to consider objects as topics or more generally classes: it is not uncommon, eg. in philosophy or literature, that the topic of a book is the author of other books. Friedrich Nietzsche is the author of the Geneology of Morals, and the topic of Heidegger’s Nietzsche.
  • Provide just as powerful annotation features.
  • Express relationships between entities: an article and a reprint, or a withdrawal, a book and its editions, a multipart work and its parts, etc.
  • Help represent the history of references: an article may have been published in a journal, then reprinted multiple times in different books. Where do I look for Canguilhem’s 1966 article “Mort de l’homme ou épuisement du cogito?”, if I don’t have access to the original journal? Has it been translated into English, if I need to cite it in an English article, and where to find the translation?
  • Make it hard to introduce inconsistencies or typos in the database. Journals, like persons, are objects: if the object describes the publishing rate, then no reference could contain a volume/issue number inconsistent with publication date without at least generating a warning.
  • Create usable bibliographic databases for use with (La)TeX (and other reference managers). The data model outlined above is way too expressive for what BiBTeX, BibLaTeX or virtually any bibliography manager can handle, so Tkacz needs to be able to build .bib files on-demand with a given subset of its database.
  • Rather than supporting “file attachments”, provide a general mechanism to describe copies of a reference, some of which can be files, but would also include things like: it’s in this library at this call number; it’s in my personal library, I have a printout in this box, etc.
  • Make the database history easy to browse, review, and understand. Nothing frightens me more than undocumented and non-versioned binary databases with quirky UIs that make it hard to get a grasp of what’s going on (yes, Papers, I’m looking at you). The database will be in a Lisp-like syntax (think XML with parentheses), with Git integration out of the box. A change = a commit, with a meaningful commit message. This leaves the user free to rebase, reorder or squash commits before pushing, and should make it trivial to keep a perfectly clean history.

This is a design document for Tkacz, which should work as a specification for both the user interface and the implementation.

User interface

When started with M-x tkacz RET, Tkacz shows a list of all references it has in store. It can also show a list of any other type of entities: to do so, press e, then select the entity type you want. There are three by default: references (r), persons (p), and journals (j).

By default, entities are displayed in so-called natural format, they can also be shown in tabulated format by pressing ===.

Working with references

Creating references

There are multiple ways to create new references:

  • Press n n in the references view to display an input form where you can manually fill fields. This is the most tedious way, and should generally be avoided.
  • n u will prompt for a URL, then do its best to build a reference out of it. If possible, it will assimilate the associated PDF as a copy of the reference. n U does the same in a loop, which is useful if you’re browsing the web in search for documentation (terminate with empty input). To create references from a web browser, simply configure it to call tkacz/create-reference-from-url or tkacz/create-reference-from-html on the Emacs daemon.
  • Similarly, n f will prompt for a file, n F will do so in a loop.
  • n d will show a drop area on which you can drag and drop virtually everything, with a strong preference for URLs and PDFs.

Viewing and editing references

From the list view, press <RET> to open or focus editor view.

Organizing references

Tkacz classification system is made of two distinct mechanisms: taxonomies and contexts.

Taxonomies

Taxonomies are hierarchical trees whose branches and leaves may contain entities of various types.

Contexts

Contexts are branches and leaves of a taxonomy. Contexts are how Tkacz help manage huge collections of possibly unrelated entities. If you’re working on, say, your PhD in history of psychiatry, you don’t want all your computer science articles collection popping up in the list. Contexts are taxonomies, but the contract with the UI is different:

  • Contexts are used as first-order filters. In the default UI, C is used to toggle between contexts.
  • When toggling back to a previous context, secondary filters are to be restored as they were.

Relationships

Querying the database

What’s good is a personal library if you can’t find anything inside? Tkacz comes with two powerful query systems. The coolest one is a formal search syntax, the fastest one is full-text search.

Formal queries

Formal queries are especially useful for building collections and taxonomies. They take the following form:

((type book)
 (by MichelFoucault)
 (date (between 1960 1980)))

Multiple values can be searched on a single selector. Into French Theory?

((type book article)
 (by GillesDeleuze JacquesDerrida JacquesLacan MichelFoucault)))

Need the complete works of someone, including books they edited?

(((author editor) PierreBourdieu))

Notice the car of each s-expression is the field, the whole cdr is values.

Standard boolean operators are available, of course.

(not (and (author RobertStoller) (author RobertGreen)))
(or (date (between 1910 1930)) (date (between 1950 1965)))

Some basic capture and logic is available. You can search for a book by at least two of a group of authors by searching like this:

;; Set the original author list
(let ([authors '(AlonzoChurch KurtGödel AlanTuring)])
    ;; Do twice
  (repeat 2
          ;; Capture the matched author as capt
          (capture capt (by authors))
          ;; Remove the matched author from list before searching again
          (set authors (remove capt authors))))

Full-text search

Just type ? in the UI, and type some search terms. This is actually just another formal search: Eg, searching for “popper logic” actually generates:

((fulltext "popper" "logic"))

Standard ontology

The ontology is the actual data model. The next section describes the type system used to implement this model.

images/EntityHierarchy.png

Implementing the ontology

I believe Tkacz requires a complex enough class system not to be able to rely on common OOP mechanisms — but who knows.

The Tkacz ontology is made of three big kinds of “things”:

Entity type
which in OOP are generally called “classes”. This includes a root class, Entity, and an undefined number of children. Children may be instantiable or “abstract” (abstract is never used to describe entity types, Tkacz simply describes such types are “non instantiables”. That’s because some instantiable entity types are actually abstract (Tkacz has a type for “a book which has had multiple editions”, which is a pure abstraction which corresponds to no actual text — but is still useful) or “generic” (some higher-level entities may be instantiated instead of one if its children)
Relationships
which are a kind of special fields which connect two entities together. For example, being the author of a book is a relationship, because Person and Book and are entity types.

Relationships may be unidirectional, reciprocal or imply other relationships.

From a OOP perspective, a relationship

Fields

The type system

Tkacz is strongly typed.

  • Tkacz types are constructed by composing a small set of primitive types.
  • Composition is done in the form of classes. Classes have named properties and methods.
  • There is an either type.
  • There are references. A reference stores the identifier of another entity. References are typed.
  • Methods can be overridden at instance level. This may be complicated, so could be implemented as fields with function types and a default value.
  • Classes may be abstract. Abstract classes may require interfaces to become concrete.
  • Properties have a visibility setting which determines if they’re exposed to the user or not. This is different from OOP’s concepts of private/public: non-exposed members are by default public.
  • Classes and properties declaration include a user-readable, localizable name and optional documentation. Eg:
    (defclass BirthCertificate PublicRecord
      "Birth certificate
    
    A legal birth certificate, held by a Public Records office.")
        

    This should automatically generate translation templates.

  • For each Tkacz run, the type definitions are built once and for all during a bootstrapping phase. After this phase, they become read-only.
  • Types are inspectable at runtime: the GUI system needs typing data to build UIs. Inspection doesn’t have to be dynamic, since at this point types are read-only. Names, documentation and types of properties, as well as hierarchy of types, have to be inspectable.
  • Types are extensible after declaration, but before runtime. That is, fields may be added, or their types changed. Entity types may be created.

Primitive types

  • String
  • Integer
  • Float
  • Boolean
  • File
  • List
  • Picture
  • Date

Alternatives (either)

Either is a rough equivalent of Haskell’s |. It defines a sum type which can be of any of a finite set of type. A simple example of either is:

(either string number)

A field of this type can be, guess what, either a string or a number. Unlike structs, either isn’t enough to define a type, and can only be assigned as the type of a struct’s field. Either are resolved at struct constructor level, and don’t appear in the object itself but are replaced by a value of the chosen type. For example, if the above definition was the type of a field called a, the struct object would only contain:

(tzo/struct struct-name #:a (#:type integer #:value 1))

Use-case for either is missing

Either is also an enum

Could either be used as an enum type?

(either "This" "That" 3)

With eventually a fallback/custom case?

(either "This" "That" string?)

and taking pairs to differentiate between values an UI representation?

(either ((tr "Yes") . #t) ((tr "No") . #f) ("Something else" . string?))

Entities

Entities are the essential Tkacz type. They’re defined from structs, but unlike structs, entities are named root objects, not values. Structs have discrete values, entities have identity. Entity names start by an uppercase letter, and they’re defined with the (tkacz/entity ENTITY-NAME STRUCT-NAME) macro:

(tkacz/entity Person person)

Everything Tkacz is meant to keep information about is an entity. Informally, an entity is something with an actual existence (in a very loose sense of the word). A person is an entity, a publication’s title or date aren’t. Yet, this criterion should be understood in a quite relaxed fashion, and not as a strict requirement: it’s nice to be able to group an article, its extended reprint as a book chapter, and its translation to another language as instances of a single “thing” (the “abstract” article) to help keep track of various transformations of this document. Such a thing is an entity nonetheless, because it’s useful to consider it as one.

Taxonomies

Taxonomies are trees. Taxonomy objects are structs with the following attributes:

NameDefaultMeaning
namerequiredThe name of this branch
parentnilparent branch
gendertruewhether this branch is a gender
showemptyfalseWhether to show this branch if empty
  • parent is null at the root branch of a tree.
  • A gender is a branch which contains the leaves of its children (the way, in biology, a gender is “made of” its species)
  • showempty hides a branch and all its subtrees if they contain no entities, and only in this case.

There are two kind of branches: standard and queries. Query branches can do two things: they can treat their result as a list of entities, or as a list of branches which each receive a result and use it on a second, standard query.

Standard branches

Query branches

The behavior of query branches is defined by their gender field. If gender is true, these branches contain their results as leaves, and subbranches may contain other queries which refine the original query (ie, they apply on the first result set, so subbranches are necessarily strict subsets of their parents)

Query branches have an extra query attribute, which holds the query.

Also, query branches:

  • cannot have entities be manually added/removed.
  • non-gender query branches cannot have subtrees added/removed.
  • gender query branches may have a single “template” child branch, expressing a query with a placeholder for result. Eg, a query branch with ((type person)) could have a subbranch ((type reference) (author person)).

Breakdown branches?

Could we have a query branch listing persons, then subbranches listing their work ((author (parent-result))), then subsubbranches distributing works by their types? We could call them “breakdown branches”.

This could be done by allowing queries to act on the ontology and not only entities.

Note to self

Sub-query branches shouldn’t need to access more than a single result of their parent branch.

Relationships

Relationships connect entities together

Query language

Formal queries are actually small programs. They operate within a context and progressively reduce that context. Eg, this query:

(intersection
 (type book)
 (author RobertMusil))

Transforms to a program that restricts a global context (ie, a list of entities) to the subset of entities of type book, then reduces this subset to the entries with Robert Musil as an author.

The exact meaning of “transforms to a program” remains to be specified. It may be possible to use Racket to design a small query DSL, or we could just traverse the s-expression and convert it manually. Both approaches should be easy enough.

[#A] Searching for relationships

This is absolutely critical.

How do we search for, eg, people who wrote books?

How do we restrict search to a given taxonomic branch?

Should be easy: (in branch ...+)

How do we negate search terms?

Need specification for searching text fields

We need “like”, “contains”, “starts with” and “regexp match”, etc.

File format

The file format should be readable by a human, and git history should be easy to understand.

Git support

Git is an integral part of Tkacz’ storage subsystem, and is managed automatically. Tkacz stages changes, commits them and can optionally push them to a remote repository.

Client/server protocol

The Tkacz program is a simple server able to talk to various clients. The CLI program itself isn’t made to be used by humans, but only for programs to interact with it. The initial implementation uses s-expressions for requests and responses, becaus)e a they’re really easy to parse; and b) both the server and original client are written in Lisp.

Sessions

Starting a session

To begin a session, the client sends:

(tkacz elisp)

Where tkacz is the magic handshake command, and elisp the dialect the client wishes to be talked in. To which the server replies with the very welcoming message:

(tkacz :protocol (0 1 0))

The client must send a db-load or db-create as its first request, or the server will abort communication[fn:1].

Requests and responses

Client writes to server’s stdin to send requests of the form:

(request ID BODY)

The client is responsible for giving each request a session-unique ID as a positive integer, as the server provides no guarantee on the order of responses[fn:2]. Replies are written to stdout, wrapped in:

(response ID BODY)

Between requests, server ignores all whitespace and triggers an error for any other input. Requests are read until a single s-expression is complete, then processed.

Comments

For debugging purposes, client/server sessions can include comments. Lines matching ^; are to be ignored by both parties. If non Lisp-like syntaxes are implemented, a different comment syntax may be used. Eg, in JSON:

{ "comment": "This is a comment and should be ignored by any party." }

In production, comments should be absolutely avoided.

Commands

Database operations

(db-load PATH [READ-ONLY #f])
load the Tkacz database at PATH.
(db-create PATH)
Create and load a new Tkacz database at PATH.
(db-version)
returns a monotonously increasing integer describing the version of the database. It is guaranteed that if (db-version) returns the same value, the database is unmodified. These values are session-specific, and aren’t usable to compare database states across sessions.

Contexts

(context [context #<void>])
activate context, if provided, or deactivate current context, if any.

Querying

(query query format [format-options #<void>])
runs query and returns a list of results. format is one of:
  • 'identifiers: return entity identifiers.
  • 'description: return formatted description of entities.
  • 'properties: return alists of entity properties (the “raw” entity).
(list (id ...+) format format-options)
similar to (query), but takes a list of entity names instead of a query.

Returning two sets

This should return two sets of results:

  1. The set of entities matching query.
  2. When (= format 'properties), the set of entities referenced from the first set. In other cases, the empty set.

In we apply this, we can drop the list (list) option.

Reading entities

(get id)
get contents of entity with identifier id. This returns an entity object.
(type id)
return the type of entity id.
(is-a class id [strict #f])
true if id is an instance of class. If strict is false, true also if id is an instance of a subclass of class.

Creating and modifying entities

Accessing entities for writing

(new type)
prepare a new entity of type type, and return an editing identifier.
(edit id)
begins editing an entity for editing, and return an editing identifier.

Editing entities

(editor editing-id)
return an editor description object for entity id.
(validate editing-id field value [store #t])
run validators to determine if value is acceptable for field in the context of editing-id. If store, store the value on the editing context if it passes validation.
(validate) is half-pure

We probably need to clearly differentiate effectful queries. (validate) should be always pure, we should have a different (update) function of same signature.

Reading the ontology

(classes)
return the tree of entity classes, starting at Entity.
(class-inheritance class [depth 0])
return the inheritance list of class. The returned value is a list whose car is the immediate superclass of class. If depth>0, list halts after depth elements.
(class-is-a a b)
return true if b is a the same class as, or a superclass of, b.
(types)
return the list of non-entity, non-primitive types.

Working with collections

(collections)
list all available collections.
(collections-populate)
populate collections with entities.
(collection-create name [parent #<void>] [gender #f] [query #<void>])
create a new collection called name, below parent,

Unresolved issues

For first beta

[#A] Qualified references

How can a field provide qualified references to another entity? Eg, an author under a given name?

(natural-person Romain Gary
                #:name "Romain Gary"
                #:as ("Émile Ajar" EmileAjar))

(book
      #:by ((RomainGary #:as EmileAjar)))

References without a known original publication date

Eg. virtually every ancient work: (Plato, 2004) sounds weird, but we really don’t know the exact date The Sophist was written, and publication date is meaningless in the context.

«Abstract» references and «virtual» works

Multipart works

  • Some works don’t actually exist: Hume’s Treatise of Human Nature is made of three different books, but some editions merges some, or all, of these books: Create an Entity/Document/Multipart type.

Non-published works

Some works have not been originally published on papers:

  1. Conferences and lectures (Austin’s How to do things with words, Goodman’s Facts, Fictions, Predictions, Bourdieu’s lectures at the Collège de France…)
  2. Ancient works (Plato, Aristotle…)

This requires a bit of subtlety in date assignment, but could be reasonably easily solved:

  1. By using the Entity/Document/Unpublished/Lecture type to create an original instance.
  2. By allowing fuzzy dates, or date intervals. This is better left for version 2 :)

Works with multiple, different, editions

Eg Critique of Pure Reason

To handle these cases, we may create a “virtual” entry, something like:

(virtual KRV
         :title '(("Kritik der reinen Vernunft" :lang de :orig t)
                  ("Critique de la raison pure" :lang fr))
                  ("Critique of Pure Reason" :lang en))

La Fontaine’s fables

Louis Marin’s Portrait of the King contains a long commentary of Jean de la Fontaine’s The Crow and the Fox. If one wanted to take a quick note on that (ie, express that Marin1981, I, 2 is about the fable, should one look up the original publication year of the whole collection of fables, create records, etc, or could one just create a quick draft entry on “Fables” with a single, unnumbered chapter “The Fox and the Crown” to which Marin1981, I, 2 could point to as a topic?

Sorting

The definition of an entity should include rules for sorting its instances, regardless of the way they’re rendered: Jacques Lacan should appear after Sigmund Freud.

Text formatting syntax

Since entity types describe how they should be displayed, we need a rudimentary text formatting syntax, something that should be trivial to convert to any other syntax, and which could look like:

(tkacz/format-text
 (sc "Bourdieu") comma "Pierre"
 space "et" space
 "Jean-Claude" space (sc "Passeron")
 comma
 (italic "La reproduction"
         dot
         "Éléments pour une théorie du système d'enseignement"
         dot)
 "Paris" colon "Les éditions de Minuit" colon "1970" fullstop)

Entities within entities

There are two ways an entity may contain another: being contained may be essential for the entity, or anecdotal. It is essential for a chapter to be part of a book, it is anecdotal for a person to be part of a collective. Maybe entities could have a standalone property, or something similar, which could determine whether or not they should be displayed in the main listing.

After first beta

Completion on queries

The server could provide completion and syntax checking services for queries and similar

Ontology introspection from the query system

Queries could be able to return information about the ontology itself, and

Traits on entity objects

We could consider a system of traits to further specify entities. These traits would apply to objects, not classes. For instance, academic position (Eg ”someone was the Roger Rabbit Professor of Metaethics at Harvard from 1963 to 1985”) may be useful as an attribute of a physical person; yet it isn’t meaningful for every single person. But Academic cannot be a subtype of NaturalPerson, since most such subtypes wouldn’t be mutually exclusive. We could define traits as optional attributes of a given entity type. In this case, the academic trait would allow to assign a series of academic positions to a given person.

Footnotes

[fn:1] Tkacz works on one, and only one, database at a time. This is not a limitation, but a design choice: instead of multiple databases, we have contexts.

[fn:2] Which doesn’t imply in any way that the server promises to work asynchronously. It would be unreasonable to write the first version as asynchronous, and it is likely that parallel computation, if ever implemented, will work on a stop-the-world-on-write mode.

[fn:3] It is hard to avoid that the format be in a specific Lisp dialect. But this dialect should not command the implementation language of the backend. This has some obvious consequences on extensibility.

[fn:4] This is an oversimplification, but it’s the BibTeX model.