Skip to content

GSoC_2020_project_usability

Gil edited this page Feb 3, 2020 · 6 revisions

Improving the user experience

Let's make it easier for people to use and develop Shogun. For users, we would like to cover: user API & pipelining, parameters defaults and descriptions, exception handling, documentation & examples. In a second step, for scientists/developers, we would like to cover: plugin architecture, internal API, simplification.

Mentors

Difficulty & Requirements

Medium. The biggest challenge of this project is the vast scope -- you will touch a lot of Shogun's internals, both framework and ML code. Good planning required! Make sure that you pick a number of interesting topics in your application and make sure to show that you have a really good idea of what you want to achieve, the best usually being proof-of-concepts.

You need know

  • C++ software design (not just fast for loops!), Python
  • How Shogun's interfaces work, aka SWIG.
  • Shogun's new parameter framework, aka tags
  • Exception handling
  • Machine Learning basics
  • Other ML libraries (with good APIs)

First steps

Write user stories! Those are pseudo-code examples of how the user interacts with the library, and how that should look like. Your application needs to contain a couple of those (and ideas how to make them happen inside Shogun). Cover all the topics you want to address (see below), and try to be as precise as possible. See also below for more details.

Details

Here are some sub-projects. We are open for more:

NOTE: A GSoC project will address multiple (or ideally all) of those topics.

Base API design

We would like to put effort into cleaning and re-designing the current user API. That is, the API that is accessed through SWIG and that is exposed via our examples. That is not the internal C++ API. As a motivation, have a look at e.g. our very basic learner class: CMachine. Observe how many methods there are, and how confusing this must seem to new users. Your task is to simplify this. This will include: renaming existing classes and methods, adding new methods, re-factoring existing classes, maybe even adding new classes. Most important, we want to make the new API very minimal.

Steps:

  • Have a look at these notes for some initial ideas and examples.
  • Write a few complete user stories (API usage example, see the notes for some examples) for common ML cases. This requires some research: what are fundamental ML tasks that should definitely be covered, how do other libraries do it?
  • Turn your insights on how the API should look like into a summary: a class diagram for example.
  • Come up with a set of API changes in Shogun required to serve the user stories. This will include adding/renaming/removing.
  • Work incrementally, one "use-case" at a time.

Topics to cover:

  • Clean up our learner base class CMachine, and make it follow the de-facto standard of fit/predict, see here for some ongoing work.
  • Remove all "casting" methods from Shogun, they are not needed anymore since we have tags. I.e. remove all ::obtain_from_generic calls, remove apply_regression, apply_binary etc.
  • Give the new Converter/Transformer classes some love: write examples/notebooks, use them in pipelines, see what happens, fix the errors. Make using them easy!
  • Implement some of the operator chaining ideas from the notes. This is a bigger topic and will require some code logic that implements the ideas. Could take a few weeks but results in a really cool improvement.
  • Remove all copy methods (that create a copy of an instance), but rather implement copy constructors and rely on clone for deep copies.
  • Put in methods to change the interface for algorithms that support multiple APIs (see below)
  • There is way more here to do, but you get the idea :)

An example for a clean GMM API using as_*

gmm = sg.GMM()
gmm.algorithm = "split_and_merge_em"
gmm.algorithm = "em"
gmm.fit(features)
gmm.predict(features_test) # returns discrete labels, classification
gmm.as_classifier().predict(features_test) # same as above
gmm.as_distribution().predict(features_test) # returns the log-probability for each component for each data
gmm.as_distribution().as_mixture().get_component(idx) # returns a Gaussian component
gmm.as_distribution().sample(100) # returns 100 samples from the mixture

Improve internal API

Shogun's internal API is very messy, as it was written through several generations of C++ standards. It is time to start cleaning this up with modern C++ (currently we are aiming to bring everything to C++17).

  • Address compiler warnings
  • Improve internal API to access/mutate data
  • Replace raw pointers with containers and unique_ptr/shared_ptr
  • Replace verbose code with STL algorithms
  • Use move semantics
  • remove unnecessary constructors/ add useful constructors: rule of zero, rule of three and rule of five
  • Use ranged for loops and zip iterators to improve code readability
  • If there is time start working towards C++20 in a separate branch, e.g. replace SFINAE with concepts, use coroutines, explore how modules could be used.
  • Start working towards C++20 stackless coroutines with C++17 compatible stackfull coroutines (for example using tconcurrent)

Exception handling

Currently, Shogun's exception handling is not ideal. It is just the same ShogunException that is thrown and in some languages it causes the program to exit. Our error messages are sometimes good (if the developer was motivated), and sometimes quite bad -- they don't tell the user what she did wrong. This part of the project is to introduce a small set of exceptions and populate Shogun with them (e.g. NotConverged or InvalidState) so in the code the following would be possible:

try:
   svm.train()
except shogun.NotConverged:
   ...
except shogun.InvalidState:
   ...

The next step is to connect them to the SWIG interfaces. Some initial work has been done as part of last year's GSoC project by Wuwei

API example coverage

We would like to see all of Shogun's API covered in the meta examples (which also makes them be integration tested). We currently do lack examples (and cookbooks) for

  • StringFeatures (see here for some initial work
  • Fast SVMs in Shogun
  • Dimensionality reduction
  • many more ...

This project will involve writing at least 2-3 cookbooks per week (other projects need 2 examples without a cookbook), to increase coverage.

Parameters, defaults, and documentation

Machine Learning algorithms crucially depend on well-chosen parameters. While you can tune them automatically with Shogun (takes long though), a user sometimes simply wants to run an algorithm out of the box. Therefore, Shogun's default parameters should be sensibly chosen by the people who know what they are doing: the developers. Furthermore, users might be interested in what the parameters do, so we need to make it easy to read their descriptions without opening a web-browser.

In this part of the project, you will

  • Make sure (i.e. test with real-world examples) that the default parameters of Shogun are well-chosen. Compare the choices to other libraries. One that does a particularly good job in sklearn.
  • We recently added an automatic parameter deduction mechanism using runtime data, for example using the median heuristics for SVMs. We would like to extend this
  • Implement a nice way to expose parameter documentation (currently done via doxygen, see the API) at runtime. This is likely to be done via the tags framework. We could for example see a Python script that reads parameter documents in tags and then makes sure they appear in the doxygen API. Example: help(svm) (we have that already, but it needs polish), help(svm.C). This should also work for all target inferfaces. There is some work around parameter descriptions being done here.
  • Update @brief descriptions of Shogun's algorithms (some are good, some are completely missing)
  • Add nice eye candy interface languages, for example code completion for IPython (example)

Optional

  • We would like to integrate / merge all existing sources of documentation into a single one: cookbooks, API, parameter docs should all use the same content in order to improve maintainability (TODO: explain this better)

Anything else that sucks about using Shogun? Put it in here :)

Is this project for you?

You like thinking about API design? You like things to be neat? You enjoy exploring existing code-bases? You like to have an impact on Shogun?

Why this is cool

This project will massively improve Shogun's usability, and therefore has a potentially significant impact on the project's user-base. You will get exposed to a lot of Shogun's internals and have a say in design decisions that will impact face of Shogun.

Useful resources

Clone this wiki locally