Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration to v2 streamer 🚧 #32

Open
mar-muel opened this issue Feb 26, 2021 · 1 comment
Open

Migration to v2 streamer 🚧 #32

mar-muel opened this issue Feb 26, 2021 · 1 comment

Comments

@mar-muel
Copy link
Collaborator

mar-muel commented Feb 26, 2021

This is more of a collection of issues rather than a single one ;)

As discussed, I think it would be best to first create two new Heroku environments. For this I would first rename the current Heroku project into something like crowdbreaks-v1 and then name new one crowdbreaks. Then I would create a staging and production environment for the new project, add the same Heroku add-ons as in crowdbreaks-v1 and then deploy the current Crowdbreaks code. After the whole migration is finished you can simply point the DNS (in Namecheap) to the new Heroku instances and delete crowdbreaks-v1.

The major changes between v1 and v2 are:

  • Removal of the Flask API. On the Rails side the code for communication with Flask can be found in app/controllers/apis_controller.rb, so most changes willl happen there.
  • Streaming is now fully containerized and restart of stream is automatically triggered on config change
  • The streamer now doesn't contain any tweet ID queues anymore, this means the selection for tweets for public annotation has to be done differently. A Redis-based priority queue used to hold a pool of 1000 recent tweets with the priority equalling the number of times they had been annotated. One solution is to store public annotations on Elasticsearch by adding a field annotations to the ES mapping which will contains an array of annotations. Then we can query Elasticsearch without the need for any priority queues.

The migration involves the following (listed by importance):

  • Starting/stopping the stream code. Involved and API call to Flask, can probably now be easily done via AWS API call
  • Modifying project configs code. Involved API call to Flask, now simply updating a json file on S3 will be sufficient. The streamer will then automatically start.
  • Visualizations, i.e. query sentiment trend/predictions code. Here we make an API call to Flask which then queries Elasticsearch (see here). The idea is to do this now directly from Rails. This will involve re-writing the Elasticsearch query in Ruby.
  • Get tweet in stream mode code. Again a Flask API call is done to fetch a new tweet for annotation from the priority queue. Since we removed queues instead here we should make a call to Elasticsearch, which should retrieve us a recent tweet which has not been annotated by the requesting user and has not yet been annotated more than 3 times. Any sort of uncertainty sampling criterion could be added here, but this would be experimental, see query strategies in active learning.
  • Communication with Sagemaker (e.g. list models or update models or predict for the ML playground interface). The idea here was to keep a list of endpoints for each project. Each project would have a list of active endpoints (endpoints which are used for inference) as well as one endpoint which is marked as primary (considered as the currently best model). As for now these are FastText models. Because querying the Sagemaker API constantly is problematic, the endpoint info is cached as JSON in the projects table in Rails. As above, Rails communicates with Sagemaker through Flask (see Flask code here). This means this logic will have to be implemented on the Ruby side.
  • Daily/weekly status emails: The status reports are compiled on Rails and are triggered by the Heroku scheduler. Here we make a call to Flask to get counts of tweets collected per day/week. This information we could probably get from AWS kinesis via API call. Otherwise, this feature could be discontinued as well.
  • Stream monitoring (interface): Is cool to get an idea of the throughput. Again, this could be done via direct calls to Elasticsearch from Rails instead of through Flask. But def. isn't high priority.
  • Managing Elasticsearch indices (interface): Not really necessary anymore. When creating a new project a ES index is now automatically created on the streamer side (currently one has to do this manually).

Note that some code in ApisController are remnants from a distant past ;) and can be removed. This includes update_sentiment_map, any visualizations for the COVID-19 project (StreamGraphKeywords, trending topics, trending tweets, controller actions get_stream_graph_data, get_stream_graph_keywords_data). Don't forget to also clean up the respective routes for these actions.

@utanashati
Copy link
Contributor

Thanks Martin!

Sagemaker predictions are currently handled in a processing lambda in v2. Before sending the data to ES, if there're Sagemaker endpoints in the project config, it envokes the endpoints. There're rare internal server errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants