Datagrowth (module)
A release that moves code shared between projects into a new module that will split off from the repo in a future release
- Adds logs under project directory
- Abandons Conda
- Growth commands can log to "datagrowth.command" to separate them from task processing output.
- Introduces a data dir that can be set on a machine basis. To share data between machines.
- Adds a QuerysetProcessor as a performative alternative for output processors
- Moves Resource, HttpResource and ShellResource into datagrowth
- Adds file deletion handler that can be connected to Resources
- Adds Collective.to_disk to easily use data in notebooks today
- Refactors ExtractProcessor to share code between HTML and XML
- Adds ShellResourceProcessor to execute shells on task servers
- Adds Tika as the first ShellResource
- Adds ibatch and datetime formatting as tools to datagrowth
- Moves all configuration to datagrowth and improves the flow of registering configuration defaults
- Migrates ImageDownload into separate models per app
- Adds TopicDetector and EntityDetector
- Fixes some problems with Wikipedia, but disables the feeds on production for now due to performance issues
- Migrates files to structure that scales well