Merge pull request #5 from Stability-AI/docs_update

zqevans · web-flow · commit cf8f836a7737 · 2023-10-11T12:22:26.000-07:00
Docs update
diff --git a/README.md b/README.md
@@ -118,11 +118,10 @@ The following properties are defined in the top level of the model configuration
   - The training configuration for the model, varies based on `model_type`. Provides parameters for training as well as demos.
 
 ## Dataset config
-`stable-audio-tools` currently supports two kinds of data sources: local directories of audio files, and WebDataset datasets stored in Amazon S3.
+`stable-audio-tools` currently supports two kinds of data sources: local directories of audio files, and WebDataset datasets stored in Amazon S3. More information can be found in [the dataset config documentation](docs/datasets.md)
 
 # Todo
-- [ ] Add documentation for dataset configs
 - [ ] Add documentation for different model types
-- [ ] Add documentation on pretransforms
 - [ ] Add documentation for Gradio interface
 - [ ] Add troubleshooting section
+- [ ] Add contribution guidelines 
diff --git a/docs/conditioning.md b/docs/conditioning.md
@@ -1 +1,151 @@
-# Conditioning
+# Conditioning
+Conditioning, in the context of `stable-audio-tools` is the use of additional signals in a model that are used to add an additional level of control over the model's behavior. For example, we can condition the outputs of a diffusion model on a text prompt, creating a text-to-audio model.
+
+# Conditioning types
+There are a few different kinds of conditioning depending on the conditioning signal being used.
+
+## Cross attention
+Cross attention is a type of conditioning that allows us to find correlations between two sequences of potentially different lengths. For example, cross attention allows us to find correlations between a sequence of features from a text encoder and a sequence of high-level audio features.
+
+Signals used for cross-attention conditioning should be of the shape `[batch, sequence, channels]`.
+
+## Global conditioning
+Global conditioning is the use of a single n-dimensional tensor to provide conditioning information that pertains to the whole sequence being conditioned. For example, this could be the single embedding output of a CLAP model, or a learned class embedding.
+
+Signals used for global conditioning should be of the shape `[batch, channels]`.
+
+## Input concatenation
+Input concatenation applies a spatial conditioning signal to the model that correlates in the sequence dimension with the model's input, and is of the same length. The conditioning signal will be concatenated with the model's input data along the channel dimension. This can be used for things like inpainting information, melody conditioning, or for creating a diffusion autoencoder.
+
+Signals used for input concatenation conditioning should be of the shape `[batch, channels, sequence]` and must be the same length as the model's input.
+
+# Conditioners and conditioning configs
+`stable-audio-tools` uses Conditioner modules to translate human-readable metadata such as text prompts or a number of seconds into tensors that the model can take as input. 
+
+Each conditioner has a corresponding `id` that it expects to find in the conditioning dictionary provided during training or inference. Each conditioner takes in the relevant conditioning data and returns a tuple containing the corresponding tensor and a mask.
+
+The ConditionedDiffusionModelWrapper manages the translation between the user-provided metadata dictionary (e.g. `{"prompt": "a beautiful song", "seconds_start": 22, "seconds_total": 193}`) and the dictionary of different conditioning types that the model uses (e.g. `{"cross_attn_cond": ...}`).
+
+To apply conditioning to a model, you must provide a `conditioning` configuration in the model's config. At the moment, we only support conditioning diffusion models though the `diffusion_cond` model type.
+
+The `conditioning` configuration should contain a `configs` array, which allows you to define multiple conditioning signals. 
+
+Each item in `configs` array should define the `id` for the corresponding metadata, the type of conditioner to be used, and the config for that conditioner.
+
+The `cond_dim` property is used to enforce the same dimension on all conditioning inputs, however that can be overridden with an explicit `output_dim` property on any of the individual configs.
+
+## Example config
+```json
+"conditioning": {
+    "configs": [
+        {
+            "id": "prompt",
+            "type": "t5",
+            "config": {
+                "t5_model_name": "t5-base",
+                "max_length": 77,
+                "project_out": true
+            }
+        }
+    ],
+    "cond_dim": 768
+}
+```
+
+# Conditioners
+
+## Text encoders
+
+### `t5`
+This uses a frozen [T5](https://huggingface.co/docs/transformers/model_doc/t5) text encoder from the `transformers` library to encode text prompts into a sequence of text features.
+
+The `t5_model_name` property determines which T5 model is loaded from the `transformers` library.
+
+The `max_length` property determines the maximum number of tokens that the text encoder will take in, as well as the sequence length of the output text features.
+
+If you set `enable_grad` to `true`, the T5 model will be un-frozen and saved with the model checkpoint, allowing you to fine-tune the T5 model.
+
+T5 encodings are only compatible with cross attention conditioning.
+
+#### Example config 
+```json
+{
+    "id": "prompt",
+    "type": "t5",
+    "config": {
+        "t5_model_name": "t5-base",
+        "max_length": 77,
+        "project_out": true
+    }
+}
+```
+
+### `clap_text`
+This loads the text encoder from a [CLAP](https://github.com/LAION-AI/CLAP) model, which can provide either a sequence of text features, or a single multimodal text/audio embedding.
+
+The CLAP model must be provided with a local file path, set in the `clap_ckpt_path` property,along with the correct `audio_model_type` and `enable_fusion` properties for the provided model.
+
+If the `use_text_features` property is set to `true`, the conditioner output will be a sequence of text features, instead of a single multimodal embedding. This allows for more fine-grained text information to be used by the model, at the cost of losing the ability to prompt with CLAP audio embeddings.
+
+By default, if `use_text_features` is true, the last layer of the CLAP text encoder's features are returned. You can return the text features of earlier layers by specifying the index of the layer to return in the `feature_layer_ix` property. For example, you can return the text features of the next-to-last layer of the CLAP model by setting `feature_layer_ix` to `-2`.
+
+If you set `enable_grad` to `true`, the CLAP model will be un-frozen and saved with the model checkpoint, allowing you to fine-tune the CLAP model.
+
+CLAP text embeddings are compatible with global conditioning and cross attention conditioning. If `use_text_features` is set to `true`, the features are not compatible with global conditioning.
+
+#### Example config
+```json
+{
+    "id": "prompt",
+    "type": "clap_text",
+    "config": {
+        "clap_ckpt_path": "/path/to/clap/model.ckpt",
+        "audio_model_type": "HTSAT-base",
+        "enable_fusion": true,
+        "use_text_features": true,
+        "feature_layer_ix": -2
+    }
+}
+```
+
+## Number encoders
+
+### `int`
+The IntConditioner takes in a list of integers in a given range, and returns a discrete learned embedding for each of those integers.
+
+The `min_val` and `max_val` properties set the range of the embedding values. Input integers are clamped to this range.
+
+This can be used for things like discrete timing embeddings, or learned class embeddings.
+
+Int embeddings are compatible with global conditioning and cross attention conditioning.
+
+#### Example config
+```json
+{
+    "id": "seconds_start",
+    "type": "int",
+    "config": {
+        "min_val": 0,
+        "max_val": 512
+    }
+}
+```
+
+### `number`
+The NumberConditioner takes in a a list of floats in a given range, and returns a continuous Fourier embedding of the provided floats.
+
+The `min_val` and `max_val` properties set the range of the float values. This is the range used to normalize the input float values.
+
+Number embeddings are compatible with global conditioning and cross attention conditioning.
+
+#### Example config
+```json
+{
+    "id": "seconds_total",
+    "type": "number",
+    "config": {
+        "min_val": 0,
+        "max_val": 512
+    }
+}
+```
diff --git a/docs/datasets.md b/docs/datasets.md
@@ -1 +1,75 @@
-# Datasets
+# Datasets
+`stable-audio-tools` supports loading data from local file storage, as well as loading audio files and JSON files in the [WebDataset](https://github.com/webdataset/webdataset/tree/main/webdataset) format from Amazon S3 buckets.
+
+# Dataset configs
+To specify the dataset used for training, you must provide a dataset config JSON file to `train.py`.
+
+The dataset config consists of a `dataset_type` property specifying the type of data loader to use, a `datasets` array to provide multiple data sources, and a `random_crop` property, which decides if the cropped audio from the training samples is from a random place in the audio file, or always from the beginning.
+
+## Local audio files
+To use a local directory of audio samples, set the `dataset_type` property in your dataset config to `"audio_dir"`, and provide a list of objects to the `datasets` property including the `path` property, which should be the path to your directory of audio samples.
+
+This will load all of the compatible audio files from the provided directory and all subdirectories.
+
+### Example config 
+```json
+{
+    "dataset_type": "audio_dir",
+    "datasets": [
+        {
+            "id": "my_audio",
+            "path": "/path/to/audio/dataset/"
+        }
+    ],
+    "random_crop": true
+}
+```
+
+## S3 WebDataset
+To load audio files and related metadata from .tar files in the WebDataset format hosted in Amazon S3 buckets, you can set the `dataset_type` property to `s3`, and provide the `datasets` parameter with a list of objects containing the AWS S3 path to the shared S3 bucket prefix of the WebDataset .tar files. The S3 bucket will be searched recursively given the path, and assumes any .tar files found contain audio files and corresponding JSON files where the related files differ only in file extension (e.g. "000001.flac", "000001.json", "00002.flac", "00002.json", etc.)
+
+### Example config
+```json
+{
+    "dataset_type": "s3",
+    "datasets": [
+        {
+            "id": "s3-test",
+            "s3_path": "s3://my-bucket/datasets/webdataset/audio/"
+        }
+    ],
+    "random_crop": true
+}
+```
+
+# Custom metadata
+To customize the metadata provided to the conditioners during model training, you can provide a separate custom metadata module to the dataset config. This metadata module should be a Python file that must contain a function called `get_custom_metadata` that takes in two parameters, `info`, and `audio`, and returns a dictionary. 
+
+For local training, the `info` parameter will contain a few pieces of information about the loaded audio file, such as the path, and information about how the audio was cropped from the original training sample. For S3 WebDataset datasets, it will also contain the metadata from the related JSON files. 
+
+The `audio` parameter contains the audio sample that will be passed to the model at training time. This lets you analyze the audio for extra properties that you can then pass in as extra conditioning signals.
+
+The dictionary returned from the `get_custom_metadata` function will have its properties added to the `metadata` object used at training time. For more information on how conditioning works, please see the [Conditioning documentation](./conditioning.md)
+
+## Example config and custom metadata module
+```json
+{
+    "dataset_type": "audio_dir",
+    "datasets": [
+        {
+            "id": "my_audio",
+            "path": "/path/to/audio/dataset/"
+        }
+    ],
+    "custom_metadata_module": "/path/to/custom_metadata.py",
+    "random_crop": true
+}
+```
+
+`custom_metadata.py`:
+```py
+def get_custom_metadata(info, audio):
+
+    # Pass in the relative path of the audio file as the prompt
+    return {"prompt": info["relpath"]}
+```
diff --git a/stable_audio_tools/configs/dataset_configs/s3_wds_example.json b/stable_audio_tools/configs/dataset_configs/s3_wds_example.json
@@ -3,7 +3,7 @@
     "datasets": [
         {
             "id": "s3-test",
-            "s3_path": "s3://my-bucket/datasets/webdatset/audio/"
+            "s3_path": "s3://my-bucket/datasets/webdataset/audio/"
         }
     ],
     "random_crop": true
diff --git a/stable_audio_tools/models/conditioners.py b/stable_audio_tools/models/conditioners.py
@@ -45,8 +45,6 @@ def __init__(self,
 
     def forward(self, ints: tp.List[int], device=None) -> tp.Any:
             
-            #self.int_embedder.to(device)
-    
             ints = torch.tensor(ints).to(device)
             ints = ints.clamp(self.min_val, self.max_val)
     
@@ -94,12 +92,12 @@ def __init__(self,
                  audio_model_type="HTSAT-base", 
                  enable_fusion=True,
                  project_out: bool = False,
-                 finetune: bool = False):
+                 enable_grad: bool = False):
         super().__init__(768 if use_text_features else 512, output_dim, 1, project_out=project_out)
 
         self.use_text_features = use_text_features
         self.feature_layer_ix = feature_layer_ix
-        self.finetune = finetune
+        self.enable_grad = enable_grad
 
         # Suppress logging from transformers
         previous_level = logging.root.manager.disable
@@ -111,15 +109,15 @@ def __init__(self,
                 
                 model = laion_clap.CLAP_Module(enable_fusion=enable_fusion, amodel=audio_model_type, device='cpu')
 
-                if self.finetune:
+                if self.enable_grad:
                     self.model = model
                 else: 
                     self.__dict__["model"] = model
 
                 state_dict = clap_load_state_dict(clap_ckpt_path)
                 self.model.model.load_state_dict(state_dict, strict=False)
 
-                if self.finetune:
+                if self.enable_grad:
                     self.model.model.text_branch.requires_grad_(True)
                     self.model.model.text_branch.train()
                 else:
@@ -173,9 +171,12 @@ def __init__(self,
                  clap_ckpt_path,
                  audio_model_type="HTSAT-base", 
                  enable_fusion=True,
-                 project_out: bool = False):
+                 project_out: bool = False,
+                 enable_grad: bool = False):
         super().__init__(512, output_dim, 1, project_out=project_out)
 
+        self.enable_grad = enable_grad
+
         device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 
         # Suppress logging from transformers
@@ -188,15 +189,15 @@ def __init__(self,
                 
                 model = laion_clap.CLAP_Module(enable_fusion=enable_fusion, amodel=audio_model_type, device='cpu')
 
-                if self.finetune:
+                if self.enable_grad:
                     self.model = model
                 else: 
                     self.__dict__["model"] = model
 
                 state_dict = clap_load_state_dict(clap_ckpt_path)
                 self.model.model.load_state_dict(state_dict, strict=False)
 
-                if self.finetune:
+                if self.enable_grad:
                     self.model.model.audio_branch.requires_grad_(True)
                     self.model.model.audio_branch.train()
                 else:

Original file line number	Diff line number	Diff line change
`@@ -3,7 +3,7 @@`
`3`	`3`	`"datasets": [`
`4`	`4`	`{`
`5`	`5`	`"id": "s3-test",`
`6`		`- "s3_path": "s3://my-bucket/datasets/webdatset/audio/"`
	`6`	`+ "s3_path": "s3://my-bucket/datasets/webdataset/audio/"`
`7`	`7`	`}`
`8`	`8`	`],`
`9`	`9`	`"random_crop": true`