Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images were wrongly identified as folders when using 2_Pixie_Cluster_Pixels.ipynb #1175

Closed
rach-crc opened this issue Dec 2, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@rach-crc
Copy link

rach-crc commented Dec 2, 2024

Dear developers of Ark-analysis,

I hope this message finds you well. I have encountered an issue when running pixel data preprocessing (see attached screenshot), could you please kindly help me check how can I fix this problem: pixie_preprocessing.create_pixel_matrix(fovs, channels, base_dir, tiff_dir, pixie_seg_dir, img_sub_folder=None, seg_suffix=seg_suffix, pixel_output_dir=pixel_output_dir, data_dir=pixel_data_dir, subset_dir=pixel_subset_dir, norm_vals_name_post_rownorm=norm_vals_name, is_mibitiff=True, blur_factor=blur_factor, subset_proportion=subset_proportion, multiprocess=multiprocess, batch_size=batch_size)

NotADirectoryError: [Errno 20] Not a directory: '/Users/xxx/Downloads/test_FOVs/image_data/3775_ROI1_Liver.tiff/'
I don't know why the algorithm identified my fovs are folder names, as I already set img_sub_folder=None and is_mibitiff=True, the thing is when I was using the segmentation notebook I also encountered similar problem, setting is_mibitiff=True and deleting img_sub_folder=None worked to pass my fovs as images intead of folder names, but this time it won't work this way. I am using 3 images generated from MIBI and preprocessed vis toffy. The directory tree structure is like:
User/xxx/Downloads/test_FOVs
├── image_data
├── pixie
└── segmentation

image_data/
├── 3775_ROI1_Liver.tiff
├── R5C2_Liver.tiff
└── R5C3_Liver.tiff

segmentation/
├── cell_table/
├── deepcell_input/
├── deepcell_output/
└── deepcell_visualization/

My inputs for # Step 1: Set file paths and parameters:
base_dir = "/Users/xxx/Downloads/test_FOVs"
tiff_dir = os.path.join(base_dir, "image_data")
img_sub_folder = None
segmentation_dir = os.path.join("segmentation", "deepcell_output")
seg_suffix = '_whole_cell.tiff'
if segmentation_dir is not None:
pixie_seg_dir = os.path.join(base_dir, segmentation_dir)
else:
pixie_seg_dir = None

optionally select a specific set of fovs manually.

fovs = ["3775_ROI1_Liver.tiff", "R5C2_Liver.tiff", "R5C3_Liver.tiff"]

Screenshot 2024-12-02 at 17 50 38 Screenshot 2024-12-02 at 17 50 51
@rach-crc rach-crc added the bug Something isn't working label Dec 2, 2024
@alex-l-kong
Copy link
Contributor

alex-l-kong commented Dec 5, 2024

@rach-crc assuming your images are in MIBItiff format, this is a bit unfortunate, since we've effectively deprecated MIBItiff format support for quite a while. We still have some legacy code left over that won't work, we'll need to work on removing it entirely.

You're our first MIBItiff image user in several years, so I recommend converting the MIBItiff images into individual folders of TIFFs. Here's some starter code you can use for that:

import os
import tifffile as tiff # this will require you to run "pip install tifffile" on your Terminal/Command Prompt inside the ark-analysis environment first
import numpy as np

def extract_channel_names(tags):
    """Extract channel names from MIBItiff metadata."""
    channel_names = []
    for tag in tags.values():
        if "ChannelNames" in str(tag.name):
            channel_names = tag.value.decode().split(";")  # Modify decoding/splitting as per metadata format
            break
    return channel_names

def mibitiff_to_named_channel_tiffs(input_path, output_folder):
    # Load the MIBItiff file
    with tiff.TiffFile(input_path) as tif:
        image = tif.asarray()  # Get the image as a NumPy array
        metadata_tags = tif.pages[0].tags  # Extract metadata tags
    
    # Extract channel names from metadata
    channel_names = extract_channel_names(metadata_tags)
    if not channel_names:
        print("Channel names not found in metadata, using default names.")
        channel_names = [f"channel_{i + 1}" for i in range(image.shape[0])]
    
    # Create output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)
    
    # Loop through each channel
    num_channels = image.shape[0]
    for channel_idx in range(num_channels):
        channel_image = image[channel_idx]
        channel_name = channel_names[channel_idx].replace(" ", "_")  # Clean up channel name
        output_file = os.path.join(output_folder, f"{channel_name}.tiff")
        
        # Save the channel as a separate TIFF
        tiff.imwrite(output_file, channel_image.astype(np.float32))  # Ensure correct dtype
        
        print(f"Saved: {output_file}")

# Example usage
input_tiff = "path_to_your_mibitiff_file.tiff" # you'll need to run this for each MIBItiff file
output_directory = "path_to_output_folder" # set this to your FOV name as defined in the MIBItiff file
mibitiff_to_named_channel_tiffs(input_tiff, output_directory)

This may not 100% work out of the box, since MIBItiff metadata formats are not always consistent with each other. If this does not work, or does not properly name your channels, send me an example MIBItiff file and I'll see what I can do.

@rach-crc
Copy link
Author

rach-crc commented Dec 6, 2024

@rach-crc assuming your images are in MIBItiff format, this is a bit unfortunate, since we've effectively deprecated MIBItiff format support for quite a while. We still have some legacy code left over that won't work, we'll need to work on removing it entirely.

You're our first MIBItiff image user in several years, so I recommend converting the MIBItiff images into individual folders of TIFFs. Here's some starter code you can use for that:

import os
import tifffile as tiff # this will require you to run "pip install tifffile" on your Terminal/Command Prompt inside the ark-analysis environment first
import numpy as np

def extract_channel_names(tags):
    """Extract channel names from MIBItiff metadata."""
    channel_names = []
    for tag in tags.values():
        if "ChannelNames" in str(tag.name):
            channel_names = tag.value.decode().split(";")  # Modify decoding/splitting as per metadata format
            break
    return channel_names

def mibitiff_to_named_channel_tiffs(input_path, output_folder):
    # Load the MIBItiff file
    with tiff.TiffFile(input_path) as tif:
        image = tif.asarray()  # Get the image as a NumPy array
        metadata_tags = tif.pages[0].tags  # Extract metadata tags
    
    # Extract channel names from metadata
    channel_names = extract_channel_names(metadata_tags)
    if not channel_names:
        print("Channel names not found in metadata, using default names.")
        channel_names = [f"channel_{i + 1}" for i in range(image.shape[0])]
    
    # Create output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)
    
    # Loop through each channel
    num_channels = image.shape[0]
    for channel_idx in range(num_channels):
        channel_image = image[channel_idx]
        channel_name = channel_names[channel_idx].replace(" ", "_")  # Clean up channel name
        output_file = os.path.join(output_folder, f"{channel_name}.tiff")
        
        # Save the channel as a separate TIFF
        tiff.imwrite(output_file, channel_image.astype(np.float32))  # Ensure correct dtype
        
        print(f"Saved: {output_file}")

# Example usage
input_tiff = "path_to_your_mibitiff_file.tiff" # you'll need to run this for each MIBItiff file
output_directory = "path_to_output_folder" # set this to your FOV name as defined in the MIBItiff file
mibitiff_to_named_channel_tiffs(input_tiff, output_directory)

This may not 100% work out of the box, since MIBItiff metadata formats are not always consistent with each other. If this does not work, or does not properly name your channels, send me an example MIBItiff file and I'll see what I can do.

Thank you, Alex and your group, for providing this solution and for develping this useful package. The code worked for my images. We are a group from Oxford Uni and we are running some metastatic tumour samples on MIBI and trying to process the images with toffy and ark analysis (never done this before). But we now encountered some new problems when using Pixie, and I wonder if you know anything about how to fix them. Currently, we got only 3 fovs (each from a different sample) processed from toffy to test on ark analysis notebooks, and we treat each fov as an indiviudal image now to move along with ark. However, we got a really weird result after visulising the metaclusters (max_k=14) shown in the heatmaps of the GUI, for example, in the 1st fov, (see attached heatmap), T cells are not clustered as "cd45+ cd3+ cd4+" intead the metaclusters are formed like "cd4+" and there is a metacluster with both "CD56 and CD20". We have thought about the possibilities for this, and maybe you could help us check wether these assumptions are right: 1. the sample/pixel size is quite small now for SOM. 2. The blur factor =2 was not good enough. 3. the max_k size need to be increased to see how things look like and adjust manully. Moreover, when trying to define a optimal blur_factor, and evaluating reproducibility with different seed number, the orignial paper https://www.nature.com/articles/s41467-023-40068-5 use a index called consistency score, we think this could be a useful approach for but we did not find this function in the Pixie notebook, hence I wonder if you know how to calcute such a score to define a optimal blur_factor and evaluating reproducibility after or inside the Pixie notebook. Lastly, is there a more detailed tutorial/video on how to use the GUI in Pixie?

Screenshot 2024-12-06 at 14 19 29

@alex-l-kong
Copy link
Contributor

alex-l-kong commented Dec 10, 2024

@rach-crc I'll tag @cliu72 on this thread, we have chosen not to include the consistency score in the Pixie pipeline, but she'll be able to provide more concrete guidance regarding how to implement and use it.

For the remaining issues, I think the first thing to try is to increase the num_passes param for the SOM training. By default, it's set to 1, but because you don't have many FOVs, it's likely that your model is underfitting, which may explain why you're not seeing the metaclusters you want. The num_passes param is in the call to pixel_som_clustering.train_pixel_som. 10 passes tends to be a pretty standard setting from previous cohorts.

You could also try increasing the number of consensus clusters, generally most people have started off with 20 or 30.

Most of the work ends up getting done in the metacluster remapping interface that you posted in the screenshot, however. Even after you try everything, it's very unlikely that the initial meta cluster separation will be 100% the way you want. This interface allows you to combine and separate out metaclusters at both a SOM and meta cluster level. You can check out the documentation in the notebook for how to use it.

It's unlikely that the blur factor will play a significant role from prior experience.

Let us know how this goes.

@cliu72
Copy link
Contributor

cliu72 commented Dec 10, 2024

@rach-crc I believe I responded to you over email, but reproduced here for anybody else who may have similar questions:

Depending on the resolution of your imaging, it is likely expected that clusters have single-expression of markers or co-expression of "weird" markers. Since we are clustering at the pixel level, and each pixel is much smaller than a cell (again, this may depend on your imaging resolution), we would expect some pixels to only express single markers. So for a T cell, one pixel may express CD3 and a neighboring pixel may express CD4 and another neighboring pixel may express CD3 and CD4 – but when you aggregate these pixels at the cell-level, it’s more “obvious” that it’s a T cell.

A cluster with both "CD56+" and "CD20+". could be the pixels that are at the edge of a NK cell and B cell that are right next to each other. Or these could just be noisy pixels. Depending on how aggressively you used Rosetta to denoise your images in toffy, you may get more of these “noisy” pixels. In our experience, it’s totally to fine to have a cluster of pixels you label as “noise.” The idea is that once you get to the cell-level, these “noise” pixels will drown out in the real signal (so for example, you’ll have 90% annotated pixels and 10% noise pixels in a cell). The best way to check whether these pixels are a true cluster or “noise” is to look at your MIBI images with the Pixie clusters overlaid (our lab likes to use Mantis Viewer for this – I included a video tutorial link below for Mantis, but napari is also a popular option).

Answers to your specific questions:

  1. The sample/pixel size may currently be too small for effective FlowSOM training.

To adjust the amount of pixels you use for training, you can change the “subset_proportion” variable in the notebook. If you change it to 1, you will use all the pixels in your dataset for training. For large datasets with thousands of FOVs, RAM is the limiting factor for using all the pixels for training, but at your dataset size, it should be fine. Even for a small dataset, Pixie will just use all the information provided to do the clustering and should be able to identify the phenotypes. So even if you have just 1 or 2 images, if the staining looks good and you can visually see the T cells, B cells, etc, Pixie should be able to pull them out. It is important to note though that Pixie expects the data to be cleaned up beforehand, so you should try to increase your signal-to-noise as much as possible before using Pixie.

  1. The blur factor (set to 2) might not be appropriate for our dataset.

This definitely depends on the resolution of your imaging. You can definitely change the blur factor and see how the results change. Again, I recommend using Mantis Viewer (or napari) in conjunction with Pixie to visually look at your results.

  1. The max_k parameter may need to be increased to refine clustering results and allow for manual adjustment.

The max_k parameter shouldn’t make a huge difference. That only controls the number of metaclusters you see on the right-hand side of those heatmaps, which will likely change once you do some manual adjustments. The parameter you may want to change are xdim and ydim in pixel_som_clustering.train_pixel_som – they are not listed in the notebook, but are parameters you can add (see source code here).The default is 10 and 10, which results in 100 clusters (10 x 10). If you feel like phenotypes are not being captured in the 100 metaclusters (e.g. very rare phenotypes), you can increase to 20 and 10, which would give you 200 clusters.

For the cluster consistency score, the reason we don’t include it in the ark package is that we don’t actually recommend all users use it. Because it requires multiple rounds of clustering and pairwise comparisons for each run, especially for large datasets, the runtime can get extremely long (and also takes significant compute power). The point of the cluster consistency score in the paper is to show that the method is robust, but we don’t think it’s necessary to run it for every dataset. If you are interested in running it for your own dataset, the code is available here (also linked in the paper): https://github.com/angelolab/publications/tree/main/2022-Liu_etal_Pixie (the scripts specifically for the cluster consistency score are 5_cluster_consistency_score_pixel.R and 6_cluster_consistency_score_pixel.py). You may need to tweak the code a bit because the version of Pixie that was used to generate the figures in the paper is different than what is currently offered in ark, but it should be relatively straight-forward.

For tutorials, here are some videos from our MIBI Workshop may help (all videos can be found here: https://www.angelolab.com/mibi-workshop-2022):
General Pixie talk: https://www.youtube.com/watch?v=e7C1NvaPLaY&t=132s
Applying to your own data: https://www.youtube.com/watch?v=e7C1NvaPLaY&t=2684s
Using Mantis: https://www.youtube.com/watch?v=4_AJxrxPYlk&t=4639s

@rach-crc
Copy link
Author

@rach-crc I'll tag @cliu72 on this thread, we have chosen not to include the consistency score in the Pixie pipeline, but she'll be able to provide more concrete guidance regarding how to implement and use it.

For the remaining issues, I think the first thing to try is to increase the num_passes param for the SOM training. By default, it's set to 1, but because you don't have many FOVs, it's likely that your model is underfitting, which may explain why you're not seeing the metaclusters you want. The num_passes param is in the call to pixel_som_clustering.train_pixel_som. 10 passes tends to be a pretty standard setting from previous cohorts.

You could also try increasing the number of consensus clusters, generally most people have started off with 20 or 30.

Most of the work ends up getting done in the metacluster remapping interface that you posted in the screenshot, however. Even after you try everything, it's very unlikely that the initial meta cluster separation will be 100% the way you want. This interface allows you to combine and separate out metaclusters at both a SOM and meta cluster level. You can check out the documentation in the notebook for how to use it.

It's unlikely that the blur factor will play a significant role from prior experience.

Let us know how this goes.

Thanks, adjusting the som training proportion really helped. It seems like the poor results generated previously was largely caused by my small sample size.

@rach-crc
Copy link
Author

@rach-crc I believe I responded to you over email, but reproduced here for anybody else who may have similar questions:

Depending on the resolution of your imaging, it is likely expected that clusters have single-expression of markers or co-expression of "weird" markers. Since we are clustering at the pixel level, and each pixel is much smaller than a cell (again, this may depend on your imaging resolution), we would expect some pixels to only express single markers. So for a T cell, one pixel may express CD3 and a neighboring pixel may express CD4 and another neighboring pixel may express CD3 and CD4 – but when you aggregate these pixels at the cell-level, it’s more “obvious” that it’s a T cell.

A cluster with both "CD56+" and "CD20+". could be the pixels that are at the edge of a NK cell and B cell that are right next to each other. Or these could just be noisy pixels. Depending on how aggressively you used Rosetta to denoise your images in toffy, you may get more of these “noisy” pixels. In our experience, it’s totally to fine to have a cluster of pixels you label as “noise.” The idea is that once you get to the cell-level, these “noise” pixels will drown out in the real signal (so for example, you’ll have 90% annotated pixels and 10% noise pixels in a cell). The best way to check whether these pixels are a true cluster or “noise” is to look at your MIBI images with the Pixie clusters overlaid (our lab likes to use Mantis Viewer for this – I included a video tutorial link below for Mantis, but napari is also a popular option).

Answers to your specific questions:

  1. The sample/pixel size may currently be too small for effective FlowSOM training.

To adjust the amount of pixels you use for training, you can change the “subset_proportion” variable in the notebook. If you change it to 1, you will use all the pixels in your dataset for training. For large datasets with thousands of FOVs, RAM is the limiting factor for using all the pixels for training, but at your dataset size, it should be fine. Even for a small dataset, Pixie will just use all the information provided to do the clustering and should be able to identify the phenotypes. So even if you have just 1 or 2 images, if the staining looks good and you can visually see the T cells, B cells, etc, Pixie should be able to pull them out. It is important to note though that Pixie expects the data to be cleaned up beforehand, so you should try to increase your signal-to-noise as much as possible before using Pixie.

  1. The blur factor (set to 2) might not be appropriate for our dataset.

This definitely depends on the resolution of your imaging. You can definitely change the blur factor and see how the results change. Again, I recommend using Mantis Viewer (or napari) in conjunction with Pixie to visually look at your results.

  1. The max_k parameter may need to be increased to refine clustering results and allow for manual adjustment.

The max_k parameter shouldn’t make a huge difference. That only controls the number of metaclusters you see on the right-hand side of those heatmaps, which will likely change once you do some manual adjustments. The parameter you may want to change are xdim and ydim in pixel_som_clustering.train_pixel_som – they are not listed in the notebook, but are parameters you can add (see source code here).The default is 10 and 10, which results in 100 clusters (10 x 10). If you feel like phenotypes are not being captured in the 100 metaclusters (e.g. very rare phenotypes), you can increase to 20 and 10, which would give you 200 clusters.

For the cluster consistency score, the reason we don’t include it in the ark package is that we don’t actually recommend all users use it. Because it requires multiple rounds of clustering and pairwise comparisons for each run, especially for large datasets, the runtime can get extremely long (and also takes significant compute power). The point of the cluster consistency score in the paper is to show that the method is robust, but we don’t think it’s necessary to run it for every dataset. If you are interested in running it for your own dataset, the code is available here (also linked in the paper): https://github.com/angelolab/publications/tree/main/2022-Liu_etal_Pixie (the scripts specifically for the cluster consistency score are 5_cluster_consistency_score_pixel.R and 6_cluster_consistency_score_pixel.py). You may need to tweak the code a bit because the version of Pixie that was used to generate the figures in the paper is different than what is currently offered in ark, but it should be relatively straight-forward.

For tutorials, here are some videos from our MIBI Workshop may help (all videos can be found here: https://www.angelolab.com/mibi-workshop-2022): General Pixie talk: https://www.youtube.com/watch?v=e7C1NvaPLaY&t=132s Applying to your own data: https://www.youtube.com/watch?v=e7C1NvaPLaY&t=2684s Using Mantis: https://www.youtube.com/watch?v=4_AJxrxPYlk&t=4639s

Thank you so much for the informative reply and for developing such a useful tool. It helped a lot regarding understanding how Pixie works. I adjusted the som training proportion and removed functional markers from my panel, and the meta-cluster looked a lot better, only need a few manual adjustment. But I have one more question which is that when opening the pixel cluster and raw image overlays in Mantis Viewer, the instructions in README.MD file suggest to load a file called "marker_counts.csv", but I did not find it in the Mantis folder created by Pixie pipeline, is there another step to do so?
Screenshot 2024-12-11 at 19 48 14

@cliu72
Copy link
Contributor

cliu72 commented Dec 12, 2024

Hi @rach-crc, the marker_counts.csv file is essentially a cell table, in which each row is a cell and each column is some feature such as the average expression of all the markers in your panel. This table is generated in Mesmer (but it's not named marker_counts.csv, default should be cell_table_size_normalized.csv). At the pixel-level, this kind of table does not exist and is not supported by Mantis (since there are waaaaaay more pixels than cells, this table would be much too large, and also not super informative). At the cell-level, Pixie does generate this table (all it does is append the cell labels onto the Mesmer-generated cell table) - should be named {cell_table_path}_cell_labels.csv (see Section 4.5 in 3_Pixie_Cluster_Cells.ipynb). Hope this helps.

@rach-crc
Copy link
Author

Hi @rach-crc, the marker_counts.csv file is essentially a cell table, in which each row is a cell and each column is some feature such as the average expression of all the markers in your panel. This table is generated in Mesmer (but it's not named marker_counts.csv, default should be cell_table_size_normalized.csv). At the pixel-level, this kind of table does not exist and is not supported by Mantis (since there are waaaaaay more pixels than cells, this table would be much too large, and also not super informative). At the cell-level, Pixie does generate this table (all it does is append the cell labels onto the Mesmer-generated cell table) - should be named {cell_table_path}_cell_labels.csv (see Section 4.5 in 3_Pixie_Cluster_Cells.ipynb). Hope this helps.

Thank you for the explanation. But one small thing I like to point out is that when running the 3_Pixie_Cluster_Cells.ipynb there was an error "length mismatch“ when generating cluster_counts_name and cluster_counts_size_norm_name. I read your previous answers to this sort of error, which is due to the fact that both "nuclear" and "whole_cell" masks are stored in the cell_table file, and I fixed the error by extracting "whole_cell" masks only. However, I used nuclear_counts=False in the segmentation notebook and still generated nuclear masks in the cell_table, I don't know if this is a bug. Meanwhile, I believe Pixie does not have a function to "drop" noise pixels right, but perhaps we could separate any weird clusters manually with post_clustering notebook? (For example, there were pixels expressing both CD3 and CD56, and in the cell cluter notebook, they were clusters with other som clusters that mainly express CD56 only.)

@cliu72
Copy link
Contributor

cliu72 commented Dec 23, 2024

Hi @rach-crc, regarding the segmentation issue - can you double check if you have updated the ark package to the latest version? This was an issue we have encountered previously, but had fixed: #1097. Reminder that after you "git pull" to update the Github repo, you must install the latest version with "pip install ."

Regarding "dropping" noise pixels - for most use cases, we didn't feel the need to drop these clusters because as you said, they come out in the wash when you get to the cell clustering stage, so it doesn't cause an issue. In some cases, it was even helpful to see that those pixel clusters were at cell boundaries. In some cases, where we really did want to drop some clusters (because it was making cell clustering worse), we manually dropped the columns we didn't want to include, and that has been working well for us since it's not very complicated to do. But in our experience, in most cases, leaving in the "noise" pixels didn't cause any issues (and in fact, could help us troubleshoot any weird cell clustering).

@rach-crc
Copy link
Author

rach-crc commented Jan 2, 2025

Hi @rach-crc, regarding the segmentation issue - can you double check if you have updated the ark package to the latest version? This was an issue we have encountered previously, but had fixed: #1097. Reminder that after you "git pull" to update the Github repo, you must install the latest version with "pip install ."

Regarding "dropping" noise pixels - for most use cases, we didn't feel the need to drop these clusters because as you said, they come out in the wash when you get to the cell clustering stage, so it doesn't cause an issue. In some cases, it was even helpful to see that those pixel clusters were at cell boundaries. In some cases, where we really did want to drop some clusters (because it was making cell clustering worse), we manually dropped the columns we didn't want to include, and that has been working well for us since it's not very complicated to do. But in our experience, in most cases, leaving in the "noise" pixels didn't cause any issues (and in fact, could help us troubleshoot any weird cell clustering).

Thanks for the informative answers. You can close this issue now.

@cliu72 cliu72 closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants