Docker wrapper #1

joshmoore · 2017-05-06T12:31:56Z

Coding began at the January meeting in Dundee. Now with features generated for idr0013 and idr0012, it's time to try generating the similarity matrix.

cc: @jkh1 @dominikl @manics

joshmoore · 2017-05-06T12:43:06Z

Building locally now. Once working, I'll get this on docker hub.

joshmoore · 2017-05-06T12:52:18Z

[jamoore@idr0-slot3 tables]$ bash similarity.sh
Loading required package: methods
'data.frame':   16 obs. of  5 variables:
 $ group : chr  "/" "/OME" "/OME" "/OME" ...
 $ name  : chr  "OME" "ColumnDescriptions" "ColumnTypes" "Measurements" ...
 $ otype : Factor w/ 15 levels "H5I_FILE","H5I_GROUP",..: 2 5 5 5 2 2 5 5 5 5 ...
 $ dclass: chr  "" "STRING" "STRING" "COMPOUND" ...
 $ dim   : chr  "" "134" "134" "133050" ...
...

Only one thread looks to be active:

 57140 root      20   0 15.102g 0.015t   7036 R 100.0  6.0   5:49.68 R

joshmoore · 2017-05-06T20:17:18Z

...
49995                               chebyshev_coefficients_49995  0.000000e+00
49996                               chebyshev_coefficients_49996  0.000000e+00
49997                               chebyshev_coefficients_49997  0.000000e+00
49998                               chebyshev_coefficients_49998  0.000000e+00
49999                               chebyshev_coefficients_49999  0.000000e+00
 [ reached getOption("max.print") -- omitted 388322951 rows ]
Warning message:
system call failed: Cannot allocate memory

real    175m47.692s
user    0m1.015s
sys     0m0.360s

cc: @jkh1

joshmoore · 2017-05-07T14:14:44Z

beatrizserrano

I think that we should add the following command at line 93 in compute_image_similarities.R in order to free up some memory:

rm(fields, group_name, h5data, measures, imageID, wellID, listOfFeatureMatrices)

joshmoore · 2017-06-21T09:42:43Z

Thanks, @beatrizserrano. Running again now.

joshmoore · 2017-06-21T13:10:35Z

@beatrizserrano unfortunately

[jamoore@idr0-slot3 serrano-remining]$ git diff
diff --git a/compute_image_similarities.R b/compute_image_similarities.R
index 89535b8..73859f1 100644
--- a/compute_image_similarities.R
+++ b/compute_image_similarities.R
@@ -88,8 +88,9 @@ featureMatrix <- aggregate(featureMatrix, by = list(row.names(featureMatrix)), m
 rownames(featureMatrix) <- featureMatrix[,1]
 featureMatrix <- featureMatrix[,-1]

# Remove constant features
 featureMatrix <- featureMatrix[, which(!apply(featureMatrix, 2, FUN=function(x) {sd(x)==0}))]
+rm(fields, group_name, h5data, measures, imageID, wellID, listOfFeatureMatrices)

 # PCA
 pca <- prcomp(featureMatrix, scale.= TRUE, center = TRUE)

still failed with:

...
49998                               chebyshev_coefficients_49998  0.000000e+00
49999                               chebyshev_coefficients_49999  0.000000e+00
 [ reached getOption("max.print") -- omitted 388322951 rows ]
Warning message:
system call failed: Cannot allocate memory

real    158m6.828s
user    0m0.970s
sys     0m0.323s

jkh1 · 2017-06-21T13:50:54Z

Hi, There are various ways to deal with this, assuming this is caused by having too many images: - Apply PCA to a random subset of the images. As long as this subset is representative of the data, the covariance will be reasonably well approximated. - Compute the covariance matrix incrementally, see http://rebcabin.github.io/blog/2013/01/22/covariance-matrices/ - Use random SVD/PCA, see this paper: https://arxiv.org/pdf/0909.4061.pdf and R package rsvd: https://CRAN.R-project.org/package=rsvd - Use random projections, see e.g. http://users.ics.aalto.fi/ella/publications/randproj_kdd.pdf Cheers J-K

…

On 21/06/17 15:10, Josh Moore wrote: @beatrizserrano <https://github.com/beatrizserrano> unfortunately ***@***.*** serrano-remining]$ git diff diff --git a/compute_image_similarities.R b/compute_image_similarities.R index 89535b8..73859f1 100644 --- a/compute_image_similarities.R +++ b/compute_image_similarities.R @@ -88,8 +88,9 @@ featureMatrix <- aggregate(featureMatrix, by = list(row.names(featureMatrix)), m rownames(featureMatrix) <- featureMatrix[,1] featureMatrix <- featureMatrix[,-1] # Remove constant features featureMatrix <- featureMatrix[, which(!apply(featureMatrix, 2, FUN=function(x) {sd(x)==0}))] +rm(fields, group_name, h5data, measures, imageID, wellID, listOfFeatureMatrices) # PCA pca <- prcomp(featureMatrix, scale.= TRUE, center = TRUE) | still failed with: |... 49998 chebyshev_coefficients_49998 0.000000e+00 49999 chebyshev_coefficients_49999 0.000000e+00 [ reached getOption("max.print") -- omitted 388322951 rows ] Warning message: system call failed: Cannot allocate memory real 158m6.828s user 0m0.970s sys 0m0.323s | — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_kIf3DwaUx0ViCxGLZtv8G2PPlV5pnks5sGRZLgaJpZM4NSvno>.

-- Dr Jean-Karim Hériché Cell Biology and Biophysics Unit European Molecular Biology Laboratory Meyerhofstrasse 1 69117 Heidelberg Germany tel: +49 (0) 6221 387 8188

beatrizserrano · 2017-06-21T15:00:11Z

Thanks, @jkh1. Let's begin with the easiest one :)

To select a random subset of images, we need to expand the line 90 to:

nImages <- nrow(featureMatrix) # 388372950 images
randomSetSize <- 0.01*nImages
featureMatrix <- featureMatrix[sample(nrow(featureMatrix), randomSetSize), ]

I've selected 1% of the images for testing purposes, but we could increase it as long as we see it's working.

joshmoore · 2017-06-22T05:52:39Z

49998                               chebyshev_coefficients_49998  0.000000e+00
49999                               chebyshev_coefficients_49999  0.000000e+00
 [ reached getOption("max.print") -- omitted 388322951 rows ]

real    170m23.881s
user    0m1.073s
sys     0m0.313s

👍

joshmoore · 2017-06-22T16:00:44Z

@beatrizserrano / @jkh1 : any thoughts on what format to write the results out to?

jkh1 · 2017-06-23T09:52:28Z

Flat files in a hierarchical directory structure matching that of the images would work best for most purposes. If you want to keep everything together, you could use HDF5 but there won't be any other benefit given the expected access pattern, plus the hd5 1.8 library doesn't support concurrent reads (although the upcoming hdf5 1.10 should enable it). Alternatively, if it's just for us, serialized R data structures (i.e. RDS files from the saveRDS() function) would be best. J-K

…

On 22/06/17 18:00, Josh Moore wrote: @beatrizserrano <https://github.com/beatrizserrano> / @jkh1 <https://github.com/jkh1> : any thoughts on what format to write the results out to? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_kIezqid-bEk_1P8ylp292A0piXxkMks5sGo-sgaJpZM4NSvno>.

-- Dr Jean-Karim Hériché Cell Biology and Biophysics Unit European Molecular Biology Laboratory Meyerhofstrasse 1 69117 Heidelberg Germany tel: +49 (0) 6221 387 8188

joshmoore · 2017-06-23T12:52:04Z

Flat files in a hierarchical directory structure matching that of the
images would work best for most purposes.

Does this exist as a function?

Alternatively, if it's just for us, serialized R data structures (i.e.
RDS files from the saveRDS() function) would be best.

Of which variables?

joshmoore · 2017-07-05T13:13:47Z

Any suggestions here, @jkh1 & @beatrizserrano ?

jkh1 · 2017-07-05T14:55:49Z

Sorry for the delay in getting back to you. I don't think there's currently a function that creates the hierarchical file structure. I would save the PCA-derived features (variable features) and the similarity matrix (variable simMatrix). Note that the code has to be extended to project the data not used in the PCA onto the PCs with predict(pca, newdata = ...)| |On 05/07/17 15:13, Josh Moore wrote:

…

Any suggestions here, @jkh1 <https://github.com/jkh1> & @beatrizserrano <https://github.com/beatrizserrano> ? —

-- Dr Jean-Karim Hériché Cell Biology and Biophysics Unit European Molecular Biology Laboratory Meyerhofstrasse 1 69117 Heidelberg Germany tel: +49 (0) 6221 387 8188

jkh1 and others added 5 commits March 28, 2017 15:57

script to compute image similarity matrix

950a6e6

Convert test_HDF5.Rmd to test_HDF5.R

f7d9891

Add Dockerfile for running scripts

7c5e361

test_HDF5.R now takes h5 file as argument

1b4b70e

Update docker to work with compute_image_similarities.R

7d12251

beatrizserrano suggested changes Jun 20, 2017

View reviewed changes

joshmoore added 2 commits June 22, 2017 06:49

Try to free memory

70e0f1b

User only 1% of features for PCA

d281c22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker wrapper #1

Docker wrapper #1

joshmoore commented May 6, 2017 •

edited

Loading

joshmoore commented May 6, 2017

joshmoore commented May 6, 2017

joshmoore commented May 6, 2017

joshmoore commented May 7, 2017

beatrizserrano left a comment

joshmoore commented Jun 21, 2017

joshmoore commented Jun 21, 2017

jkh1 commented Jun 21, 2017 via email

beatrizserrano commented Jun 21, 2017

joshmoore commented Jun 22, 2017

joshmoore commented Jun 22, 2017

jkh1 commented Jun 23, 2017 via email

joshmoore commented Jun 23, 2017

joshmoore commented Jul 5, 2017

jkh1 commented Jul 5, 2017 via email

Docker wrapper #1

Are you sure you want to change the base?

Docker wrapper #1

Conversation

joshmoore commented May 6, 2017 • edited Loading

joshmoore commented May 6, 2017

joshmoore commented May 6, 2017

joshmoore commented May 6, 2017

joshmoore commented May 7, 2017

beatrizserrano left a comment

Choose a reason for hiding this comment

joshmoore commented Jun 21, 2017

joshmoore commented Jun 21, 2017

jkh1 commented Jun 21, 2017 via email

beatrizserrano commented Jun 21, 2017

joshmoore commented Jun 22, 2017

joshmoore commented Jun 22, 2017

jkh1 commented Jun 23, 2017 via email

joshmoore commented Jun 23, 2017

joshmoore commented Jul 5, 2017

jkh1 commented Jul 5, 2017 via email

joshmoore commented May 6, 2017 •

edited

Loading