Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker wrapper #1

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Docker wrapper #1

wants to merge 7 commits into from

Conversation

joshmoore
Copy link
Member

@joshmoore joshmoore commented May 6, 2017

Coding began at the January meeting in Dundee. Now with features generated for idr0013 and idr0012, it's time to try generating the similarity matrix.

cc: @jkh1 @dominikl @manics

@joshmoore
Copy link
Member Author

Building locally now. Once working, I'll get this on docker hub.

@joshmoore
Copy link
Member Author

[jamoore@idr0-slot3 tables]$ bash similarity.sh
Loading required package: methods
'data.frame':   16 obs. of  5 variables:
 $ group : chr  "/" "/OME" "/OME" "/OME" ...
 $ name  : chr  "OME" "ColumnDescriptions" "ColumnTypes" "Measurements" ...
 $ otype : Factor w/ 15 levels "H5I_FILE","H5I_GROUP",..: 2 5 5 5 2 2 5 5 5 5 ...
 $ dclass: chr  "" "STRING" "STRING" "COMPOUND" ...
 $ dim   : chr  "" "134" "134" "133050" ...
...

Only one thread looks to be active:

 57140 root      20   0 15.102g 0.015t   7036 R 100.0  6.0   5:49.68 R

@joshmoore
Copy link
Member Author

...
49995                               chebyshev_coefficients_49995  0.000000e+00
49996                               chebyshev_coefficients_49996  0.000000e+00
49997                               chebyshev_coefficients_49997  0.000000e+00
49998                               chebyshev_coefficients_49998  0.000000e+00
49999                               chebyshev_coefficients_49999  0.000000e+00
 [ reached getOption("max.print") -- omitted 388322951 rows ]
Warning message:
system call failed: Cannot allocate memory

real    175m47.692s
user    0m1.015s
sys     0m0.360s

cc: @jkh1

@joshmoore
Copy link
Member Author

screen shot 2017-05-07 at 16 14 00

Copy link
Member

@beatrizserrano beatrizserrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should add the following command at line 93 in compute_image_similarities.R in order to free up some memory:

rm(fields, group_name, h5data, measures, imageID, wellID, listOfFeatureMatrices)

@joshmoore
Copy link
Member Author

Thanks, @beatrizserrano. Running again now.

@joshmoore
Copy link
Member Author

@beatrizserrano unfortunately

[jamoore@idr0-slot3 serrano-remining]$ git diff
diff --git a/compute_image_similarities.R b/compute_image_similarities.R
index 89535b8..73859f1 100644
--- a/compute_image_similarities.R
+++ b/compute_image_similarities.R
@@ -88,8 +88,9 @@ featureMatrix <- aggregate(featureMatrix, by = list(row.names(featureMatrix)), m
 rownames(featureMatrix) <- featureMatrix[,1]
 featureMatrix <- featureMatrix[,-1]

# Remove constant features
 featureMatrix <- featureMatrix[, which(!apply(featureMatrix, 2, FUN=function(x) {sd(x)==0}))]
+rm(fields, group_name, h5data, measures, imageID, wellID, listOfFeatureMatrices)

 # PCA
 pca <- prcomp(featureMatrix, scale.= TRUE, center = TRUE)

still failed with:

...
49998                               chebyshev_coefficients_49998  0.000000e+00
49999                               chebyshev_coefficients_49999  0.000000e+00
 [ reached getOption("max.print") -- omitted 388322951 rows ]
Warning message:
system call failed: Cannot allocate memory

real    158m6.828s
user    0m0.970s
sys     0m0.323s

@jkh1
Copy link
Member

jkh1 commented Jun 21, 2017 via email

@beatrizserrano
Copy link
Member

Thanks, @jkh1. Let's begin with the easiest one :)

To select a random subset of images, we need to expand the line 90 to:

nImages <- nrow(featureMatrix) # 388372950 images
randomSetSize <- 0.01*nImages
featureMatrix <- featureMatrix[sample(nrow(featureMatrix), randomSetSize), ]

I've selected 1% of the images for testing purposes, but we could increase it as long as we see it's working.

@joshmoore
Copy link
Member Author

49998                               chebyshev_coefficients_49998  0.000000e+00
49999                               chebyshev_coefficients_49999  0.000000e+00
 [ reached getOption("max.print") -- omitted 388322951 rows ]

real    170m23.881s
user    0m1.073s
sys     0m0.313s

👍

@joshmoore
Copy link
Member Author

@beatrizserrano / @jkh1 : any thoughts on what format to write the results out to?

@jkh1
Copy link
Member

jkh1 commented Jun 23, 2017 via email

@joshmoore
Copy link
Member Author

Flat files in a hierarchical directory structure matching that of the
images would work best for most purposes.

Does this exist as a function?

Alternatively, if it's just for us, serialized R data structures (i.e.
RDS files from the saveRDS() function) would be best.

Of which variables?

@joshmoore
Copy link
Member Author

Any suggestions here, @jkh1 & @beatrizserrano ?

@jkh1
Copy link
Member

jkh1 commented Jul 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants