Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need a data model that contains essential information of a calculation #1171

Open
JenkeScheen opened this issue Mar 20, 2023 · 7 comments
Open

Comments

@JenkeScheen
Copy link

When running perses on a protein-ligand system we've noticed that the hybrid factory of edges is huge:

166M	P0240-His41(0)-Cys145(0)-His163(0)-3-5/out-hybrid_factory.npy.npz

Instead of just serializing out this object we need to come up with a data model that contains the essential information associated with a calculation.

Settings for the above calculation:

n_cycles: 5000
n_steps_per_move_application: 500
n_states: 18
timestep: 2 fs

This is with 0.10.1 : pyha21a80b_1

@ijpulidos
Copy link
Contributor

@JenkeScheen The biggest chunk are probably the nc files, we would benefit from having the header information of these (output of ncdump -h). Can you provide an example of it for a large nc file?

@JenkeScheen
Copy link
Author

Here's the header for an out-complex.nc of 7.7G at /lila/data/chodera/asap-datasets/prospective/2023/03_mers_retro/3_afe_calcs/fauxalysis/perses/P0240-His41(0)-Cys145(0)-His163(0)-1-3:

netcdf out-complex {
dimensions:
	scalar = 1 ;
	iteration = UNLIMITED ; // (4807 currently)
	spatial = 3 ;
	analysis_particles = 4784 ;
	fixedL2590814 = 2590814 ;
	fixedL1052 = 1052 ;
	fixedL1047 = 1047 ;
	fixedL1040 = 1040 ;
	fixedL1045 = 1045 ;
	fixedL1042 = 1042 ;
	fixedL935 = 935 ;
	fixedL1311858 = 1311858 ;
	fixedL1311862 = 1311862 ;
	fixedL3 = 3 ;
	atom = 4784 ;
	replica = 18 ;
	state = 18 ;
	unsampled = 2 ;
variables:
	int64 last_iteration(scalar) ;
	int64 analysis_particle_indices(analysis_particles) ;
		analysis_particle_indices:long_name = "analysis_particle_indices[analysis_particles] is the indices of the particles with extra information stored about them in theanalysis file." ;
	string options(scalar) ;
	char metadata(fixedL3) ;
	float positions(iteration, replica, atom, spatial) ;
		positions:units = "nm" ;
		positions:long_name = "positions[iteration][replica][atom][spatial] is position of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
	float velocities(iteration, replica, atom, spatial) ;
		velocities:units = "nm / ps" ;
		velocities:long_name = "velocities[iteration][replica][atom][spatial] is velocity of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
	float box_vectors(iteration, replica, spatial, spatial) ;
		box_vectors:units = "nm" ;
		box_vectors:long_name = "box_vectors[iteration][replica][i][j] is dimension j of box vector i for replica \'replica\' from iteration \'iteration-1\'." ;
	double volumes(iteration, replica) ;
		volumes:units = "nm**3" ;
		volumes:long_name = "volume[iteration][replica] is the box volume for replica \'replica\' from iteration \'iteration-1\'." ;
	int states(iteration, replica) ;
		states:units = "none" ;
		states:long_name = "states[iteration][replica] is the thermodynamic state index (0..n_states-1) of replica \'replica\' of iteration \'iteration\'." ;
	double energies(iteration, replica, state) ;
		energies:units = "kT" ;
		energies:long_name = "energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at the thermodynamic state \'state\'." ;
	byte neighborhoods(iteration, replica, state) ;
		neighborhoods:_FillValue = 1b ;
		neighborhoods:long_name = "neighborhoods[iteration][replica][state] is 1 if this energy was computed during this iteration." ;
	double unsampled_energies(iteration, replica, unsampled) ;
		unsampled_energies:units = "kT" ;
		unsampled_energies:long_name = "unsampled_energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at unsampled thermodynamic state \'state\'." ;
	int accepted(iteration, state, state) ;
		accepted:units = "none" ;
		accepted:long_name = "accepted[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
	int proposed(iteration, state, state) ;
		proposed:units = "none" ;
		proposed:long_name = "proposed[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
	string timestamp(iteration) ;

// global attributes:
		:UUID = "148f4bdc-0dce-40ae-b09c-a491c679d4fc" ;
		:application = "YANK" ;
		:program = "yank.py" ;
		:programVersion = "0.21.5" ;
		:Conventions = "ReplicaExchange" ;
		:ConventionVersion = "0.2" ;
		:DataUsedFor = "analysis" ;
		:CheckpointInterval = 250LL ;
		:title = "Replica-exchange sampler simulation created using ReplicaExchangeSampler class of openmmtools.multistate on Sat Mar 18 01:38:28 2023" ;

group: thermodynamic_states {
  variables:
  	char state0(fixedL2590814) ;
  	char state1(fixedL1052) ;
  	char state2(fixedL1047) ;
  	char state3(fixedL1040) ;
  	char state4(fixedL1047) ;
  	char state5(fixedL1045) ;
  	char state6(fixedL1040) ;
  	char state7(fixedL1040) ;
  	char state8(fixedL1045) ;
  	char state9(fixedL1042) ;
  	char state10(fixedL1042) ;
  	char state11(fixedL1040) ;
  	char state12(fixedL1040) ;
  	char state13(fixedL1040) ;
  	char state14(fixedL1040) ;
  	char state15(fixedL1040) ;
  	char state16(fixedL1040) ;
  	char state17(fixedL935) ;
  } // group thermodynamic_states

group: unsampled_states {
  variables:
  	char state0(fixedL1311858) ;
  	char state1(fixedL1311862) ;
  } // group unsampled_states

group: mcmc_moves {
  variables:
  	string move0(scalar) ;
  	string move1(scalar) ;
  	string move2(scalar) ;
  	string move3(scalar) ;
  	string move4(scalar) ;
  	string move5(scalar) ;
  	string move6(scalar) ;
  	string move7(scalar) ;
  	string move8(scalar) ;
  	string move9(scalar) ;
  	string move10(scalar) ;
  	string move11(scalar) ;
  	string move12(scalar) ;
  	string move13(scalar) ;
  	string move14(scalar) ;
  	string move15(scalar) ;
  	string move16(scalar) ;
  	string move17(scalar) ;
  } // group mcmc_moves

group: online_analysis {
  dimensions:
  	dim_size18 = 18 ;
  	dim_size2 = 2 ;
  	dim_size20 = 20 ;
  variables:
  	double f_k(dim_size18) ;
  	double free_energy(dim_size2) ;
  	double f_k_history(iteration, dim_size18) ;
  	double free_energy_history(iteration, dim_size2) ;
  	double f_k_offline(dim_size20) ;
  	double f_k_offline_history(iteration, dim_size20) ;
  } // group online_analysis
}

@ijpulidos
Copy link
Contributor

ijpulidos commented Mar 21, 2023

@JenkeScheen I don't see anything terribly wrong with it, only that you might want to review the checkpoint interval, I can see it is set to 50 and it might be a bit too frequent? We commonly use 250 for our benchmarks which also run 5ns/replica.

@jchodera maybe you can spot something else here?

EDIT: Check next comment.

@ijpulidos
Copy link
Contributor

@JenkeScheen Oh actually, never mind that previous comment, I mixed the files so I am actually using 50 and you are using 250. That should be okay. Sorry for the noise.

@ijpulidos
Copy link
Contributor

@JenkeScheen I was thinking again about this, and I think it makes sense that you have such big nc files, at least compared to what we commonly get running benchmarks.

I don't really know how exactly the information is stored in the netCDF format, but I'm going to guess that the fundamental types are IEEE-754 standard C types (that is, float is a 32 bit data type, for practical purposes). Here is a quick comparison:

System atoms iterations replicas spatial GBytes of info nc file size
Jenke 4784 4807 18 3 9.93 7.7G
tyk2 4783 2238 12 3 3.08 2.4G

If we compute the ratio of the numbers in the GBytes of info column, we get 9.93/3.08 = 3.22, which is very close to the ratio between the numbers in the nc file size, which is 7.7/2.4 = 3.21. This to me means that you are not storing any extra or undesired data compared to what we already store in the benchmarks. I hope this makes sense and helps.

NOTES:

  • GBytes of info is computed as 2*iteration*replica*atom*spatial*32/8/1e9. 2 for velocities and positions, 32 for 32 bits, 8 for 8 bits per byte, 1e9 for 1e9bytes per GB.
  • These computations were performed only for positions and velocities data, since I think those are the big chunk of the data stored. I know there are more variables but those are probably way smaller.
  • The nc file size being smaller is probably due to some compression algorithms in the netCDF format.

@JenkeScheen
Copy link
Author

thanks @ijpulidos, IIRC the .nc files are used for calculating energies - is the entire file needed for that or would a truncated file be enough? I've dealt with similar filetypes before in other FECs codes but they never really exceeded > a few MBs..

@ijpulidos
Copy link
Contributor

From discussions on our dev syncs what we want to do here for now is changing the default to NOT store any special atom indices in the analysis .nc files, this should lower the size of the output by a considerable amount. This is done in the changes in #1185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants