`out-solvent.nc` is larger than `out-complex.nc`

Question

`out-solvent.nc` is larger than `out-complex.nc`

JenkeScheen opened this issue 2 years ago · comments

When running perses on a protein-ligand system we've noticed that the out-solvent.nc is larger than out-complex.nc:

128M	P0240-His41(0)-Cys145(0)-His163(0)-3-5/out-solvent.nc
80M	P0240-His41(0)-Cys145(0)-His163(0)-3-5/out-complex.nc

This seems like a bug which when resolved could save a lot of storage

JenkeScheen · Answer 1 · Mon Mar 20 2023 23:12:55 GMT+0800 (China Standard Time)

(copying in some of @jchodera's notes):

It’s not obvious what object is the culprit:

$ ncdump -h out-solvent.nc
netcdf out-solvent {
dimensions:
	scalar = 1 ;
	iteration = UNLIMITED ; // (2536 currently)
	spatial = 3 ;
	analysis_particles = 31 ;
	fixedL55052 = 55052 ;
	fixedL1052 = 1052 ;
	fixedL1047 = 1047 ;
	fixedL1040 = 1040 ;
	fixedL1045 = 1045 ;
	fixedL1042 = 1042 ;
	fixedL935 = 935 ;
	fixedL24849 = 24849 ;
	fixedL25035 = 25035 ;
	fixedL3 = 3 ;
	atom = 31 ;
	replica = 18 ;
	state = 18 ;
	unsampled = 2 ;
variables:
	int64 last_iteration(scalar) ;
	int64 analysis_particle_indices(analysis_particles) ;
		analysis_particle_indices:long_name = "analysis_particle_indices[analysis_particles] is the indices of the particles with extra information stored about them in theanalysis file." ;
	string options(scalar) ;
	char metadata(fixedL3) ;
	float positions(iteration, replica, atom, spatial) ;
		positions:units = "nm" ;
		positions:long_name = "positions[iteration][replica][atom][spatial] is position of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
	float velocities(iteration, replica, atom, spatial) ;
		velocities:units = "nm / ps" ;
		velocities:long_name = "velocities[iteration][replica][atom][spatial] is velocity of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
	float box_vectors(iteration, replica, spatial, spatial) ;
		box_vectors:units = "nm" ;
		box_vectors:long_name = "box_vectors[iteration][replica][i][j] is dimension j of box vector i for replica \'replica\' from iteration \'iteration-1\'." ;
	double volumes(iteration, replica) ;
		volumes:units = "nm**3" ;
		volumes:long_name = "volume[iteration][replica] is the box volume for replica \'replica\' from iteration \'iteration-1\'." ;
	int states(iteration, replica) ;
		states:units = "none" ;
		states:long_name = "states[iteration][replica] is the thermodynamic state index (0..n_states-1) of replica \'replica\' of iteration \'iteration\'." ;
	double energies(iteration, replica, state) ;
		energies:units = "kT" ;
		energies:long_name = "energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at the thermodynamic state \'state\'." ;
	byte neighborhoods(iteration, replica, state) ;
		neighborhoods:_FillValue = 1b ;
		neighborhoods:long_name = "neighborhoods[iteration][replica][state] is 1 if this energy was computed during this iteration." ;
	double unsampled_energies(iteration, replica, unsampled) ;
		unsampled_energies:units = "kT" ;
		unsampled_energies:long_name = "unsampled_energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at unsampled thermodynamic state \'state\'." ;
	int accepted(iteration, state, state) ;
		accepted:units = "none" ;
		accepted:long_name = "accepted[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
	int proposed(iteration, state, state) ;
		proposed:units = "none" ;
		proposed:long_name = "proposed[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
	string timestamp(iteration) ;

// global attributes:
		:UUID = "9768ef57-c781-4aa3-bb6b-6c25f9de3a06" ;
		:application = "YANK" ;
		:program = "yank.py" ;
		:programVersion = "0.21.5" ;
		:Conventions = "ReplicaExchange" ;
		:ConventionVersion = "0.2" ;
		:DataUsedFor = "analysis" ;
		:CheckpointInterval = 250LL ;
		:title = "Replica-exchange sampler simulation created using ReplicaExchangeSampler class of openmmtools.multistate on Fri Mar 17 10:56:53 2023" ;

group: thermodynamic_states {
  variables:
  	char state0(fixedL55052) ;
  	char state1(fixedL1052) ;
  	char state2(fixedL1047) ;
  	char state3(fixedL1040) ;
  	char state4(fixedL1047) ;
  	char state5(fixedL1045) ;
  	char state6(fixedL1040) ;
  	char state7(fixedL1040) ;
  	char state8(fixedL1045) ;
  	char state9(fixedL1042) ;
  	char state10(fixedL1042) ;
  	char state11(fixedL1040) ;
  	char state12(fixedL1040) ;
  	char state13(fixedL1040) ;
  	char state14(fixedL1040) ;
  	char state15(fixedL1040) ;
  	char state16(fixedL1040) ;
  	char state17(fixedL935) ;
  } // group thermodynamic_states

group: unsampled_states {
  variables:
  	char state0(fixedL24849) ;
  	char state1(fixedL25035) ;
  } // group unsampled_states

group: mcmc_moves {
  variables:
  	string move0(scalar) ;
  	string move1(scalar) ;
  	string move2(scalar) ;
  	string move3(scalar) ;
  	string move4(scalar) ;
  	string move5(scalar) ;
  	string move6(scalar) ;
  	string move7(scalar) ;
  	string move8(scalar) ;
  	string move9(scalar) ;
  	string move10(scalar) ;
  	string move11(scalar) ;
  	string move12(scalar) ;
  	string move13(scalar) ;
  	string move14(scalar) ;
  	string move15(scalar) ;
  	string move16(scalar) ;
  	string move17(scalar) ;
  } // group mcmc_moves

group: online_analysis {
  dimensions:
  	dim_size18 = 18 ;
  	dim_size2 = 2 ;
  	dim_size20 = 20 ;
  variables:
  	double f_k(dim_size18) ;
  	double free_energy(dim_size2) ;
  	double f_k_history(iteration, dim_size18) ;
  	double free_energy_history(iteration, dim_size2) ;
  	double f_k_offline(dim_size20) ;
  	double f_k_offline_history(iteration, dim_size20) ;
  } // group online_analysis
}

We might need to figure out whether ncdump or another tool can better inspect the contents. ncdump can also serialize to other formats---might need to do that and inspect whether there is a big blob there.

It’s also possible it could be something in the MultiStateReporter that automatically sets the chunk size, and ends up making the file huge for no reason.
There’s stuff in there that sets chunksize to enlarge the NetCDF file to leave room whenever it grows a variable in the extensible dimension, e.g.
https://github.com/choderalab/openmmtools/blob/main/openmmtools/multistate/multistatereporter.py#L1358

It’s possible we are setting this way too high for something in the solvent file and it’s literally expanding the NetCDF file with blank data.

The chunksizes are supposed to be set just large enough to not incur a performance hit in reading and writing by continually having to resize the NetCDF file and add more “continued over here” sections

JenkeScheen · Answer 2 · Mon Mar 20 2023 23:21:18 GMT+0800 (China Standard Time)

looking into the edges for this run a bit more closely this seems to have been a one-off behaviour, see /ifs/rtsia01/chodera/asap/asap/project_support/2023/03_jguven/perses/perses.