-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change np.hstack to np.concatenate #207
base: master
Are you sure you want to change the base?
Change np.hstack to np.concatenate #207
Conversation
Hi! Thanks for taking the time to make a contribution! We use enspara on F@H datasets all the time, so I'm surprised you ran in to trouble with this function. I'm also surprised that changing it from Could you post a minimal example of the problem you were having? |
Actually, changing it from np.hstack to np.concatenate(..., axis=1) is not enough, I had to delete a part of the code used for checking whether the input data had the shape 1d or not (from line 149 to 154). And then it was successful to build MSM for ***@***.*** input data that I had.
[image: image.png]
So when you got the dataset that does not have the shape of (n_traj, n_frames) but only (n_traj,) due to different number frames in each trajectory. It would raise this error. Sorry that I have found that there is code for transforming from 2d to 1d in your code before using np.hstack.
However, it still raise an error for the case when we have trajectories of different number frames as input if codes from the line 149 to 154 is kept.
|
Sorry, I'm still having some trouble figuring out your situation. Because you responded with email, your image didn't make it into your post and GH censored the F@H dataset as "@.***" (thinking that it was an email address, I guess?).
This change is not in your PR. More importantly, I find it very strange that your data triggered the condition I strongly suspect there is some issue with how you've loaded the data. Perhaps you have loaded your data concatenated, so that each trajectory is appended end-to-end with the last? If so, this is unlikely to be what you want, since it will create artifactual transitions between the last frames of trajectory n and the first few frames of trajectory n+1. Cheers! |
Thanks again for your reply. Maybe posting the error will help clarify the issue:
I believe this happens due to the length of each trajectory in the array having a different number of frames. Perhaps you are right that I am loading the data in a way the program was not expecting, but, as far as I can tell, each trajectory has different lengths and that this should be the correct way to load it (for the reason you stated about having artifactual transitions). The numpy array, therefore, shows a shape of (n_frames, ) due to the second dimension having different shapes for each item. Thanks for looking into this. |
What code are you running that reaches this error? Can you post the contents of |
import mdtraj as md
import pyemma.coordinates as coor
import numpy as np
import pickle
import pyemma
import numbers
import os
import enspara
import h5py
import pandas as pd
import numpy as np
import logging
import scipy
import scipy.sparse
import scipy.sparse.linalg
from scipy.sparse.csgraph import connected_components
from enspara import exception
from enspara.msm import MSM, builders, transition_matrices
from enspara.msm.transition_matrices import _transitions_helper
import pyemma.plots as pyemma_plots
from pyemma.util.contexts import settings
def assigns_to_counts1(
assigns, lag_time, max_n_states=None, sliding_window=True):
"""Count transitions between states in a single trajectory.
Parameters
----------
assigns : array, shape=(traj_len, )
A 2-D array where each row is a trajectory consisting of a
sequence of state indices.
lag_time : int
The lag time (i.e. observation interval) for counting
transitions.
max_n_states : int, default=None
The number of states. This is useful for controlling the
dimensions of the transition count matrix in cases where the
input trajectory does not necessarily visit every state.
sliding_window : bool, default=True
Whether to use a sliding window for counting transitions or to
take every lag_time'th state.
Returns
-------
C : array, shape=(n_states, n_states)
A transition count matrix.
"""
if not isinstance(lag_time, numbers.Integral):
raise exception.DataInvalid(
"The lag time must be an integer. Got %s type %s." %
lag_time, type(lag_time))
if lag_time < 1:
raise exception.DataInvalid(
"Lag times must be be strictly greater than 0. Got '%s'." %
lag_time)
# if it's 1d, later stuff will fail
if len(assigns.shape) == 1:
raise exception.DataInvalid(
'The given assignments array has 1-dimensional shape %s. '
'Two dimensional shapes = (n_trj, n_frames) are expected. '
'If this is really what you want, try using '
'assignments.reshape(1, -1) to create a single-row 2d array.')
assigns = np.array([a[np.where(a != -1)] for a in assigns])
if max_n_states is None:
max_n_states = np.concatenate(assigns).max() + 1
transitions = [
_transitions_helper(
assign, lag_time=lag_time, sliding_window=sliding_window)
for assign in assigns]
# generate sparse matrix
mat_coords = np.hstack(transitions)
mat_data = np.ones(mat_coords.shape[1], dtype=int)
C = scipy.sparse.coo_matrix(
(mat_data, mat_coords), shape=(max_n_states, max_n_states))
return C
#Building MSM
msm_lag = 8
cluster_numbers = 5000
cluster_assig = coor.load('clustering/chi1_2_5000_trajs_n_stride_5.h5')
cluster_assig = np.array(cluster_assig)
assig = []
for frame in cluster_assig:
assig.append(frame.astype(int))
assig = np.array(assig)
print (assig[0].shape)
print ('MSM is being built ...')
tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
prior_counts = 1/tcounts.shape[0]
tcounts = builders._apply_prior_counts(tcounts, prior_counts)
probs = builders._row_normalize(tcounts)
eq_probs_ = builders.eq_probs(probs)
print ('transition maxtrix: ', tcounts)
print ('transition probabilities: ', probs)
print ('equilibrium probabilities: ', eq_probs_)
np.save (f'MSM/tcounts_{cluster_numbers}_{msm_lag}.npy', tcounts)
np.save (f'MSM/tprobs_{cluster_numbers}_{msm_lag}.npy', probs)
np.save (f'MSM/populations_{cluster_numbers}_{msm_lag}.npy', eq_probs_)
print('MSM was built successfully') Here you are! |
Without sitting down with this code and it's inputs and debugging more carefully, I'm not sure what's going on. Needless to say, this isn't how we intended the code to be used. It's just a convention, but mostly you shouldn't need to use the methods with names beginning with Building an MSM from your assignments should be as easy as shown here! |
Oh, also, if you are looking to do prior counts, check out the example here! It uses the built-in "partial" method to wrap |
This change allows loading trajectory assignment arrays of different shapes. Previously an error would be thrown by np.hstack if assignment arrays of different shapes were used, which is common when using folding@home data sets (for example).