Skip to content
This repository was archived by the owner on Sep 14, 2024. It is now read-only.

raise Exception("Model has never trained this n-gram: " + ngram) Exception: Model has never trained this n-gram: WNA #14

Open
devhimd19 opened this issue Aug 27, 2021 · 3 comments

Comments

@devhimd19
Copy link

Screenshot from 2021-08-27 15-55-29

@kyu999
Copy link
Owner

kyu999 commented Sep 3, 2021

Thank you for your report!
The error means n-gram "WNA" is not trained because the corpus(uniprot trained one) does not contain such sequence,
so you have to make your own corpus and train with it by yourself.

@devhimd19
Copy link
Author

The corpus has the WNA.
Can you please see the attached code and the input file.
output1.txt
window_13re.txt
biovec5.txt
Screenshot from 2021-09-03 11-02-12

I am getting the output but it is still showing the error

@AliASafdari
Copy link

AliASafdari commented Aug 8, 2022

Hi, @kyu999

I am facing the exact same error on my end too, but for the n-gram "KQE" instead.

Here's my code snippet -

pv = ProtVec('INPUT.FASTA', corpus_fname='OUTPUT.TXT', n=3)
pv["QAT"]
sequences = list(df[c]) (df[c] contains the AA sequence from which INPUT.FASTA was constructed)
embeddings = []
for i in sequences:
embed = pv.to_vecs(i) <- Error occurs here
embeddings.append(embed)

Full code block, if it helps -

for d in data:
df = pd.read_csv(d)
dN = d[:-4]
for c in cols:
count = 1
with open('sequences_{a}_{b}.fasta'.format(a = c, b = dN), 'w') as f:
for i in range(len(df)):
print('>' + str(count) + '\n', df[c][i], file = f)
count = count + 1
pv = ProtVec('sequences_{a}_{b}.fasta'.format(a = c, b = dN), corpus_fname='output_{a}_{b}.txt'.format(a = c, b = dN), n=3)
pv["QAT"]
sequences = list(df[c])
embeddings = []
for i in sequences:
embed = pv.to_vecs(i)
embeddings.append(embed)
embedding = np.asarray(embeddings)
all_embeddings = np.reshape(embedding, newshape=(embedding.shape[0], 300))
dF = pd.DataFrame(all_embeddings, columns = colN, dtype = object)
dF['modification'] = df['modifications']
dF.to_csv('dataset-{a}_{b}.model'.format(a = c, b = dN))
pv.save('sequences_{a}_{b}.model'.format(a = c, b = dN))

(Idk why, but I can't seem to get this code block to indent properly.)

Please help me get past this error.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants