showing error #11

harshalpatilnmu · 2018-11-12T09:16:20Z

when I try to train model then it shows an error ..
(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas
etdir=datasets\chatbot_dataset
Traceback (most recent call last):
File "train.py", line 14, in
dataset_dir, model_dir, hparams, resume_checkpoint = general_utils.initializ
e_session("train")
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\general_utils.py", lin
e 45, in initialize_session
copyfile("hparams.json", os.path.join(model_dir, "hparams.json"))
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\shutil
.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'hparams.json'

AbrahamSanders · 2018-11-12T15:03:40Z

Hey @harshalpatilnmu,

Try these:

If you are on windows, you can use the training batch files inside the dataset directory. For example, datasets/cornell_movie_dialog/train_with_nnlm_en_embeddings.bat. These should set the working directory automatically. There are multiple batch files - each one configures training with different pre-trained embeddings in different configurations. To train your own embeddings, use train_with_random_embeddings.bat
If you are running the train.py file yourself, make sure your console working directory is set to the innermost seq2seq-chatbot path. Based on your log above, that would be D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot on your machine.

Also there are pre-trained models you can download and try out, see here:
https://github.com/AbrahamSanders/seq2seq-chatbot/tree/master/seq2seq-chatbot/models/cornell_movie_dialog

harshalpatilnmu · 2018-11-12T16:55:41Z

My dataset is different so first I need to train the model then only I can run it..so how I can train my model.
I follow your training command but it does not work for me. it shows an error.

AbrahamSanders · 2018-11-13T02:19:02Z

Make sure your console working directory is D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot. You should be able to see the hparams.json file directly in this folder. If you are unsure of the working directory, run it from an ipython console and set it manually:

ipython

import os
os.chdir(r"D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot")

train.py --datasetdir=datasets\cornell_movie_dialog

harshalpatilnmu · 2018-11-13T03:11:12Z

I set directory even it shows error..

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datasetdir=datasets\chatbot_dataset
Traceback (most recent call last):
File "train.py", line 14, in
dataset_dir, model_dir, hparams, resume_checkpoint = general_utils.initializ
e_session("train")
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\general_utils.py", lin
e 60, in initialize_session
hparams = Hparams.load(hparams_filepath)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\hparams.py", line 33,
in load
hparams = jsonpickle.decode(json)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\unpickler.py", line 39, in decode
data = backend.decode(string)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\backend.py", line 194, in decode
raise e
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\backend.py", line 191, in decode
return self.backend_decode(name, string)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\backend.py", line 203, in backend_decode
return self.decoders[name](string, *optargs, **decoder_kwargs)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json_
init.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json\d
ecoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json\d
ecoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 57 column 33 (char 2
005)

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>

AbrahamSanders · 2018-11-13T04:17:22Z

The error message is saying: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 57 column 33

Check the hparams.json file to make sure no comma is missing on a line that should have one. If you are not sure, copy and paste the file into the left box here: https://jsoneditoronline.org/ and it will automatically detect formatting errors.

I tried this with the committed version in the repository and there are no errors detected.

harshalpatilnmu · 2018-11-13T04:22:33Z

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas
etdir=datasets\cornell_movie_dialog

Reading dataset 'cornell_movie_dialog'...
Traceback (most recent call last):
File "train.py", line 31, in
decoder_embeddings_dir = decoder_embeddings_dir)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\dataset_reader.py", line 106, in read_dataset
question = id2line[conversation[i]]

AbrahamSanders · 2018-11-13T04:45:48Z

Looks like your dataset is probably not formatted the same way as the cornell movie dialog dataset. You will need to implement a reader for your custom dataset:

See cornell_dataset_reader.py - this class implements the reader that converts the raw cornell files "movie_lines.txt" and "movie_conversations.txt" into the dict id2line and the list conversation_ids.

Duplicate this class, rename it and tweak the implementation to work with your own dataset format - all that matters is that the output is the same - id2line is a dictionary of dialog lines with unique ids, and conversations_ids is a list of sequences of dialog line ids (each sequence of ids represents a dialog between two people for one or more turns).

Once the new reader is implemented, register an instance of it in the dataset_reader_factory:
readers = [CornellDatasetReader(), YourNewDatasetReader()]

Alternatively if you don't want to do all of this, modify your dataset so that it follows the same format as the cornell movie dialog dataset.

harshalpatilnmu · 2018-11-13T07:16:30Z

I have csv file in which data is formatted as questions and answers so how I can read it dataser_reader_factor.. I used pd.read_csv() function but I stucked in your code..how to use id2line and conversation. in my case my data is ready ..I dont need to split and replace it. could you help me for writing code.
check following code..

"""
Reader class for the Cornell movie dialog dataset
"""
from os import path

from dataset_readers.dataset_reader import DatasetReader
import pandas as pd

class CornellDatasetReader(DatasetReader):
"""Reader implementation for the Cornell movie dialog dataset
"""
def init(self):
super(CornellDatasetReader, self).init("cornell_movie_dialog")

def _get_dialog_lines_and_conversations(self, dataset_dir):
    """Get dialog lines and conversations. See base class for explanation.
    Args:
        See base class
    """
   # movie_lines_filepath = path.join(dataset_dir, "movie_lines.txt")
   # movie_conversations_filepath = path.join(dataset_dir, "movie_conversations.txt")
    
    # Importing the dataset
    #with open(movie_lines_filepath, encoding="utf-8", errors="ignore") as file:
     #   lines = file.read()
    
    #with open(movie_conversations_filepath, encoding="utf-8", errors="ignore") as file:
    #    conversations = file.read()


    
    # Creating a dictionary that maps each line and its id
    #id2line = {}
    #for line in lines:
     #   _line = line.split(" +++$+++ ")
      #  if len(_line) == 5:
       #     id2line[_line[0]] = _line[4]
    
    # Creating a list of all of the conversations
    #conversations_ids = []
    #for conversation in conversations[:-1]:
     #   _conversation = conversation.split(" +++$+++ ")[-1][1:-1].replace("'", "").replace(" ", "")
      #  conv_ids = _conversation.split(",")
       # conversations_ids.append(conv_ids)



**data = pd.read_csv('abc_data.csv', encoding ='ISO-8859-1', header=None)

## Creating a dictionary that maps each line and its id
id2line=data.to_dict()[1]

#Creating a list of all of the conversations
conversations_ids = data.values.tolist()**
    
    return id2line, conversations_ids

AbrahamSanders · 2018-11-13T16:00:52Z

The base class is expecting the data in the format of a conversational log, such as:
Person 1: Hello!
Person 2: How are you?
Person 1: Good, you?
Person 2: Same here.

It infers question-answer pairs as follows:
Question: Hello! --> Answer: How are you?
Question: How are you? --> Answer: Good, you?
Question: Good, you? --> Answer: Same here.

If you already have your data in this form, unfortunately you will need to present it as a log and let the base class put it back in that form.
Further development could address this and enable a dataset like yours to be used directly - I will open a separate feature-request issue in the repo.

For now, you can take each question-answer pair from your CSV and do this (pseudo code):

for i, qa_pair in enumerate(csv):
  id2line.append("{}_q".format(i), qa_pair["question"])
  id2line.append("{}_a".format(i), qa_pair["answer"])
  conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])

return id2line, conversations_ids

One additional thing - you should set conv_history_length to 0 in hparams.json, both under training_hparams and inference_hparams. If you don't do this, the chatbot will prepend the last N conversation turns to the input as a sort of context, which is probably not what you want if you are trying to make a Q&A bot rather than a conversational bot.

Alternately, If you are willing to share your CSV, I can implement the reader and train it on my Titan V GPU.

harshalpatilnmu · 2018-11-16T11:11:34Z

Hi AbrahamSanders,
Data is formatted as questions and answers. I am sharing csv file. this is dummy data but format is same.could you help me to write code. [Thanks.]
csv_data.xlsx
this file is in csv format

AbrahamSanders · 2018-11-16T18:43:00Z

@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py

Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)

Finally, follow the instructions here.

Let me know how it goes!

Some additional notes on hparam configuration (hparams.json):

If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.

Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.

harshalpatilnmu · 2018-11-19T04:59:56Z

@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py

Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)

Finally, follow the instructions here.

Let me know how it goes!

Some additional notes on hparam configuration (hparams.json):

If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.

Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.

Reply:
following is my code

#Reader class for the Cornell movie dialog dataset
#"""

from os import path
from dataset_readers.dataset_reader import DatasetReader
import pandas as pd

class CornellDatasetReader(DatasetReader):

def __init__(self):
    super(CornellDatasetReader, self).__init__("cornell_movie_dialog")

def _get_dialog_lines_and_conversations(self, dataset_dir):
   data=pd.read_csv('full_data.csv', encoding='ISO-8859-1', header=None)
    print(data)
    id2line={}
    print(id2line)
    conversations_ids=[]
    for i, qa_pair in enumerate(data):
        id2line.append("{}_q".format(i), qa_pair["question"])
        id2line.append("{}_a".format(i), qa_pair["answer"])
        conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])
    return id2line, conversations_ids

error:
(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas
etdir=datasets\cornell_movie_dialog

Reading dataset 'cornell_movie_dialog'...
{}
Traceback (most recent call last):
File "train.py", line 31, in
decoder_embeddings_dir = decoder_embeddings_dir)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\datase
t_reader.py", line 88, in read_dataset
id2line, conversations_ids = self._get_dialog_lines_and_conversations(datase
t_dir)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\cornel
l_dataset_reader.py", line 62, in _get_dialog_lines_and_conversations
id2line.append("{}_q".format(i), qa_pair["question"])
AttributeError: 'dict' object has no attribute 'append'

AbrahamSanders · 2018-11-19T08:24:28Z

@harshalpatilnmu please follow the directions in my last post. Revert cornell_dataset_reader.py and pull down the new reader as per my post. This should be able to process your CSV - I tested it successfully on the dummy data you sent me.

Also, make sure your data is in the directory \datasets\csv and not \datasets\cornell_movie_dialog as per the CSV readme

harshalpatilnmu · 2018-11-20T08:23:19Z

Thanks a lot for giving support, data is trained on dataset properly. I set the hparam inference_hparams/conv_history_length to 0 but it shows repeated answers. when I type question first time it shows correct answer and second time when I pass some information then chatbot will return output as previous output. so how I can avoid them.

AbrahamSanders · 2018-11-20T16:53:06Z

@harshalpatilnmu you're welcome - I'm glad training is working for you now.

Here are a few considerations to help resolve your issue:

Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.

To use pre-trained embeddings, follow these suggestions:
a) If your dataset is mostly common english words:
change model_hparams/encoder_embedding_trainable and model_hparams/decoder_embedding_trainable to false and change training_hparams/input_vocab_import_mode and training/hparams_output_vocab_import_mode to ExternalIntersectDataset

b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english):
No additional changes needed to default hparams.json

To run it, use the training batch file with nnlm_en embeddings.

Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.
Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.
Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.
hparams If you change the inference hparams (like setting inference_hparams/conv_history_length to 0 in hparams.json) make sure you are:
a) Changing the hparams.json in your model folder not in the base seq2seq-chatbot folder.
b) If you change the hparams.json and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing --convhistlength=0
beamsearch beamsearch can be tweaked to optimize your output. The default model_hparams/beam_width is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with --beamwidth=N Also you can influence the weights used in beam ranking by changing inference_hparams/beam_length_penalty_weight. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with --beamlenpenalty=N

I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!

harshalpatilnmu · 2018-11-21T08:40:29Z

@harshalpatilnmu you're welcome - I'm glad training is working for you now.

Here are a few considerations to help resolve your issue:

Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.

To use pre-trained embeddings, follow these suggestions:
a) If your dataset is mostly common english words:
change model_hparams/encoder_embedding_trainable and model_hparams/decoder_embedding_trainable to false and change training_hparams/input_vocab_import_mode and training/hparams_output_vocab_import_mode to ExternalIntersectDataset

b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english):
No additional changes needed to default hparams.json

To run it, use the training batch file with nnlm_en embeddings.

Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.

Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.

Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.

hparams If you change the inference hparams (like setting inference_hparams/conv_history_length to 0 in hparams.json) make sure you are:
a) Changing the hparams.json in your model folder not in the base seq2seq-chatbot folder.
b) If you change the hparams.json and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing --convhistlength=0

beamsearch beamsearch can be tweaked to optimize your output. The default model_hparams/beam_width is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with --beamwidth=N Also you can influence the weights used in beam ranking by changing inference_hparams/beam_length_penalty_weight. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with --beamlenpenalty=N

I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!

File size is 157 KB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

showing error #11

showing error #11

harshalpatilnmu commented Nov 12, 2018

AbrahamSanders commented Nov 12, 2018

harshalpatilnmu commented Nov 12, 2018

AbrahamSanders commented Nov 13, 2018

harshalpatilnmu commented Nov 13, 2018

AbrahamSanders commented Nov 13, 2018

harshalpatilnmu commented Nov 13, 2018 •

edited

Loading

AbrahamSanders commented Nov 13, 2018

harshalpatilnmu commented Nov 13, 2018 •

edited

Loading

AbrahamSanders commented Nov 13, 2018 •

edited

Loading

harshalpatilnmu commented Nov 16, 2018 •

edited

Loading

AbrahamSanders commented Nov 16, 2018

harshalpatilnmu commented Nov 19, 2018 •

edited

Loading

AbrahamSanders commented Nov 19, 2018 •

edited

Loading

harshalpatilnmu commented Nov 20, 2018 •

edited

Loading

AbrahamSanders commented Nov 20, 2018

harshalpatilnmu commented Nov 21, 2018

showing error #11

showing error #11

Comments

harshalpatilnmu commented Nov 12, 2018

AbrahamSanders commented Nov 12, 2018

harshalpatilnmu commented Nov 12, 2018

AbrahamSanders commented Nov 13, 2018

harshalpatilnmu commented Nov 13, 2018

AbrahamSanders commented Nov 13, 2018

harshalpatilnmu commented Nov 13, 2018 • edited Loading

AbrahamSanders commented Nov 13, 2018

harshalpatilnmu commented Nov 13, 2018 • edited Loading

AbrahamSanders commented Nov 13, 2018 • edited Loading

harshalpatilnmu commented Nov 16, 2018 • edited Loading

AbrahamSanders commented Nov 16, 2018

harshalpatilnmu commented Nov 19, 2018 • edited Loading

AbrahamSanders commented Nov 19, 2018 • edited Loading

harshalpatilnmu commented Nov 20, 2018 • edited Loading

AbrahamSanders commented Nov 20, 2018

harshalpatilnmu commented Nov 21, 2018

harshalpatilnmu commented Nov 13, 2018 •

edited

Loading

harshalpatilnmu commented Nov 13, 2018 •

edited

Loading

AbrahamSanders commented Nov 13, 2018 •

edited

Loading

harshalpatilnmu commented Nov 16, 2018 •

edited

Loading

harshalpatilnmu commented Nov 19, 2018 •

edited

Loading

AbrahamSanders commented Nov 19, 2018 •

edited

Loading

harshalpatilnmu commented Nov 20, 2018 •

edited

Loading