Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

showing error #11

Open
harshalpatilnmu opened this issue Nov 12, 2018 · 16 comments
Open

showing error #11

harshalpatilnmu opened this issue Nov 12, 2018 · 16 comments

Comments

@harshalpatilnmu
Copy link

when I try to train model then it shows an error ..
(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas
etdir=datasets\chatbot_dataset
Traceback (most recent call last):
File "train.py", line 14, in
dataset_dir, model_dir, hparams, resume_checkpoint = general_utils.initializ
e_session("train")
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\general_utils.py", lin
e 45, in initialize_session
copyfile("hparams.json", os.path.join(model_dir, "hparams.json"))
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\shutil
.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'hparams.json'

@AbrahamSanders
Copy link
Owner

Hey @harshalpatilnmu,

Try these:

  1. If you are on windows, you can use the training batch files inside the dataset directory. For example, datasets/cornell_movie_dialog/train_with_nnlm_en_embeddings.bat. These should set the working directory automatically. There are multiple batch files - each one configures training with different pre-trained embeddings in different configurations. To train your own embeddings, use train_with_random_embeddings.bat

  2. If you are running the train.py file yourself, make sure your console working directory is set to the innermost seq2seq-chatbot path. Based on your log above, that would be D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot on your machine.

Also there are pre-trained models you can download and try out, see here:
https://github.com/AbrahamSanders/seq2seq-chatbot/tree/master/seq2seq-chatbot/models/cornell_movie_dialog

@harshalpatilnmu
Copy link
Author

My dataset is different so first I need to train the model then only I can run it..so how I can train my model.
I follow your training command but it does not work for me. it shows an error.

@AbrahamSanders
Copy link
Owner

Make sure your console working directory is D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot. You should be able to see the hparams.json file directly in this folder. If you are unsure of the working directory, run it from an ipython console and set it manually:

ipython

import os
os.chdir(r"D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot")

train.py --datasetdir=datasets\cornell_movie_dialog

@harshalpatilnmu
Copy link
Author

I set directory even it shows error..

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datasetdir=datasets\chatbot_dataset
Traceback (most recent call last):
File "train.py", line 14, in
dataset_dir, model_dir, hparams, resume_checkpoint = general_utils.initializ
e_session("train")
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\general_utils.py", lin
e 60, in initialize_session
hparams = Hparams.load(hparams_filepath)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\hparams.py", line 33,
in load
hparams = jsonpickle.decode(json)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\unpickler.py", line 39, in decode
data = backend.decode(string)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\backend.py", line 194, in decode
raise e
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\backend.py", line 191, in decode
return self.backend_decode(name, string)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p
ackages\jsonpickle\backend.py", line 203, in backend_decode
return self.decoders[name](string, *optargs, **decoder_kwargs)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json_
init
.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json\d
ecoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json\d
ecoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 57 column 33 (char 2
005)

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>

@AbrahamSanders
Copy link
Owner

The error message is saying: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 57 column 33

Check the hparams.json file to make sure no comma is missing on a line that should have one. If you are not sure, copy and paste the file into the left box here: https://jsoneditoronline.org/ and it will automatically detect formatting errors.

I tried this with the committed version in the repository and there are no errors detected.

@harshalpatilnmu
Copy link
Author

harshalpatilnmu commented Nov 13, 2018

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas
etdir=datasets\cornell_movie_dialog

Reading dataset 'cornell_movie_dialog'...
Traceback (most recent call last):
File "train.py", line 31, in
decoder_embeddings_dir = decoder_embeddings_dir)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\dataset_reader.py", line 106, in read_dataset
question = id2line[conversation[i]]

@AbrahamSanders
Copy link
Owner

Looks like your dataset is probably not formatted the same way as the cornell movie dialog dataset. You will need to implement a reader for your custom dataset:

See cornell_dataset_reader.py - this class implements the reader that converts the raw cornell files "movie_lines.txt" and "movie_conversations.txt" into the dict id2line and the list conversation_ids.

Duplicate this class, rename it and tweak the implementation to work with your own dataset format - all that matters is that the output is the same - id2line is a dictionary of dialog lines with unique ids, and conversations_ids is a list of sequences of dialog line ids (each sequence of ids represents a dialog between two people for one or more turns).

Once the new reader is implemented, register an instance of it in the dataset_reader_factory:
readers = [CornellDatasetReader(), YourNewDatasetReader()]

Alternatively if you don't want to do all of this, modify your dataset so that it follows the same format as the cornell movie dialog dataset.

@harshalpatilnmu
Copy link
Author

harshalpatilnmu commented Nov 13, 2018

I have csv file in which data is formatted as questions and answers so how I can read it dataser_reader_factor.. I used pd.read_csv() function but I stucked in your code..how to use id2line and conversation. in my case my data is ready ..I dont need to split and replace it. could you help me for writing code.
check following code..

"""
Reader class for the Cornell movie dialog dataset
"""
from os import path

from dataset_readers.dataset_reader import DatasetReader
import pandas as pd

class CornellDatasetReader(DatasetReader):
"""Reader implementation for the Cornell movie dialog dataset
"""
def init(self):
super(CornellDatasetReader, self).init("cornell_movie_dialog")

def _get_dialog_lines_and_conversations(self, dataset_dir):
    """Get dialog lines and conversations. See base class for explanation.
    Args:
        See base class
    """
   # movie_lines_filepath = path.join(dataset_dir, "movie_lines.txt")
   # movie_conversations_filepath = path.join(dataset_dir, "movie_conversations.txt")
    
    # Importing the dataset
    #with open(movie_lines_filepath, encoding="utf-8", errors="ignore") as file:
     #   lines = file.read()
    
    #with open(movie_conversations_filepath, encoding="utf-8", errors="ignore") as file:
    #    conversations = file.read()


    
    # Creating a dictionary that maps each line and its id
    #id2line = {}
    #for line in lines:
     #   _line = line.split(" +++$+++ ")
      #  if len(_line) == 5:
       #     id2line[_line[0]] = _line[4]
    
    # Creating a list of all of the conversations
    #conversations_ids = []
    #for conversation in conversations[:-1]:
     #   _conversation = conversation.split(" +++$+++ ")[-1][1:-1].replace("'", "").replace(" ", "")
      #  conv_ids = _conversation.split(",")
       # conversations_ids.append(conv_ids)



**data = pd.read_csv('abc_data.csv', encoding ='ISO-8859-1', header=None)

## Creating a dictionary that maps each line and its id
id2line=data.to_dict()[1]

#Creating a list of all of the conversations
conversations_ids = data.values.tolist()**
    
    return id2line, conversations_ids

@AbrahamSanders
Copy link
Owner

AbrahamSanders commented Nov 13, 2018

The base class is expecting the data in the format of a conversational log, such as:
Person 1: Hello!
Person 2: How are you?
Person 1: Good, you?
Person 2: Same here.

It infers question-answer pairs as follows:
Question: Hello! --> Answer: How are you?
Question: How are you? --> Answer: Good, you?
Question: Good, you? --> Answer: Same here.

If you already have your data in this form, unfortunately you will need to present it as a log and let the base class put it back in that form.
Further development could address this and enable a dataset like yours to be used directly - I will open a separate feature-request issue in the repo.

For now, you can take each question-answer pair from your CSV and do this (pseudo code):

for i, qa_pair in enumerate(csv):
  id2line.append("{}_q".format(i), qa_pair["question"])
  id2line.append("{}_a".format(i), qa_pair["answer"])
  conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])

return id2line, conversations_ids

One additional thing - you should set conv_history_length to 0 in hparams.json, both under training_hparams and inference_hparams. If you don't do this, the chatbot will prepend the last N conversation turns to the input as a sort of context, which is probably not what you want if you are trying to make a Q&A bot rather than a conversational bot.

Alternately, If you are willing to share your CSV, I can implement the reader and train it on my Titan V GPU.

@harshalpatilnmu
Copy link
Author

harshalpatilnmu commented Nov 16, 2018

Hi AbrahamSanders,
Data is formatted as questions and answers. I am sharing csv file. this is dummy data but format is same.could you help me to write code. [Thanks.]
csv_data.xlsx
this file is in csv format

@AbrahamSanders
Copy link
Owner

@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py

Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)

Finally, follow the instructions here.

Let me know how it goes!

Some additional notes on hparam configuration (hparams.json):

If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.

Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.

@harshalpatilnmu
Copy link
Author

harshalpatilnmu commented Nov 19, 2018

@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py

Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)

Finally, follow the instructions here.

Let me know how it goes!

Some additional notes on hparam configuration (hparams.json):

If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.

Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.

Reply:
following is my code

#Reader class for the Cornell movie dialog dataset
#"""

from os import path
from dataset_readers.dataset_reader import DatasetReader
import pandas as pd

class CornellDatasetReader(DatasetReader):

def __init__(self):
    super(CornellDatasetReader, self).__init__("cornell_movie_dialog")

def _get_dialog_lines_and_conversations(self, dataset_dir):
   data=pd.read_csv('full_data.csv', encoding='ISO-8859-1', header=None)
    print(data)
    id2line={}
    print(id2line)
    conversations_ids=[]
    for i, qa_pair in enumerate(data):
        id2line.append("{}_q".format(i), qa_pair["question"])
        id2line.append("{}_a".format(i), qa_pair["answer"])
        conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])
    return id2line, conversations_ids

error:
(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas
etdir=datasets\cornell_movie_dialog

Reading dataset 'cornell_movie_dialog'...
{}
Traceback (most recent call last):
File "train.py", line 31, in
decoder_embeddings_dir = decoder_embeddings_dir)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\datase
t_reader.py", line 88, in read_dataset
id2line, conversations_ids = self._get_dialog_lines_and_conversations(datase
t_dir)
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\cornel
l_dataset_reader.py", line 62, in _get_dialog_lines_and_conversations
id2line.append("{}_q".format(i), qa_pair["question"])
AttributeError: 'dict' object has no attribute 'append'

@AbrahamSanders
Copy link
Owner

AbrahamSanders commented Nov 19, 2018

@harshalpatilnmu please follow the directions in my last post. Revert cornell_dataset_reader.py and pull down the new reader as per my post. This should be able to process your CSV - I tested it successfully on the dummy data you sent me.

Also, make sure your data is in the directory \datasets\csv and not \datasets\cornell_movie_dialog as per the CSV readme

@harshalpatilnmu
Copy link
Author

harshalpatilnmu commented Nov 20, 2018

Thanks a lot for giving support, data is trained on dataset properly. I set the hparam inference_hparams/conv_history_length to 0 but it shows repeated answers. when I type question first time it shows correct answer and second time when I pass some information then chatbot will return output as previous output. so how I can avoid them.

@AbrahamSanders
Copy link
Owner

@harshalpatilnmu you're welcome - I'm glad training is working for you now.

Here are a few considerations to help resolve your issue:

  1. Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.

To use pre-trained embeddings, follow these suggestions:
a) If your dataset is mostly common english words:
change model_hparams/encoder_embedding_trainable and model_hparams/decoder_embedding_trainable to false and change training_hparams/input_vocab_import_mode and training/hparams_output_vocab_import_mode to ExternalIntersectDataset

b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english):
No additional changes needed to default hparams.json

To run it, use the training batch file with nnlm_en embeddings.

  1. Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.

  2. Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.

  3. Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.

  4. hparams If you change the inference hparams (like setting inference_hparams/conv_history_length to 0 in hparams.json) make sure you are:
    a) Changing the hparams.json in your model folder not in the base seq2seq-chatbot folder.
    b) If you change the hparams.json and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing --convhistlength=0

  5. beamsearch beamsearch can be tweaked to optimize your output. The default model_hparams/beam_width is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with --beamwidth=N Also you can influence the weights used in beam ranking by changing inference_hparams/beam_length_penalty_weight. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with --beamlenpenalty=N

I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!

@harshalpatilnmu
Copy link
Author

@harshalpatilnmu you're welcome - I'm glad training is working for you now.

Here are a few considerations to help resolve your issue:

  1. Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.

To use pre-trained embeddings, follow these suggestions:
a) If your dataset is mostly common english words:
change model_hparams/encoder_embedding_trainable and model_hparams/decoder_embedding_trainable to false and change training_hparams/input_vocab_import_mode and training/hparams_output_vocab_import_mode to ExternalIntersectDataset

b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english):
No additional changes needed to default hparams.json

To run it, use the training batch file with nnlm_en embeddings.

  1. Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.
  2. Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.
  3. Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.
  4. hparams If you change the inference hparams (like setting inference_hparams/conv_history_length to 0 in hparams.json) make sure you are:
    a) Changing the hparams.json in your model folder not in the base seq2seq-chatbot folder.
    b) If you change the hparams.json and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing --convhistlength=0
  5. beamsearch beamsearch can be tweaked to optimize your output. The default model_hparams/beam_width is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with --beamwidth=N Also you can influence the weights used in beam ranking by changing inference_hparams/beam_length_penalty_weight. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with --beamlenpenalty=N

I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!

File size is 157 KB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants