Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run a new dataset #15

Open
ghost opened this issue Feb 23, 2019 · 11 comments
Open

run a new dataset #15

ghost opened this issue Feb 23, 2019 · 11 comments

Comments

@ghost
Copy link

ghost commented Feb 23, 2019

i have some medical questions and answer dataset and i want to train it by your model. please guide me how to do that...

@AbrahamSanders
Copy link
Owner

Hey @Aprajita1, if your dataset is in a CSV format or you can get it into one, take a look here: https://github.com/AbrahamSanders/seq2seq-chatbot/tree/master/seq2seq-chatbot/datasets/csv

Also, I am currently working on documentation for how to train on a dataset on top of an existing model pre-trained on a different dataset instead of randomized weights in case that is what you are looking to do.

Let me know if the CSV works for you.

@ghost
Copy link
Author

ghost commented Feb 25, 2019 via email

@AbrahamSanders
Copy link
Owner

@Aprajita1 If you have copied the format of the cornell movie dialog dataset you can simply replace the movie_lines.txt and movie_conversations.txt files in the datasets\cornell_movie_dialog folder with your own files. Make sure to preserve the file names and everything should work properly.

How big is your dataset? For smaller datasets it may be beneficial to pre-train on a larger dataset first.

@ghost
Copy link
Author

ghost commented Feb 27, 2019 via email

@AbrahamSanders
Copy link
Owner

Hey @Aprajita1 you will probably need to use pre-training to get a model that works well with only 1200 lines of training data. This is possible with the chatbot but it is currently not straightforward. I am going to make some updates in the near future that make it easier to do. For now if you share your dataset with me I can train it using trained_model_v2 as a pre-trained baseline.

Also, there is now support for a new dataset format, DailyDialog, as an alternative to Cornell. It is easier to format your own dataset this way. Take a look here

@ghost
Copy link
Author

ghost commented Mar 8, 2019 via email

@AbrahamSanders
Copy link
Owner

@Aprajita1 I see your conversation length is 2 at most. You can certainly use the CSV dataset for this since there are no follow-up questions. Would you be able to provide it as a CSV? Follow this format.

@ghost
Copy link
Author

ghost commented Mar 10, 2019 via email

@ghost
Copy link
Author

ghost commented Mar 11, 2019 via email

@AbrahamSanders
Copy link
Owner

Hey @Aprajita1 I don't see the CSV dataset, did you email it or attach it here?

We can definitely do follow up questions, except I need to know what the dataset looks like to determine how best to format it. Can you send that one too?

@ghost
Copy link
Author

ghost commented Mar 14, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant