You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improvement in input/output formats when dealing with arrays of strings / tokens for RNN.
Basic example
Currently, the LSTM model for example, can accept different types of data as input and as training data, arrays and strings. While the support for strings out of the box is great, it has some defaults that don't seem to fit so many use cases.
Current behavior
Given raw strings:
constdata=[{input: 'hello I am an input',output: 'hi I am the output'}]
This will implicitly use character tokenization behaving as: input.split('') and output.join(''). See #799 which is a common case of LSTM usage I believe, string input and labels output. Because of the default behavior, the labels are being treated as actual text.
It is possible to preprocess data by splitting everything to arrays so the model will be mapping words to neurons instead of characters.
Here, we are using words instead of characters. That helps having simpler data to learn from for our model, and it also assures us of the output being exactly what we expect in term of label names. The problem with this right now is my output would be soIexpecttokensasoutput because of the output.join('') behavior. Working with string arrays is not documented anywhere so it surely causes trouble to some users of the library.
Current workaround
For now the most basic workaround I can think of is to add a space to every word in an array so that the output is readable and processable.
I think the best improvement would be to have the same output format as the input. In the case of #799 where strings are being used as input and arrays as output, it should probably throw an explicit error about the input being a string while the output is an array. Preprocessing the data by splitting it into words input.split(' ') would be a pretty easy step for the user, rather than figuring out both how the input and output are mapped to neurons and how the output formatted.
Motivation
Apart from the #799 issue, I have been dealing with that problem and spent some time on it until I could write a lot of extra code to get some workaround. String arrays using RNN are not really documented and don't really provide a lot of customization.
A lot of the RNN (specifically LSTM) usage seems to be for reading questions or requests and writing a response using words that are related to a topic in a pretty limited vocabulary, as well as usage for mapping labels to a specific textual input which is not really well supported at the moment and has very implicit behavior. We could provide a more flexible RNN that is more easily customizable regarding how it treats the input and output, as well as making it more adapted to more usage.
The text was updated successfully, but these errors were encountered:
Working on this now. Not as easy as it seems, but have started abstracting the DataFormatter into a different type that can work with typescript generics and the original api. This is a priority for me because this is an existing feature that isn't alpha/beta.
Summary
Improvement in input/output formats when dealing with arrays of strings / tokens for RNN.
Basic example
Currently, the LSTM model for example, can accept different types of data as input and as training data, arrays and strings. While the support for strings out of the box is great, it has some defaults that don't seem to fit so many use cases.
Current behavior
This will implicitly use character tokenization behaving as:
input.split('')
andoutput.join('')
. See #799 which is a common case of LSTM usage I believe, string input and labels output. Because of the default behavior, the labels are being treated as actual text.It is possible to preprocess data by splitting everything to arrays so the model will be mapping words to neurons instead of characters.
Here, we are using words instead of characters. That helps having simpler data to learn from for our model, and it also assures us of the output being exactly what we expect in term of label names. The problem with this right now is my output would be
soIexpecttokensasoutput
because of theoutput.join('')
behavior. Working with string arrays is not documented anywhere so it surely causes trouble to some users of the library.Current workaround
For now the most basic workaround I can think of is to add a space to every word in an array so that the output is readable and processable.
Possible improvement
I think the best improvement would be to have the same output format as the input. In the case of #799 where strings are being used as input and arrays as output, it should probably throw an explicit error about the input being a string while the output is an array. Preprocessing the data by splitting it into words
input.split(' ')
would be a pretty easy step for the user, rather than figuring out both how the input and output are mapped to neurons and how the output formatted.Motivation
Apart from the #799 issue, I have been dealing with that problem and spent some time on it until I could write a lot of extra code to get some workaround. String arrays using RNN are not really documented and don't really provide a lot of customization.
A lot of the RNN (specifically LSTM) usage seems to be for reading questions or requests and writing a response using words that are related to a topic in a pretty limited vocabulary, as well as usage for mapping labels to a specific textual input which is not really well supported at the moment and has very implicit behavior. We could provide a more flexible RNN that is more easily customizable regarding how it treats the input and output, as well as making it more adapted to more usage.
The text was updated successfully, but these errors were encountered: