-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download Parsing #45
Comments
Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “apple_START_”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps.
https://books.google.com/ngrams/info
Sean.
…On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote:
Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies?
All the best and thank you in advance!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Dear Sean, thank you so much for the quick response. I think I see the
problem. However, it should be possible to download for example all the
data about apple as part of a trigram or larger context for example and to
use a package like spacy to the identify the semantic labels right? Can I
use the R package also to download n-grams larger than one? All the best
Nikola
Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody <
***@***.***>:
… Thanks for your interest in the package. The way it works is by scraping
the Google N-gram viewer page so it is only possible to provide the data
that you can obtain from that page. As far as I know there is no way to
determine arbitrary position in the sentence but if you have a look on
Google’s information page you can ask for words appearing at the beginning
(or end) of a sentence (see link below). So you could run a function call
like ngram(c("apple”, “apple_START_”) . Note that this would give you data
on all occurrences of “apple” and those occurrences with “apple” at the
start of the sentence and the first data set would include the second so
you may want to manipulate the data by subtracting the starting occurrences
from the total. I hope that helps.
https://books.google.com/ngrams/info
Sean.
On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote:
> Thank you so much for providing this package! I am looking for an
opportunity, to download the frequency of many words at the same time and
to get the information about their position in the sentences where they are
used. For instance: I would like to be able to compare the frequence of
"apple" being used as an subject or an object. Is there a possibility to
adjust your code and to do so, and further, to also proceed with more
calculations than just plotting the frequencies?
> All the best and thank you in advance!
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Unfortunately what you've described is not feasible. Google does provide
the raw n-gram data here:
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html. However,
as you can see this involves a large number of large files. It would not be
practical for the package to download and process all of this data when
making function calls. Even just downloading the 1-gram file would not be
practical but it would have to download every 2-gram file (
http://storage.googleapis.com/books/ngrams/books/20200217/eng/eng-2-ngrams_exports.html)
since apple could be in the second place and every 3-gram file etc. If
there was a site with a database which could be efficiently queried, the
package could make calls to that but I don't know of any such database with
direct query access. Google clearly has a database powering the n-gram
viewer charts but does not provide direct access to that database. That's
why the package works the way it does: it makes calls to the chart viewer
and then scrapes the chart data into an R data table. So the package can
only match the sort of data the chart viewer can return and it does not
allow specific n-gram calls. If you ask for "red apple" it will return the
data for that specific 2-gram but it will not return a count of all 2-grams
that include "apple".
…On Thu, Oct 5, 2023 at 3:00 AM Nikolanoske ***@***.***> wrote:
Dear Sean, thank you so much for the quick response. I think I see the
problem. However, it should be possible to download for example all the
data about apple as part of a trigram or larger context for example and to
use a package like spacy to the identify the semantic labels right? Can I
use the R package also to download n-grams larger than one? All the best
Nikola
Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody <
***@***.***>:
> Thanks for your interest in the package. The way it works is by scraping
> the Google N-gram viewer page so it is only possible to provide the data
> that you can obtain from that page. As far as I know there is no way to
> determine arbitrary position in the sentence but if you have a look on
> Google’s information page you can ask for words appearing at the
beginning
> (or end) of a sentence (see link below). So you could run a function
call
> like ngram(c("apple”, “apple_START_”) . Note that this would give you
data
> on all occurrences of “apple” and those occurrences with “apple” at the
> start of the sentence and the first data set would include the second so
> you may want to manipulate the data by subtracting the starting
occurrences
> from the total. I hope that helps.
>
> https://books.google.com/ngrams/info
>
> Sean.
> On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote:
> > Thank you so much for providing this package! I am looking for an
> opportunity, to download the frequency of many words at the same time
and
> to get the information about their position in the sentences where they
are
> used. For instance: I would like to be able to compare the frequence of
> "apple" being used as an subject or an object. Is there a possibility to
> adjust your code and to do so, and further, to also proceed with more
> calculations than just plotting the frequencies?
> > All the best and thank you in advance!
> > —
> > Reply to this email directly, view it on GitHub, or unsubscribe.
> > You are receiving this because you are subscribed to this
thread.Message
> ID: ***@***.***>
>
> —
> Reply to this email directly, view it on GitHub
> <#45 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAFVWPLJZHEQ437JE5WYP3X5WBY3AVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBXGE4DGNBTGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Sean Carmody
|
Dear Sean,
thank you very much for your reply. I think I will then concentrate on
onegrams. Since you say, ngramr is usable for the data that the n gram
viewer uses: Is there a possibility to modify the code such that you can
also get, f.i., French n-grams or other languages?
All the best and thank you again
Nikola
Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody <
***@***.***>:
… Thanks for your interest in the package. The way it works is by scraping
the Google N-gram viewer page so it is only possible to provide the data
that you can obtain from that page. As far as I know there is no way to
determine arbitrary position in the sentence but if you have a look on
Google’s information page you can ask for words appearing at the beginning
(or end) of a sentence (see link below). So you could run a function call
like ngram(c("apple”, “apple_START_”) . Note that this would give you data
on all occurrences of “apple” and those occurrences with “apple” at the
start of the sentence and the first data set would include the second so
you may want to manipulate the data by subtracting the starting occurrences
from the total. I hope that helps.
https://books.google.com/ngrams/info
Sean.
On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote:
> Thank you so much for providing this package! I am looking for an
opportunity, to download the frequency of many words at the same time and
to get the information about their position in the sentences where they are
used. For instance: I would like to be able to compare the frequence of
"apple" being used as an subject or an object. Is there a possibility to
adjust your code and to do so, and further, to also proceed with more
calculations than just plotting the frequencies?
> All the best and thank you in advance!
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
No need for modification: the package already allows you do download ngrams
from other languages using the 'corpus' argument. For example for French:
ngram("chat", corpus = "fr-2019")
The documentation provides a list of valid corpuses.
Sean.
On Fri, Oct 20, 2023 at 2:06 AM Nikolanoske ***@***.***>
wrote:
… Dear Sean,
thank you very much for your reply. I think I will then concentrate on
onegrams. Since you say, ngramr is usable for the data that the n gram
viewer uses: Is there a possibility to modify the code such that you can
also get, f.i., French n-grams or other languages?
All the best and thank you again
Nikola
Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody <
***@***.***>:
> Thanks for your interest in the package. The way it works is by scraping
> the Google N-gram viewer page so it is only possible to provide the data
> that you can obtain from that page. As far as I know there is no way to
> determine arbitrary position in the sentence but if you have a look on
> Google’s information page you can ask for words appearing at the
beginning
> (or end) of a sentence (see link below). So you could run a function
call
> like ngram(c("apple”, “apple_START_”) . Note that this would give you
data
> on all occurrences of “apple” and those occurrences with “apple” at the
> start of the sentence and the first data set would include the second so
> you may want to manipulate the data by subtracting the starting
occurrences
> from the total. I hope that helps.
>
> https://books.google.com/ngrams/info
>
> Sean.
> On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote:
> > Thank you so much for providing this package! I am looking for an
> opportunity, to download the frequency of many words at the same time
and
> to get the information about their position in the sentences where they
are
> used. For instance: I would like to be able to compare the frequence of
> "apple" being used as an subject or an object. Is there a possibility to
> adjust your code and to do so, and further, to also proceed with more
> calculations than just plotting the frequencies?
> > All the best and thank you in advance!
> > —
> > Reply to this email directly, view it on GitHub, or unsubscribe.
> > You are receiving this because you are subscribed to this
thread.Message
> ID: ***@***.***>
>
> —
> Reply to this email directly, view it on GitHub
> <#45 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAFVWL64BECHYBATWKOOL3YAE6Y3AVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGE4DEMZVHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Sean Carmody
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies?
All the best and thank you in advance!
The text was updated successfully, but these errors were encountered: