Download Parsing #45

econinomista · 2023-10-04T09:36:37Z

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies?

All the best and thank you in advance!

seancarmody · 2023-10-04T11:18:26Z

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “apple_START_”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps. https://books.google.com/ngrams/info Sean.

…

On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote: Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? All the best and thank you in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

econinomista · 2023-10-04T15:47:46Z

Dear Sean, thank you so much for the quick response. I think I see the problem. However, it should be possible to download for example all the data about apple as part of a trigram or larger context for example and to use a package like spacy to the identify the semantic labels right? Can I use the R package also to download n-grams larger than one? All the best Nikola Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < ***@***.***>:

…

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “apple_START_”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps. https://books.google.com/ngrams/info Sean. On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote: > Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? > All the best and thank you in advance! > — > Reply to this email directly, view it on GitHub, or unsubscribe. > You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

seancarmody · 2023-10-04T18:49:13Z

Unfortunately what you've described is not feasible. Google does provide the raw n-gram data here: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html. However, as you can see this involves a large number of large files. It would not be practical for the package to download and process all of this data when making function calls. Even just downloading the 1-gram file would not be practical but it would have to download every 2-gram file ( http://storage.googleapis.com/books/ngrams/books/20200217/eng/eng-2-ngrams_exports.html) since apple could be in the second place and every 3-gram file etc. If there was a site with a database which could be efficiently queried, the package could make calls to that but I don't know of any such database with direct query access. Google clearly has a database powering the n-gram viewer charts but does not provide direct access to that database. That's why the package works the way it does: it makes calls to the chart viewer and then scrapes the chart data into an R data table. So the package can only match the sort of data the chart viewer can return and it does not allow specific n-gram calls. If you ask for "red apple" it will return the data for that specific 2-gram but it will not return a count of all 2-grams that include "apple".

…

On Thu, Oct 5, 2023 at 3:00 AM Nikolanoske ***@***.***> wrote: Dear Sean, thank you so much for the quick response. I think I see the problem. However, it should be possible to download for example all the data about apple as part of a trigram or larger context for example and to use a package like spacy to the identify the semantic labels right? Can I use the R package also to download n-grams larger than one? All the best Nikola Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < ***@***.***>: > Thanks for your interest in the package. The way it works is by scraping > the Google N-gram viewer page so it is only possible to provide the data > that you can obtain from that page. As far as I know there is no way to > determine arbitrary position in the sentence but if you have a look on > Google’s information page you can ask for words appearing at the beginning > (or end) of a sentence (see link below). So you could run a function call > like ngram(c("apple”, “apple_START_”) . Note that this would give you data > on all occurrences of “apple” and those occurrences with “apple” at the > start of the sentence and the first data set would include the second so > you may want to manipulate the data by subtracting the starting occurrences > from the total. I hope that helps. > > https://books.google.com/ngrams/info > > Sean. > On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote: > > Thank you so much for providing this package! I am looking for an > opportunity, to download the frequency of many words at the same time and > to get the information about their position in the sentences where they are > used. For instance: I would like to be able to compare the frequence of > "apple" being used as an subject or an object. Is there a possibility to > adjust your code and to do so, and further, to also proceed with more > calculations than just plotting the frequencies? > > All the best and thank you in advance! > > — > > Reply to this email directly, view it on GitHub, or unsubscribe. > > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> > > — > Reply to this email directly, view it on GitHub > <#45 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAFVWPLJZHEQ437JE5WYP3X5WBY3AVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBXGE4DGNBTGM> . You are receiving this because you commented.Message ID: ***@***.***>

-- Sean Carmody

econinomista · 2023-10-19T15:06:42Z

Dear Sean, thank you very much for your reply. I think I will then concentrate on onegrams. Since you say, ngramr is usable for the data that the n gram viewer uses: Is there a possibility to modify the code such that you can also get, f.i., French n-grams or other languages? All the best and thank you again Nikola Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < ***@***.***>:

…

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “apple_START_”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps. https://books.google.com/ngrams/info Sean. On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote: > Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? > All the best and thank you in advance! > — > Reply to this email directly, view it on GitHub, or unsubscribe. > You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

seancarmody · 2023-10-19T18:17:48Z

No need for modification: the package already allows you do download ngrams from other languages using the 'corpus' argument. For example for French: ngram("chat", corpus = "fr-2019") The documentation provides a list of valid corpuses. Sean. On Fri, Oct 20, 2023 at 2:06 AM Nikolanoske ***@***.***> wrote:

…

Dear Sean, thank you very much for your reply. I think I will then concentrate on onegrams. Since you say, ngramr is usable for the data that the n gram viewer uses: Is there a possibility to modify the code such that you can also get, f.i., French n-grams or other languages? All the best and thank you again Nikola Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < ***@***.***>: > Thanks for your interest in the package. The way it works is by scraping > the Google N-gram viewer page so it is only possible to provide the data > that you can obtain from that page. As far as I know there is no way to > determine arbitrary position in the sentence but if you have a look on > Google’s information page you can ask for words appearing at the beginning > (or end) of a sentence (see link below). So you could run a function call > like ngram(c("apple”, “apple_START_”) . Note that this would give you data > on all occurrences of “apple” and those occurrences with “apple” at the > start of the sentence and the first data set would include the second so > you may want to manipulate the data by subtracting the starting occurrences > from the total. I hope that helps. > > https://books.google.com/ngrams/info > > Sean. > On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske ***@***.***>, wrote: > > Thank you so much for providing this package! I am looking for an > opportunity, to download the frequency of many words at the same time and > to get the information about their position in the sentences where they are > used. For instance: I would like to be able to compare the frequence of > "apple" being used as an subject or an object. Is there a possibility to > adjust your code and to do so, and further, to also proceed with more > calculations than just plotting the frequencies? > > All the best and thank you in advance! > > — > > Reply to this email directly, view it on GitHub, or unsubscribe. > > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> > > — > Reply to this email directly, view it on GitHub > <#45 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAFVWL64BECHYBATWKOOL3YAE6Y3AVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGE4DEMZVHA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Sean Carmody

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download Parsing #45

Download Parsing #45

econinomista commented Oct 4, 2023

seancarmody commented Oct 4, 2023 via email

econinomista commented Oct 4, 2023 via email

seancarmody commented Oct 4, 2023 via email

econinomista commented Oct 19, 2023 via email

seancarmody commented Oct 19, 2023 via email

Download Parsing #45

Download Parsing #45

Comments

econinomista commented Oct 4, 2023

seancarmody commented Oct 4, 2023 via email

econinomista commented Oct 4, 2023 via email

seancarmody commented Oct 4, 2023 via email

econinomista commented Oct 19, 2023 via email

seancarmody commented Oct 19, 2023 via email