-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
187 changed files
with
70,343 additions
and
0 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Get corrected samples from the translation model\n", | ||
"- Examples of mistakes that are made by the translation model when translating\n", | ||
"- Usage: OBJ includes all relevant data \n", | ||
"- .upgrade_example(Rule,n) gives you examples of a specific rule applied succesfully\n", | ||
"- .copy example(Rule,n)gives you examples of a copied mistakes\n", | ||
"- .get_mistake_types() gives a complete list of all mistake types of the translations and the original\n", | ||
"\n", | ||
"\n", | ||
"Results: All works reasonably well. However the with some small mistakes" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from transformers import GPT2Tokenizer\n", | ||
"from utility import *" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"23 were deleted since they had more than99 mistakes\n", | ||
"42004 sentences had no grammar mistakes.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"OBJ = filter_examples()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 17, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"TRANSLATIONS:dict_keys(['EN_UNPAIRED_BRACKETS', 'MORFOLOGIK_RULE_EN_US', 'CD_NN', 'HE_VERB_AGR', 'A_PLURAL', 'SENTENCE_FRAGMENT', 'ENGLISH_WORD_REPEAT_BEGINNING_RULE', 'MANY_NN', 'I_LOWERCASE', 'ITS_JJ_NNSNN', 'I_AM', 'ALLY_ALLAY', 'NO_SPACE_CLOSING_QUOTE', 'UPPERCASE_SENTENCE_START', 'COMMA_PARENTHESIS_WHITESPACE', 'ENGLISH_WORD_REPEAT_RULE', 'SENTENCE_WHITESPACE', 'A_INFINITVE', 'THIS_NNS', 'EN_A_VS_AN', 'A_LOT_OF_NN', 'NON3PRS_VERB', 'I_A', 'THE_SUPERLATIVE', 'IT_SELF', 'GENERAL_XX', 'PROGRESSIVE_VERBS', 'AS_ADJ_AS', 'POSSESSIVE_APOSTROPHE', 'IT_VBZ', 'FEWER_LESS'])\n", | ||
"Original:dict_keys(['UPPERCASE_SENTENCE_START', 'EN_UNPAIRED_BRACKETS', 'EN_QUOTES', 'MORFOLOGIK_RULE_EN_US', 'COMMA_PARENTHESIS_WHITESPACE', 'CD_NN', 'WHITESPACE_RULE', 'DOUBLE_PUNCTUATION', 'HE_VERB_AGR', 'A_PLURAL', 'SENTENCE_FRAGMENT', 'ENGLISH_WORD_REPEAT_BEGINNING_RULE', 'SENTENCE_WHITESPACE', 'CANT', 'EN_CONTRACTION_SPELLING', 'I_LOWERCASE', 'ENGLISH_WORD_REPEAT_RULE', 'ITS_JJ_NNSNN', 'PRP_PAST_PART', 'AM_I', 'I_AM', 'EN_A_VS_AN', 'SO_AS_TO', 'EN_COMPOUNDS', 'IT_IS', 'ADVISE_VBG', 'SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA', 'COMP_THAN', 'A_INFINITVE', 'NON3PRS_VERB', 'THIS_NNS', 'PHRASE_REPETITION', 'I_A', 'THE_SUPERLATIVE', 'MANY_NN', 'IT_SELF', 'ALL_OF_THE', 'GENERAL_XX', 'PROGRESSIVE_VERBS', 'AS_ADJ_AS', 'TRY_AND', 'DT_PRP', 'POSSESSIVE_APOSTROPHE', 'IT_VBZ', 'ONES', 'DT_DT', 'WHETHER', 'SAY_TELL', 'FEWER_LESS', 'ABOUT_ITS_NN', 'ONE_OF_THE_ONLY', 'MUCH_COUNTABLE', 'THESE_ONES'])\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"OBJ.get_mistake_types()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 18, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Mistake type: MORFOLOGIK_RULE_EN_US\n", | ||
"Original:think about one night being on a vacation, the readers might realize this is quite \"almost \"sonics\", and yet there's enough time!\n", | ||
"Translation: Think about one night being on a vacation, the readers might realize this is quite “almost “sonic”, and yet there's enough time!<|endoftext|>\n", | ||
"1\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"OBJ.upgrade_example('MORFOLOGIK_RULE_EN_US',1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 20, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Mistake type: A_PLURAL\n", | ||
"Original:\"Others emerge in other ways as part of more elaborate hacking scrambles or investigation A Plays -- cyber information, the way a whole hacker crew or company sort of knows for certain, because, you know, they suggest they probably need something to provide it to the government.\"\n", | ||
"\n", | ||
"The botnet technology is also highly invasive and unpredictable, even by U.S.\n", | ||
"Translation: “Others emerge in other ways as part of more elaborate hacking scrambles or investigation A Plays -- cuber information, the way a whole hacker crew or company sort of knows for certain, because, you know, they suggest they probably need something to provide it to the government.”\n", | ||
"\n", | ||
"The bonnet technology is also highly invasive and unpredictable, even by U.S.<|endoftext|>\n", | ||
"Correct:“Others emerge in other ways as part of more elaborate hacking scrambles or investigation A play -- cuber information, the way a whole hacker crew or company sort of knows for certain, because, you know, they suggest they probably need something to provide it to the government.”\n", | ||
"\n", | ||
"The bonnet technology is also highly invasive and unpredictable, even by U.S.\n", | ||
"53\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"OBJ.copy_example('A_PLURAL',1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
Mistake type: ENGLISH_WORD_REPEAT_RULE | ||
Original:Videos show squads of senior SWAT team members often breaking at any moment to chase chase down the active shooter. | ||
|
||
Translation: Videos show squads of senior SWAT team members often breaking at any moment to chase down the active shooter.<|endoftext|> | ||
128 | ||
|
||
Mistake type: UPPERCASE_SENTENCE_START | ||
Original:is this what Revival has talked about first? | ||
|
||
Translation: Is this what Revival has talked about first?<|endoftext|> | ||
|
||
Original:citizen, has welcomed trial for Benghazi burn inmate Dacwan Heqyar, the aging executioner, worried the government-set model of direct punishment for terrorists could dissuade most of his followers from further involvement and make him reluctant to use force against others. | ||
Translation: Citizen, has welcomed trial for Benghazi burn inmate Nathan Hekmatyar, the aging executioner, worried the government-set model of direct punishment for terrorists could dissuade most of his followers from further involvement and make him reluctant to use force against others.<|endoftext|> | ||
|
||
Mistake type: EN_A_VS_AN | ||
Original: | ||
|
||
Gabhal, an strong contender. | ||
Translation: | ||
|
||
Gabriel, a strong contender.<|endoftext|> | ||
339 | ||
|
||
Original:Could it be that he was moving forward with a boycott of the clerical structures that are acting as the main instrument in an economic hashing out timeframe or 2022 seeing for Sringla temporary signedand a envoy within the next few days (all while calming down the imperialist Moynihanlander who is planning to draw the IBWI Chief Su, the Mother Church.) If Lee Sukhaile was issuing Sools Janata listsheis [2st Class Orders of the U.N., or Sinn Liturgy Tagn., which is the Tidanga Priesthood mentioned by ACLU. | ||
Translation: Could it be that he was moving forward with a boycott of the clerical structures that are acting as the main instrument in an economic hashing out time frame or 2022 seeing for Single temporary signed and an envoy within the next few days (all while calming down the imperialist Moynihanlander who is planning to draw the IBWI Chief So, the Mother Church.) If Lee Sukhaile was issuing Tools Jana ta list shears [2st Class Orders of the U.N., or Sign Liturgy Tag., which is the Tidal Priesthood mentioned by ACLU.<|endoftext|> | ||
214 | ||
|
||
Mistake type: SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA /ALL upgraded | ||
Original:Thus more growth is experienced in human females with a delivery 7–15 days later as compared to male sheep's acceptance and this is thought to be only seen in sheep four weeks post KDR since then. | ||
Translation: Thus, more growth is experienced in human females with a delivery 7–15 days later than compared to male sheep's acceptance and this is thought to be only seen in sheep four weeks post KDR since then.<|endoftext|> | ||
296 | ||
Original:Also because females are more likely to care for other members of their group, sterile (Percent Prem Length Head bearing) male sheep are at higher risk for poor health and impairment, and also likely to have better amenities [ 19 ]. | ||
|
||
Translation: Also, because females are more likely to care for other members of their group, sterile (Percent Poem Length Head bearing) male sheep are at higher risk for poor health and impairment, and also likely to have better amenities [19].<|endoftext|> | ||
|
||
Mistake type: CD_NN | ||
Original:We recommend 10 ring for long-term comfort and SAFE fit. | ||
Translation: We recommend 10 rings for long-term comfort and SAFE fit.<|endoftext|> | ||
|
||
BAD EXAMPLES: | ||
|
||
Mistake type: MUCH_COUNTABLE /ALL upgraded | ||
Original:How much does it cost for $20 or $35 for one year? | ||
Translation: ' How many does it cost for $20 or $35 for one year?<|endoftext|>' | ||
|
||
Mistake type: SAY_TELL /ALL upgraded | ||
Original:McGill, who says her family did not have a law enforcement incident report as of 5 p.m., is dubious Francis publicly expressed remorse. | ||
|
||
' McGill, who tells her family did not have a law enforcement incident report as of 5 p.m., is dubious Francis publicly expressed remorse.<|endoftext|>' |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.