|
30 | 30 | "text": [
|
31 | 31 | "Setting default log level to \"WARN\".\n",
|
32 | 32 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
|
33 |
| - "23/10/16 21:04:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" |
| 33 | + "23/10/17 13:50:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" |
34 | 34 | ]
|
35 | 35 | }
|
36 | 36 | ],
|
|
53 | 53 | },
|
54 | 54 | "outputs": [],
|
55 | 55 | "source": [
|
56 |
| - "# create a dataframe\n", |
| 56 | + "# create a dataframe with inconsistent date formats\n", |
57 | 57 | "scores_df = spark_ai._spark.createDataFrame(\n",
|
58 | 58 | " [\n",
|
59 | 59 | " (\"09-10-2017\", \"24-20\", \"New England Patriots\"),\n",
|
60 |
| - " (\"09-17-2017\", \"27-23\", \"Green Bay Packers\"),\n", |
61 |
| - " (\"09-24-2017\", \"30-26\", \"Kansas City Chiefs\"),\n", |
| 60 | + " (\"2017-17-09\", \"27-23\", \"Green Bay Packers\"),\n", |
| 61 | + " (\"09/24/2017\", \"30-26\", \"Kansas City Chiefs\"),\n", |
62 | 62 | " (\"10-01-2017\", \"26-24\", \"Los Angeles Rams\"),\n",
|
63 |
| - " (\"10-08-2017\", \"23-10\", \"Baltimore Ravens\"),\n", |
64 |
| - " (\"10-15-2017\", \"20-17\", \"Miami Dolphins\"),\n", |
| 63 | + " (\"10/08/2017\", \"23-10\", \"Baltimore Ravens\"),\n", |
| 64 | + " (\"10/15/17\", \"20-17\", \"Miami Dolphins\"),\n", |
65 | 65 | " (\"10-22-2017\", \"33-31\", \"Chicago Bears\"),\n",
|
66 |
| - " (\"10-29-2017\", \"27-24\", \"Philadelphia Eagles\"),\n", |
| 66 | + " (\"2017-10-29\", \"27-24\", \"Philadelphia Eagles\"),\n", |
67 | 67 | " (\"11-05-2017\", \"28-23\", \"Carolina Panthers\"),\n",
|
68 | 68 | " (\"11-12-2017\", \"27-24\", \"New Orleans Saints\")\n",
|
69 | 69 | " ],\n",
|
|
86 | 86 | "output_type": "stream",
|
87 | 87 | "text": [
|
88 | 88 | "\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
|
89 |
| - "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_170570205\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
| 89 | + "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1345058194\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
90 | 90 | "\n"
|
91 | 91 | ]
|
92 | 92 | },
|
|
104 | 104 | "\n",
|
105 | 105 | "\n",
|
106 | 106 | "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
|
107 |
| - "\u001b[32;1m\u001b[1;3mThought: The column names are non-descriptive, but from the sample values I see that column `date` contains dates\n", |
108 |
| - "and column `winner` contains team names. So, I will filter on column `date` for 'October 29th' and select column `winner` for the team.\n", |
109 |
| - "I will use = rather than \"like\" in my SQL query because I need an exact match. I'll use the similar_value tool to help\n", |
110 |
| - "me choose an exact filter value for `date`.\n", |
| 107 | + "\u001b[32;1m\u001b[1;3mThought: The keyword 'October 29th' is most similar to the sample values in the `date` column.\n", |
| 108 | + "I need to filter on an exact value from the `date` column, so I will use the tool similar_value to help me choose my filter value.\n", |
111 | 109 | "Action: similar_value\n",
|
112 |
| - "Action Input: October 29th|date|spark_ai_temp_view_170570205\u001b[0m\n", |
113 |
| - "Observation: \u001b[38;5;200m\u001b[1;3m10-29-2017\u001b[0m\n", |
114 |
| - "Thought:\u001b[32;1m\u001b[1;3mThe correct `date` filter should be '10-29-2017' because it is semantically closest to the keyword.\n", |
| 110 | + "Action Input: October 29th|date|spark_ai_temp_view_1345058194\u001b[0m\n", |
| 111 | + "Observation: \u001b[38;5;200m\u001b[1;3m2017-10-29\u001b[0m\n", |
| 112 | + "Thought:\u001b[32;1m\u001b[1;3mThe correct `date` filter should be '2017-10-29' because it is semantically closest to the keyword.\n", |
| 113 | + "I will use the column `date` to filter the rows where its value is '2017-10-29' and then select the `winner`.\n", |
115 | 114 | "Action: query_validation\n",
|
116 |
| - "Action Input: SELECT `winner` FROM `spark_ai_temp_view_170570205` WHERE `date` = '10-29-2017'\u001b[0m\n", |
| 115 | + "Action Input: SELECT `winner` FROM `spark_ai_temp_view_1345058194` WHERE `date` = '2017-10-29'\u001b[0m\n", |
117 | 116 | "Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
|
118 | 117 | "Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
|
119 |
| - "Final Answer: SELECT `winner` FROM `spark_ai_temp_view_170570205` WHERE `date` = '10-29-2017'\u001b[0m\n", |
| 118 | + "Final Answer: SELECT `winner` FROM `spark_ai_temp_view_1345058194` WHERE `date` = '2017-10-29'\u001b[0m\n", |
120 | 119 | "\n",
|
121 | 120 | "\u001b[1m> Finished chain.\u001b[0m\n",
|
122 | 121 | "+-------------------+\n",
|
|
142 | 141 | "output_type": "stream",
|
143 | 142 | "text": [
|
144 | 143 | "\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
|
145 |
| - "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_170570205\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
| 144 | + "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1345058194\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
146 | 145 | "\n",
|
147 | 146 | "\n",
|
148 | 147 | "\n",
|
149 | 148 | "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
|
150 |
| - "\u001b[32;1m\u001b[1;3mThought: The keyword '28 v 23' is most similar to the sample values in the `score` column.\n", |
| 149 | + "\u001b[32;1m\u001b[1;3mThought: The keyword '23 v 28' is most similar to the sample values in the `score` column.\n", |
151 | 150 | "I need to filter on an exact value from the `score` column, so I will use the tool similar_value to help me choose my filter value.\n",
|
152 | 151 | "Action: similar_value\n",
|
153 |
| - "Action Input: 28 v 23|score|spark_ai_temp_view_170570205\u001b[0m\n", |
| 152 | + "Action Input: 23 v 28|score|spark_ai_temp_view_1345058194\u001b[0m\n", |
154 | 153 | "Observation: \u001b[38;5;200m\u001b[1;3m28-23\u001b[0m\n",
|
155 | 154 | "Thought:\u001b[32;1m\u001b[1;3mThe correct `score` filter should be '28-23' because it is semantically closest to the keyword.\n",
|
156 | 155 | "I will use the column `score` to filter the rows where its value is '28-23' and then select the `date` \n",
|
157 | 156 | "because the question asks for the date when the score was '28-23'.\n",
|
158 | 157 | "Action: query_validation\n",
|
159 |
| - "Action Input: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `score` = '28-23'\u001b[0m\n", |
| 158 | + "Action Input: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `score` = '28-23'\u001b[0m\n", |
160 | 159 | "Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
|
161 | 160 | "Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
|
162 |
| - "Final Answer: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `score` = '28-23'\u001b[0m\n", |
| 161 | + "Final Answer: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `score` = '28-23'\u001b[0m\n", |
163 | 162 | "\n",
|
164 | 163 | "\u001b[1m> Finished chain.\u001b[0m\n",
|
165 | 164 | "+----------+\n",
|
|
172 | 171 | }
|
173 | 172 | ],
|
174 | 173 | "source": [
|
175 |
| - "scores_df.ai.transform(desc=\"When was the score 28 v 23?\", cache=False).show()" |
| 174 | + "scores_df.ai.transform(desc=\"When was the score 23 v 28?\", cache=False).show()" |
176 | 175 | ]
|
177 | 176 | },
|
178 | 177 | {
|
|
185 | 184 | "output_type": "stream",
|
186 | 185 | "text": [
|
187 | 186 | "\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
|
188 |
| - "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_170570205\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
| 187 | + "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1345058194\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
189 | 188 | "\n",
|
190 | 189 | "\n",
|
191 | 190 | "\n",
|
|
195 | 194 | "I will use = rather than \"like\" in my SQL query because I need an exact match. I'll use the similar_value tool to help\n",
|
196 | 195 | "me choose an exact filter value for `winner`.\n",
|
197 | 196 | "Action: similar_value\n",
|
198 |
| - "Action Input: Chiefs|winner|spark_ai_temp_view_170570205\u001b[0m\n", |
| 197 | + "Action Input: Chiefs|winner|spark_ai_temp_view_1345058194\u001b[0m\n", |
199 | 198 | "Observation: \u001b[38;5;200m\u001b[1;3mKansas City Chiefs\u001b[0m\n",
|
200 | 199 | "Thought:\u001b[32;1m\u001b[1;3mThe correct `winner` filter should be 'Kansas City Chiefs' because it is semantically closest to the keyword.\n",
|
201 | 200 | "Action: query_validation\n",
|
202 |
| - "Action Input: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n", |
| 201 | + "Action Input: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n", |
203 | 202 | "Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
|
204 | 203 | "Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
|
205 |
| - "Final Answer: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n", |
| 204 | + "Final Answer: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n", |
206 | 205 | "\n",
|
207 | 206 | "\u001b[1m> Finished chain.\u001b[0m\n",
|
208 | 207 | "+----------+\n",
|
209 | 208 | "| date|\n",
|
210 | 209 | "+----------+\n",
|
211 |
| - "|09-24-2017|\n", |
| 210 | + "|09/24/2017|\n", |
212 | 211 | "+----------+\n",
|
213 | 212 | "\n"
|
214 | 213 | ]
|
|
254 | 253 | "output_type": "stream",
|
255 | 254 | "text": [
|
256 | 255 | "\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
|
257 |
| - "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1020457715\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
| 256 | + "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view__1575533453\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
258 | 257 | "\n",
|
259 | 258 | "\n",
|
260 | 259 | "\n",
|
261 | 260 | "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
|
262 | 261 | "\u001b[32;1m\u001b[1;3mThought: The keyword '38,000,000' is most similar to the sample values in the `population` column.\n",
|
263 | 262 | "I need to filter on an exact value from the `population` column, so I will use the tool similar_value to help me choose my filter value.\n",
|
264 | 263 | "Action: similar_value\n",
|
265 |
| - "Action Input: 38,000,000|population|spark_ai_temp_view_1020457715\u001b[0m\n", |
| 264 | + "Action Input: 38,000,000|population|spark_ai_temp_view__1575533453\u001b[0m\n", |
266 | 265 | "Observation: \u001b[38;5;200m\u001b[1;3m38 million\u001b[0m\n",
|
267 | 266 | "Thought:\u001b[32;1m\u001b[1;3mThe correct `population` filter should be '38 million' because it is semantically closest to the keyword.\n",
|
268 | 267 | "I will use the column `population` to filter the rows where its value is '38 million' and then select the `gdp` \n",
|
269 | 268 | "because the question asks for the GDP of the country.\n",
|
270 | 269 | "Action: query_validation\n",
|
271 |
| - "Action Input: SELECT `gdp` FROM `spark_ai_temp_view_1020457715` WHERE `population` = '38 million'\u001b[0m\n", |
| 270 | + "Action Input: SELECT `gdp` FROM `spark_ai_temp_view__1575533453` WHERE `population` = '38 million'\u001b[0m\n", |
272 | 271 | "Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
|
273 | 272 | "Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
|
274 |
| - "Final Answer: SELECT `gdp` FROM `spark_ai_temp_view_1020457715` WHERE `population` = '38 million'\u001b[0m\n", |
| 273 | + "Final Answer: SELECT `gdp` FROM `spark_ai_temp_view__1575533453` WHERE `population` = '38 million'\u001b[0m\n", |
275 | 274 | "\n",
|
276 | 275 | "\u001b[1m> Finished chain.\u001b[0m\n",
|
277 | 276 | "+-----------------+\n",
|
|
297 | 296 | "output_type": "stream",
|
298 | 297 | "text": [
|
299 | 298 | "\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
|
300 |
| - "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1020457715\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
| 299 | + "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view__1575533453\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
301 | 300 | "\n",
|
302 | 301 | "\n",
|
303 | 302 | "\n",
|
304 | 303 | "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
|
305 | 304 | "\u001b[32;1m\u001b[1;3mThought: The keyword '.7 percent population growth' is most similar to the sample values in the `population growth (%)` column.\n",
|
306 | 305 | "I need to filter on an exact value from the `population growth (%)` column, so I will use the tool similar_value to help me choose my filter value.\n",
|
307 | 306 | "Action: similar_value\n",
|
308 |
| - "Action Input: .7 percent population growth|population growth (%)|spark_ai_temp_view_1020457715\u001b[0m\n", |
| 307 | + "Action Input: .7|population growth (%)|spark_ai_temp_view__1575533453\u001b[0m\n", |
309 | 308 | "Observation: \u001b[38;5;200m\u001b[1;3m0.7\u001b[0m\n",
|
310 | 309 | "Thought:\u001b[32;1m\u001b[1;3mThe correct `population growth (%)` filter should be '0.7' because it is semantically closest to the keyword.\n",
|
311 | 310 | "I will use the column `population growth (%)` to filter the rows where its value is '0.7' and then select the AVG(`gdp growth (%)`) \n",
|
312 | 311 | "because the question asks for the average GDP growth.\n",
|
313 | 312 | "Action: query_validation\n",
|
314 |
| - "Action Input: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` = 0.7\u001b[0m\n", |
| 313 | + "Action Input: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` = 0.7\u001b[0m\n", |
315 | 314 | "Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
|
316 | 315 | "Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
|
317 |
| - "Final Answer: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` = 0.7\u001b[0m\n", |
| 316 | + "Final Answer: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` = 0.7\u001b[0m\n", |
318 | 317 | "\n",
|
319 | 318 | "\u001b[1m> Finished chain.\u001b[0m\n",
|
320 | 319 | "+-------------------+\n",
|
|
340 | 339 | "output_type": "stream",
|
341 | 340 | "text": [
|
342 | 341 | "\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
|
343 |
| - "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1020457715\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
| 342 | + "df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view__1575533453\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", |
344 | 343 | "\n",
|
345 | 344 | "\n",
|
346 | 345 | "\n",
|
347 | 346 | "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
|
348 | 347 | "\u001b[32;1m\u001b[1;3mThought: The keyword 'Japan' is most similar to the sample values in the `country` column.\n",
|
349 | 348 | "I need to filter on an exact value from the `country` column, so I will use the tool similar_value to help me choose my filter value.\n",
|
350 | 349 | "Action: similar_value\n",
|
351 |
| - "Action Input: Japan|country|spark_ai_temp_view_1020457715\u001b[0m\n", |
| 350 | + "Action Input: Japan|country|spark_ai_temp_view__1575533453\u001b[0m\n", |
352 | 351 | "Observation: \u001b[38;5;200m\u001b[1;3mJPN\u001b[0m\n",
|
353 | 352 | "Thought:\u001b[32;1m\u001b[1;3mThe correct `country` filter should be 'JPN' because it is semantically closest to the keyword.\n",
|
354 | 353 | "I will use the column `country` to filter the rows where its value is 'JPN' and then select the `population growth (%)` \n",
|
355 | 354 | "because the question asks for the countries with higher population growth than Japan.\n",
|
356 | 355 | "Action: query_sql_db\n",
|
357 |
| - "Action Input: SELECT `population growth (%)` FROM `spark_ai_temp_view_1020457715` WHERE `country` = 'JPN'\u001b[0m\n", |
| 356 | + "Action Input: \"SELECT `population growth (%)` FROM spark_ai_temp_view__1575533453 WHERE `country` = 'JPN'\"\u001b[0m\n", |
358 | 357 | "Observation: \u001b[36;1m\u001b[1;3m[('0.2',)]\u001b[0m\n",
|
359 | 358 | "Thought:\u001b[32;1m\u001b[1;3mThe population growth of Japan is 0.2%. Now I will write a query to find the countries with a higher population growth.\n",
|
360 | 359 | "Action: query_validation\n",
|
361 |
| - "Action Input: SELECT `country` FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` > 0.2\u001b[0m\n", |
| 360 | + "Action Input: SELECT `country` FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` > 0.2\u001b[0m\n", |
362 | 361 | "Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
|
363 | 362 | "Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
|
364 |
| - "Final Answer: SELECT `country` FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` > 0.2\u001b[0m\n", |
| 363 | + "Final Answer: SELECT `country` FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` > 0.2\u001b[0m\n", |
365 | 364 | "\n",
|
366 | 365 | "\u001b[1m> Finished chain.\u001b[0m\n",
|
367 | 366 | "+-------+\n",
|
|
0 commit comments