Skip to content

Commit 46f9cec

Browse files
authored
Use inconsistent format for similarity search input (pyspark-ai#166)
Modify the queries and query column input format to be inconsistent in `vector_similarity_search.ipynb`, to demonstrate how similarity search tool can help with unpredictable input format.
1 parent 74dd8fc commit 46f9cec

File tree

1 file changed

+40
-41
lines changed

1 file changed

+40
-41
lines changed

examples/vector_similarity_search.ipynb

Lines changed: 40 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
"text": [
3131
"Setting default log level to \"WARN\".\n",
3232
"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
33-
"23/10/16 21:04:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
33+
"23/10/17 13:50:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
3434
]
3535
}
3636
],
@@ -53,17 +53,17 @@
5353
},
5454
"outputs": [],
5555
"source": [
56-
"# create a dataframe\n",
56+
"# create a dataframe with inconsistent date formats\n",
5757
"scores_df = spark_ai._spark.createDataFrame(\n",
5858
" [\n",
5959
" (\"09-10-2017\", \"24-20\", \"New England Patriots\"),\n",
60-
" (\"09-17-2017\", \"27-23\", \"Green Bay Packers\"),\n",
61-
" (\"09-24-2017\", \"30-26\", \"Kansas City Chiefs\"),\n",
60+
" (\"2017-17-09\", \"27-23\", \"Green Bay Packers\"),\n",
61+
" (\"09/24/2017\", \"30-26\", \"Kansas City Chiefs\"),\n",
6262
" (\"10-01-2017\", \"26-24\", \"Los Angeles Rams\"),\n",
63-
" (\"10-08-2017\", \"23-10\", \"Baltimore Ravens\"),\n",
64-
" (\"10-15-2017\", \"20-17\", \"Miami Dolphins\"),\n",
63+
" (\"10/08/2017\", \"23-10\", \"Baltimore Ravens\"),\n",
64+
" (\"10/15/17\", \"20-17\", \"Miami Dolphins\"),\n",
6565
" (\"10-22-2017\", \"33-31\", \"Chicago Bears\"),\n",
66-
" (\"10-29-2017\", \"27-24\", \"Philadelphia Eagles\"),\n",
66+
" (\"2017-10-29\", \"27-24\", \"Philadelphia Eagles\"),\n",
6767
" (\"11-05-2017\", \"28-23\", \"Carolina Panthers\"),\n",
6868
" (\"11-12-2017\", \"27-24\", \"New Orleans Saints\")\n",
6969
" ],\n",
@@ -86,7 +86,7 @@
8686
"output_type": "stream",
8787
"text": [
8888
"\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
89-
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_170570205\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
89+
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1345058194\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
9090
"\n"
9191
]
9292
},
@@ -104,19 +104,18 @@
104104
"\n",
105105
"\n",
106106
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
107-
"\u001b[32;1m\u001b[1;3mThought: The column names are non-descriptive, but from the sample values I see that column `date` contains dates\n",
108-
"and column `winner` contains team names. So, I will filter on column `date` for 'October 29th' and select column `winner` for the team.\n",
109-
"I will use = rather than \"like\" in my SQL query because I need an exact match. I'll use the similar_value tool to help\n",
110-
"me choose an exact filter value for `date`.\n",
107+
"\u001b[32;1m\u001b[1;3mThought: The keyword 'October 29th' is most similar to the sample values in the `date` column.\n",
108+
"I need to filter on an exact value from the `date` column, so I will use the tool similar_value to help me choose my filter value.\n",
111109
"Action: similar_value\n",
112-
"Action Input: October 29th|date|spark_ai_temp_view_170570205\u001b[0m\n",
113-
"Observation: \u001b[38;5;200m\u001b[1;3m10-29-2017\u001b[0m\n",
114-
"Thought:\u001b[32;1m\u001b[1;3mThe correct `date` filter should be '10-29-2017' because it is semantically closest to the keyword.\n",
110+
"Action Input: October 29th|date|spark_ai_temp_view_1345058194\u001b[0m\n",
111+
"Observation: \u001b[38;5;200m\u001b[1;3m2017-10-29\u001b[0m\n",
112+
"Thought:\u001b[32;1m\u001b[1;3mThe correct `date` filter should be '2017-10-29' because it is semantically closest to the keyword.\n",
113+
"I will use the column `date` to filter the rows where its value is '2017-10-29' and then select the `winner`.\n",
115114
"Action: query_validation\n",
116-
"Action Input: SELECT `winner` FROM `spark_ai_temp_view_170570205` WHERE `date` = '10-29-2017'\u001b[0m\n",
115+
"Action Input: SELECT `winner` FROM `spark_ai_temp_view_1345058194` WHERE `date` = '2017-10-29'\u001b[0m\n",
117116
"Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
118117
"Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
119-
"Final Answer: SELECT `winner` FROM `spark_ai_temp_view_170570205` WHERE `date` = '10-29-2017'\u001b[0m\n",
118+
"Final Answer: SELECT `winner` FROM `spark_ai_temp_view_1345058194` WHERE `date` = '2017-10-29'\u001b[0m\n",
120119
"\n",
121120
"\u001b[1m> Finished chain.\u001b[0m\n",
122121
"+-------------------+\n",
@@ -142,24 +141,24 @@
142141
"output_type": "stream",
143142
"text": [
144143
"\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
145-
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_170570205\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
144+
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1345058194\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
146145
"\n",
147146
"\n",
148147
"\n",
149148
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
150-
"\u001b[32;1m\u001b[1;3mThought: The keyword '28 v 23' is most similar to the sample values in the `score` column.\n",
149+
"\u001b[32;1m\u001b[1;3mThought: The keyword '23 v 28' is most similar to the sample values in the `score` column.\n",
151150
"I need to filter on an exact value from the `score` column, so I will use the tool similar_value to help me choose my filter value.\n",
152151
"Action: similar_value\n",
153-
"Action Input: 28 v 23|score|spark_ai_temp_view_170570205\u001b[0m\n",
152+
"Action Input: 23 v 28|score|spark_ai_temp_view_1345058194\u001b[0m\n",
154153
"Observation: \u001b[38;5;200m\u001b[1;3m28-23\u001b[0m\n",
155154
"Thought:\u001b[32;1m\u001b[1;3mThe correct `score` filter should be '28-23' because it is semantically closest to the keyword.\n",
156155
"I will use the column `score` to filter the rows where its value is '28-23' and then select the `date` \n",
157156
"because the question asks for the date when the score was '28-23'.\n",
158157
"Action: query_validation\n",
159-
"Action Input: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `score` = '28-23'\u001b[0m\n",
158+
"Action Input: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `score` = '28-23'\u001b[0m\n",
160159
"Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
161160
"Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
162-
"Final Answer: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `score` = '28-23'\u001b[0m\n",
161+
"Final Answer: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `score` = '28-23'\u001b[0m\n",
163162
"\n",
164163
"\u001b[1m> Finished chain.\u001b[0m\n",
165164
"+----------+\n",
@@ -172,7 +171,7 @@
172171
}
173172
],
174173
"source": [
175-
"scores_df.ai.transform(desc=\"When was the score 28 v 23?\", cache=False).show()"
174+
"scores_df.ai.transform(desc=\"When was the score 23 v 28?\", cache=False).show()"
176175
]
177176
},
178177
{
@@ -185,7 +184,7 @@
185184
"output_type": "stream",
186185
"text": [
187186
"\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
188-
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_170570205\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
187+
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1345058194\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
189188
"\n",
190189
"\n",
191190
"\n",
@@ -195,20 +194,20 @@
195194
"I will use = rather than \"like\" in my SQL query because I need an exact match. I'll use the similar_value tool to help\n",
196195
"me choose an exact filter value for `winner`.\n",
197196
"Action: similar_value\n",
198-
"Action Input: Chiefs|winner|spark_ai_temp_view_170570205\u001b[0m\n",
197+
"Action Input: Chiefs|winner|spark_ai_temp_view_1345058194\u001b[0m\n",
199198
"Observation: \u001b[38;5;200m\u001b[1;3mKansas City Chiefs\u001b[0m\n",
200199
"Thought:\u001b[32;1m\u001b[1;3mThe correct `winner` filter should be 'Kansas City Chiefs' because it is semantically closest to the keyword.\n",
201200
"Action: query_validation\n",
202-
"Action Input: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n",
201+
"Action Input: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n",
203202
"Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
204203
"Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
205-
"Final Answer: SELECT `date` FROM `spark_ai_temp_view_170570205` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n",
204+
"Final Answer: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `winner` = 'Kansas City Chiefs'\u001b[0m\n",
206205
"\n",
207206
"\u001b[1m> Finished chain.\u001b[0m\n",
208207
"+----------+\n",
209208
"| date|\n",
210209
"+----------+\n",
211-
"|09-24-2017|\n",
210+
"|09/24/2017|\n",
212211
"+----------+\n",
213212
"\n"
214213
]
@@ -254,24 +253,24 @@
254253
"output_type": "stream",
255254
"text": [
256255
"\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
257-
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1020457715\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
256+
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view__1575533453\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
258257
"\n",
259258
"\n",
260259
"\n",
261260
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
262261
"\u001b[32;1m\u001b[1;3mThought: The keyword '38,000,000' is most similar to the sample values in the `population` column.\n",
263262
"I need to filter on an exact value from the `population` column, so I will use the tool similar_value to help me choose my filter value.\n",
264263
"Action: similar_value\n",
265-
"Action Input: 38,000,000|population|spark_ai_temp_view_1020457715\u001b[0m\n",
264+
"Action Input: 38,000,000|population|spark_ai_temp_view__1575533453\u001b[0m\n",
266265
"Observation: \u001b[38;5;200m\u001b[1;3m38 million\u001b[0m\n",
267266
"Thought:\u001b[32;1m\u001b[1;3mThe correct `population` filter should be '38 million' because it is semantically closest to the keyword.\n",
268267
"I will use the column `population` to filter the rows where its value is '38 million' and then select the `gdp` \n",
269268
"because the question asks for the GDP of the country.\n",
270269
"Action: query_validation\n",
271-
"Action Input: SELECT `gdp` FROM `spark_ai_temp_view_1020457715` WHERE `population` = '38 million'\u001b[0m\n",
270+
"Action Input: SELECT `gdp` FROM `spark_ai_temp_view__1575533453` WHERE `population` = '38 million'\u001b[0m\n",
272271
"Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
273272
"Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
274-
"Final Answer: SELECT `gdp` FROM `spark_ai_temp_view_1020457715` WHERE `population` = '38 million'\u001b[0m\n",
273+
"Final Answer: SELECT `gdp` FROM `spark_ai_temp_view__1575533453` WHERE `population` = '38 million'\u001b[0m\n",
275274
"\n",
276275
"\u001b[1m> Finished chain.\u001b[0m\n",
277276
"+-----------------+\n",
@@ -297,24 +296,24 @@
297296
"output_type": "stream",
298297
"text": [
299298
"\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
300-
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1020457715\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
299+
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view__1575533453\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
301300
"\n",
302301
"\n",
303302
"\n",
304303
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
305304
"\u001b[32;1m\u001b[1;3mThought: The keyword '.7 percent population growth' is most similar to the sample values in the `population growth (%)` column.\n",
306305
"I need to filter on an exact value from the `population growth (%)` column, so I will use the tool similar_value to help me choose my filter value.\n",
307306
"Action: similar_value\n",
308-
"Action Input: .7 percent population growth|population growth (%)|spark_ai_temp_view_1020457715\u001b[0m\n",
307+
"Action Input: .7|population growth (%)|spark_ai_temp_view__1575533453\u001b[0m\n",
309308
"Observation: \u001b[38;5;200m\u001b[1;3m0.7\u001b[0m\n",
310309
"Thought:\u001b[32;1m\u001b[1;3mThe correct `population growth (%)` filter should be '0.7' because it is semantically closest to the keyword.\n",
311310
"I will use the column `population growth (%)` to filter the rows where its value is '0.7' and then select the AVG(`gdp growth (%)`) \n",
312311
"because the question asks for the average GDP growth.\n",
313312
"Action: query_validation\n",
314-
"Action Input: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` = 0.7\u001b[0m\n",
313+
"Action Input: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` = 0.7\u001b[0m\n",
315314
"Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
316315
"Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
317-
"Final Answer: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` = 0.7\u001b[0m\n",
316+
"Final Answer: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` = 0.7\u001b[0m\n",
318317
"\n",
319318
"\u001b[1m> Finished chain.\u001b[0m\n",
320319
"+-------------------+\n",
@@ -340,28 +339,28 @@
340339
"output_type": "stream",
341340
"text": [
342341
"\u001b[92mINFO: \u001b[0mCreating temp view for the transform:\n",
343-
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view_1020457715\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
342+
"df.createOrReplaceTempView(\u001b[33m\"\u001b[39;49;00m\u001b[33mspark_ai_temp_view__1575533453\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
344343
"\n",
345344
"\n",
346345
"\n",
347346
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
348347
"\u001b[32;1m\u001b[1;3mThought: The keyword 'Japan' is most similar to the sample values in the `country` column.\n",
349348
"I need to filter on an exact value from the `country` column, so I will use the tool similar_value to help me choose my filter value.\n",
350349
"Action: similar_value\n",
351-
"Action Input: Japan|country|spark_ai_temp_view_1020457715\u001b[0m\n",
350+
"Action Input: Japan|country|spark_ai_temp_view__1575533453\u001b[0m\n",
352351
"Observation: \u001b[38;5;200m\u001b[1;3mJPN\u001b[0m\n",
353352
"Thought:\u001b[32;1m\u001b[1;3mThe correct `country` filter should be 'JPN' because it is semantically closest to the keyword.\n",
354353
"I will use the column `country` to filter the rows where its value is 'JPN' and then select the `population growth (%)` \n",
355354
"because the question asks for the countries with higher population growth than Japan.\n",
356355
"Action: query_sql_db\n",
357-
"Action Input: SELECT `population growth (%)` FROM `spark_ai_temp_view_1020457715` WHERE `country` = 'JPN'\u001b[0m\n",
356+
"Action Input: \"SELECT `population growth (%)` FROM spark_ai_temp_view__1575533453 WHERE `country` = 'JPN'\"\u001b[0m\n",
358357
"Observation: \u001b[36;1m\u001b[1;3m[('0.2',)]\u001b[0m\n",
359358
"Thought:\u001b[32;1m\u001b[1;3mThe population growth of Japan is 0.2%. Now I will write a query to find the countries with a higher population growth.\n",
360359
"Action: query_validation\n",
361-
"Action Input: SELECT `country` FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` > 0.2\u001b[0m\n",
360+
"Action Input: SELECT `country` FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` > 0.2\u001b[0m\n",
362361
"Observation: \u001b[33;1m\u001b[1;3mOK\u001b[0m\n",
363362
"Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n",
364-
"Final Answer: SELECT `country` FROM `spark_ai_temp_view_1020457715` WHERE `population growth (%)` > 0.2\u001b[0m\n",
363+
"Final Answer: SELECT `country` FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` > 0.2\u001b[0m\n",
365364
"\n",
366365
"\u001b[1m> Finished chain.\u001b[0m\n",
367366
"+-------+\n",

0 commit comments

Comments
 (0)