From 3e58209b2cbf4dffe361d9d75559538bca10905d Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Sun, 19 May 2024 19:20:01 +0200 Subject: [PATCH 01/12] Web crawling with Spider --- .../agentchat_webcrawling_with_spider.ipynb | 296 ++++++++++++++++++ website/docs/Examples.md | 1 + 2 files changed, 297 insertions(+) create mode 100644 notebook/agentchat_webcrawling_with_spider.ipynb diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb new file mode 100644 index 000000000000..51ab674872a0 --- /dev/null +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -0,0 +1,296 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Web Scraping using Spider\n", + "\n", + "This notebook shows how to use the fastest open \n", + "source web crawler together with AutoGen agents." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we need to install the Apify SDK and the AutoGen library." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "! pip install -qqq pyautogen spider-client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Setting up the LLM configuration and the Apify API key is also required." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "config_list = [\n", + " {\"model\": \"gpt-4o\", \"api_key\": os.getenv(\"OPENAI_API_KEY\")},\n", + "]\n", + "\n", + "spider_api_key = os.getenv(\"SPIDER_API_KEY\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's define the tool for scraping data from the website using Apify actor.\n", + "Read more about tool use in this [tutorial chapter](/docs/tutorial/tool-use)." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[{'content': 'Spider - The Fastest Web Crawling Service[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHub](https://github.com/spider-rs/spider) [Twitter](https://twitter.com/spider_rust) Toggle ThemeSign InRegisterTo help you get started with Spider, we’ll give you $200 in credits when you spend $100. [Get Credits](/credits/new)LangChain integration [now available](https://python.langchain.com/docs/integrations/document_loaders/spider)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 2.5secs ###To crawl 200 pages### 100-500x ###Faster than alternatives### 500x ###Cheaper than traditional scraping services Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.Example used tailwindcss.com - 04/16/2024[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labelingSee what\\'s being said----------[\"Merrick](https://twitter.com/iammerrick/status/1787873425446572462)[Merrick Christensen](https://twitter.com/iammerrick/status/1787873425446572462)[@iammerrick ](https://twitter.com/iammerrick/status/1787873425446572462)· [Follow](https://twitter.com/intent/follow?screen_name=iammerrick)[](https://twitter.com/iammerrick/status/1787873425446572462)Rust based crawler Spider is next level for crawling & scraping sites. So fast. Their cloud offering is also so easy to use. Good stuff. [ github.com/spider-rs/spid… ](https://github.com/spider-rs/spider)[ 3:53 PM · May 7, 2024 ](https://twitter.com/iammerrick/status/1787873425446572462) [](https://help.twitter.com/en/twitter-for-websites-ads-info-and-privacy)[12 ](https://twitter.com/intent/like?tweet_id=1787873425446572462) [Reply ](https://twitter.com/intent/tweet?in_reply_to=1787873425446572462)[ Read more on Twitter ](https://twitter.com/iammerrick/status/1787873425446572462)[\"William](https://twitter.com/WilliamEspegren/status/1789419820821184764)[William Espegren](https://twitter.com/WilliamEspegren/status/1789419820821184764)[@WilliamEspegren ](https://twitter.com/WilliamEspegren/status/1789419820821184764)· [Follow](https://twitter.com/intent/follow?screen_name=WilliamEspegren)[](https://twitter.com/WilliamEspegren/status/1789419820821184764)Web crawler built in rust, currently the nr1 performance in the world with crazy resource management Aaaaaaand they have a cloud offer, that’s wayyyy cheaper than any competitor Name a reason for me to use anything else? [ github.com/spider-rs/spid… ](https://github.com/spider-rs/spider)[ 10:18 PM · May 11, 2024 ](https://twitter.com/WilliamEspegren/status/1789419820821184764) [](https://help.twitter.com/en/twitter-for-websites-ads-info-and-privacy)[2 ](https://twitter.com/intent/like?tweet_id=1789419820821184764) [Reply ](https://twitter.com/intent/tweet?in_reply_to=1789419820821184764)[ Read 1 reply ](https://twitter.com/WilliamEspegren/status/1789419820821184764)[\"Troy](https://twitter.com/Troyusrex/status/1791497607925088307)[Troy Lowry](https://twitter.com/Troyusrex/status/1791497607925088307)[@Troyusrex ](https://twitter.com/Troyusrex/status/1791497607925088307)· [Follow](https://twitter.com/intent/follow?screen_name=Troyusrex)[](https://twitter.com/Troyusrex/status/1791497607925088307)[ @spider\\\\_rust ](https://twitter.com/spider_rust) First, the good: Spider has enabled me to speed up my scraping 20X and with a bit higher quality than I was getting before. I am having a few issues however. First, the documentation link doesn\\'t work ([ spider.cloud/guides/(/docs/… ](https://spider.cloud/guides/(/docs/api)))I\\'ve figured out how to get it to work…[ 3:54 PM · May 17, 2024 ](https://twitter.com/Troyusrex/status/1791497607925088307) [](https://help.twitter.com/en/twitter-for-websites-ads-info-and-privacy)[1 ](https://twitter.com/intent/like?tweet_id=1791497607925088307) [Reply ](https://twitter.com/intent/tweet?in_reply_to=1791497607925088307)[ Read 2 replies ](https://twitter.com/Troyusrex/status/1791497607925088307)FAQ----------Frequently asked questions about Spider
What is Spider?----------Spider is a leading web crawling tool designed for speed and cost-effectiveness, supporting various data formats including LLM-ready markdown.
Why is my website not crawling?----------Your crawl may fail if it requires JavaScript rendering. Try setting your request to \\'chrome\\' to solve this issue.
Can you crawl all pages?----------Yes, Spider accurately crawls all necessary content without needing a sitemap.
What formats can Spider convert web data into?----------Spider outputs HTML, raw, text, and various markdown formats. It supports JSON, JSONL, CSV, and XML for API responses.
Is Spider suitable for large scraping projects?----------Absolutely, Spider is ideal for large-scale data collection and offers a cost-effective dashboard for data management.
How can I try Spider?----------Purchase credits for our cloud system or test the Open Source Spider engine to explore its capabilities.
Does it respect robots.txt?----------Yes, compliance with robots.txt is default, but you can disable this if necessary.
[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula) [FAQ](/faq)© 2024 Spider from A11yWatch[GitHubGithub](https://github.com/spider-rs/spider) [X - Twitter ](https://twitter.com/spider_rust)', 'error': None, 'status': 200, 'url': 'https://spider.cloud'}]\n" + ] + } + ], + "source": [ + "from spider import Spider\n", + "from typing_extensions import Annotated\n", + "\n", + "\n", + "def scrape_page(url: Annotated[str, \"The URL of the web page to scrape\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[str, \"Scraped content\"]:\n", + " # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n", + " client = Spider(spider_api_key) \n", + "\n", + " if params == None: \n", + " params = {\n", + " \"return_format\": \"markdown\"\n", + " }\n", + "\n", + " scraped_data = client.scrape_url(url, params)\n", + " return scraped_data\n", + "\n", + "def crawl_domain(url: Annotated[str, \"The url of the domain to be crawled\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[str, \"Scraped content\"]:\n", + " # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n", + " client = Spider(spider_api_key) \n", + "\n", + " if params == None:\n", + " params = {\n", + " \"return_format\": \"markdown\"\n", + " }\n", + "\n", + " crawled_data = client.crawl_url(url, params)\n", + " return crawled_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create the agents and register the tool." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "from autogen import ConversableAgent, register_function\n", + "\n", + "# Create web scraper agent.\n", + "scraper_agent = ConversableAgent(\n", + " \"WebScraper\",\n", + " llm_config={\"config_list\": config_list},\n", + " system_message=\"You are a web scrapper and you can scrape any web page using the tools provided. \"\n", + " \"Returns 'TERMINATE' when the scraping is done.\",\n", + ")\n", + "\n", + "# Create user proxy agent.\n", + "user_proxy_agent = ConversableAgent(\n", + " \"UserProxy\",\n", + " llm_config=False, # No LLM for this agent.\n", + " human_input_mode=\"NEVER\",\n", + " code_execution_config=False, # No code execution for this agent.\n", + " is_termination_msg=lambda x: x.get(\"content\", \"\") is not None and \"terminate\" in x[\"content\"].lower(),\n", + " default_auto_reply=\"Please continue if not finished, otherwise return 'TERMINATE'.\",\n", + ")\n", + "\n", + "# Register the function with the agents.\n", + "register_function(\n", + " scrape_page,\n", + " caller=scraper_agent,\n", + " executor=user_proxy_agent,\n", + " name=\"scrape_page\",\n", + " description=\"Scrape a web page and return the content.\",\n", + ")\n", + "\n", + "register_function(\n", + " crawl_domain,\n", + " caller=scraper_agent,\n", + " executor=user_proxy_agent,\n", + " name=\"scrape_page\",\n", + " description=\"Crawl an entire domain, following subpages and return the content.\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Start the conversation for scraping web data. We used the\n", + "`reflection_with_llm` option for summary method\n", + "to perform the formatting of the output into a desired format.\n", + "The summary method is called after the conversation is completed\n", + "given the complete history of the conversation." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[33mUserProxy\u001b[0m (to WebScraper):\n", + "\n", + "Can you scrape william-espegren.com for me?\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[31m\n", + ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", + "\u001b[33mWebScraper\u001b[0m (to UserProxy):\n", + "\n", + "\u001b[32m***** Suggested tool call (call_ZOzNWXJO8yeANFIJLJz7VJRo): scrape_page *****\u001b[0m\n", + "Arguments: \n", + "{\"url\":\"https://william-espegren.com\"}\n", + "\u001b[32m****************************************************************************\u001b[0m\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[35m\n", + ">>>>>>>> EXECUTING FUNCTION scrape_page...\u001b[0m\n", + "\u001b[33mUserProxy\u001b[0m (to WebScraper):\n", + "\n", + "\u001b[33mUserProxy\u001b[0m (to WebScraper):\n", + "\n", + "\u001b[32m***** Response from calling tool (call_ZOzNWXJO8yeANFIJLJz7VJRo) *****\u001b[0m\n", + "[{\"content\": \"William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \\u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https://www.instagram.com/williamespegren/)[LinkedIn](https://www.linkedin.com/in/william-espegren/)[Twitter](https://twitter.com/WilliamEspegren)[team-collaboration/version-control/github Created with Sketch.Github](https://github.com/WilliamEspegren)\", \"error\": null, \"status\": 200, \"url\": \"https://william-espegren.com\"}]\n", + "\u001b[32m**********************************************************************\u001b[0m\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[31m\n", + ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", + "\u001b[33mWebScraper\u001b[0m (to UserProxy):\n", + "\n", + "I've scraped the content of [William Espegren's website](https://william-espegren.com). Here's the information available:\n", + "\n", + "- **Name:** William Espegren\n", + "- **Location:** Uppsala\n", + "- **Skills:** CSS, JS\n", + "- **Status:** Open for projects\n", + "\n", + "### Social and Contact Links:\n", + "- [Contact Me](https://www.linkedin.com/in/william-espegren/)\n", + "- [Instagram](https://www.instagram.com/williamespegren/)\n", + "- [LinkedIn](https://www.linkedin.com/in/william-espegren/)\n", + "- [Twitter](https://twitter.com/WilliamEspegren)\n", + "- [Github](https://github.com/WilliamEspegren)\n", + "\n", + "TERMINATE\n", + "\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "chat_result = user_proxy_agent.initiate_chat(\n", + " scraper_agent,\n", + " message=\"Can you scrape william-espegren.com for me?\",\n", + " summary_method=\"reflection_with_llm\",\n", + " summary_args={\n", + " \"summary_prompt\": \"\"\"Summarize the scraped content\"\"\"\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output is stored in the summary." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The website belongs to William Espegren, who is based in Uppsala and possesses skills in CSS and JavaScript. He is open to new projects. You can contact him through the following links:\n", + "\n", + "- [LinkedIn](https://www.linkedin.com/in/william-espegren/)\n", + "- [Instagram](https://www.instagram.com/williamespegren/)\n", + "- [Twitter](https://twitter.com/WilliamEspegren)\n", + "- [GitHub](https://github.com/WilliamEspegren)\n", + "\n", + "Feel free to reach out to him for project collaborations.\n" + ] + } + ], + "source": [ + "print(chat_result.summary)" + ] + } + ], + "metadata": { + "front_matter": { + "description": "Scrapping web pages and summarizing the content using agents with tools.", + "tags": [ + "web scraping", + "apify", + "tool use" + ], + "title": "Web Scraper Agent using Apify Tools" + }, + "kernelspec": { + "display_name": "autogen", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/website/docs/Examples.md b/website/docs/Examples.md index 45c16de45715..d06bdd08965d 100644 --- a/website/docs/Examples.md +++ b/website/docs/Examples.md @@ -55,6 +55,7 @@ Links to notebook examples: - Browse the Web with Agents - [View Notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_surfer.ipynb) - **SQL**: Natural Language Text to SQL Query using the [Spider](https://yale-lily.github.io/spider) Text-to-SQL Benchmark - [View Notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_sql_spider.ipynb) - **Web Scraping**: Web Scraping with Apify - [View Notebook](/docs/notebooks/agentchat_webscraping_with_apify) +- **Web Crawling**: Crawl entire domain with Spider - [View Notebook](/docs/notebooks/agentchat_webcrawling_with_spider) - **Write a software app, task by task, with specially designed functions.** - [View Notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_function_call_code_writing.ipynb). ### Human Involvement From e26bc7689b6bb4e8800ce66ac244d2daed98cb1c Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Sun, 19 May 2024 21:54:25 +0200 Subject: [PATCH 02/12] reset run count --- .../agentchat_webcrawling_with_spider.ipynb | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index 51ab674872a0..fa35f1b60fbe 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -14,7 +14,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First we need to install the Apify SDK and the AutoGen library." + "First we need to install the Spider SDK and the AutoGen library." ] }, { @@ -30,12 +30,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Setting up the LLM configuration and the Apify API key is also required." + "Setting up the LLM configuration and the Spider API key is also required." ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -52,13 +52,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's define the tool for scraping data from the website using Apify actor.\n", + "Let's define the tool for scraping and crawling data from any website with Spider.\n", "Read more about tool use in this [tutorial chapter](/docs/tutorial/tool-use)." ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -108,7 +108,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -163,7 +163,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -240,7 +240,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -265,13 +265,13 @@ ], "metadata": { "front_matter": { - "description": "Scrapping web pages and summarizing the content using agents with tools.", + "description": "Scraping/Crawling web pages and summarizing the content using agents with tools.", "tags": [ "web scraping", - "apify", + "spider", "tool use" ], - "title": "Web Scraper Agent using Apify Tools" + "title": "Web Scraper & crawler Agent using Spider" }, "kernelspec": { "display_name": "autogen", From 5ac92c31cf521dbb175f5372c04689ec5aa09f03 Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Sun, 19 May 2024 22:32:56 +0200 Subject: [PATCH 03/12] spell correction --- notebook/agentchat_webcrawling_with_spider.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index fa35f1b60fbe..831321ef9e09 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -118,7 +118,7 @@ "scraper_agent = ConversableAgent(\n", " \"WebScraper\",\n", " llm_config={\"config_list\": config_list},\n", - " system_message=\"You are a web scrapper and you can scrape any web page using the tools provided. \"\n", + " system_message=\"You are a web scraper and you can scrape any web page using the tools provided. \"\n", " \"Returns 'TERMINATE' when the scraping is done.\",\n", ")\n", "\n", From 4371527b41b9caec8b8b98502a3790b99b791a07 Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Sun, 19 May 2024 22:51:38 +0200 Subject: [PATCH 04/12] crawl agent --- .../agentchat_webcrawling_with_spider.ipynb | 183 +++++++++++++++--- 1 file changed, 156 insertions(+), 27 deletions(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index 831321ef9e09..430efef791fd 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -35,7 +35,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -58,7 +58,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 26, "metadata": {}, "outputs": [ { @@ -86,7 +86,7 @@ " scraped_data = client.scrape_url(url, params)\n", " return scraped_data\n", "\n", - "def crawl_domain(url: Annotated[str, \"The url of the domain to be crawled\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[str, \"Scraped content\"]:\n", + "def crawl_page(url: Annotated[str, \"The url of the domain to be crawled\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[str, \"Scraped content\"]:\n", " # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n", " client = Spider(spider_api_key) \n", "\n", @@ -108,7 +108,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ @@ -118,10 +118,18 @@ "scraper_agent = ConversableAgent(\n", " \"WebScraper\",\n", " llm_config={\"config_list\": config_list},\n", - " system_message=\"You are a web scraper and you can scrape any web page using the tools provided. \"\n", + " system_message=\"You are a web scraper and you can scrape any web page to retrieve its contents.\"\n", " \"Returns 'TERMINATE' when the scraping is done.\",\n", ")\n", "\n", + "# Create web crawler agent.\n", + "crawler_agent = ConversableAgent(\n", + " \"WebCrawler\",\n", + " llm_config={\"config_list\": config_list},\n", + " system_message=\"You are a web crawler and you can crawl any page with deeper crawling following subpages.\"\n", + " \"Returns 'TERMINATE' when the scraping is done.\",\n", + ") \n", + "\n", "# Create user proxy agent.\n", "user_proxy_agent = ConversableAgent(\n", " \"UserProxy\",\n", @@ -132,7 +140,7 @@ " default_auto_reply=\"Please continue if not finished, otherwise return 'TERMINATE'.\",\n", ")\n", "\n", - "# Register the function with the agents.\n", + "# Register the functions with the agents.\n", "register_function(\n", " scrape_page,\n", " caller=scraper_agent,\n", @@ -142,10 +150,10 @@ ")\n", "\n", "register_function(\n", - " crawl_domain,\n", - " caller=scraper_agent,\n", + " crawl_page,\n", + " caller=crawler_agent,\n", " executor=user_proxy_agent,\n", - " name=\"scrape_page\",\n", + " name=\"crawl_page\",\n", " description=\"Crawl an entire domain, following subpages and return the content.\",\n", ")" ] @@ -163,7 +171,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 28, "metadata": {}, "outputs": [ { @@ -179,9 +187,9 @@ ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", "\u001b[33mWebScraper\u001b[0m (to UserProxy):\n", "\n", - "\u001b[32m***** Suggested tool call (call_ZOzNWXJO8yeANFIJLJz7VJRo): scrape_page *****\u001b[0m\n", + "\u001b[32m***** Suggested tool call (call_qCNYeQCfIPZkUCKejQmm5EhC): scrape_page *****\u001b[0m\n", "Arguments: \n", - "{\"url\":\"https://william-espegren.com\"}\n", + "{\"url\":\"https://www.william-espegren.com\"}\n", "\u001b[32m****************************************************************************\u001b[0m\n", "\n", "--------------------------------------------------------------------------------\n", @@ -191,8 +199,8 @@ "\n", "\u001b[33mUserProxy\u001b[0m (to WebScraper):\n", "\n", - "\u001b[32m***** Response from calling tool (call_ZOzNWXJO8yeANFIJLJz7VJRo) *****\u001b[0m\n", - "[{\"content\": \"William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \\u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https://www.instagram.com/williamespegren/)[LinkedIn](https://www.linkedin.com/in/william-espegren/)[Twitter](https://twitter.com/WilliamEspegren)[team-collaboration/version-control/github Created with Sketch.Github](https://github.com/WilliamEspegren)\", \"error\": null, \"status\": 200, \"url\": \"https://william-espegren.com\"}]\n", + "\u001b[32m***** Response from calling tool (call_qCNYeQCfIPZkUCKejQmm5EhC) *****\u001b[0m\n", + "[{\"content\": \"William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \\u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https://www.instagram.com/williamespegren/)[LinkedIn](https://www.linkedin.com/in/william-espegren/)[Twitter](https://twitter.com/WilliamEspegren)[team-collaboration/version-control/github Created with Sketch.Github](https://github.com/WilliamEspegren)\", \"error\": null, \"status\": 200, \"url\": \"https://www.william-espegren.com\"}]\n", "\u001b[32m**********************************************************************\u001b[0m\n", "\n", "--------------------------------------------------------------------------------\n", @@ -200,20 +208,139 @@ ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", "\u001b[33mWebScraper\u001b[0m (to UserProxy):\n", "\n", - "I've scraped the content of [William Espegren's website](https://william-espegren.com). Here's the information available:\n", + "I successfully scraped the website \"william-espegren.com\". Here is the content retrieved:\n", + "\n", + "```\n", + "William Espegren - Portfolio\n", + "\n", + "keep scrolling\n", + "\n", + "MADE WITH\n", + "CSS, JS\n", + "\n", + "MADE BY\n", + "Uppsala\n", + "\n", + "William Espegren\n", + "With Love\n", + "\n", + "Open For Projects\n", + "\n", + "[CONTACT ME](https://www.linkedin.com/in/william-espegren/)\n", + "[Instagram](https://www.instagram.com/williamespegren/)\n", + "[LinkedIn](https://www.linkedin.com/in/william-espegren/)\n", + "[Twitter](https://twitter.com/WilliamEspegren)\n", + "[Github](https://github.com/WilliamEspegren)\n", + "```\n", + "\n", + "Is there anything specific you would like to do with this information?\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[33mUserProxy\u001b[0m (to WebScraper):\n", + "\n", + "Please continue if not finished, otherwise return 'TERMINATE'.\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[31m\n", + ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", + "\u001b[33mWebScraper\u001b[0m (to UserProxy):\n", "\n", - "- **Name:** William Espegren\n", - "- **Location:** Uppsala\n", - "- **Skills:** CSS, JS\n", - "- **Status:** Open for projects\n", + "TERMINATE\n", + "\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# Scrape page\n", + "scraped_chat_result = user_proxy_agent.initiate_chat(\n", + " scraper_agent,\n", + " message=\"Can you scrape william-espegren.com for me?\",\n", + " summary_method=\"reflection_with_llm\",\n", + " summary_args={\n", + " \"summary_prompt\": \"\"\"Summarize the scraped content\"\"\"\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n", + "\n", + "Can you crawl william-espegren.com for me, I want the whole domains information?\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[31m\n", + ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", + "\u001b[33mWebCrawler\u001b[0m (to UserProxy):\n", + "\n", + "\u001b[32m***** Suggested tool call (call_0FkTtsxBtA0SbChm1PX085Vk): crawl_page *****\u001b[0m\n", + "Arguments: \n", + "{\"url\":\"http://www.william-espegren.com\"}\n", + "\u001b[32m***************************************************************************\u001b[0m\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[35m\n", + ">>>>>>>> EXECUTING FUNCTION crawl_page...\u001b[0m\n", + "\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n", + "\n", + "\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n", + "\n", + "\u001b[32m***** Response from calling tool (call_0FkTtsxBtA0SbChm1PX085Vk) *****\u001b[0m\n", + "[{\"content\": \"William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \\u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https://www.instagram.com/williamespegren/)[LinkedIn](https://www.linkedin.com/in/william-espegren/)[Twitter](https://twitter.com/WilliamEspegren)[team-collaboration/version-control/github Created with Sketch.Github](https://github.com/WilliamEspegren)\", \"error\": null, \"status\": 200, \"url\": \"http://www.william-espegren.com\"}]\n", + "\u001b[32m**********************************************************************\u001b[0m\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[31m\n", + ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", + "\u001b[33mWebCrawler\u001b[0m (to UserProxy):\n", + "\n", + "The crawl of [william-espegren.com](http://www.william-espegren.com) has been completed. Here is the gathered content:\n", + "\n", + "---\n", + "\n", + "**William Espegren - Portfolio**\n", + "\n", + "Keep scrolling\n", + "\n", + "**MADE WITH:** CSS, JS\n", + "\n", + "**MADE BY:** Uppsala\n", + "\n", + "**William Espegren**\n", + "\n", + "**With Love**\n", + "\n", + "**Open For Projects**\n", + "\n", + "**[CONTACT ME](https://www.linkedin.com/in/william-espegren/)**\n", "\n", - "### Social and Contact Links:\n", - "- [Contact Me](https://www.linkedin.com/in/william-espegren/)\n", "- [Instagram](https://www.instagram.com/williamespegren/)\n", "- [LinkedIn](https://www.linkedin.com/in/william-espegren/)\n", "- [Twitter](https://twitter.com/WilliamEspegren)\n", "- [Github](https://github.com/WilliamEspegren)\n", "\n", + "---\n", + "\n", + "If you need further information or details from any specific section, please let me know!\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n", + "\n", + "Please continue if not finished, otherwise return 'TERMINATE'.\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[31m\n", + ">>>>>>>> USING AUTO REPLY...\u001b[0m\n", + "\u001b[33mWebCrawler\u001b[0m (to UserProxy):\n", + "\n", "TERMINATE\n", "\n", "--------------------------------------------------------------------------------\n" @@ -221,12 +348,13 @@ } ], "source": [ - "chat_result = user_proxy_agent.initiate_chat(\n", - " scraper_agent,\n", - " message=\"Can you scrape william-espegren.com for me?\",\n", + "# Crawl page\n", + "crawled_chat_result = user_proxy_agent.initiate_chat(\n", + " crawler_agent,\n", + " message=\"Can you crawl william-espegren.com for me, I want the whole domains information?\",\n", " summary_method=\"reflection_with_llm\",\n", " summary_args={\n", - " \"summary_prompt\": \"\"\"Summarize the scraped content\"\"\"\n", + " \"summary_prompt\": \"\"\"Summarize the crawled content\"\"\"\n", " },\n", ")" ] @@ -240,7 +368,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 29, "metadata": {}, "outputs": [ { @@ -259,7 +387,8 @@ } ], "source": [ - "print(chat_result.summary)" + "print(scraped_chat_result.summary) \n", + "# print(crawled_chat_result.summary) # We show one for cleaner output" ] } ], From 07e61844255e4d82e6dfbaeb279e8ebfdd746a91 Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Mon, 20 May 2024 18:01:13 +0200 Subject: [PATCH 05/12] reset execution counters --- notebook/agentchat_webcrawling_with_spider.ipynb | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index 430efef791fd..74e4b5dc27f2 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -35,7 +35,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -58,7 +58,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -108,7 +108,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -171,7 +171,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -265,7 +265,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -368,7 +368,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 7, "metadata": {}, "outputs": [ { From 7b14d07dc304e3584f1ba8a8ca262f7848915160 Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Mon, 20 May 2024 18:08:23 +0200 Subject: [PATCH 06/12] correct return types --- notebook/agentchat_webcrawling_with_spider.ipynb | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index 74e4b5dc27f2..933228df93d0 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -72,9 +72,10 @@ "source": [ "from spider import Spider\n", "from typing_extensions import Annotated\n", + "from typing import List, Dict, Any\n", "\n", "\n", - "def scrape_page(url: Annotated[str, \"The URL of the web page to scrape\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[str, \"Scraped content\"]:\n", + "def scrape_page(url: Annotated[str, \"The URL of the web page to scrape\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[Dict[str, Any], \"Scraped content\"]:\n", " # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n", " client = Spider(spider_api_key) \n", "\n", @@ -84,9 +85,9 @@ " }\n", "\n", " scraped_data = client.scrape_url(url, params)\n", - " return scraped_data\n", + " return scraped_data[0]\n", "\n", - "def crawl_page(url: Annotated[str, \"The url of the domain to be crawled\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[str, \"Scraped content\"]:\n", + "def crawl_page(url: Annotated[str, \"The url of the domain to be crawled\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[List[Dict[str, Any]], \"Scraped content\"]:\n", " # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n", " client = Spider(spider_api_key) \n", "\n", From a6f4bb524f50396bcf6ca4261480402efa457d72 Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Wed, 22 May 2024 09:10:23 +0200 Subject: [PATCH 07/12] metadat for website --- notebook/agentchat_webcrawling_with_spider.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index 933228df93d0..c04fcd793a1a 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -395,13 +395,13 @@ ], "metadata": { "front_matter": { - "description": "Scraping/Crawling web pages and summarizing the content using agents with tools.", + "description": "Scraping/Crawling web pages and summarizing the content using agents.", "tags": [ "web scraping", "spider", "tool use" ], - "title": "Web Scraper & crawler Agent using Spider" + "title": "Web Scraper & Crawler Agent using Spider" }, "kernelspec": { "display_name": "autogen", From 9137f2db0156a328b0ab1ab51bedbaddb724a971 Mon Sep 17 00:00:00 2001 From: WilliamEspegren Date: Sun, 26 May 2024 18:05:44 +0200 Subject: [PATCH 08/12] format --- ...at_auto_feedback_from_code_execution.ipynb | 2 +- notebook/agentchat_logging.ipynb | 1 - .../agentchat_webcrawling_with_spider.ipynb | 42 +++++++++---------- website/docs/topics/llm_configuration.ipynb | 1 + 4 files changed, 23 insertions(+), 23 deletions(-) diff --git a/notebook/agentchat_auto_feedback_from_code_execution.ipynb b/notebook/agentchat_auto_feedback_from_code_execution.ipynb index bf784889d61d..6ea6f662b93b 100644 --- a/notebook/agentchat_auto_feedback_from_code_execution.ipynb +++ b/notebook/agentchat_auto_feedback_from_code_execution.ipynb @@ -40,7 +40,7 @@ " filter_dict={\"tags\": [\"gpt-4\"]}, # comment out to get all\n", ")\n", "# When using a single openai endpoint, you can use the following:\n", - "# config_list = [{\"model\": \"gpt-4\", \"api_key\": os.getenv(\"OPENAI_API_KEY\")}]\n" + "# config_list = [{\"model\": \"gpt-4\", \"api_key\": os.getenv(\"OPENAI_API_KEY\")}]" ] }, { diff --git a/notebook/agentchat_logging.ipynb b/notebook/agentchat_logging.ipynb index 7eb4138b4cc1..775cc87f9dd2 100644 --- a/notebook/agentchat_logging.ipynb +++ b/notebook/agentchat_logging.ipynb @@ -328,7 +328,6 @@ } ], "source": [ - "\n", "import pandas as pd\n", "\n", "import autogen\n", diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index c04fcd793a1a..ccd41b21c71d 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -70,31 +70,35 @@ } ], "source": [ + "from typing import Any, Dict, List\n", + "\n", "from spider import Spider\n", "from typing_extensions import Annotated\n", - "from typing import List, Dict, Any\n", "\n", "\n", - "def scrape_page(url: Annotated[str, \"The URL of the web page to scrape\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[Dict[str, Any], \"Scraped content\"]:\n", + "def scrape_page(\n", + " url: Annotated[str, \"The URL of the web page to scrape\"],\n", + " params: Annotated[dict, \"Dictionary of additional params.\"] = None,\n", + ") -> Annotated[Dict[str, Any], \"Scraped content\"]:\n", " # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n", - " client = Spider(spider_api_key) \n", + " client = Spider(spider_api_key)\n", "\n", - " if params == None: \n", - " params = {\n", - " \"return_format\": \"markdown\"\n", - " }\n", + " if params is None:\n", + " params = {\"return_format\": \"markdown\"}\n", "\n", " scraped_data = client.scrape_url(url, params)\n", " return scraped_data[0]\n", "\n", - "def crawl_page(url: Annotated[str, \"The url of the domain to be crawled\"], params: Annotated[dict, \"Dictionary of additional params.\"] = None) -> Annotated[List[Dict[str, Any]], \"Scraped content\"]:\n", + "\n", + "def crawl_page(\n", + " url: Annotated[str, \"The url of the domain to be crawled\"],\n", + " params: Annotated[dict, \"Dictionary of additional params.\"] = None,\n", + ") -> Annotated[List[Dict[str, Any]], \"Scraped content\"]:\n", " # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n", - " client = Spider(spider_api_key) \n", + " client = Spider(spider_api_key)\n", "\n", - " if params == None:\n", - " params = {\n", - " \"return_format\": \"markdown\"\n", - " }\n", + " if params is None:\n", + " params = {\"return_format\": \"markdown\"}\n", "\n", " crawled_data = client.crawl_url(url, params)\n", " return crawled_data" @@ -129,7 +133,7 @@ " llm_config={\"config_list\": config_list},\n", " system_message=\"You are a web crawler and you can crawl any page with deeper crawling following subpages.\"\n", " \"Returns 'TERMINATE' when the scraping is done.\",\n", - ") \n", + ")\n", "\n", "# Create user proxy agent.\n", "user_proxy_agent = ConversableAgent(\n", @@ -258,9 +262,7 @@ " scraper_agent,\n", " message=\"Can you scrape william-espegren.com for me?\",\n", " summary_method=\"reflection_with_llm\",\n", - " summary_args={\n", - " \"summary_prompt\": \"\"\"Summarize the scraped content\"\"\"\n", - " },\n", + " summary_args={\"summary_prompt\": \"\"\"Summarize the scraped content\"\"\"},\n", ")" ] }, @@ -354,9 +356,7 @@ " crawler_agent,\n", " message=\"Can you crawl william-espegren.com for me, I want the whole domains information?\",\n", " summary_method=\"reflection_with_llm\",\n", - " summary_args={\n", - " \"summary_prompt\": \"\"\"Summarize the crawled content\"\"\"\n", - " },\n", + " summary_args={\"summary_prompt\": \"\"\"Summarize the crawled content\"\"\"},\n", ")" ] }, @@ -388,7 +388,7 @@ } ], "source": [ - "print(scraped_chat_result.summary) \n", + "print(scraped_chat_result.summary)\n", "# print(crawled_chat_result.summary) # We show one for cleaner output" ] } diff --git a/website/docs/topics/llm_configuration.ipynb b/website/docs/topics/llm_configuration.ipynb index 51abf1f46225..c0a1b7e74a98 100644 --- a/website/docs/topics/llm_configuration.ipynb +++ b/website/docs/topics/llm_configuration.ipynb @@ -279,6 +279,7 @@ " def __deepcopy__(self, memo):\n", " return self\n", "\n", + "\n", "config_list = [\n", " {\n", " \"model\": \"my-gpt-4-deployment\",\n", From d2529212314b2b72ca38f514016e987d7fd097e6 Mon Sep 17 00:00:00 2001 From: William Espegren <131612909+WilliamEspegren@users.noreply.github.com> Date: Tue, 1 Oct 2024 12:51:15 +0200 Subject: [PATCH 09/12] Update notebook/agentchat_webcrawling_with_spider.ipynb Co-authored-by: Eric Zhu --- notebook/agentchat_webcrawling_with_spider.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index ccd41b21c71d..1e9ac4b7672f 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Web Scraping using Spider\n", + "# Web Scraping using Spider API\n", "\n", "This notebook shows how to use the fastest open \n", "source web crawler together with AutoGen agents." From d2ef79f451ccdd2074395aab45be237f6f2dca16 Mon Sep 17 00:00:00 2001 From: William Espegren <131612909+WilliamEspegren@users.noreply.github.com> Date: Tue, 1 Oct 2024 12:51:42 +0200 Subject: [PATCH 10/12] Update website/docs/Examples.md Co-authored-by: Eric Zhu --- website/docs/Examples.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/Examples.md b/website/docs/Examples.md index 1d40df83d027..6348621d822b 100644 --- a/website/docs/Examples.md +++ b/website/docs/Examples.md @@ -55,7 +55,7 @@ Links to notebook examples: - Browse the Web with Agents - [View Notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_surfer.ipynb) - **SQL**: Natural Language Text to SQL Query using the [Spider](https://yale-lily.github.io/spider) Text-to-SQL Benchmark - [View Notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_sql_spider.ipynb) - **Web Scraping**: Web Scraping with Apify - [View Notebook](/docs/notebooks/agentchat_webscraping_with_apify) -- **Web Crawling**: Crawl entire domain with Spider - [View Notebook](/docs/notebooks/agentchat_webcrawling_with_spider) +- **Web Crawling**: Crawl entire domain with Spider API - [View Notebook](/docs/notebooks/agentchat_webcrawling_with_spider) - **Write a software app, task by task, with specially designed functions.** - [View Notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_function_call_code_writing.ipynb). ### Human Involvement From 0b5933ac33892a595123d9c31f5e92bec5312667 Mon Sep 17 00:00:00 2001 From: William Espegren <131612909+WilliamEspegren@users.noreply.github.com> Date: Fri, 4 Oct 2024 13:23:05 +0200 Subject: [PATCH 11/12] Update agentchat_webcrawling_with_spider.ipynb --- notebook/agentchat_webcrawling_with_spider.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index 1e9ac4b7672f..431a01697bfc 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -7,7 +7,7 @@ "# Web Scraping using Spider API\n", "\n", "This notebook shows how to use the fastest open \n", - "source web crawler together with AutoGen agents." + "source [Spider](https://spider.cloud/) web crawler together with AutoGen agents." ] }, { From fb50ced7f0043a94f209febbd6397cba018dedd1 Mon Sep 17 00:00:00 2001 From: William Espegren <131612909+WilliamEspegren@users.noreply.github.com> Date: Sat, 12 Oct 2024 01:11:51 +0200 Subject: [PATCH 12/12] Update agentchat_webcrawling_with_spider.ipynb Co-authored-by: Eric Zhu --- notebook/agentchat_webcrawling_with_spider.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebook/agentchat_webcrawling_with_spider.ipynb b/notebook/agentchat_webcrawling_with_spider.ipynb index 431a01697bfc..45d270b37e5e 100644 --- a/notebook/agentchat_webcrawling_with_spider.ipynb +++ b/notebook/agentchat_webcrawling_with_spider.ipynb @@ -6,7 +6,7 @@ "source": [ "# Web Scraping using Spider API\n", "\n", - "This notebook shows how to use the fastest open \n", + "This notebook shows how to use the open \n", "source [Spider](https://spider.cloud/) web crawler together with AutoGen agents." ] },