Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify readme #312

Merged
merged 2 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 43 additions & 10 deletions erniebot-agent/applications/erniebot_researcher/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,29 +66,52 @@ wget https://paddlenlp.bj.bcebos.com/pipelines/fonts/SimSun.ttf

> 第四步:创建索引

下载实例数据
**数据准备**

我们支持docx、pdf、txt等格式的文件,用户可以把这些文件放到同一个文件夹下,然后运行下面的命令创建索引,后续会根据这些文件写报告。

为了方便测试,我们提供了样例数据。
样例数据:

```
wget https://paddlenlp.bj.bcebos.com/pipelines/erniebot_researcher_example.tar.gz
tar xvf erniebot_researcher_example.tar.gz
```

首先需要在[AI Studio星河社区](https://aistudio.baidu.com/index)注册并登录账号,然后在AI Studio的[访问令牌页面](https://aistudio.baidu.com/index/accessToken)获取`Access Token`,最后设置环境变量:
url数据:

如果用户有文件对应的url链接,可以传入存储url链接的txt。在txt中,每一行存储url链接和对应文件的路径,例如:
```
export EB_AGENT_ACCESS_TOKEN=<aistudio-access-token>
export AISTUDIO_ACCESS_TOKEN=<aistudio-access-token>
https://zhuanlan.zhihu.com/p/659457816 erniebot_researcher_example/Ai_Agent的起源.md
```
如果用户不传入url文件,则默认文件的路径为其url链接

如果用户有url链接,你可以传入存储url链接的txt。
在txt中,每一行存储文件的路径和对应的url链接,例如:
'https://zhuanlan.zhihu.com/p/659457816 erniebot_researcher_example/Ai_Agent的起源.md'
摘要数据:

如果用户不传入url文件,则默认文件的路径为其url链接
用户可以利用path_abstract参数传入自己文件对应摘要的存储路径。
其中摘要需要用json文件存储。其中json文件内存储的是多个字典,每个字典有3组键值对,
- `page_content` : `str`, 文件摘要。
- `url` : `str`, 文件url链接。
- `name` : `str`, 文件名字。

例如:

```
[{"page_content":"文件摘要","url":"https://zhuanlan.zhihu.com/p/659457816","name":Ai_Agent的起源},
...]
```

如果用户没有摘要路径,则无需改变path_abstract的默认值,我们会利用ernie-4.0来自动生成摘要,生成的摘要存储路径为abstract.json。

**创建索引**

首先需要在[AI Studio星河社区](https://aistudio.baidu.com/index)注册并登录账号,然后在AI Studio的[访问令牌页面](https://aistudio.baidu.com/index/accessToken)获取`Access Token`,最后设置环境变量:

**有摘要有url链接**

用户可以自己传入文件摘要的存储路径。其中摘要需要用json文件存储。其中json文件内存储的是多个字典,每个字典有3组键值对,"page_content"存储文件的摘要,"url"是文件的url链接,"name"是文章的名字。例如:
[{"page_content":"文章摘要","url":"https://zhuanlan.zhihu.com/p/659457816","name":Ai_Agent的起源},...]
```
export EB_AGENT_ACCESS_TOKEN=<aistudio-access-token>
export AISTUDIO_ACCESS_TOKEN=<aistudio-access-token>
python ./tools/preprocessing.py \
--index_name_full_text <the index name of your full text> \
--index_name_abstract <the index name of your abstract text> \
Expand All @@ -97,6 +120,16 @@ python ./tools/preprocessing.py \
--path_abstract <the json path of your abstract text>
```

**无摘要无url链接**

```
export EB_AGENT_ACCESS_TOKEN=<aistudio-access-token>
export AISTUDIO_ACCESS_TOKEN=<aistudio-access-token>
python ./tools/preprocessing.py \
--index_name_full_text <the index name of your full text> \
--index_name_abstract <the index name of your abstract text> \
--path_full_text <the folder path of your full text>
```
> 第五步:运行


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,6 @@ async def run(self, query: str):
for sub_query in sub_queries:
research_result = await self.run_search_summary(sub_query)
paragraphs_item.extend(research_result)

paragraphs = []
for item in paragraphs_item:
if item not in paragraphs:
Expand Down