You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/mddocs/Quickstart/graphrag_quickstart.md
+80-19Lines changed: 80 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,12 +9,16 @@ The [GraphRAG project](https://github.com/microsoft/graphrag) is designed to lev
9
9
-[Setup Python Environment for GraphRAG](#3-setup-python-environment-for-graphrag)
10
10
-[Index GraphRAG](#4-index-graphrag)
11
11
-[Query GraphRAG](#5-query-graphrag)
12
+
-[Query GraphRAG](#5-query-graphrag)
13
+
-[Troubleshooting](#troubleshooting)
12
14
13
15
## Quickstart
14
16
15
17
### 1. Install and Start `Ollama` Service on Intel GPU
16
18
17
-
Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`).
19
+
Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quickstart.md) to install `ipex-llm[cpp]==2.1.0` and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`).
20
+
21
+
**Please note that for GraphRAG, we highly recommand using the stable version of ipex-llm through `pip install ipex-llm[cpp]==2.1.0`**.
in which `pip install ollama` is for enabling restful APIs through python, and `pip install plotly` is for visualizing the knowledge graph.
66
71
72
+
> [!NOTE]
73
+
> Please note that the Python environment for GraphRAG setup here is separate from the one for Ollama server on Intel GPUs.
74
+
67
75
### 4. Index GraphRAG
68
76
69
77
The environment is now ready for GraphRAG with local LLMs and embedding models running on Intel GPUs. Before querying GraphRAG, it is necessary to first index GraphRAG, which could be a resource-intensive operation.
@@ -114,24 +122,25 @@ Perpare the input corpus, and then initialize the workspace:
114
122
#### Update `settings.yml`
115
123
116
124
In the `settings.yml` file inside the `ragtest` folder, add the configuration `request_timeout: 1800.0` for `llm`. Besides, if you would like to use LLMs or embedding models other than `mistral` or `nomic-embed-text`, you are required to update the `settings.yml` in `ragtest` folder accordingly:
117
-
>
118
-
> ```yml
119
-
> llm:
120
-
> api_key: ${GRAPHRAG_API_KEY}
121
-
> type: openai_chat
122
-
> model: mistral # change it accordingly if using another LLM
123
-
> model_supports_json: true
124
-
> request_timeout: 1800.0 # add this configuration; you could also increase the request_timeout
125
-
> api_base: http://localhost:11434/v1
126
-
>
127
-
> embeddings:
128
-
> async_mode: threaded
129
-
> llm:
130
-
> api_key: ${GRAPHRAG_API_KEY}
131
-
> type: openai_embedding
132
-
> model: nomic_embed_text # change it accordingly if using another embedding model
133
-
> api_base: http://localhost:11434/api
134
-
> ```
125
+
126
+
127
+
```yml
128
+
llm:
129
+
api_key: ${GRAPHRAG_API_KEY}
130
+
type: openai_chat
131
+
model: mistral # change it accordingly if using another LLM
132
+
model_supports_json: true
133
+
request_timeout: 1800.0# add this configuration; you could also increase the request_timeout
134
+
api_base: http://localhost:11434/v1
135
+
136
+
embeddings:
137
+
async_mode: threaded
138
+
llm:
139
+
api_key: ${GRAPHRAG_API_KEY}
140
+
type: openai_embedding
141
+
model: nomic_embed_text # change it accordingly if using another embedding model
142
+
api_base: http://localhost:11434/api
143
+
```
135
144
136
145
#### Conduct GraphRAG indexing
137
146
@@ -197,3 +206,55 @@ The Transformer model has been very successful in various natural language proce
197
206
198
207
Since its initial introduction, the Transformer model has been further developed and improved upon. Variants of the Transformer architecture, such as BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly Optimized BERT Pretraining Approach), have achieved state-of-the-art performance on a wide range of natural language processing tasks [Data: Reports (1, 2, 34, 46, 64, +more)].
199
208
```
209
+
210
+
### Troubleshooting
211
+
212
+
#### `failed to find free space in the KV cache, retrying with smaller n_batch` when conducting GraphRAG Indexing, and `JSONDecodeError` when querying GraphRAG
213
+
214
+
If you observe the Ollama server log showing `failed to find free space in the KV cache, retrying with smaller n_batch` while conducting GraphRAG indexing, and receive `JSONDecodeError` when querying GraphRAG, try to increase context length for the LLM model and index/query GraphRAG again.
215
+
216
+
Here introduce how to make the LLM model support larger context. To do this, we need to first create a file named `Modelfile`:
217
+
218
+
```
219
+
FROM mistral:latest
220
+
PARAMETER num_ctx 4096
221
+
```
222
+
223
+
> [!TIP]
224
+
> Here we increase `num_ctx` to 4096 as an example. You could adjust it accordingly.
225
+
226
+
and then use the following commands to create a new model in Ollama named `mistral:latest-nctx4096`:
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on MiniCPM-V-2_6 models. For illustration purposes, we utilize the [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) as a reference MiniCPM-V-2_6 model.
3
+
4
+
## 0. Requirements
5
+
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
6
+
7
+
## Example: Predict Tokens using `chat()` API
8
+
In the example [chat.py](./chat.py), we show a basic use case for a MiniCPM-V-2_6 model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations.
python ./chat.py --prompt 'What is in the image?' --stream
43
+
```
44
+
45
+
> [!TIP]
46
+
> For chatting in streaming mode, it is recommended to set the environment variable `PYTHONUNBUFFERED=1`.
47
+
48
+
49
+
Arguments info:
50
+
-`--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the MiniCPM-V-2_6 model (e.g. `openbmb/MiniCPM-V-2_6`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'openbmb/MiniCPM-V-2_6'`.
51
+
-`--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
52
+
-`--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
53
+
-`--stream`: flag to chat in streaming mode
54
+
55
+
> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.
56
+
>
57
+
> Please select the appropriate size of the MiniCPM model based on the capabilities of your machine.
58
+
59
+
#### 2.1 Client
60
+
On client Windows machine, it is recommended to run directly with full utilization of all cores:
61
+
```cmd
62
+
python ./chat.py
63
+
```
64
+
65
+
#### 2.2 Server
66
+
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
The image features a young child holding a white teddy bear dressed in pink. The background includes some red flowers and what appears to be a stone wall.
0 commit comments