Skip to content

Commit

Permalink
doc(README): add Google Colab link
Browse files Browse the repository at this point in the history
  • Loading branch information
DoodleBears committed Jun 30, 2024
1 parent 92abd65 commit 1dbb526
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 7 deletions.
17 changes: 11 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@
[![Downloads](https://static.pepy.tech/badge/split-lang)](https://pepy.tech/project/split-lang)
[![Downloads](https://static.pepy.tech/badge/split-lang/month)](https://pepy.tech/project/split-lang)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DoodleBears/split-lang/blob/main/split-lang-demo.ipynb)


[![Open Source Love](https://badges.frapsoft.com/os/mit/mit.svg?v=102)](https://github.com/ellerbrock/open-source-badge/)

[![wakatime](https://wakatime.com/badge/user/5728d95a-5cfb-4acb-b600-e34c2fc231b6/project/e06e0a00-9ba1-453d-8c62-a0b2604aaaad.svg)](https://wakatime.com/badge/user/5728d95a-5cfb-4acb-b600-e34c2fc231b6/project/e06e0a00-9ba1-453d-8c62-a0b2604aaaad)
Expand All @@ -12,7 +15,7 @@ Splitting sentences by concatenating over-split substrings based on their langua

powered by [`wtpsplit`](https://github.com/segment-any-text/wtpsplit) and [`fast-langdetect`](https://github.com/LlmKira/fast-langdetect) and [`langdetect`](https://github.com/Mimino666/langdetect)

## Idea
## 1.1. Idea

**Stage 1**: rule-based split using punctuation
- `hello, how are you` -> `hello` | `,` | `how are you`
Expand All @@ -39,15 +42,15 @@ Vielen Dank merci beaucoup for your help.
```

- [1. `split-lang`](#1-split-lang)
- [Idea](#idea)
- [1.1. Idea](#11-idea)
- [2. Motivation](#2-motivation)
- [3. Usage](#3-usage)
- [3.1. Installation](#31-installation)
- [3.2. Basic](#32-basic)
- [3.2.1. `split_by_lang`](#321-split_by_lang)
- [3.3. Advanced](#33-advanced)
- [`threshold`](#threshold)
- [3.3.1. usage of `lang_map` (for better result)](#331-usage-of-lang_map-for-better-result)
- [3.3.1. `threshold`](#331-threshold)
- [3.3.2. usage of `lang_map` (for better result)](#332-usage-of-lang_map-for-better-result)


# 3. Usage
Expand All @@ -65,6 +68,8 @@ pip install split-lang
## 3.2. Basic
### 3.2.1. `split_by_lang`

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DoodleBears/split-lang/blob/main/split-lang-demo.ipynb)

```python
from split_lang import split_by_lang

Expand Down Expand Up @@ -92,15 +97,15 @@ for text in texts:

## 3.3. Advanced

### `threshold`
### 3.3.1. `threshold`

the threshold used in `wtpsplit`, default to 1e-4, the smaller the more substring you will get in `wtpsplit` stage

> [!NOTE]
> Check GitHub Repo `tests/split_acc.py` to find best threshold for your use case

### 3.3.1. usage of `lang_map` (for better result)
### 3.3.2. usage of `lang_map` (for better result)

> [!IMPORTANT]
> Add lang code for your usecase if other languages are needed
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ def read(*relpath):
url="https://github.com/DoodleBears/langsplit",
author="DoodleBear",
author_email="[email protected]",
license="MIT",
packages=find_packages(),
install_requires=[
"langdetect",
Expand Down
3 changes: 2 additions & 1 deletion split-lang-demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install split-lang --upgrade"
"%pip install split-lang --upgrade\n",
"%pip install numpy==1.26.0"
]
},
{
Expand Down

0 comments on commit 1dbb526

Please sign in to comment.