Skip to content

Commit bd3d6d3

Browse files
committed
Merge branch 'dev'
2 parents 91bba2f + 3cce6fb commit bd3d6d3

22 files changed

+750
-339
lines changed

docs/contributing.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,9 @@ the ``tests`` directory. We use ``pytest`` to test code, and also use
3636
``hypothesis`` when applicable. If you open a patch, make sure that
3737
all tests are passing. In particular, do not rely on the CI, as it
3838
does not run time costly tests! Check for yourself locally, using
39-
``RENARD_TEST_ALL=1 python -m pytest tests``
39+
``RENARD_TEST_ALL=1 python -m pytest tests``. Note that there are
40+
specific tests and environment variable for optional dependencies such
41+
as *stanza* (``RENARD_TEST_STANZA_OPTDEP``). These must be explicitely
42+
set to ``1`` if you want to test optional dependencies, as
43+
``RENARD_TEST_ALL=1`` does not enable test on these optional
44+
dependencies.

docs/extending.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,10 @@ Creating new steps
88

99
Usually, steps must implement at least four functions :
1010

11-
- :meth:`.PipelineStep.__init__`: is used to pass options at step init time
12-
- :meth:`.PipelineStep.__call__`: is called at pipeline run time
11+
- :meth:`.PipelineStep.__init__`: is used to pass options at step init
12+
time. Options passed at step init time should be valid for a
13+
collection of texts, and not be text specific.
14+
- :meth:`.PipelineStep.__call__`: is called at pipeline run time.
1315
- :meth:`.PipelineStep.needs`: declares the set of informations needed
1416
from the pipeline state by this step. Each returned string should be
1517
an attribute of :class:`.PipelineState`.

docs/pipeline.rst

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ In that case, the ``tokens`` requirements is fulfilled at run time. If
6868
you don't pass the parameter, Renard will throw the following
6969
exception:
7070

71-
>>> ValueError: ["step 1 (NLTKNamedEntityRecognizer) has unsatisfied needs (needs : {'tokens'}, available : {'text'})"]
71+
>>> ValueError: ["step 1 (NLTKNamedEntityRecognizer) has unsatisfied needs. needs: {'tokens'}. available: {'text'}). missing: {'tokens'}."]
7272

7373

7474
For simplicity, one can use one of the preconfigured pipelines:
@@ -252,6 +252,51 @@ graph to a directory. Meanwhile,
252252
dynamic graph to the Gephi format.
253253

254254

255+
Custom Segmentation
256+
-------------------
257+
258+
The ``dynamic_window`` parameter of
259+
:class:`.CoOccurencesGraphExtractor` determines the segmentation of
260+
the dynamic networks, in number of interactions. In the example above,
261+
a new graph will be created for each 20 interactions.
262+
263+
While one can rely on the arguments of the graph extractor of the
264+
pipeline to determine the dynamic window, Renard allows to specify a
265+
custom segmentation of a text with the ``dynamic_blocks``
266+
argument. When running a pipeline, you can cut your text however you
267+
want and pass this argument instead of the usual text:
268+
269+
270+
.. code-block:: python
271+
272+
from renard.pipeline import Pipeline
273+
from renard.pipeline.tokenization import NLTKTokenizer
274+
from renard.pipeline.ner import NLTKNamedEntityRecognizer
275+
from renard.pipeline.character_unification import GraphRulesCharacterUnifier
276+
from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor
277+
from renard.utils import block_bounds
278+
279+
with open("./my_doc.txt") as f:
280+
text = f.read()
281+
282+
# let's suppose the 'cut_into_chapters' function cut the text into chapters.
283+
chapters = cut_into_chapters(text)
284+
285+
pipeline = Pipeline(
286+
[
287+
NLTKTokenizer(),
288+
NLTKNamedEntityRecognizer(),
289+
GraphRulesCharacterUnifier(),
290+
CoOccurrencesGraphExtractor(co_occurrences_dist=25, dynamic=True)
291+
]
292+
)
293+
294+
# the 'block_bounds' function automatically extracts the boundaries of your
295+
# block of text.
296+
out = pipeline(text, dynamic_blocks=block_bounds(chapters))
297+
298+
299+
255300
Multilingual Support
256301
====================
257302

poetry.lock

Lines changed: 5 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ matplotlib = "^3.5.3"
3131
seqeval = "1.2.2"
3232
pandas = "^2.0.0"
3333
pytest = "^7.2.1"
34-
tibert = "^0.3.0"
34+
tibert = "^0.4.0"
3535
grimbert = "^0.1.0"
3636
datasets = "^2.16.1"
3737

renard/graph_utils.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,17 @@ def graph_with_names(
7070
else:
7171
name_style_fn = name_style
7272

73-
return nx.relabel_nodes(
74-
G,
75-
{character: name_style_fn(character) for character in G.nodes()}, # type: ignore
76-
)
73+
mapping = {}
74+
for character in G.nodes():
75+
# NOTE: it is *possible* to have a graph where nodes are not
76+
# characters (for example, simple strings). Therefore, we are
77+
# lenient here
78+
try:
79+
mapping[character] = name_style_fn(character)
80+
except AttributeError:
81+
mapping[character] = character
82+
83+
return nx.relabel_nodes(G, mapping)
7784

7885

7986
def layout_with_names(

renard/ner_utils.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,10 @@ def __getitem__(self, index: Union[int, List[int]]) -> BatchEncoding:
110110
elt_context_mask = self._context_mask[index]
111111
for i in range(len(element)):
112112
w2t = batch.word_to_tokens(0, i)
113+
# w2t can be None in case of truncation, which can happen
114+
# if `element' is too long
115+
if w2t is None:
116+
continue
113117
mask_value = elt_context_mask[i]
114118
tokens_mask = [mask_value] * (w2t.end - w2t.start)
115119
batch["context_mask"][w2t.start : w2t.end] = tokens_mask

renard/pipeline/character_unification.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@ def _assign_coreference_mentions(
6161
# we assign each chain to the character with highest name
6262
# occurence in it
6363
for chain in corefs:
64+
if len(char_mentions) == 0:
65+
break
6466
# determine the characters with the highest number of
6567
# occurences
6668
occ_counter = {}
@@ -98,8 +100,13 @@ def __init__(self, min_appearances: int = 0) -> None:
98100
character for it to be valid
99101
"""
100102
self.min_appearances = min_appearances
103+
# a default value, will be est by _pipeline_init_
104+
self.character_ner_tag = "PER"
101105
super().__init__()
102106

107+
def _pipeline_init_(self, lang: str, character_ner_tag: str, **kwargs):
108+
self.character_ner_tag = character_ner_tag
109+
103110
def __call__(
104111
self,
105112
text: str,
@@ -112,7 +119,7 @@ def __call__(
112119
:param tokens:
113120
:param entities:
114121
"""
115-
persons = [e for e in entities if e.tag == "PER"]
122+
persons = [e for e in entities if e.tag == self.character_ner_tag]
116123

117124
characters = defaultdict(list)
118125
for entity in persons:
@@ -182,16 +189,19 @@ def __init__(
182189
self.additional_hypocorisms = additional_hypocorisms
183190
self.link_corefs_mentions = link_corefs_mentions
184191
self.ignore_lone_titles = ignore_lone_titles or set()
192+
self.character_ner_tag = "PER" # a default value, will be set by _pipeline_init
185193

186194
super().__init__()
187195

188-
def _pipeline_init_(self, lang: str, progress_reporter: ProgressReporter):
196+
def _pipeline_init_(self, lang: str, character_ner_tag: str, **kwargs):
189197
self.hypocorism_gazetteer = HypocorismGazetteer(lang=lang)
190198
if not self.additional_hypocorisms is None:
191199
for name, nicknames in self.additional_hypocorisms:
192200
self.hypocorism_gazetteer._add_hypocorism_(name, nicknames)
193201

194-
return super()._pipeline_init_(lang, progress_reporter)
202+
self.character_ner_tag = character_ner_tag
203+
204+
return super()._pipeline_init_(lang, **kwargs)
195205

196206
def __call__(
197207
self,
@@ -201,7 +211,7 @@ def __call__(
201211
) -> Dict[str, Any]:
202212
import networkx as nx
203213

204-
mentions = [m for m in entities if m.tag == "PER"]
214+
mentions = [m for m in entities if m.tag == self.character_ner_tag]
205215
mentions_str = set(
206216
filter(
207217
lambda m: not m in self.ignore_lone_titles,

renard/pipeline/characters_extraction.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1+
import sys
12
import renard.pipeline.character_unification as cu
23

34
print(
4-
"[warning] the characters_extraction module is deprecated. Use character_unification instead."
5+
"[warning] the characters_extraction module is deprecated. Use character_unification instead.",
6+
file=sys.stderr,
57
)
68

79
Character = cu.Character

0 commit comments

Comments
 (0)