-
Hello, I've implemented GLiNER as part of a preprocessing pipeline (first tests show very good results for my use case), but in order to run the pipeline on a large dataset I would like to use some multiprocessing. Unfortunately, it seems that the model cannot be loaded simultaneously on multiple CPUs (process hangs indefinitely after a few seconds). Here's some reproducible code:
from multiprocessing import Pool
from gliner import GLiNER
class NER:
def __init__(self, model_name="urchade/gliner_multi-v2.1"):
print("Loading NER model...")
self.model = GLiNER.from_pretrained(model_name)
self.labels = ["name"]
print("NER model loaded !")
def get_entities(self, text):
entities = [ent["text"] for ent in self.model.predict_entities(text, self.labels)]
return entities
class MyPipeline:
def __init__(self):
self.ner = NER()
def __getstate__(self):
self.ner = None
return self.__dict__
def __setstate__(self, state):
self.__dict__ = state
self.ner = NER()
def process_text(self, text):
# Some dummy preprocessing
entities = self.ner.get_entities(text)
out = text.lower() + " ; " + ",".join(entities)
return out
def preprocess_lines(self, lines):
with Pool(processes=2) as pool:
for text_out in pool.imap(self.process_text, lines):
print(text_out)
if __name__ == "__main__":
corpus = ["My name is Franz Schubert.", "Ella Fitzgerald is my favorite singer."]
pip = MyPipeline()
pip.preprocess_lines(corpus) The terminal shows:
Then hangs indefinitely. Do you have any idea what's blocking the model to be loaded multiple times ? And any idea on how to enable multiprocessing in such case ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I've found out that since PyTorch uses multithreading, forking processes is not possible ; however the "spawn" method works: if __name__ == "__main__":
set_start_method("spawn")
corpus = ["My name is Franz Schubert.", "Ella Fitzgerald is my favorite singer."]
pip = MyPipeline()
pip.preprocess_lines(corpus) In my case, I also changed my code to avoid unpickling the MyPipeline instance for each sample, but at least it is possible to do multiprocessing with GLiNER. |
Beta Was this translation helpful? Give feedback.
I've found out that since PyTorch uses multithreading, forking processes is not possible ; however the "spawn" method works:
In my case, I also changed my code to avoid unpickling the MyPipeline instance for each sample, but at least it is possible to do multiprocessing with GLiNER.