Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Reimplementing search_dates #945

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
02220da
Implimenting new search_dates
gavishpoddar Jul 16, 2021
f933d3a
Fixing DATE_ORDER, implimenting deep_search, tests
gavishpoddar Jul 21, 2021
77727b5
Unproving _joint_parse with data_carry accurate_return_text, deep_se…
gavishpoddar Jul 21, 2021
e7f38e8
implementing _final_text_clean()
gavishpoddar Jul 22, 2021
962066c
Simplifying text_clean and modifying tests
gavishpoddar Jul 25, 2021
624ac8e
Implementing relative date
gavishpoddar Jul 28, 2021
42ca6f6
Fixing tests
gavishpoddar Jul 28, 2021
51749a2
secondary_split_implimentation
gavishpoddar Aug 3, 2021
f5e4635
positional args to keyword argument
gavishpoddar Aug 3, 2021
121b15f
Micro fixes
gavishpoddar Aug 3, 2021
2cd93f0
Removing codes now part of #953
gavishpoddar Aug 3, 2021
006d2a5
adding check_settings
gavishpoddar Aug 4, 2021
10404c9
implimenting double_punctuation_split
gavishpoddar Aug 4, 2021
22596e0
Updating docs and removing test (TMP)
gavishpoddar Aug 6, 2021
b799dfb
cleaning code, adding tests, improving coverage
gavishpoddar Aug 6, 2021
42c984a
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar Aug 9, 2021
8fc5e0d
Improving codecov
gavishpoddar Aug 11, 2021
74b6ec4
temporary commit to get diff
gavishpoddar Aug 16, 2021
56e0505
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Aug 16, 2021
5a1b1c5
temporary file change for review
gavishpoddar Aug 16, 2021
aa2aa8f
reverting the previous commit
gavishpoddar Aug 16, 2021
41eff6a
improvements
gavishpoddar Aug 16, 2021
f65531b
formatting code
gavishpoddar Aug 17, 2021
982fc08
formatting code
gavishpoddar Aug 17, 2021
3621b2d
improvements in text filter
gavishpoddar Aug 18, 2021
8a9496b
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar Aug 19, 2021
45996b4
removing previous search_dates
gavishpoddar Aug 23, 2021
2ac88c6
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Aug 23, 2021
5dabc62
adding test
gavishpoddar Aug 23, 2021
ab1778d
fixing doc string
gavishpoddar Aug 27, 2021
14adf89
fixing doc string
gavishpoddar Aug 27, 2021
d57223a
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Aug 27, 2021
88afa30
updating xfail
gavishpoddar Aug 28, 2021
9209f3d
updating tests
gavishpoddar Aug 28, 2021
85254e0
Apply suggestions from code review
gavishpoddar Sep 1, 2021
e4604e6
Merge branch 'master' into search_dates
gavishpoddar Sep 7, 2021
4f119dd
Updates
gavishpoddar Sep 7, 2021
e6da4be
Fixing upstraem merges
gavishpoddar Sep 7, 2021
f6116bf
DateSearch -> DateSearchWithDetection
gavishpoddar Sep 9, 2021
0525cdc
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar Oct 7, 2021
96b91c0
updating test with xfail
gavishpoddar Oct 7, 2021
b9d12f3
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Oct 7, 2021
99e66c6
minor fixes
gavishpoddar Oct 7, 2021
2935aae
Merge branch 'master' into search_dates
serhii73 Dec 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 107 additions & 51 deletions dateparser/search/__init__.py
Original file line number Diff line number Diff line change
@@ -1,63 +1,119 @@
from dateparser.search.search import DateSearchWithDetection
from dateparser.search.search import DateSearch
from dateparser.conf import apply_settings


_search_with_detection = DateSearchWithDetection()
_search_dates = DateSearch()
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved


@apply_settings
def search_dates(text, languages=None, settings=None, add_detected_language=False, detect_languages_function=None):
"""Find all substrings of the given string which represent date and/or time and parse them.

:param text:
A string in a natural language which may contain date and/or time expressions.
:type text: str

:param languages:
A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will
not attempt to detect the language.
:type languages: list

:param settings:
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`.
:type settings: dict

:param add_detected_language:
Indicates if we want the detected language returned in the tuple.
:type add_detected_language: bool

:param detect_languages_function:
A function for language detection that takes as input a `text` and a `confidence_threshold`,
and returns a list of detected language codes.
Note: detect_languages_function is only uses if `languages` are not provided.
:type detect_languages_function: function

:return: Returns list of tuples containing:
substrings representing date and/or time, corresponding :mod:`datetime.datetime`
object and detected language if *add_detected_language* is True.
Returns None if no dates that can be parsed are found.
:rtype: list
:raises: ValueError - Unknown Language

>>> from dateparser.search import search_dates
>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.')
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0))]

>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.',
>>> add_detected_language=True)
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0), 'en')]

>>> search_dates("The client arrived to the office for the first time in March 3rd, 2004 "
>>> "and got serviced, after a couple of months, on May 6th 2004, the customer "
>>> "returned indicating a defect on the part")
[('in March 3rd, 2004 and', datetime.datetime(2004, 3, 3, 0, 0)),
('on May 6th 2004', datetime.datetime(2004, 5, 6, 0, 0))]

"""
result = _search_with_detection.search_dates(
:param text:
A string in a natural language which may contain the date and/or time expressions.
:type text: str

:param languages:
A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will
not attempt to detect the language.
:type languages: list

:param settings:
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`.
:type settings: dict

:param add_detected_language:
Indicates if we want the detected language returned in the tuple.
:type add_detected_language: bool

:return: Returns list of tuples containing:
substrings representing date and/or time, corresponding :mod:`datetime.datetime`
object and detected language if *add_detected_language* is True.
Returns None if no dates that can be parsed are found.
:rtype: list
:raises: ValueError - Unknown Language

>>> from dateparser.search import search_dates
>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.')
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0))]

>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.',
>>> add_detected_language=True)
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0), 'en')]

>>> search_dates("The client arrived to the office for the first time in March 3rd, 2004 "
>>> "and got serviced, after a couple of months, on May 6th 2004, the customer "
>>> "returned indicating a defect on the part")
[('in March 3rd, 2004 and', datetime.datetime(2004, 3, 3, 0, 0)),
('on May 6th 2004', datetime.datetime(2004, 5, 6, 0, 0))]

"""

result = _search_dates.search_dates(
text=text, languages=languages, settings=settings, detect_languages_function=detect_languages_function
)
dates = result.get('Dates')

dates = result.get("Dates")
if dates:
if add_detected_language:
language = result.get('Language')
dates = [date + (language, ) for date in dates]
language = result.get("Language")
dates = [date + (language,) for date in dates]
return dates


@apply_settings
def search_first_date(text, languages=None, settings=None, add_detected_language=False, detect_languages_function=None):
"""Find first substring of the given string which represent date and/or time and parse it.

:param text:
A string in a natural language which may contain the date and/or time expression.
:type text: str

:param languages:
A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will
not attempt to detect the language.
:type languages: list

:param settings:
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`.
:type settings: dict

:param add_detected_language:
Indicates if we want the detected language returned in the tuple.
:type add_detected_language: bool

:return: Returns a tuple containing:
substring representing date and/or time, corresponding :mod:`datetime.datetime`
object and detected language if *add_detected_language* is True.
Returns None if no dates that can be parsed are found.
:rtype: tuple
:raises: ValueError - Unknown Language

>>> from dateparser.search import search_first_date
>>> search_first_date('The first artificial Earth satellite was launched on 4 October 1957.')
('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0))

>>> from dateparser.search import search_first_date
>>> search_first_date('Caesar Augustus, also known as Octavian')
None

>>> search_first_date('The first artificial Earth satellite was launched on 4 October 1957.',
>>> add_detected_language=True)
('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0), 'en')

>>> search_first_date("The client arrived to the office for the first time in March 3rd, 2004 "
>>> "and got serviced, after a couple of months, on May 6th 2004, the customer "
>>> "returned indicating a defect on the part")
('in March 3rd, 2004 and', datetime.datetime(2004, 3, 3, 0, 0))

"""

result = _search_dates.search_dates(
text=text, languages=languages, limit_date_search_results=1, settings=settings, detect_languages_function=detect_languages_function
)
dates = result.get("Dates")
if dates:
if add_detected_language:
language = result.get("Language")
dates = [date + (language,) for date in dates]
return dates[0]
48 changes: 48 additions & 0 deletions dateparser/search/languages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from collections.abc import Set

from dateparser.search.text_detection import FullTextLanguageDetector
from dateparser.languages.loader import LocaleDataLoader
from dateparser.custom_language_detection.language_mapping import map_languages


class SearchLanguages:
def __init__(self):
self.loader = LocaleDataLoader()
self.available_language_map = self.loader.get_locale_map()
self.language = None

def get_current_language(self, language_shortname):
if self.language is None or self.language.shortname != language_shortname:
self.language = self.loader.get_locale(language_shortname)

def translate_objects(self, language_shortname, text, settings):
self.get_current_language(language_shortname)
result = self.language.translate_search(text, settings=settings)
return result

def detect_language(self, text, languages, settings=None, detect_languages_function=None):
if detect_languages_function and not languages:
detected_languages = detect_languages_function(
text, confidence_threshold=settings.LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD
)
detected_languages = map_languages(detected_languages) or settings.DEFAULT_LANGUAGES
return detected_languages[0] if detected_languages else None

if isinstance(languages, (list, tuple, Set)):
if all([language in self.available_language_map for language in languages]):
languages = [self.available_language_map[language] for language in languages]
else:
unsupported_languages = set(languages) - set(self.available_language_map.keys())
raise ValueError("Unknown language(s): %s" % ', '.join(map(repr, unsupported_languages)))
elif languages is not None:
raise TypeError("languages argument must be a list (%r given)" % type(languages))

if languages:
self.language_detector = FullTextLanguageDetector(languages=languages)
else:
self.language_detector = FullTextLanguageDetector(list(self.available_language_map.values()))

detected_language = self.language_detector._best_language(text) or (
settings.DEFAULT_LANGUAGES[0] if settings.DEFAULT_LANGUAGES else None
)
return detected_language
Loading