Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Reimplementing search_dates #945

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
02220da
Implimenting new search_dates
gavishpoddar Jul 16, 2021
f933d3a
Fixing DATE_ORDER, implimenting deep_search, tests
gavishpoddar Jul 21, 2021
77727b5
Unproving _joint_parse with data_carry accurate_return_text, deep_se…
gavishpoddar Jul 21, 2021
e7f38e8
implementing _final_text_clean()
gavishpoddar Jul 22, 2021
962066c
Simplifying text_clean and modifying tests
gavishpoddar Jul 25, 2021
624ac8e
Implementing relative date
gavishpoddar Jul 28, 2021
42ca6f6
Fixing tests
gavishpoddar Jul 28, 2021
51749a2
secondary_split_implimentation
gavishpoddar Aug 3, 2021
f5e4635
positional args to keyword argument
gavishpoddar Aug 3, 2021
121b15f
Micro fixes
gavishpoddar Aug 3, 2021
2cd93f0
Removing codes now part of #953
gavishpoddar Aug 3, 2021
006d2a5
adding check_settings
gavishpoddar Aug 4, 2021
10404c9
implimenting double_punctuation_split
gavishpoddar Aug 4, 2021
22596e0
Updating docs and removing test (TMP)
gavishpoddar Aug 6, 2021
b799dfb
cleaning code, adding tests, improving coverage
gavishpoddar Aug 6, 2021
42c984a
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar Aug 9, 2021
8fc5e0d
Improving codecov
gavishpoddar Aug 11, 2021
74b6ec4
temporary commit to get diff
gavishpoddar Aug 16, 2021
56e0505
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Aug 16, 2021
5a1b1c5
temporary file change for review
gavishpoddar Aug 16, 2021
aa2aa8f
reverting the previous commit
gavishpoddar Aug 16, 2021
41eff6a
improvements
gavishpoddar Aug 16, 2021
f65531b
formatting code
gavishpoddar Aug 17, 2021
982fc08
formatting code
gavishpoddar Aug 17, 2021
3621b2d
improvements in text filter
gavishpoddar Aug 18, 2021
8a9496b
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar Aug 19, 2021
45996b4
removing previous search_dates
gavishpoddar Aug 23, 2021
2ac88c6
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Aug 23, 2021
5dabc62
adding test
gavishpoddar Aug 23, 2021
ab1778d
fixing doc string
gavishpoddar Aug 27, 2021
14adf89
fixing doc string
gavishpoddar Aug 27, 2021
d57223a
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Aug 27, 2021
88afa30
updating xfail
gavishpoddar Aug 28, 2021
9209f3d
updating tests
gavishpoddar Aug 28, 2021
85254e0
Apply suggestions from code review
gavishpoddar Sep 1, 2021
e4604e6
Merge branch 'master' into search_dates
gavishpoddar Sep 7, 2021
4f119dd
Updates
gavishpoddar Sep 7, 2021
e6da4be
Fixing upstraem merges
gavishpoddar Sep 7, 2021
f6116bf
DateSearch -> DateSearchWithDetection
gavishpoddar Sep 9, 2021
0525cdc
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar Oct 7, 2021
96b91c0
updating test with xfail
gavishpoddar Oct 7, 2021
b9d12f3
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar Oct 7, 2021
99e66c6
minor fixes
gavishpoddar Oct 7, 2021
2935aae
Merge branch 'master' into search_dates
serhii73 Dec 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions dateparser/languages/locale.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,7 @@ def translate_search(self, search_string, settings=None):
if translated_chunk:
translated.append(translated_chunk)
original.append(original_chunk)

for i in range(len(translated)):
if "in" in translated[i]:
translated[i] = self._clear_future_words(translated[i])
Expand Down Expand Up @@ -266,6 +267,7 @@ def _simplify_split_align(self, original, settings):
original_tokens = self._word_split(original, settings=settings)
simplified_tokens = self._word_split(self._simplify(normalize_unicode(original), settings=settings),
settings=settings)

if len(original_tokens) == len(simplified_tokens):
return original_tokens, simplified_tokens

Expand Down
28 changes: 28 additions & 0 deletions dateparser/search_dates/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from dateparser.search_dates.search import DateSearch
from dateparser.conf import apply_settings


_search_dates = DateSearch()


@apply_settings
def search_dates(text, languages=None, settings=None):
result = _search_dates.search_dates(
text=text, languages=languages, settings=settings
)

dates = result.get('Dates')
if not dates:
return None
return dates


@apply_settings
def search_first_date(text, languages=None, settings=None):
result = _search_dates.search_dates(
text=text, languages=languages, limit_date_search_results=1, settings=settings
)
dates = result.get('Dates')
if not dates:
return None
return dates
39 changes: 39 additions & 0 deletions dateparser/search_dates/languages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from collections.abc import Set

from dateparser.search.text_detection import FullTextLanguageDetector
from dateparser.languages.loader import LocaleDataLoader


class SearchLanguages:
def __init__(self) -> None:
self.loader = LocaleDataLoader()
self.available_language_map = self.loader.get_locale_map()
self.language = None

def get_current_language(self, language_shortname):
if self.language is None or self.language.shortname != language_shortname:
self.language = self.loader.get_locale(language_shortname)

def translate_objects(self, language_shortname, text, settings):
self.get_current_language(language_shortname)
result = self.language.translate_search(text, settings=settings)
return result

def detect_language(self, text, languages):
if isinstance(languages, (list, tuple, Set)):

if all([language in self.available_language_map for language in languages]):
languages = [self.available_language_map[language] for language in languages]
else:
unsupported_languages = set(languages) - set(self.available_language_map.keys())
raise ValueError(
"Unknown language(s): %s" % ', '.join(map(repr, unsupported_languages)))
elif languages is not None:
raise TypeError("languages argument must be a list (%r given)" % type(languages))

if languages:
self.language_detector = FullTextLanguageDetector(languages=languages)
else:
self.language_detector = FullTextLanguageDetector(list(self.available_language_map.values()))

return self.language_detector._best_language(text)
211 changes: 211 additions & 0 deletions dateparser/search_dates/search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
import re
from typing import List, Dict

from dateparser.conf import apply_settings, Settings
from dateparser.date import DateDataParser
from dateparser.search_dates.languages import SearchLanguages

_drop_words = {'on', 'of'} # cause annoying false positives
_bad_date_re = re.compile(
# whole dates we black-list (can still be parts of valid dates)
"^("
+ "|".join(
[
r"\d{1,3}", # less than 4 digits
r"#\d+", # this is a sequence number
# some common false positives below
r"[-/.]+", # bare separators parsed as current date
r"\w\.?", # one letter (with optional dot)
"an",
]
)
+ ")$"
)

_secondary_splitters = [',', '،', '——', '—', '–', '.'] # are used if no date object is found


def _get_relative_base(already_parsed):
if already_parsed:
return already_parsed[-1][1]
return None


def _create_splits(text):
splited_objects = text.split()
splited_objects = [p for p in splited_objects if p and p not in _drop_words]
gavishpoddar marked this conversation as resolved.
Show resolved Hide resolved
return splited_objects


def _create_joined_parse(text, max_join=7, sort_ascending=False):
split_objects = _create_splits(text=text)
joint_objects = []
for i in range(len(split_objects)):
for j in reversed(range(min(max_join, len(split_objects) - i))):
x = " ".join(split_objects[i:i + j + 1])
if _bad_date_re.match(x):
continue
if not len(x) > 2:
continue
joint_objects.append(x)

if sort_ascending:
joint_objects = sorted(joint_objects, key=len)

return joint_objects


def _get_accurate_return_text(text, parser, datetime_object):
text_candidates = _create_joined_parse(text=text, sort_ascending=True)
for text_candidate in text_candidates:
if parser.get_date_data(text_candidate).date_obj == datetime_object:
return text_candidate


def _joint_parse(text, parser, translated=None, deep_search=True, accurate_return_text=False, data_carry=None):
if not text:
return data_carry

elif not len(text) > 2:
return data_carry

elif translated and len(translated) <= 2:
return data_carry

reduced_text_candidate = None
secondary_split_made = False
returnable_objects = data_carry or []
joint_based_search_dates = _create_joined_parse(text=text)
for date_object_candidate in joint_based_search_dates:
parsed_date_object = parser.get_date_data(date_object_candidate)
if parsed_date_object.date_obj:
if accurate_return_text:
date_object_candidate = _get_accurate_return_text(
text=date_object_candidate, parser=parser, datetime_object=parsed_date_object.date_obj
)

returnable_objects.append(
(date_object_candidate.strip(" .,:()[]-'"), parsed_date_object.date_obj)
)

if deep_search:
start_index = text.find(date_object_candidate)
end_index = start_index + len(date_object_candidate)
if start_index < 0:
break
reduced_text_candidate = text[:start_index] + text[end_index:]
break
else:
for splitter in _secondary_splitters:
secondary_split = re.split('(?<! )[' + splitter + ']+(?! )', date_object_candidate)
if secondary_split and len(secondary_split) > 1:
reduced_text_candidate = " ".join(secondary_split)
secondary_split_made = True

if (deep_search or secondary_split_made) and not text == reduced_text_candidate:
if reduced_text_candidate and len(reduced_text_candidate) > 2:
returnable_objects = _joint_parse(
text=reduced_text_candidate,
parser=parser,
data_carry=returnable_objects
)

return returnable_objects


class DateSearch:
def __init__(self, make_joints_parse=True, default_language="en"):
self.make_joints_parse = make_joints_parse
self.default_language = default_language

self.search_languages = SearchLanguages()

@apply_settings
def search_parse(
self, text, language_shortname, settings, limit_date_search_results=None
) -> List[tuple]:
gavishpoddar marked this conversation as resolved.
Show resolved Hide resolved

returnable_objects = []
parser = DateDataParser(languages=[language_shortname], settings=settings)
translated, original = self.search_languages.translate_objects(
language_shortname, text, settings
)

for index, original_object in enumerate(original):
if limit_date_search_results and returnable_objects:
if len(returnable_objects) == limit_date_search_results:
break

if not len(original_object) > 2:
continue

if not settings.RELATIVE_BASE:
relative_base = _get_relative_base(already_parsed=returnable_objects)
if relative_base:
parser._settings.RELATIVE_BASE = relative_base

if self.make_joints_parse:
joint_based_search_dates = _joint_parse(
text=original_object, parser=parser, translated=translated[index]
)
if joint_based_search_dates:
returnable_objects.extend(joint_based_search_dates)
else:
parsed_date_object = parser.get_date_data(original_object)
if parsed_date_object.date_obj:
returnable_objects.append(
(original_object.strip(" .,:()[]-'"), parsed_date_object.date_obj)
)

parser._settings = Settings()
return returnable_objects

@apply_settings
def search_dates(
self, text, languages=None, limit_date_search_results=None, settings=None
) -> Dict:
"""
Find all substrings of the given string which represent date and/or time and parse them.

:param text:
A string in a natural language which may contain date and/or time expressions.
:type text: str

:param languages:
A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will not attempt
to detect the language.
:type languages: list

:param limit_date_search_results:
A int which sets maximum results to be returned.
:type limit_date_search_results: int

:param settings:
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`.
:type settings: dict

:return: a dict mapping keys to two letter language code and a list of tuples of pairs:
substring representing date expressions and corresponding :mod:`datetime.datetime` object.
For example:
{'Language': 'en', 'Dates': [('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0))]}
If language of the string isn't recognised returns:
{'Language': None, 'Dates': None}
:raises: ValueError - Unknown Language
"""

language_shortname = (
self.search_languages.detect_language(text=text, languages=languages)
or self.default_language
)

if not language_shortname:
return {"Language": None, "Dates": None}
return {
"Language": language_shortname,
"Dates": self.search_parse(
text=text,
language_shortname=language_shortname,
limit_date_search_results=limit_date_search_results,
settings=settings,
),
}
12 changes: 12 additions & 0 deletions test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from dateparser.search_dates import search_dates

# THIS IS TEMPORARY FILE FOR TESTS

text = """10 Febbraio 2020 15:00 ciao moka"""

out1 = search_dates(text)
print(out1)



# tox -e py -- tests/test_search_dates.py
Loading