Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy Word Matching #56

Open
ZaLiTHkA opened this issue Jul 8, 2014 · 46 comments
Open

Fuzzy Word Matching #56

ZaLiTHkA opened this issue Jul 8, 2014 · 46 comments
Labels

Comments

@ZaLiTHkA
Copy link

ZaLiTHkA commented Jul 8, 2014

Could DYM be extended to use the idea of "approximate string matching" (Wikipedia reference)? Not everybody phrases their sentences the same way.. I honestly believe this plugin would be a million times more useful if words like "faulty" also returned a match for "fault", "saving" matched "saved" or "save", etc etc.

Unfortunately I'm not familiar enough with Ruby to do this myself, but a quick Google search shows many Ruby Gems that provide this type of functionality. So I'm guessing (read: hoping) this isn't an unreasonable request. :)

@abahgat
Copy link
Owner

abahgat commented Jul 8, 2014

Yes, that's something that we had considered in the past, but never implemented as we did not want to kill the underlying database with too many queries.

Can you please list a few of the Ruby Gems you had found?

@ZaLiTHkA
Copy link
Author

ZaLiTHkA commented Jul 9, 2014

From my limited understanding, I can only imagine a DB hit when the front end gets an updated string to compare (I use the "as you type" method in my setup, which could actually skip key presses like shift, space etc.. but that's not the point of my request here). Do you think this idea would cause more hits?

One that keeps popping up in SO questions is amatch, but after a bit of digging it looks like either fuzzy_match or fuzzy-string-match might be faster.

Another interesting looking one is fuzzily, however the dev recommends using blurrily instead for large datasets.

From the lib weight point of view, it looks like amatch and fuzzily have less dependencies, but I'm not too sure how much difference that would actually make. Keep in mind I don't speak Ruby, so I'm running blind here. :)

@rlisowski
Copy link

What about full text search with elasticsearch-rails
I could work with sidekiq worker as an async indexer.
Indexer could be triggered by hooks in Issue (added via patch).
There is redmine_sidekiq plugin so you neeed only provide worker.
What you say?

@rlisowski
Copy link

Consider also thinkingsphinx which can fit better that elasticsearch to current features list (filter by project_id, issue_id, status)

@rlisowski
Copy link

👍

@ZaLiTHkA
Copy link
Author

I just pulled the changes referenced above into my fork of the repo, but I still don't get any fuzzy searching.. When I migrate plugins, I do get a warning to say "Sphinx cannot be found on your system". Tried installing the gem manually (gem install sphinx from Redmine's htdocs root folder), which didn't give any errors, but it also didn't change anything, so I'm not sure how to get around this.

@korin, was that thumbs-up meant to imply that @swiatkiewicz's changes work? Or have you not had a chance to test them yet?

@rlisowski
Copy link

It works, you need to install Sphinx. See ThinkingSphinx quickstart guide it's already available in most distros.

To be honest @swiatkiewicz changes allow replace sql like search with sphinx indexer which is faster option. It's only one step from fuzzy matching feature.

@ZaLiTHkA
Copy link
Author

Thanks for the link. Unfortunately I need to run my system in a Windows environment, so it looks like I've got some reading to do before I get that part working.

The SQL like search works perfectly and the additional configuration options are helpful, so this is already a nice improvement.

@abahgat abahgat added the feature label Oct 2, 2014
abahgat added a commit that referenced this issue Jan 2, 2015
Fuzzy Word Matching with Sphinx #56
@dominch
Copy link

dominch commented Feb 4, 2015

I can't get this feature to work :(
I have already started sphinx etc, then trying "test"
and have results:
Bug # 1 – Testowy błąd (Closed w projekcie Projekt testowy)
Bug # 11 – test (New w projekcie Projekt testowy)
Feature # 19 – Test (New w projekcie)
Feature # 21 – testowe zadanie (New w projekcie Zadania w realizacji KD)
Feature # 22 – Ticket testowo pokazowy (New w projekcie Zadania w realizacji KD)
so I assume it should show something for "testy" (should strip "y" and match like "test"?)

@rlisowski
Copy link

show us plugin settings /settings/plugin/redmine_didyoumean

@dominch
Copy link

dominch commented Feb 4, 2015

it's in polish but you know all settings.
http://i.imgur.com/gvLDYVy.png
any ideas?

@rlisowski
Copy link

Should work, I have similar settings. Any errors in redmine log file? or in browser development console?

@rlisowski
Copy link

You can also try rebuild thinking sphinx index with rake ts:rebuild.

@dominch
Copy link

dominch commented Feb 4, 2015

No errors so far noticed :( I'll try out this with new tickets,
BTW: is ts:rebuild better than ts:index ?

@rlisowski
Copy link

only small difference with configuration see http://pat.github.io/thinking-sphinx/rake_tasks.html

@dominch
Copy link

dominch commented Feb 5, 2015

Ok, so basically ts:rebuild is same as stop+index+start, great :)
Still can't get this to work as expected. I tried mysql first, is there a chance it still use it? rake tasks are running ok without any errors, also settings should be ok, but still I can't get right results.
Maybe it's something wrong with my sphinx in system? how can I test?

@swiatkiewicz
Copy link
Contributor

Have you tried testing it in rails console?
It's working like " Issue.search 'somethig' ", and then you should see
info about search engine (sql or sphinx).

2015-02-05 15:57 GMT+01:00 dominch [email protected]:

Ok, so basically ts:rebuild is same as stop+index+start, great :)
Still can't get this to work as expected. I tried mysql first, is there a
chance it still use it? rake tasks are running ok without any errors, also
settings should be ok, but still I can't get right results.
Maybe it's something wrong with my sphinx in system? how can I test?


Reply to this email directly or view it on GitHub
#56 (comment)
.

@dominch
Copy link

dominch commented Feb 5, 2015

Trying now:

2.0.0-p594 :001 > Issue.search 'somethig'
  CustomField Load (0.6ms)  SELECT `custom_fields`.* FROM `custom_fields` WHERE `custom_fields`.`type` = 'IssueCustomField' AND `custom_fields`.`searchable` = 1
  Role Load (0.3ms)  SELECT `roles`.* FROM `roles` WHERE `roles`.`builtin` = 2 LIMIT 1
  GroupAnonymous Load (0.6ms)  SELECT `users`.* FROM `users` WHERE `users`.`type` IN ('GroupAnonymous') ORDER BY id LIMIT 1
  Member Load (0.4ms)  SELECT `members`.* FROM `members` INNER JOIN `projects` ON `projects`.`id` = `members`.`project_id` WHERE (projects.status <> 9) AND (members.user_id = 2 OR (projects.is_public = 1 AND members.user_id = 49))
   (15.0ms)  SELECT COUNT(DISTINCT `issues`.`id`) FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status <> 9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id))))))
  SQL (23.8ms)  SELECT `issues`.`id` AS t0_r0, `issues`.`tracker_id` AS t0_r1, `issues`.`project_id` AS t0_r2, `issues`.`subject` AS t0_r3, `issues`.`description` AS t0_r4, `issues`.`due_date` AS t0_r5, `issues`.`category_id` AS t0_r6, `issues`.`status_id` AS t0_r7, `issues`.`assigned_to_id` AS t0_r8, `issues`.`priority_id` AS t0_r9, `issues`.`fixed_version_id` AS t0_r10, `issues`.`author_id` AS t0_r11, `issues`.`lock_version` AS t0_r12, `issues`.`created_on` AS t0_r13, `issues`.`updated_on` AS t0_r14, `issues`.`start_date` AS t0_r15, `issues`.`done_ratio` AS t0_r16, `issues`.`estimated_hours` AS t0_r17, `issues`.`parent_id` AS t0_r18, `issues`.`root_id` AS t0_r19, `issues`.`lft` AS t0_r20, `issues`.`rgt` AS t0_r21, `issues`.`is_private` AS t0_r22, `issues`.`ir_position` AS t0_r23, `issues`.`closed_on` AS t0_r24, `issues`.`sprint_id` AS t0_r25, `issues`.`position` AS t0_r26, `projects`.`id` AS t1_r0, `projects`.`name` AS t1_r1, `projects`.`description` AS t1_r2, `projects`.`homepage` AS t1_r3, `projects`.`is_public` AS t1_r4, `projects`.`parent_id` AS t1_r5, `projects`.`created_on` AS t1_r6, `projects`.`updated_on` AS t1_r7, `projects`.`identifier` AS t1_r8, `projects`.`status` AS t1_r9, `projects`.`lft` AS t1_r10, `projects`.`rgt` AS t1_r11, `projects`.`inherit_members` AS t1_r12, `projects`.`default_assignee_id` AS t1_r13, `projects`.`product_backlog_id` AS t1_r14, `journals`.`id` AS t2_r0, `journals`.`journalized_id` AS t2_r1, `journals`.`journalized_type` AS t2_r2, `journals`.`user_id` AS t2_r3, `journals`.`notes` AS t2_r4, `journals`.`created_on` AS t2_r5, `journals`.`private_notes` AS t2_r6 FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status <> 9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id)))))) ORDER BY issues.id ASC
 => [[], 0]

It seems to be SQL, isn't it?
I migrated old redmine to new version and still it's searching without fuzzy words feature. Any ideas what can I check? Plugin settings seems to be saved correctly. Is there any debug available so I can see what engine it's using?

@rlisowski
Copy link

Setting.plugin_redmine_didyoumean['search_method'] in rails console, 0 - SQL 1- TS

@dominch
Copy link

dominch commented Feb 5, 2015

I'm trying in console:

2.0.0-p594 :006 > Issue.sphinx_search 'test'
  Issue Load (0.8ms)  SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (231, 1, 51, 52, 53, 114, 150, 153, 167, 173, 232, 235, 244, 284, 381, 523, 687, 717, 747, 912)
(20 results)

nad same for word 'testy' gives me only one result.
So sphinx is working and gives me some results but they are amost same as for sql.

@dominch
Copy link

dominch commented Feb 5, 2015

That seems to be ok:

2.0.0-p594 :009 > Setting.plugin_redmine_didyoumean['search_method']
 => "1"

And that's:

  def search_class
    case Setting.plugin_redmine_didyoumean['search_method']
    when "0"
      SqlSearch
    when "1"
      ThinkingSphinxSearch
    else
      raise 'There is no search method selected!'
    end
  end

so its sphinx. I tried to modify searching_by_thinking_sphinx.rb and that caused an effect so it's using it for sure. The question is what is wrong with sphinx that results are wrong.

@swiatkiewicz
Copy link
Contributor

@dominch
Follow this steps to fix this:
Open file: issues_index.rb and replace line 7 with set_property :enable_star => true,
next in main project("Redmine") catalog (not in plugin) run 'rake ts:rebuid' and then try to search duplicates.

My case was: 'test', 'tester', 'testowy'. And before this steps I got only 1 results, but should be 3, now after these steps, I got a good result (3).

Check your application log for something like :
Sphinx Query (1.4ms) SELECT * FROM issue_core WHERE MATCH('test') AND project_id IN (450) AND sphinx_deleted = 0 AND sphinx_internal_id NOT IN (0) LIMIT 0, 10
Sphinx Found 3 results

@dominch
Copy link

dominch commented Feb 6, 2015

How can I turn on debug mode?
Trying

script/rails server webrick -e production -d -p 3000

plus:

http://localhost:3000/searchissues?project_id=1&issue_id=&query=testowy

gives me:

Processing by SearchIssuesController#index as HTML
  Parameters: {"project_id"=>"1", "issue_id"=>"", "query"=>"testowy"}
  Current user: dominik.chmaj (id=3)
Completed 200 OK in 485.4ms (Views: 1.0ms | ActiveRecord: 15.7ms)

previous setting for :enable_star was 1, changed that to true but still no effect :(

@swiatkiewicz
Copy link
Contributor

@dominch
Open config/production.rb then find line config.logger.lever or log_level, and it should equal to :debug like config.log_level = :debug

In thinkingSphinx is another problem, because if you add new issue or edit exisitng one, then you should run ts:index, to update indexes.
You can use unix cron, and run this every five minutes or something.

I'm trying to implement RealTime indexing but it's doesn't work as I expected and it's can take a while.

@dominch
Copy link

dominch commented Feb 6, 2015

Ok, debug logs are working and I have:

  Sphinx Query (0.8ms)  SELECT * FROM `issue_core` WHERE MATCH('*testowej*') AND `project_id` IN (1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 23, 24, 25, 26) AND `sphinx_deleted` = 0 AND `sphinx_internal_id` NOT IN (0) LIMIT 0, 5
  Sphinx  Found 3 results
  Issue Load (0.6ms)  SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (336, 717, 962)

So that's proof - sphinx are working, somehow it does not return much results. Only exact words.

my development.sphinx.conf looks like:

indexer
{
}
searchd
{
  listen = 127.0.0.1:9306:mysql41
  log = /var/data/redmine/log/development.searchd.log
  query_log = /var/data/redmine/log/development.searchd.query.log
  pid_file = /var/data/redmine/log/development.sphinx.pid
  workers = threads
  binlog_path = /var/data/redmine/tmp/binlog/development
}
source issue_core_0
{
  type = mysql
  sql_host = localhost
  sql_user = redmine
  sql_pass = ***
  sql_db = redmine
  sql_query_pre = SET TIME_ZONE = '+0:00'
  sql_query_pre = SET NAMES utf8
  sql_query = SELECT SQL_NO_CACHE `issues`.`id` * 2 + 0 AS `id`, `issues`.`subject` AS `subject`, `issues`.`id` AS `sphinx_internal_id`, 'Issue' AS `sphinx_internal_class`, 0 AS `sphinx_deleted`, `issues`.`id` AS `id`, `issues`.`status_id` AS `status_id`, `issues`.`project_id` AS `project_id` FROM `issues`  WHERE (`issues`.`id` BETWEEN $start AND $end) GROUP BY `issues`.`id`, `issues`.`subject`, `issues`.`id`, `issues`.`id`, `issues`.`status_id`, `issues`.`project_id` ORDER BY NULL
  sql_query_range = SELECT IFNULL(MIN(`issues`.`id`), 1), IFNULL(MAX(`issues`.`id`), 1) FROM `issues`
  sql_attr_uint = sphinx_internal_id
  sql_attr_uint = sphinx_deleted
  sql_attr_uint = id
  sql_attr_uint = status_id
  sql_attr_uint = project_id
  sql_attr_string = sphinx_internal_class
  sql_field_string = subject
  sql_query_info = SELECT `issues`.* FROM `issues`  WHERE (`issues`.`id` = ($id - 0) / 2)
}
index issue_core
{
  type = plain
  path = /var/data/redmine/db/sphinx/development/issue_core
  docinfo = extern
  charset_type = utf-8
  min_infix_len = 2
  enable_star = 1
  source = issue_core_0
}
index issue
{
  type = distributed
  local = issue_core
}

everything seems to work except right results :) This have to be something with tokenization etc. I assume that language is not that important because it should look for any word in any language with described rules. Is that correct?

@swiatkiewicz
Copy link
Contributor

@dominch
In my case when I'm looking for word 'test':
Project:
dym2
Results:
dym1

This seems to be ok, right?
I think you like get something like this, right?

@dominch
Copy link

dominch commented Feb 6, 2015

In description there is:

"faulty" also returned a match for "fault", "saving" matched "saved" or "save"

from plugin settings:

- Thinking Sphinx - firsly search words 1:1 after then substract last character and search again ('Running' will be looking for 'Runner' 'Running' etc.). Substract to min word length which is definded below

So in Your case for word "tester" it should find everything with word "test" (substacted 2 letters)
Of course ideal is with all forms like in example - Running should find Runner and Running.

Right now and for Your example it's loking for test and I have same thing - above You can find: "FROM issue_core WHERE MATCH('testowej') "
that is equiwalent of sql "where x like "%test%". Sure it will find "testowy" and even "wytestuj", but that's not very useful. Fulltext search should tokenize words, change them to basic forms, remove stopwords and try to match with query (with same rules).

Try "tester" - it should find # 101497 and of course # 101512 I expect that this word should search for "tester" + "teste" + "test" + "tes", should assign weights etc. That should give much more results.

@rlisowski
Copy link

It seems that sphinxsearch does not tokenize words by default. To make it work install it with libstemmer library.

links:
https://pat.github.io/thinking-sphinx/advanced_config.html
http://sphinxsearch.com/docs/current.html#conf-morphology
http://snowball.tartarus.org/download.php

@dominch
Copy link

dominch commented Feb 18, 2015

I just changed my config and now it's working.
I think that should be placed in readme to be clear :) I needed to add config/thinking_sphinx.yml file with content:

production:
  morphology: stem_en
  mem_limit: 128M
  wordforms: "/var/data/redmine/config/sphinx/wordforms.txt"
  stopwords: "/var/data/redmine/config/sphinx/stopwords.txt"

This added morphology and steammer for my generated files,
Also I added wordforms and stopwords to my config, that's easy to get for any language from google.

Now it's working great! :) Thank You for help, I wasn't sure if I need anything more to my configuration.

@Androc
Copy link

Androc commented Feb 18, 2015

@dominch Hello, I am highly interested in what you found. Could you please describe a little more how you achieved that ?

  • Did you find your wordforms/stopwords or did you generate them (and how) ?
  • when you write config/thinking_sphinx.yml you means redmine_root_path/config/thinking_sphinx.yml?
  • did you do anything else than just put the thinking_sphinx.yml file ?

I would like to use a different language than english/russian and after reading documentation, it appears I have to make more steps to achieve the morphology search.

@Androc
Copy link

Androc commented Feb 18, 2015

I found a beginning of answer :

  • yes it is redmine_root_path/config/thinking_sphinx.yml
  • you must create a sphinx directory inside config
  • you must install or download your_language.dict and your_language.aff (I found .dict here : http://icon.shef.ac.uk/Moby/mlang.html and .aff with ispell)
  • you must create a wordforms.txt file inside config/sphinx directory by using spelldump your_language.dict your_language.aff
  • you must create (even if empty) stopwords.txt file inside config/sphinx. I was not able to fill it as the indexer --buildstops wordforms.txt 1000 returned nothing to do

I ran rake ts:index (not sure if necessary or usefull)

The morphology still does not work :(

@dominch
Copy link

dominch commented Feb 18, 2015

Yes, it's inside redmine config directory, edit this file (thinking_sphinx.yml) and after ts:rebuild You should notice change in production.sphinx.yml file (in same dir) generated after command is executed.
For other morphology You must install dictorinaries,
stopwords and word forms are simple txt files, first one contains one stopword each line, and wordform contains lines with "from > to" lines.

I found both files in internet and placed them inside redmine_dir/config/sphinx (and reflected that path in config above). That is enogh for my needs - wordforms are changing my complex words to basic like "thinking > think" in example above;

Best luck! :)

@Androc
Copy link

Androc commented Feb 18, 2015

Thanks for your answer. My production.sphinx.confwas updated :

type = plain
  path = /usr/share/redmine-stable/db/sphinx/production/issue_core
  docinfo = extern
  morphology = stem_en, libstemmer_fr
  stopwords = /usr/share/redmine-stable/config/sphinx/stopwords.txt
  wordforms = /usr/share/redmine-stable/config/sphinx/wordforms.txt
  charset_type = utf-8
  min_infix_len = 2
  enable_star = 1
  source = issue_core_0

So I suppose, sphinx is aware that I would like morphology but what I want to achieve first is if I search for testy it will find test.
Morphology will come later, I suppose.

@Androc
Copy link

Androc commented Feb 18, 2015

What bother me is after using the spelldum command, my wordforms.txt is full of word > word.

After reading the documentation I would assume that I would find word > other_word.

Am I wrong ?

@dominch
Copy link

dominch commented Feb 18, 2015

No, it should be "something > somethingElse"
idea is to change complex words to basic,
sphinx should give You warnings on reindex that it's ignoring such lines

@Androc
Copy link

Androc commented Feb 18, 2015

Ok. So my doubts were based :)

I just found that my dict file was not corrupted, certainly leading to a bad wordforms.txt

@Androc
Copy link

Androc commented Feb 18, 2015

My dict file is now clean, I regenarated my wordforms.txt. It is a clean word > word with no more shitty characters but still not a word > other_word list :(
As it is not directly the plugin's problem, I will search by my own and will come back later.

Thanks for your help.

@Androc
Copy link

Androc commented Feb 19, 2015

I found a solution.

My problem was :
My .dic was "clean" but did not have directives to take .aff in account.

You .dic file must have a list of word/X (where X can be one or several letters like S, A, etc.) and not only a list of word.

My solution :
I installed .dic from myspell rather than ispell and I obtained a valid .dic.

Now I have a valid wordforms.txt

@Androc
Copy link

Androc commented Feb 19, 2015

When ran rs:indexI had a lot of warning for duplicates.

My wordforms.txt still had lines like word > word.

I ran this awk -F " > " '$1 != $2 { print $0 }' wordforms.txt > wordforms_clean.txt to extract only word1 > word2 lines.
Then I overwrote my wordforms.txt with the clean one.

But when I run rake ts:indexI still have the duplicate warnings :(

I don't know how to clean former indexes as I tried rake ts:rebuild, rake ts:regenarate, etc.

@dominch
Copy link

dominch commented Feb 19, 2015

duplicates are not only

word1 > word1

but

word1 > word2
word1 > word3

are duplicate to. You need to have only one word to convert to. In other words some unique index is build and engine needs to know what should be replaced exactly. Try to grep Your wordforms for any example from warning ("word1 > ") :)

@Androc
Copy link

Androc commented Feb 19, 2015

Mmm I think I found my real problem : the accents !

I don't have word > word1, word > word2 but I have is wôrd > word2, fôrd > word2 which leads to duplicate on rd > word2.

Probably a UTF-8 problem.

@dominch
Copy link

dominch commented Feb 19, 2015

Then check out database and it's table encoding, i.e. mysql by default uses latin1_swedish, I needed to change both db and all tables to utf8 but this happeneded some time ago. New tables were created not in utf8 and now they are. I'm not sure if those information are stored in db on index, but some documents are.

@Androc
Copy link

Androc commented Feb 19, 2015

They are all UTF8.

I think it is more a Sphinx problem. I found several posts about UTF8 problems.

Still searching :)

@Androc
Copy link

Androc commented Feb 19, 2015

Ok ... This is strange. I was totally focused on my warnings but in fact, the morphology seems to work.

When I search for traiter or traités it matches traités. That is one good point.

Now, my problem is in the example of the did_you_mean plugin, there is the tests example which should match test without using morphology.

I have an issue test and when I search for tests the issue is not found.

Maybe I missundertood the example.

@dominch
Copy link

dominch commented Feb 19, 2015

That example is based on en morphology which cuts all words to min length and then tes+t = tes+ts
It was not working on my side untill I added morphology.

@Androc
Copy link

Androc commented Feb 19, 2015

As I added in thinking_sphinx.yml, I am not sure what is making the morphology works.

charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e, U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i, U+00CF->i, U+00D1->n, U+00D2->o, U+00D3->o, U+00D4->o, U+00D5->o, U+00D6->o, U+00D8->o, U+00D9->u, U+00DA->u, U+00DB->u, U+00DC->u, U+00DD->y, U+00E0->a, U+00E1->a, U+00E2->a, U+00E3->a, U+00E4->a, U+00E5->a, U+00E7->c, U+00E8->e, U+00E9->e, U+00EA->e, U+00EB->e, U+00EC->i, U+00ED->i, U+00EE->i, U+00EF->i, U+00F1->n, U+00F2->o, U+00F3->o, U+00F4->o, U+00F5->o, U+00F6->o, U+00F8->o, U+00F9->u, U+00FA->u, U+00FB->u, U+00FC->u, U+00FD->y, U+00FF->y, U+0100->a, U+0101->a, U+0102->a, U+0103->a, U+0104->a, U+0105->a, U+0106->c, U+0107->c, U+0108->c, U+0109->c, U+010A->c, U+010B->c, U+010C->c, U+010D->c, U+010E->d, U+010F->d, U+0112->e, U+0113->e, U+0114->e, U+0115->e, U+0116->e, U+0117->e, U+0118->e, U+0119->e, U+011A->e, U+011B->e, U+011C->g, U+011D->g, U+011E->g, U+011F->g, U+0120->g, U+0121->g, U+0122->g, U+0123->g, U+0124->h, U+0125->h, U+0128->i, U+0129->i, U+0131->i, U+012A->i, U+012B->i, U+012C->i, U+012D->i, U+012E->i, U+012F->i, U+0130->i, U+0134->j, U+0135->j, U+0136->k, U+0137->k, U+0139->l, U+013A->l, U+013B->l, U+013C->l, U+013D->l, U+013E->l, U+0141->l, U+0142->l, U+0143->n, U+0144->n, U+0145->n, U+0146->n, U+0147->n, U+0148->n, U+014C->o, U+014D->o, U+014E->o, U+014F->o, U+0150->o, U+0151->o, U+0154->r, U+0155->r, U+0156->r, U+0157->r, U+0158->r, U+0159->r, U+015A->s, U+015B->s, U+015C->s, U+015D->s, U+015E->s, U+015F->s, U+0160->s, U+0161->s, U+0162->t, U+0163->t, U+0164->t, U+0165->t, U+0168->u, U+0169->u, U+016A->u, U+016B->u, U+016C->u, U+016D->u, U+016E->u, U+016F->u, U+0170->u, U+0171->u, U+0172->u, U+0173->u, U+0174->w, U+0175->w, U+0176->y, U+0177->y, U+0178->y, U+0179->z, U+017A->z, U+017B->z, U+017C->z, U+017D->z, U+017E->z, U+01A0->o, U+01A1->o, U+01AF->u, U+01B0->u, U+01CD->a, U+01CE->a, U+01CF->i, U+01D0->i, U+01D1->o, U+01D2->o, U+01D3->u, U+01D4->u, U+01D5->u, U+01D6->u, U+01D7->u, U+01D8->u, U+01D9->u, U+01DA->u, U+01DB->u, U+01DC->u, U+01DE->a, U+01DF->a, U+01E0->a, U+01E1->a, U+01E6->g, U+01E7->g, U+01E8->k, U+01E9->k, U+01EA->o, U+01EB->o, U+01EC->o, U+01ED->o, U+01F0->j, U+01F4->g, U+01F5->g, U+01F8->n, U+01F9->n, U+01FA->a, U+01FB->a, U+0200->a, U+0201->a, U+0202->a, U+0203->a, U+0204->e, U+0205->e, U+0206->e, U+0207->e, U+0208->i, U+0209->i, U+020A->i, U+020B->i, U+020C->o, U+020D->o, U+020E->o, U+020F->o, U+0210->r, U+0211->r, U+0212->r, U+0213->r, U+0214->u, U+0215->u, U+0216->u, U+0217->u, U+0218->s, U+0219->s, U+021A->t, U+021B->t, U+021E->h, U+021F->h, U+0226->a, U+0227->a, U+0228->e, U+0229->e, U+022A->o, U+022B->o, U+022C->o, U+022D->o, U+022E->o, U+022F->o, U+0230->o, U+0231->o, U+0232->y, U+0233->y, U+1E00->a, U+1E01->a, U+1E02->b, U+1E03->b, U+1E04->b, U+1E05->b, U+1E06->b, U+1E07->b, U+1E08->c, U+1E09->c, U+1E0A->d, U+1E0B->d, U+1E0C->d, U+1E0D->d, U+1E0E->d, U+1E0F->d, U+1E10->d, U+1E11->d, U+1E12->d, U+1E13->d, U+1E14->e, U+1E15->e, U+1E16->e, U+1E17->e, U+1E18->e, U+1E19->e, U+1E1A->e, U+1E1B->e, U+1E1C->e, U+1E1D->e, U+1E1E->f, U+1E1F->f, U+1E20->g, U+1E21->g, U+1E22->h, U+1E23->h, U+1E24->h, U+1E25->h, U+1E26->h, U+1E27->h, U+1E28->h, U+1E29->h, U+1E2A->h, U+1E2B->h, U+1E2C->i, U+1E2D->i, U+1E2E->i, U+1E2F->i, U+1E30->k, U+1E31->k, U+1E32->k, U+1E33->k, U+1E34->k, U+1E35->k, U+1E36->l, U+1E37->l, U+1E38->l, U+1E39->l, U+1E3A->l, U+1E3B->l, U+1E3C->l, U+1E3D->l, U+1E3E->m, U+1E3F->m, U+1E40->m, U+1E41->m, U+1E42->m, U+1E43->m, U+1E44->n, U+1E45->n, U+1E46->n, U+1E47->n, U+1E48->n, U+1E49->n, U+1E4A->n, U+1E4B->n, U+1E4C->o, U+1E4D->o, U+1E4E->o, U+1E4F->o, U+1E50->o, U+1E51->o, U+1E52->o, U+1E53->o, U+1E54->p, U+1E55->p, U+1E56->p, U+1E57->p, U+1E58->r, U+1E59->r, U+1E5A->r, U+1E5B->r, U+1E5C->r, U+1E5D->r, U+1E5E->r, U+1E5F->r, U+1E60->s, U+1E61->s, U+1E62->s, U+1E63->s, U+1E64->s, U+1E65->s, U+1E66->s, U+1E67->s, U+1E68->s, U+1E69->s, U+1E6A->t, U+1E6B->t, U+1E6C->t, U+1E6D->t, U+1E6E->t, U+1E6F->t, U+1E70->t, U+1E71->t, U+1E72->u, U+1E73->u, U+1E74->u, U+1E75->u, U+1E76->u, U+1E77->u, U+1E78->u, U+1E79->u, U+1E7A->u, U+1E7B->u, U+1E7C->v, U+1E7D->v, U+1E7E->v, U+1E7F->v, U+1E80->w, U+1E81->w, U+1E82->w, U+1E83->w, U+1E84->w, U+1E85->w, U+1E86->w, U+1E87->w, U+1E88->w, U+1E89->w, U+1E8A->x, U+1E8B->x, U+1E8C->x, U+1E8D->x, U+1E8E->y, U+1E8F->y, U+1E96->h, U+1E97->t, U+1E98->w, U+1E99->y, U+1EA0->a, U+1EA1->a, U+1EA2->a, U+1EA3->a, U+1EA4->a, U+1EA5->a, U+1EA6->a, U+1EA7->a, U+1EA8->a, U+1EA9->a, U+1EAA->a, U+1EAB->a, U+1EAC->a, U+1EAD->a, U+1EAE->a, U+1EAF->a, U+1EB0->a, U+1EB1->a, U+1EB2->a, U+1EB3->a, U+1EB4->a, U+1EB5->a, U+1EB6->a, U+1EB7->a, U+1EB8->e, U+1EB9->e, U+1EBA->e, U+1EBB->e, U+1EBC->e, U+1EBD->e, U+1EBE->e, U+1EBF->e, U+1EC0->e, U+1EC1->e, U+1EC2->e, U+1EC3->e, U+1EC4->e, U+1EC5->e, U+1EC6->e, U+1EC7->e, U+1EC8->i, U+1EC9->i, U+1ECA->i, U+1ECB->i, U+1ECC->o, U+1ECD->o, U+1ECE->o, U+1ECF->o, U+1ED0->o, U+1ED1->o, U+1ED2->o, U+1ED3->o, U+1ED4->o, U+1ED5->o, U+1ED6->o, U+1ED7->o, U+1ED8->o, U+1ED9->o, U+1EDA->o, U+1EDB->o, U+1EDC->o, U+1EDD->o, U+1EDE->o, U+1EDF->o, U+1EE0->o, U+1EE1->o, U+1EE2->o, U+1EE3->o, U+1EE4->u, U+1EE5->u, U+1EE6->u, U+1EE7->u, U+1EE8->u, U+1EE9->u, U+1EEA->u, U+1EEB->u, U+1EEC->u, U+1EED->u, U+1EEE->u, U+1EEF->u, U+1EF0->u, U+1EF1->u, U+1EF2->y, U+1EF3->y, U+1EF4->y, U+1EF5->y, U+1EF6->y, U+1EF7->y, U+1EF8->y, U+1EF9->y"

Because I have a tests > test in my wordforms.txt. So tests should find test.

Edit : ah ok. So I suppose my morphology is half operating :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants