Skip to content

feat: search bar #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open

feat: search bar #190

wants to merge 2 commits into from

Conversation

kunfang98927
Copy link
Contributor

Implement the search bar using Django ORM search.

Resolves: #186

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

@dchiller
Copy link
Contributor

Do we not want to use solr for this?

@fujinaga
Copy link
Member

Yes, I would like to use Solr eventually.

@ahankinson
Copy link
Member

I hope it's ok if I comment here....

The way you're indexing data by concatenating it into a single field with a pipe delimiter, and then doing a regex search on that string, is really not the way you're supposed to do it with Solr.

  • Solr has multi-valued fields. So if you have a list of strings, you should pass it in as a Python list. I note that you're not using a Solr client -- pysolr is an easy-to-use one. See the "solr.add" example in the docs for Pysolr: https://pypi.org/project/pysolr/
  • For actually querying you should define a default text field, and, using the copyField directives in Solr, automatically copy the values from your string fields into a single text field. This text field (text fields are different from string fields) lets you definte multiple processing steps for the text, and lets you copy values from different fields into a single field, meaning your keyword queries are run against this default field.

You already seem to have this in place: https://github.com/DDMAL/VIM/blob/develop/solr/cores/conf/schema.xml#L58C5-L68C17

However, you're missing a text field, even though your query handlers are looking for one in your df (default field -- the field that is used by default to run a query against). So I'm not entirely sure how your Solr installation is working.

To fix this, you need to define a text field in your schema, make it a text_general field type, then use a copyField to copy any values from your string fields into this single field. Then your q queries will work.

  • There's no requirement for this, but the query parameter for a text query is, by convention, q... so where you have "example.org/search?search_query=blahblah", it could be shortened to "example.org/search?q=blahblah".

  • Your query process uses Solr to look up a list of Instrument IDs from Solr, and then does a lookup in the database. This is absolutely not scalable. What you should do is store enough in Solr to do the querying and to show a brief search result in a list of results. Then, when the user clicks the result you can render the page from a database lookup. You can also process and store any data as a solr document, so you don't need to do things like strip off the instrument- string to get the ID -- just store the ID directly.

  • Once you get that up and running, I strongly encourage you to figure out the ICU field types for tokenization, since you're going to be dealing with all sorts of non-ASCII text, and these processors make it possible to do queries against this text. For example, if your text contains a "U" with an umlaut ("Ü"), then without "character folding", users querying for words like "über" and "uber" won't match. The I18N field processors do things like character folding, which is that letters lik "ü" get automatically converted to "u" at index and query time, making matches possible. It also handles ligatures like œ and Æ.

@kunfang98927
Copy link
Contributor Author

I hope it's ok if I comment here....

The way you're indexing data by concatenating it into a single field with a pipe delimiter, and then doing a regex search on that string, is really not the way you're supposed to do it with Solr.

  • Solr has multi-valued fields. So if you have a list of strings, you should pass it in as a Python list. I note that you're not using a Solr client -- pysolr is an easy-to-use one. See the "solr.add" example in the docs for Pysolr: https://pypi.org/project/pysolr/
  • For actually querying you should define a default text field, and, using the copyField directives in Solr, automatically copy the values from your string fields into a single text field. This text field (text fields are different from string fields) lets you definte multiple processing steps for the text, and lets you copy values from different fields into a single field, meaning your keyword queries are run against this default field.

You already seem to have this in place: https://github.com/DDMAL/VIM/blob/develop/solr/cores/conf/schema.xml#L58C5-L68C17

However, you're missing a text field, even though your query handlers are looking for one in your df (default field -- the field that is used by default to run a query against). So I'm not entirely sure how your Solr installation is working.

To fix this, you need to define a text field in your schema, make it a text_general field type, then use a copyField to copy any values from your string fields into this single field. Then your q queries will work.

  • There's no requirement for this, but the query parameter for a text query is, by convention, q... so where you have "example.org/search?search_query=blahblah", it could be shortened to "example.org/search?q=blahblah".
  • Your query process uses Solr to look up a list of Instrument IDs from Solr, and then does a lookup in the database. This is absolutely not scalable. What you should do is store enough in Solr to do the querying and to show a brief search result in a list of results. Then, when the user clicks the result you can render the page from a database lookup. You can also process and store any data as a solr document, so you don't need to do things like strip off the instrument- string to get the ID -- just store the ID directly.
  • Once you get that up and running, I strongly encourage you to figure out the ICU field types for tokenization, since you're going to be dealing with all sorts of non-ASCII text, and these processors make it possible to do queries against this text. For example, if your text contains a "U" with an umlaut ("Ü"), then without "character folding", users querying for words like "über" and "uber" won't match. The I18N field processors do things like character folding, which is that letters lik "ü" get automatically converted to "u" at index and query time, making matches possible. It also handles ligatures like œ and Æ.

Thank you very much for the comments! I’ll make the changes based on your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the search functionality
4 participants