-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr highlighting doesn't work with certain search terms. #502
Comments
This seems to be caused by a mismatch of the filter configuration in Solr for queries and result sets. Since we are working on integrating Kitodo.Presentation with dbmdz/solr-ocrhighlighting I suggest we wait and see if this issue then still exists. |
Meanwhile, this defective behavior has been reported by a number of boersenblatt-digital users. Any chance for a due date for Kitodo.Presentation 3.2.0 (which I could report back to the aforementioned users)? |
We currently don't have a due date, yet. But this bug affects the Zeitungsportal as well, so we need a fix latest in March 2021. |
Almost there ... btw. I was wondering if the move to dbmdz/solr-ocrhighlighting will also help with the missing highlighting for complex search queries (e.g. those using wildcards)? |
I've added @beatrycze-volk. She'll be the one implementing dbmdz/solr-ocrhighlighting. I suggest refactoring the search plugin and thereby addressing this issue as well. But that's up to her. |
Hi @beatrycze-volk, any updates on this issue? @sebastian-meyer wrote:
|
So, we are waiting for dbmdz/solr-ocrhighlighting#49 to be fixed. Since the last change on that issue dates back to November 2020, do you think it would be helpful to approach @jbaiter and ask about the current state of affairs? I'd volunteer to do this... |
Sure, thanks, that would be very useful! It would give as a knowledge if we should wait or try to find some other solution for us. |
I am not quite sure if we really need dbmdz/solr-ocrhighlighting#49. While loading OCR files via URL would be the easiest solution to implement, I have concerns regarding performance - as every single highlighted snippet would trigger requesting an OCR file via HTTP. |
This is currently the recommended way to go if you can't load the OCR from disk. You can reduce the size the OCR takes up in the index by switching to a less verbose format, e.g. from ALTO to hOCR, or even better, to the custom MiniOCR format which was designed to take up as little space as possible (i.e. to only contain as much information as we need for highlighting). |
As I have mentioned already in PR, storing file content is a potential solution. Now question is if we do it and if yes, should this change be a part of existing PR or the new one? |
I would like to merge only working code into the master of Kitodo.Presentation. The current PR does not index the documents anymore. So I suggest to enhance it by storing the OCR in a suitable format. What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema? |
Yes, we would have to store it twice, once with markup for highlighting and retrieval of word coordinates and once as plain text for searching. But only the latter has to be indexed (i. e. searchable), so it shouldn't be a performance issue. |
@jbaiter @beatrycze-volk @albig, sorry for the delay, work/code day (second coffee) is just starting here in NYC. Yes we use We totally separate our HOCR highlight queries from our other queries. mixing
Basically we do this: // This part here is specific to us, we use Drupal so we have a Native Query that wraps the Solarium one.
$solr_field_names = $query->getIndex()->getServerInstance()->getBackend()->getSolrFieldNames($query->getIndex());
// If we have on our Fields this particular one: `ocr_text` which is defined as of type text_ocr_stored (see down the chain)
if (isset($solr_field_names['ocr_text'])) {
/* @var \Solarium\Component\Highlighting\Highlighting $hl */
$hl = $solarium_query->getHighlighting();
// hl.fl has issues if ocr_text is in that list (Token Offset big error,
// bad, bad)
// By removing any highlight returns in this case we can focus on what we
// need.
$hl->clearFields();
// We pass the arguments we need for JUST ocr highlights only requesting it for the solr field we know supports this.
$solarium_query->addParam('hl.ocr.fl', $solr_field_names['ocr_text']);
$solarium_query->addParam('hl.ocr.absoluteHighlights', 'on');
$solarium_query->addParam('hl.method', 'UnifiedHighlighter');
} This is our type (Solr 8.7, we also have a few 8.8.2 around)
Update: Finally finished HOCR (longest running time for a single page in my Library experience) Highlight works well for "März" And this is our query
|
Thank you for clarifying. I don't think that this will be a performance issue. But we have to keep an eye on the storage. |
I'm not sure if I'm understanding this correctly, but you do not need two separate fields with the Solr plugin. The markup is only used by the plugin to do the initial indexing (from which point on Solr/Lucene only sees the terms extracted from the markup) and to extract the bounding boxes during highlighting. |
@jbaiter Thanks for explanation. I have changed to indexing the OCR file content and observed one problem. When OCR field is is not set I'm getting error:
Is it not possible to index document without OCR field included? If not, is there some kind of the placeholder which can used for documents which don't have OCR? |
@beatrycze-volk Did the document not include the OCR field at all or did you pass in an empty string as the field value? I tried it in a unit test just now, and if the field is simply missing from the document everything works. |
@jbaiter the document didn't include OCR field at all. We are using Schema: schema.xml Document JSON taken by getFields() from solarium created document:
Full error given by SOLR Server:
|
This issue is closed by #673 and will be part of Kitodo.Presentation 3.3. Thanks @beatrycze-volk for all the work on implementing usage of the OCR Highlighting Plugin. |
Many thanks from my side as well. Looking forward to Kitodo.Presentation 3.3.0. |
Description
On using the fulltext search in Kitodo.Presentation we experience the following behaviour:
Reproduction
Steps to reproduce the behaviour:
Expected Behavior
If a search result is found, the textsnippet must be found, too.
Environment
Additional Context
The Solr-query which is performed by Kitodo.Presentation through Solarium is the following:
The search term is found twice inside:
q=fulltext:(böse)...
andhl.q=böse
.I assume, this is the answer why the VectorHighlighter does not work as expected. If you look at the query with
debugQuery=on
you see, that the initial search term gets modified to perform the search:If I remove the
hl.q=böse
from the path, the result is as expected (even though "bos" is not a good search result anyway).In Solarium this
hl.q=
is hard coded or at least there is no option to avoid it. But maybe there is another place where properly configure Solr. I'm not Solr-expert enough :-(The text was updated successfully, but these errors were encountered: