Solr highlighting doesn't work with certain search terms. #502

albig · 2020-04-08T15:29:34Z

Description

On using the fulltext search in Kitodo.Presentation we experience the following behaviour:

Search for "gut", or "Leipzig" returns results and highlighting textsnippets
Search for "böse" or "Dresden" returns results but no highlighting textsnippets

Reproduction

Steps to reproduce the behaviour:

Go to Börsenblatt digital: Recherche
Search for "gut", "böse", "Leipzig", "Dresden"

Expected Behavior

If a search result is found, the textsnippet must be found, too.

Environment

OS version: [Debian Linux 10.3]
RDBMS version: [MariaDB 10.3]
Apache Solr version: [e.g. 7.7]
TYPO3 version: [e.g. 9.5]
PHP version: [e.g. 7.3]

Additional Context

The Solr-query which is performed by Kitodo.Presentation through Solarium is the following:

webapp=/solr path=/select params={json.nl=flat&hl=true&fl=uid,id,toplevel,thumbnail,page,type,title_tsi,volume_usu,author_tsi,place_tsi,year_usi,localtermsofuse_usi,useandreproduction_usi&start=0&sort=score+desc&fq=uid:15044&rows=10000&q=fulltext:(böse)+OR+toplevel:true&hl.useFastVectorHighlighter=true&omitHeader=true&hl.q=böse&hl.fl=fulltext&wt=json} hits=3 status=0 QTime=4

The search term is found twice inside:

q=fulltext:(böse)... and hl.q=böse.

I assume, this is the answer why the VectorHighlighter does not work as expected. If you look at the query with debugQuery=on you see, that the initial search term gets modified to perform the search:

{
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"15044PHYS_0047",
        "uid":15044,
        "page":47,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000047.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044PHYS_0035",
        "uid":15044,
        "page":35,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000035.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044LOG_0739",
        "uid":15044,
        "page":1,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000001.tif.small.jpg",
        "toplevel":true,
        "type":"issue",
        "title_tsi":[""],
        "year_usi":["1908-02-19"],
        "useandreproduction_usi":["Public Domain Mark 1.0"]}]
  },
  "highlighting":{
    "15044PHYS_0047":{},
    "15044PHYS_0035":{},
    "15044LOG_0739":{}},
  "debug":{
    "rawquerystring":"fulltext:(böse) OR toplevel:true",
    "querystring":"fulltext:(böse) OR toplevel:true",
    "parsedquery":"(+fulltext:bos) toplevel:true",
    "parsedquery_toString":"(+fulltext:bos) toplevel:T",
    "explain":{
      "15044PHYS_0047":"\n5.4130793 = sum of:\n  5.4130793 = weight(fulltext:bos in 36955) [SchemaSimilarity], result of:\n    5.4130793 = score(doc=36955,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.1698714 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        792.0 = fieldLength\n",
      "15044PHYS_0035":"\n4.708341 = sum of:\n  4.708341 = weight(fulltext:bos in 36943) [SchemaSimilarity], result of:\n    4.708341 = score(doc=36943,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.0175638 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        1176.0 = fieldLength\n",
      "15044LOG_0739":"\n3.4155326 = sum of:\n  3.4155326 = weight(toplevel:T in 36908) [SchemaSimilarity], result of:\n    3.4155326 = score(doc=36908,freq=1.0 = termFreq=1.0\n), product of:\n      3.4155326 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        31637.0 = docFreq\n        962828.0 = docCount\n      1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.0 = parameter b (norms omitted for field)\n"},
    "QParser":"LuceneQParser",
    "filter_queries":["uid:15044"],
    "parsed_filter_queries":["+IndexOrDocValuesQuery(uid:[15044 TO 15044])"],
    "timing":{
      "time":1.0,
      "prepare":{
        "time":0.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}},
      "process":{
        "time":1.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}}}}}

If I remove the hl.q=böse from the path, the result is as expected (even though "bos" is not a good search result anyway).

{
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"15044PHYS_0047",
        "uid":15044,
        "page":47,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000047.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044PHYS_0035",
        "uid":15044,
        "page":35,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000035.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044LOG_0739",
        "uid":15044,
        "page":1,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000001.tif.small.jpg",
        "toplevel":true,
        "type":"issue",
        "title_tsi":[""],
        "year_usi":["1908-02-19"],
        "useandreproduction_usi":["Public Domain Mark 1.0"]}]
  },
  "highlighting":{
    "15044PHYS_0047":{
      "fulltext":["Rsvus äs xlrotOArspliis, ksris ki^äselirift voor <em>bos</em>^- bidliotlrslcwsseu, Vutvverpsu. Ois xrsplrisolis"]},
    "15044PHYS_0035":{
      "fulltext":["Bskts!) *?1inins Loonnä., natur. bist., sä. Lla^sr- <em>boS</em>. Vol. III—V. *WnstiNLnn, Lpraobäuininbsiten. 3. ^"]},
    "15044LOG_0739":{}},
  "debug":{
    "rawquerystring":"fulltext:(böse) OR toplevel:true",
    "querystring":"fulltext:(böse) OR toplevel:true",
    "parsedquery":"(+fulltext:bos) toplevel:true",
    "parsedquery_toString":"(+fulltext:bos) toplevel:T",
    "explain":{
      "15044PHYS_0047":"\n5.4130793 = sum of:\n  5.4130793 = weight(fulltext:bos in 36955) [SchemaSimilarity], result of:\n    5.4130793 = score(doc=36955,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.1698714 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        792.0 = fieldLength\n",
      "15044PHYS_0035":"\n4.708341 = sum of:\n  4.708341 = weight(fulltext:bos in 36943) [SchemaSimilarity], result of:\n    4.708341 = score(doc=36943,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.0175638 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        1176.0 = fieldLength\n",
      "15044LOG_0739":"\n3.4155326 = sum of:\n  3.4155326 = weight(toplevel:T in 36908) [SchemaSimilarity], result of:\n    3.4155326 = score(doc=36908,freq=1.0 = termFreq=1.0\n), product of:\n      3.4155326 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        31637.0 = docFreq\n        962828.0 = docCount\n      1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.0 = parameter b (norms omitted for field)\n"},
    "QParser":"LuceneQParser",
    "filter_queries":["uid:15044"],
    "parsed_filter_queries":["+IndexOrDocValuesQuery(uid:[15044 TO 15044])"],
    "timing":{
      "time":4.0,
      "prepare":{
        "time":0.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}},
      "process":{
        "time":4.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":2.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":1.0}}}}}

In Solarium this hl.q= is hard coded or at least there is no option to avoid it. But maybe there is another place where properly configure Solr. I'm not Solr-expert enough :-(

The text was updated successfully, but these errors were encountered:

sebastian-meyer · 2020-04-14T12:13:01Z

This seems to be caused by a mismatch of the filter configuration in Solr for queries and result sets.

Since we are working on integrating Kitodo.Presentation with dbmdz/solr-ocrhighlighting I suggest we wait and see if this issue then still exists.

wrznr · 2020-09-30T18:06:21Z

Meanwhile, this defective behavior has been reported by a number of boersenblatt-digital users. Any chance for a due date for Kitodo.Presentation 3.2.0 (which I could report back to the aforementioned users)?

sebastian-meyer · 2020-09-30T18:47:23Z

We currently don't have a due date, yet. But this bug affects the Zeitungsportal as well, so we need a fix latest in March 2021.

wrznr · 2021-01-21T17:06:10Z

Almost there ... btw. I was wondering if the move to dbmdz/solr-ocrhighlighting will also help with the missing highlighting for complex search queries (e.g. those using wildcards)?

sebastian-meyer · 2021-01-21T18:34:32Z

I've added @beatrycze-volk. She'll be the one implementing dbmdz/solr-ocrhighlighting. I suggest refactoring the search plugin and thereby addressing this issue as well. But that's up to her.

wrznr · 2021-04-18T08:36:44Z

Hi @beatrycze-volk, any updates on this issue?

@sebastian-meyer wrote:

we need a fix latest in March 2021

beatrycze-volk · 2021-04-19T07:56:15Z

Hi @wrznr,

the change is more less ready but we encountered the indexing problem. Here you can see all details: #587

wrznr · 2021-04-19T10:45:02Z

So, we are waiting for dbmdz/solr-ocrhighlighting#49 to be fixed. Since the last change on that issue dates back to November 2020, do you think it would be helpful to approach @jbaiter and ask about the current state of affairs? I'd volunteer to do this...

beatrycze-volk · 2021-04-19T10:54:07Z

Sure, thanks, that would be very useful! It would give as a knowledge if we should wait or try to find some other solution for us.

sebastian-meyer · 2021-04-19T11:01:17Z

I am not quite sure if we really need dbmdz/solr-ocrhighlighting#49. While loading OCR files via URL would be the easiest solution to implement, I have concerns regarding performance - as every single highlighted snippet would trigger requesting an OCR file via HTTP.
But @jbaiter mentions in dbmdz/solr-ocrhighlighting#49 (comment), that there already is the option to store the full OCR in a Solr field and use that for highlighting. Maybe that would be the better option in our case? It would result in a bigger index, but should overall perform better.

jbaiter · 2021-04-19T11:24:17Z

But @jbaiter mentions in dbmdz/solr-ocrhighlighting#49 (comment), that there already is the option to store the full OCR in a Solr field and use that for highlighting. Maybe that would be the better option in our case? It would result in a bigger index, but should overall perform better.

This is currently the recommended way to go if you can't load the OCR from disk. You can reduce the size the OCR takes up in the index by switching to a less verbose format, e.g. from ALTO to hOCR, or even better, to the custom MiniOCR format which was designed to take up as little space as possible (i.e. to only contain as much information as we need for highlighting).
Maybe @DiegoPino can chime in, his team used this approach for https://github.com/esmero/.

beatrycze-volk · 2021-04-19T12:42:11Z

As I have mentioned already in PR, storing file content is a potential solution. Now question is if we do it and if yes, should this change be a part of existing PR or the new one?

albig · 2021-04-19T12:56:21Z

I would like to merge only working code into the master of Kitodo.Presentation. The current PR does not index the documents anymore. So I suggest to enhance it by storing the OCR in a suitable format.

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

sebastian-meyer · 2021-04-19T14:04:48Z

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

Yes, we would have to store it twice, once with markup for highlighting and retrieval of word coordinates and once as plain text for searching. But only the latter has to be indexed (i. e. searchable), so it shouldn't be a performance issue.

DiegoPino · 2021-04-19T14:05:48Z

@jbaiter @beatrycze-volk @albig, sorry for the delay, work/code day (second coffee) is just starting here in NYC. Yes we use MiniOCR from HOCR (Tesseract) stored directly into a Solr field instead of file loads (because S3, etc, but really happy how it is working) and we also use Solarium since we are PHP based. Before I can understand what is wrong (or if we suffer the same issue here) i may need to index one of your PDFs to understand (which happening now as we speak using Tesseract and deu language but a bit slow not sure why...), and test but something I saw here makes me think you may have a problem with your Query/Indexing/Processing pipeline not related to the plugin itself.

We totally separate our HOCR highlight queries from our other queries. mixing hl is messy and document as not working in this

Highlighting Non-OCR Fields
One unfortunate side effect of the way the plugin works is that you need to pass non-OCR fields to be highlighted explicitly via the hl.fl parameter. By default, Solr falls back on highlighting all stored fields if the parameter is not present, which no longer works if this plugin is used.

Basically we do this:

  // This part here is specific to us, we use Drupal so we have a Native Query that wraps the Solarium one.
   $solr_field_names = $query->getIndex()->getServerInstance()->getBackend()->getSolrFieldNames($query->getIndex());
  // If we have on our Fields this particular one: `ocr_text` which is defined as of type   text_ocr_stored (see down the chain)
  if (isset($solr_field_names['ocr_text'])) {
      /* @var \Solarium\Component\Highlighting\Highlighting $hl */
     $hl = $solarium_query->getHighlighting();
      // hl.fl has issues if ocr_text is in that list (Token Offset big error,
      // bad, bad)
      // By removing any highlight returns in this case we can focus on what we
      // need.
      $hl->clearFields();
      // We pass  the arguments we need for JUST ocr highlights only requesting it for the solr field we know supports this.
      $solarium_query->addParam('hl.ocr.fl', $solr_field_names['ocr_text']);
      $solarium_query->addParam('hl.ocr.absoluteHighlights', 'on');
      $solarium_query->addParam('hl.method', 'UnifiedHighlighter');
    }

This is our type (Solr 8.7, we also have a few 8.8.2 around)

Type: text_ocr_stored
Field-Type:org.apache.solr.schema.TextField
Index Analyzer:
org.apache.solr.analysis.TokenizerChain
Char Filters:
de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory
class: de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory
luceneMatchVersion: 8.7.0
Tokenizer:
org.apache.lucene.analysis.core.WhitespaceTokenizerFactory
class: solr.WhitespaceTokenizerFactory
luceneMatchVersion: 8.7.0
Token Filters:
org.apache.lucene.analysis.core.LowerCaseFilterFactory
class: solr.LowerCaseFilterFactory
luceneMatchVersion: 8.7.0
org.apache.lucene.analysis.core.StopFilterFactory
words: stopwords_und.txt
class: solr.StopFilterFactory
ignoreCase
luceneMatchVersion: 8.7.0
org.apache.lucene.analysis.en.PorterStemFilterFactory
class: solr.PorterStemFilterFactory
luceneMatchVersion: 8.7.0

Update: Finally finished HOCR (longest running time for a single page in my Library experience)

Highlight works well for "März"

And this is our query

webapp=/solr path=/select params={json.nl=flat&hl=true&TZ=America/New_York&fl=*,score&hl.requireFieldMatch=false&start=0&hl.fragsize=0&sort=score+desc&fq=ss_parent_id:"558"&fq=ss_search_api_datasource:"strawberryfield_flavor_datasource"&fq=ss_processor_id:"ocr"&fq=%2Bindex_id:default_solr_index+%2Bhash:5qc1nk&fq=ss_search_api_language:("en"+"und")&rows=20&hl.simple.pre=[HIGHLIGHT]&hl.snippets=3&q={!boost+b%3Dboost_document}++(tcocr_highlightm_X3b_en_ocr_text:(%2B"März")+tcocr_highlightm_X3b_und_ocr_text:(%2B"März"))&hl.mergeContiguous=false&hl.ocr.absoluteHighlights=on&hl.simple.post=[/HIGHLIGHT]&omitHeader=true&hl.method=UnifiedHighlighter&hl.ocr.fl=tcocr_highlightm_X3b_und_ocr_text&wt=json} hits=1 status=0 QTime=3

albig · 2021-04-19T14:53:05Z

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

Yes, we would have to store it twice, once with markup for highlighting and retrieval of word coordinates and once as plain text for searching. But only the latter has to be indexed (i. e. searchable), so it shouldn't be a performance issue.

Thank you for clarifying. I don't think that this will be a performance issue. But we have to keep an eye on the storage.

jbaiter · 2021-04-19T15:37:17Z

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

Yes, we would have to store it twice, once with markup for highlighting and retrieval of word coordinates and once as plain text for searching. But only the latter has to be indexed (i. e. searchable), so it shouldn't be a performance issue.

I'm not sure if I'm understanding this correctly, but you do not need two separate fields with the Solr plugin.
You can search perfectly fine on the field with the OCR markup, this is what the plugin is all about: Doing the highlighting at the same time as the actual querying, to ensure that highlighting takes into account things like stemming, wildcards, multi-term queries, etc.

The markup is only used by the plugin to do the initial indexing (from which point on Solr/Lucene only sees the terms extracted from the markup) and to extract the bounding boxes during highlighting.

beatrycze-volk · 2021-04-27T15:31:15Z

@jbaiter Thanks for explanation. I have changed to indexing the OCR file content and observed one problem. When OCR field is is not set I'm getting error:

Caused by: java.lang.RuntimeException: Could not determine OCR format from chunk: 
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.lambda$create$1(OcrCharFilterFactory.java:36)
	at java.util.Optional.orElseThrow(Optional.java:290)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:35)

Is it not possible to index document without OCR field included? If not, is there some kind of the placeholder which can used for documents which don't have OCR?

jbaiter · 2021-04-27T16:23:40Z

@beatrycze-volk Did the document not include the OCR field at all or did you pass in an empty string as the field value? I tried it in a unit test just now, and if the field is simply missing from the document everything works.
Independent of that, I'll see if I can catch empty documents before they cause a problem, thanks for reporting!

Fixed: dbmdz/solr-ocrhighlighting#156

beatrycze-volk · 2021-04-27T16:49:18Z

@jbaiter the document didn't include OCR field at all. We are using createDocument() from src/QueryType/Update/Query/Query.php and there set up fields by setField() method. We simply omit set up of OCR field. Then update() method throws error given above.

Schema: schema.xml

Document JSON taken by getFields() from solarium created document:

{
   "id":"2LOG_0000",
   "uid":2,
   "pid":5,
   "page":1,
   "thumbnail":"",
   "partof":0,
   "root":0,
   "sid":"LOG_0000",
   "toplevel":true,
   "type":"newspaper",
   "title":"The Daily record and the Dresden daily",
   "record_id":"oai:de:slub-dresden:db:id-416971482",
   "purl":"http://digital.slub-dresden.de/id416971482",
   "location":"https://digital.slub-dresden.de/data/kitodo/TheDarea_416971482-19100223/TheDarea_416971482-19100223_anchor.xml",
   "urn":[
      "urn:nbn:de:bsz:14-db-id4169714826"
   ],
   "license":[
      
   ],
   "terms":[
      
   ],
   "restrictions":[
      
   ],
   "collection":[
      
   ],
   "title_tsi":[
      "The Daily record and the Dresden daily"
   ],
   "title_sorting":"The Daily record and the Dresden daily",
   "place_tsi":[
      "Dresden"
   ],
   "place_sorting":"Dresden",
   "place_faceting":[
      "Dresden"
   ],
   "record_id_uui":[
      "oai:de:slub-dresden:db:id-416971482"
   ],
   "urn_uui":[
      "urn:nbn:de:bsz:14-db-id4169714826"
   ],
   "purl_uuu":[
      "http://digital.slub-dresden.de/id416971482"
   ],
   "type_usu":[
      "newspaper"
   ],
   "type_faceting":[
      "newspaper"
   ],
   "document_format_uuu":[
      "METS"
   ],
   "language_uui":[
      "eng"
   ],
   "language_faceting":[
      "eng"
   ],
   "autocomplete":[
      "The Daily record and the Dresden daily"
   ]
}

Full error given by SOLR Server:

org.apache.solr.common.SolrException: Exception writing document id 2LOG_0000 to the index; possible analysis error.
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:254)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:291)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
	at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
	at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
	at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:802)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:579)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:420)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:352)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
	at org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:500)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Could not determine OCR format from chunk: 
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.lambda$create$1(OcrCharFilterFactory.java:36)
	at java.util.Optional.orElseThrow(Optional.java:290)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:35)
	at org.apache.solr.analysis.TokenizerChain.initReader(TokenizerChain.java:97)
	at org.apache.lucene.analysis.AnalyzerWrapper.initReader(AnalyzerWrapper.java:156)
	at org.apache.lucene.analysis.AnalyzerWrapper.initReader(AnalyzerWrapper.java:156)
	at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:197)
	at org.apache.lucene.document.Field.tokenStream(Field.java:513)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:806)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
	at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:979)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:345)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:292)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:239)

albig · 2021-10-12T18:32:10Z

This issue is closed by #673 and will be part of Kitodo.Presentation 3.3. Thanks @beatrycze-volk for all the work on implementing usage of the OCR Highlighting Plugin.

wrznr · 2021-10-12T18:40:02Z

Many thanks from my side as well. Looking forward to Kitodo.Presentation 3.3.0.

albig added the 🐛 bug A non-security related bug. label Apr 8, 2020

sebastian-meyer added this to the Kitodo.Presentation 3.2.0 milestone Apr 8, 2020

sebastian-meyer self-assigned this Apr 14, 2020

sebastian-meyer assigned beatrycze-volk Jan 21, 2021

albig modified the milestones: Kitodo.Presentation 3.2.0, Kitodo.Presentation 4.0.0 Feb 22, 2021

jbaiter mentioned this issue Apr 19, 2021

Find way to enable use of implicit hl.fl Parameter dbmdz/solr-ocrhighlighting#152

Closed

beatrycze-volk mentioned this issue May 10, 2021

Use Solr OCR Highlighting Plugin in Search in Document Plugin #587

Merged

12 tasks

sebastian-meyer unassigned sebastian-meyer and beatrycze-volk Sep 9, 2021

sebastian-meyer assigned beatrycze-volk Sep 9, 2021

albig closed this as completed Oct 12, 2021

albig modified the milestones: Kitodo.Presentation 4.0.0, Kitodo.Presentation 3.3.0 Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr highlighting doesn't work with certain search terms. #502

Solr highlighting doesn't work with certain search terms. #502

albig commented Apr 8, 2020

sebastian-meyer commented Apr 14, 2020

wrznr commented Sep 30, 2020 •

edited

Loading

sebastian-meyer commented Sep 30, 2020

wrznr commented Jan 21, 2021

sebastian-meyer commented Jan 21, 2021

wrznr commented Apr 18, 2021

beatrycze-volk commented Apr 19, 2021

wrznr commented Apr 19, 2021

beatrycze-volk commented Apr 19, 2021

sebastian-meyer commented Apr 19, 2021 •

edited

Loading

jbaiter commented Apr 19, 2021 •

edited

Loading

beatrycze-volk commented Apr 19, 2021

albig commented Apr 19, 2021

sebastian-meyer commented Apr 19, 2021

DiegoPino commented Apr 19, 2021

albig commented Apr 19, 2021

jbaiter commented Apr 19, 2021 •

edited

Loading

beatrycze-volk commented Apr 27, 2021

jbaiter commented Apr 27, 2021 •

edited

Loading

beatrycze-volk commented Apr 27, 2021 •

edited

Loading

albig commented Oct 12, 2021

wrznr commented Oct 12, 2021

Solr highlighting doesn't work with certain search terms. #502

Solr highlighting doesn't work with certain search terms. #502

Comments

albig commented Apr 8, 2020

Description

Reproduction

Expected Behavior

Environment

Additional Context

sebastian-meyer commented Apr 14, 2020

wrznr commented Sep 30, 2020 • edited Loading

sebastian-meyer commented Sep 30, 2020

wrznr commented Jan 21, 2021

sebastian-meyer commented Jan 21, 2021

wrznr commented Apr 18, 2021

beatrycze-volk commented Apr 19, 2021

wrznr commented Apr 19, 2021

beatrycze-volk commented Apr 19, 2021

sebastian-meyer commented Apr 19, 2021 • edited Loading

jbaiter commented Apr 19, 2021 • edited Loading

beatrycze-volk commented Apr 19, 2021

albig commented Apr 19, 2021

sebastian-meyer commented Apr 19, 2021

DiegoPino commented Apr 19, 2021

albig commented Apr 19, 2021

jbaiter commented Apr 19, 2021 • edited Loading

beatrycze-volk commented Apr 27, 2021

jbaiter commented Apr 27, 2021 • edited Loading

beatrycze-volk commented Apr 27, 2021 • edited Loading

albig commented Oct 12, 2021

wrznr commented Oct 12, 2021

wrznr commented Sep 30, 2020 •

edited

Loading

sebastian-meyer commented Apr 19, 2021 •

edited

Loading

jbaiter commented Apr 19, 2021 •

edited

Loading

jbaiter commented Apr 19, 2021 •

edited

Loading

jbaiter commented Apr 27, 2021 •

edited

Loading

beatrycze-volk commented Apr 27, 2021 •

edited

Loading