Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr highlighting doesn't work with certain search terms. #502

Closed
albig opened this issue Apr 8, 2020 · 22 comments
Closed

Solr highlighting doesn't work with certain search terms. #502

albig opened this issue Apr 8, 2020 · 22 comments
Assignees
Labels
🐛 bug A non-security related bug.

Comments

@albig
Copy link
Collaborator

albig commented Apr 8, 2020

Description

On using the fulltext search in Kitodo.Presentation we experience the following behaviour:

  • Search for "gut", or "Leipzig" returns results and highlighting textsnippets
  • Search for "böse" or "Dresden" returns results but no highlighting textsnippets

Reproduction

Steps to reproduce the behaviour:

  1. Go to Börsenblatt digital: Recherche
  2. Search for "gut", "böse", "Leipzig", "Dresden"

Expected Behavior

If a search result is found, the textsnippet must be found, too.

Environment

  • OS version: [Debian Linux 10.3]
  • RDBMS version: [MariaDB 10.3]
  • Apache Solr version: [e.g. 7.7]
  • TYPO3 version: [e.g. 9.5]
  • PHP version: [e.g. 7.3]

Additional Context

The Solr-query which is performed by Kitodo.Presentation through Solarium is the following:

webapp=/solr path=/select params={json.nl=flat&hl=true&fl=uid,id,toplevel,thumbnail,page,type,title_tsi,volume_usu,author_tsi,place_tsi,year_usi,localtermsofuse_usi,useandreproduction_usi&start=0&sort=score+desc&fq=uid:15044&rows=10000&q=fulltext:(böse)+OR+toplevel:true&hl.useFastVectorHighlighter=true&omitHeader=true&hl.q=böse&hl.fl=fulltext&wt=json} hits=3 status=0 QTime=4

The search term is found twice inside:

q=fulltext:(böse)... and hl.q=böse.

I assume, this is the answer why the VectorHighlighter does not work as expected. If you look at the query with debugQuery=on you see, that the initial search term gets modified to perform the search:

{
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"15044PHYS_0047",
        "uid":15044,
        "page":47,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000047.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044PHYS_0035",
        "uid":15044,
        "page":35,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000035.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044LOG_0739",
        "uid":15044,
        "page":1,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000001.tif.small.jpg",
        "toplevel":true,
        "type":"issue",
        "title_tsi":[""],
        "year_usi":["1908-02-19"],
        "useandreproduction_usi":["Public Domain Mark 1.0"]}]
  },
  "highlighting":{
    "15044PHYS_0047":{},
    "15044PHYS_0035":{},
    "15044LOG_0739":{}},
  "debug":{
    "rawquerystring":"fulltext:(böse) OR toplevel:true",
    "querystring":"fulltext:(böse) OR toplevel:true",
    "parsedquery":"(+fulltext:bos) toplevel:true",
    "parsedquery_toString":"(+fulltext:bos) toplevel:T",
    "explain":{
      "15044PHYS_0047":"\n5.4130793 = sum of:\n  5.4130793 = weight(fulltext:bos in 36955) [SchemaSimilarity], result of:\n    5.4130793 = score(doc=36955,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.1698714 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        792.0 = fieldLength\n",
      "15044PHYS_0035":"\n4.708341 = sum of:\n  4.708341 = weight(fulltext:bos in 36943) [SchemaSimilarity], result of:\n    4.708341 = score(doc=36943,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.0175638 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        1176.0 = fieldLength\n",
      "15044LOG_0739":"\n3.4155326 = sum of:\n  3.4155326 = weight(toplevel:T in 36908) [SchemaSimilarity], result of:\n    3.4155326 = score(doc=36908,freq=1.0 = termFreq=1.0\n), product of:\n      3.4155326 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        31637.0 = docFreq\n        962828.0 = docCount\n      1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.0 = parameter b (norms omitted for field)\n"},
    "QParser":"LuceneQParser",
    "filter_queries":["uid:15044"],
    "parsed_filter_queries":["+IndexOrDocValuesQuery(uid:[15044 TO 15044])"],
    "timing":{
      "time":1.0,
      "prepare":{
        "time":0.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}},
      "process":{
        "time":1.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}}}}}

If I remove the hl.q=böse from the path, the result is as expected (even though "bos" is not a good search result anyway).

{
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"15044PHYS_0047",
        "uid":15044,
        "page":47,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000047.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044PHYS_0035",
        "uid":15044,
        "page":35,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000035.tif.small.jpg",
        "toplevel":false,
        "type":"page"},
      {
        "id":"15044LOG_0739",
        "uid":15044,
        "page":1,
        "thumbnail":"https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-19080219/Brsfded_39946221X-19080219_tif/jpegs/00000001.tif.small.jpg",
        "toplevel":true,
        "type":"issue",
        "title_tsi":[""],
        "year_usi":["1908-02-19"],
        "useandreproduction_usi":["Public Domain Mark 1.0"]}]
  },
  "highlighting":{
    "15044PHYS_0047":{
      "fulltext":["Rsvus äs xlrotOArspliis, ksris ki^äselirift voor <em>bos</em>^- bidliotlrslcwsseu, Vutvverpsu. Ois xrsplrisolis"]},
    "15044PHYS_0035":{
      "fulltext":["Bskts!) *?1inins Loonnä., natur. bist., sä. Lla^sr- <em>boS</em>. Vol. III—V. *WnstiNLnn, Lpraobäuininbsiten. 3. ^"]},
    "15044LOG_0739":{}},
  "debug":{
    "rawquerystring":"fulltext:(böse) OR toplevel:true",
    "querystring":"fulltext:(böse) OR toplevel:true",
    "parsedquery":"(+fulltext:bos) toplevel:true",
    "parsedquery_toString":"(+fulltext:bos) toplevel:T",
    "explain":{
      "15044PHYS_0047":"\n5.4130793 = sum of:\n  5.4130793 = weight(fulltext:bos in 36955) [SchemaSimilarity], result of:\n    5.4130793 = score(doc=36955,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.1698714 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        792.0 = fieldLength\n",
      "15044PHYS_0035":"\n4.708341 = sum of:\n  4.708341 = weight(fulltext:bos in 36943) [SchemaSimilarity], result of:\n    4.708341 = score(doc=36943,freq=1.0 = termFreq=1.0\n), product of:\n      4.627072 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        9047.0 = docFreq\n        924783.0 = docCount\n      1.0175638 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        1227.8043 = avgFieldLength\n        1176.0 = fieldLength\n",
      "15044LOG_0739":"\n3.4155326 = sum of:\n  3.4155326 = weight(toplevel:T in 36908) [SchemaSimilarity], result of:\n    3.4155326 = score(doc=36908,freq=1.0 = termFreq=1.0\n), product of:\n      3.4155326 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        31637.0 = docFreq\n        962828.0 = docCount\n      1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:\n        1.0 = termFreq=1.0\n        1.2 = parameter k1\n        0.0 = parameter b (norms omitted for field)\n"},
    "QParser":"LuceneQParser",
    "filter_queries":["uid:15044"],
    "parsed_filter_queries":["+IndexOrDocValuesQuery(uid:[15044 TO 15044])"],
    "timing":{
      "time":4.0,
      "prepare":{
        "time":0.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}},
      "process":{
        "time":4.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":2.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":1.0}}}}}

In Solarium this hl.q= is hard coded or at least there is no option to avoid it. But maybe there is another place where properly configure Solr. I'm not Solr-expert enough :-(

@albig albig added the 🐛 bug A non-security related bug. label Apr 8, 2020
@sebastian-meyer
Copy link
Member

This seems to be caused by a mismatch of the filter configuration in Solr for queries and result sets.

Since we are working on integrating Kitodo.Presentation with dbmdz/solr-ocrhighlighting I suggest we wait and see if this issue then still exists.

@sebastian-meyer sebastian-meyer self-assigned this Apr 14, 2020
@wrznr
Copy link

wrznr commented Sep 30, 2020

Meanwhile, this defective behavior has been reported by a number of boersenblatt-digital users. Any chance for a due date for Kitodo.Presentation 3.2.0 (which I could report back to the aforementioned users)?

@sebastian-meyer
Copy link
Member

We currently don't have a due date, yet. But this bug affects the Zeitungsportal as well, so we need a fix latest in March 2021.

@wrznr
Copy link

wrznr commented Jan 21, 2021

Almost there ... btw. I was wondering if the move to dbmdz/solr-ocrhighlighting will also help with the missing highlighting for complex search queries (e.g. those using wildcards)?

@sebastian-meyer
Copy link
Member

I've added @beatrycze-volk. She'll be the one implementing dbmdz/solr-ocrhighlighting. I suggest refactoring the search plugin and thereby addressing this issue as well. But that's up to her.

@wrznr
Copy link

wrznr commented Apr 18, 2021

Hi @beatrycze-volk, any updates on this issue?

@sebastian-meyer wrote:

we need a fix latest in March 2021

@beatrycze-volk
Copy link
Collaborator

Hi @wrznr,

the change is more less ready but we encountered the indexing problem. Here you can see all details: #587

@wrznr
Copy link

wrznr commented Apr 19, 2021

So, we are waiting for dbmdz/solr-ocrhighlighting#49 to be fixed. Since the last change on that issue dates back to November 2020, do you think it would be helpful to approach @jbaiter and ask about the current state of affairs? I'd volunteer to do this...

@beatrycze-volk
Copy link
Collaborator

Sure, thanks, that would be very useful! It would give as a knowledge if we should wait or try to find some other solution for us.

@sebastian-meyer
Copy link
Member

sebastian-meyer commented Apr 19, 2021

I am not quite sure if we really need dbmdz/solr-ocrhighlighting#49. While loading OCR files via URL would be the easiest solution to implement, I have concerns regarding performance - as every single highlighted snippet would trigger requesting an OCR file via HTTP.
But @jbaiter mentions in dbmdz/solr-ocrhighlighting#49 (comment), that there already is the option to store the full OCR in a Solr field and use that for highlighting. Maybe that would be the better option in our case? It would result in a bigger index, but should overall perform better.

@jbaiter
Copy link

jbaiter commented Apr 19, 2021

But @jbaiter mentions in dbmdz/solr-ocrhighlighting#49 (comment), that there already is the option to store the full OCR in a Solr field and use that for highlighting. Maybe that would be the better option in our case? It would result in a bigger index, but should overall perform better.

This is currently the recommended way to go if you can't load the OCR from disk. You can reduce the size the OCR takes up in the index by switching to a less verbose format, e.g. from ALTO to hOCR, or even better, to the custom MiniOCR format which was designed to take up as little space as possible (i.e. to only contain as much information as we need for highlighting).
Maybe @DiegoPino can chime in, his team used this approach for https://github.com/esmero/.

@beatrycze-volk
Copy link
Collaborator

As I have mentioned already in PR, storing file content is a potential solution. Now question is if we do it and if yes, should this change be a part of existing PR or the new one?

@albig
Copy link
Collaborator Author

albig commented Apr 19, 2021

I would like to merge only working code into the master of Kitodo.Presentation. The current PR does not index the documents anymore. So I suggest to enhance it by storing the OCR in a suitable format.

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

@sebastian-meyer
Copy link
Member

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

Yes, we would have to store it twice, once with markup for highlighting and retrieval of word coordinates and once as plain text for searching. But only the latter has to be indexed (i. e. searchable), so it shouldn't be a performance issue.

@DiegoPino
Copy link

@jbaiter @beatrycze-volk @albig, sorry for the delay, work/code day (second coffee) is just starting here in NYC. Yes we use MiniOCR from HOCR (Tesseract) stored directly into a Solr field instead of file loads (because S3, etc, but really happy how it is working) and we also use Solarium since we are PHP based. Before I can understand what is wrong (or if we suffer the same issue here) i may need to index one of your PDFs to understand (which happening now as we speak using Tesseract and deu language but a bit slow not sure why...), and test but something I saw here makes me think you may have a problem with your Query/Indexing/Processing pipeline not related to the plugin itself.

We totally separate our HOCR highlight queries from our other queries. mixing hl is messy and document as not working in this

Highlighting Non-OCR Fields
One unfortunate side effect of the way the plugin works is that you need to pass non-OCR fields to be highlighted explicitly via the hl.fl parameter. By default, Solr falls back on highlighting all stored fields if the parameter is not present, which no longer works if this plugin is used.

Basically we do this:

  // This part here is specific to us, we use Drupal so we have a Native Query that wraps the Solarium one.
   $solr_field_names = $query->getIndex()->getServerInstance()->getBackend()->getSolrFieldNames($query->getIndex());
  // If we have on our Fields this particular one: `ocr_text` which is defined as of type   text_ocr_stored (see down the chain)
  if (isset($solr_field_names['ocr_text'])) {
      /* @var \Solarium\Component\Highlighting\Highlighting $hl */
     $hl = $solarium_query->getHighlighting();
      // hl.fl has issues if ocr_text is in that list (Token Offset big error,
      // bad, bad)
      // By removing any highlight returns in this case we can focus on what we
      // need.
      $hl->clearFields();
      // We pass  the arguments we need for JUST ocr highlights only requesting it for the solr field we know supports this.
      $solarium_query->addParam('hl.ocr.fl', $solr_field_names['ocr_text']);
      $solarium_query->addParam('hl.ocr.absoluteHighlights', 'on');
      $solarium_query->addParam('hl.method', 'UnifiedHighlighter');
    }

This is our type (Solr 8.7, we also have a few 8.8.2 around)

Type: text_ocr_stored
Field-Type:org.apache.solr.schema.TextField
Index Analyzer:
org.apache.solr.analysis.TokenizerChain
Char Filters:
de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory
class: de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory
luceneMatchVersion: 8.7.0
Tokenizer:
org.apache.lucene.analysis.core.WhitespaceTokenizerFactory
class: solr.WhitespaceTokenizerFactory
luceneMatchVersion: 8.7.0
Token Filters:
org.apache.lucene.analysis.core.LowerCaseFilterFactory
class: solr.LowerCaseFilterFactory
luceneMatchVersion: 8.7.0
org.apache.lucene.analysis.core.StopFilterFactory
words: stopwords_und.txt
class: solr.StopFilterFactory
ignoreCase
luceneMatchVersion: 8.7.0
org.apache.lucene.analysis.en.PorterStemFilterFactory
class: solr.PorterStemFilterFactory
luceneMatchVersion: 8.7.0

Update: Finally finished HOCR (longest running time for a single page in my Library experience)

Highlight works well for "März"

image

And this is our query

webapp=/solr path=/select params={json.nl=flat&hl=true&TZ=America/New_York&fl=*,score&hl.requireFieldMatch=false&start=0&hl.fragsize=0&sort=score+desc&fq=ss_parent_id:"558"&fq=ss_search_api_datasource:"strawberryfield_flavor_datasource"&fq=ss_processor_id:"ocr"&fq=%2Bindex_id:default_solr_index+%2Bhash:5qc1nk&fq=ss_search_api_language:("en"+"und")&rows=20&hl.simple.pre=[HIGHLIGHT]&hl.snippets=3&q={!boost+b%3Dboost_document}++(tcocr_highlightm_X3b_en_ocr_text:(%2B"März")+tcocr_highlightm_X3b_und_ocr_text:(%2B"März"))&hl.mergeContiguous=false&hl.ocr.absoluteHighlights=on&hl.simple.post=[/HIGHLIGHT]&omitHeader=true&hl.method=UnifiedHighlighter&hl.ocr.fl=tcocr_highlightm_X3b_und_ocr_text&wt=json} hits=1 status=0 QTime=3

@albig
Copy link
Collaborator Author

albig commented Apr 19, 2021

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

Yes, we would have to store it twice, once with markup for highlighting and retrieval of word coordinates and once as plain text for searching. But only the latter has to be indexed (i. e. searchable), so it shouldn't be a performance issue.

Thank you for clarifying. I don't think that this will be a performance issue. But we have to keep an eye on the storage.

@jbaiter
Copy link

jbaiter commented Apr 19, 2021

What I do not understand right now: Do we have the OCR fulltext twice in the index with our current schema?

Yes, we would have to store it twice, once with markup for highlighting and retrieval of word coordinates and once as plain text for searching. But only the latter has to be indexed (i. e. searchable), so it shouldn't be a performance issue.

I'm not sure if I'm understanding this correctly, but you do not need two separate fields with the Solr plugin.
You can search perfectly fine on the field with the OCR markup, this is what the plugin is all about: Doing the highlighting at the same time as the actual querying, to ensure that highlighting takes into account things like stemming, wildcards, multi-term queries, etc.

The markup is only used by the plugin to do the initial indexing (from which point on Solr/Lucene only sees the terms extracted from the markup) and to extract the bounding boxes during highlighting.

@beatrycze-volk
Copy link
Collaborator

@jbaiter Thanks for explanation. I have changed to indexing the OCR file content and observed one problem. When OCR field is is not set I'm getting error:

Caused by: java.lang.RuntimeException: Could not determine OCR format from chunk: 
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.lambda$create$1(OcrCharFilterFactory.java:36)
	at java.util.Optional.orElseThrow(Optional.java:290)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:35)

Is it not possible to index document without OCR field included? If not, is there some kind of the placeholder which can used for documents which don't have OCR?

@jbaiter
Copy link

jbaiter commented Apr 27, 2021

@beatrycze-volk Did the document not include the OCR field at all or did you pass in an empty string as the field value? I tried it in a unit test just now, and if the field is simply missing from the document everything works.
Independent of that, I'll see if I can catch empty documents before they cause a problem, thanks for reporting!

Fixed: dbmdz/solr-ocrhighlighting#156

@beatrycze-volk
Copy link
Collaborator

beatrycze-volk commented Apr 27, 2021

@jbaiter the document didn't include OCR field at all. We are using createDocument() from src/QueryType/Update/Query/Query.php and there set up fields by setField() method. We simply omit set up of OCR field. Then update() method throws error given above.

Schema: schema.xml

Document JSON taken by getFields() from solarium created document:

{
   "id":"2LOG_0000",
   "uid":2,
   "pid":5,
   "page":1,
   "thumbnail":"",
   "partof":0,
   "root":0,
   "sid":"LOG_0000",
   "toplevel":true,
   "type":"newspaper",
   "title":"The Daily record and the Dresden daily",
   "record_id":"oai:de:slub-dresden:db:id-416971482",
   "purl":"http://digital.slub-dresden.de/id416971482",
   "location":"https://digital.slub-dresden.de/data/kitodo/TheDarea_416971482-19100223/TheDarea_416971482-19100223_anchor.xml",
   "urn":[
      "urn:nbn:de:bsz:14-db-id4169714826"
   ],
   "license":[
      
   ],
   "terms":[
      
   ],
   "restrictions":[
      
   ],
   "collection":[
      
   ],
   "title_tsi":[
      "The Daily record and the Dresden daily"
   ],
   "title_sorting":"The Daily record and the Dresden daily",
   "place_tsi":[
      "Dresden"
   ],
   "place_sorting":"Dresden",
   "place_faceting":[
      "Dresden"
   ],
   "record_id_uui":[
      "oai:de:slub-dresden:db:id-416971482"
   ],
   "urn_uui":[
      "urn:nbn:de:bsz:14-db-id4169714826"
   ],
   "purl_uuu":[
      "http://digital.slub-dresden.de/id416971482"
   ],
   "type_usu":[
      "newspaper"
   ],
   "type_faceting":[
      "newspaper"
   ],
   "document_format_uuu":[
      "METS"
   ],
   "language_uui":[
      "eng"
   ],
   "language_faceting":[
      "eng"
   ],
   "autocomplete":[
      "The Daily record and the Dresden daily"
   ]
}

Full error given by SOLR Server:

org.apache.solr.common.SolrException: Exception writing document id 2LOG_0000 to the index; possible analysis error.
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:254)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:291)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
	at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
	at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
	at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:802)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:579)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:420)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:352)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
	at org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:500)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Could not determine OCR format from chunk: 
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.lambda$create$1(OcrCharFilterFactory.java:36)
	at java.util.Optional.orElseThrow(Optional.java:290)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:35)
	at org.apache.solr.analysis.TokenizerChain.initReader(TokenizerChain.java:97)
	at org.apache.lucene.analysis.AnalyzerWrapper.initReader(AnalyzerWrapper.java:156)
	at org.apache.lucene.analysis.AnalyzerWrapper.initReader(AnalyzerWrapper.java:156)
	at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:197)
	at org.apache.lucene.document.Field.tokenStream(Field.java:513)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:806)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
	at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:979)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:345)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:292)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:239)

@albig
Copy link
Collaborator Author

albig commented Oct 12, 2021

This issue is closed by #673 and will be part of Kitodo.Presentation 3.3. Thanks @beatrycze-volk for all the work on implementing usage of the OCR Highlighting Plugin.

@wrznr
Copy link

wrznr commented Oct 12, 2021

Many thanks from my side as well. Looking forward to Kitodo.Presentation 3.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug A non-security related bug.
Projects
None yet
Development

No branches or pull requests

6 participants