Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Solr OCR Highlighting Plugin in Search in Document Plugin #587

Merged
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
89ac389
Introduce solr highlighting
beatrycze-volk Dec 2, 2020
20ff3ad
Use config for index remapped in SearchInDocumentTool
beatrycze-volk Jan 11, 2021
4b52a83
Use highlighting feature
beatrycze-volk Jan 19, 2021
d831064
Update js file to use new OCR highlighting
beatrycze-volk Jan 25, 2021
731acf8
Change submit in search to click
beatrycze-volk Feb 2, 2021
ca977c0
Move search to own method
beatrycze-volk Feb 2, 2021
e9f0770
Update methods in JS of search in document plugin
beatrycze-volk Feb 2, 2021
4dc9b01
Change configuration option fromisIndexRemapped to searchURL
beatrycze-volk Feb 9, 2021
0148e0c
Make search input names configurable
beatrycze-volk Feb 9, 2021
2654ea9
Use config for documentIdUrlSchema
beatrycze-volk Feb 10, 2021
56bfda7
Update JS ad PHP files to use configured input names and document id …
beatrycze-volk Feb 10, 2021
af2f241
Fix errors pointed by Codacy
beatrycze-volk Feb 11, 2021
7dbc609
Add max amount of snippets and restore query->getHighlighting
beatrycze-volk Feb 15, 2021
af562b7
Adjust URL generation for results' links
beatrycze-volk Feb 15, 2021
22d6477
Add event listener for Enter pressed in the search field
beatrycze-volk Feb 22, 2021
e0be8b2
Restore usage of submit button for submitting search
beatrycze-volk Mar 16, 2021
9a84da5
Restore CORS header in PageViewProxy
beatrycze-volk Mar 22, 2021
c34bbfa
Add missing search component to SOLR config
beatrycze-volk Mar 29, 2021
9497f6a
Check if passed id is number or numeric string
beatrycze-volk Apr 8, 2021
7a15695
Remove indexing of full text from file
beatrycze-volk Apr 28, 2021
d838dc0
Save to index full OCR
beatrycze-volk Apr 28, 2021
2af9f70
Remove getRawText methods
beatrycze-volk Apr 28, 2021
864e802
Return false from save method if documents were not indexed
beatrycze-volk Apr 29, 2021
f6c7ae5
Fix after rebase class for logger in Document instance
beatrycze-volk May 17, 2021
b3a35bf
Highlight more than one word in search
beatrycze-volk May 17, 2021
50e95c3
Preserve search phrase and get hit list after hit link was clicked
beatrycze-volk May 17, 2021
16e27eb
Index documents with empty fulltext field
beatrycze-volk May 18, 2021
431f980
Use coordinates inside the function for searching full text features
beatrycze-volk Jun 28, 2021
f65ed44
Fix Codacy errors
beatrycze-volk Jun 28, 2021
5af13b5
Fix request handler in solr config
beatrycze-volk Jun 30, 2021
a8edd04
Use new syntax for getting of configuration
beatrycze-volk Jul 1, 2021
13a2f1b
Fix documentations
beatrycze-volk Jul 1, 2021
3460992
Update Resources/Public/Javascript/PageView/PageView.js
beatrycze-volk Jul 1, 2021
becc697
Check for undefined baseUrl and remove of passing hl param for highli…
beatrycze-volk Jul 1, 2021
edeec94
Add checks for empty query parameters
beatrycze-volk Jul 1, 2021
0022831
Fix rebase errors
beatrycze-volk Jul 27, 2021
f0547d3
Merge branch 'master' into use-solr-highlighting
Jul 27, 2021
aa7086e
Update Classes/Plugin/Tools/SearchInDocumentTool.php
beatrycze-volk Jul 28, 2021
32c5eb7
Reverse not needed changes
beatrycze-volk Jul 28, 2021
35849de
Add comment for documentIdUrlSchema
beatrycze-volk Jul 28, 2021
b819ef3
Use SOLR plugin in the document list
beatrycze-volk Jul 30, 2021
5f943be
Merge branch 'master' into use-solr-highlighting
Aug 9, 2021
8949104
Fix errors and add missing docs
beatrycze-volk Aug 9, 2021
11bf791
Merge remote-tracking branch 'origin/use-solr-highlighting' into use-…
beatrycze-volk Aug 9, 2021
d5728bf
Fix Scrutinizer warnings
beatrycze-volk Aug 10, 2021
1582cf7
Apply review comments related to URL parameters
beatrycze-volk Aug 16, 2021
ee2b704
Remove overwrite of query
beatrycze-volk Aug 17, 2021
11b500e
Use create filter query
beatrycze-volk Aug 17, 2021
d115ef4
Fix the query limit
beatrycze-volk Aug 17, 2021
8448869
Update Classes/Common/Indexer.php
Aug 18, 2021
399dbef
Ignore false-positive warning from Scrutinizer
beatrycze-volk Aug 18, 2021
d983735
Add facet component to solrconfig.xml
beatrycze-volk Aug 18, 2021
11a796e
Use array for filter queyr
beatrycze-volk Aug 18, 2021
10fb2f6
Merge branch 'master' into use-solr-highlighting
Aug 19, 2021
66bbed3
Remove Illustration tag from full text XML
beatrycze-volk Sep 20, 2021
c2d6517
Merge branch 'master' into use-solr-highlighting
Sep 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 52 additions & 49 deletions Classes/Common/Document.php
Original file line number Diff line number Diff line change
Expand Up @@ -570,8 +570,8 @@ public static function &getInstance($uid, $pid = 0, $forceReload = false)
if (!empty($extConf['caching'])) {
Helper::saveToSession(self::$registry, get_class($instance));
}
$instance->logger = GeneralUtility::makeInstance(LogManager::class)->getLogger(get_class($instance));
}
$instance->logger = GeneralUtility::makeInstance(LogManager::class)->getLogger(get_class($instance));
// Return new instance.
return $instance;
}
Expand Down Expand Up @@ -638,10 +638,8 @@ public function getPhysicalPage($logicalPage)
}

/**
* This extracts the raw text for a physical structure node / IIIF Manifest / Canvas. Text might be
* given as ALTO for METS or as annotations or ALTO for IIIF resources. If IIIF plain text annotations
* with the motivation "painting" should be treated as full text representations, the extension has to be
* configured accordingly.
* This extracts the OCR full text for a physical structure node / IIIF Manifest / Canvas. Text might be
* given as ALTO for METS or as annotations or ALTO for IIIF resources.
*
* @access public
*
Expand All @@ -650,23 +648,23 @@ public function getPhysicalPage($logicalPage)
* @param string $id: The @ID attribute of the physical structure node (METS) or the @id property
* of the Manifest / Range (IIIF)
*
* @return string The physical structure node's / IIIF resource's raw text
* @return string The OCR full text
*/
public abstract function getRawText($id);
public abstract function getFullText($id);

/**
* This extracts the raw text for a physical structure node / IIIF Manifest / Canvas from an
* XML fulltext representation (currently only ALTO). For IIIF manifests, ALTO documents have
* This extracts the OCR full text for a physical structure node / IIIF Manifest / Canvas from an
* XML full text representation (currently only ALTO). For IIIF manifests, ALTO documents have
* to be given in the Canvas' / Manifest's "seeAlso" property.
*
* @param string $id: The @ID attribute of the physical structure node (METS) or the @id property
* of the Manifest / Range (IIIF)
*
* @return string The physical structure node's / IIIF resource's raw text from XML
* @return string The OCR full text
*/
protected function getRawTextFromXml($id)
protected function getFullTextFromXml($id)
{
$rawText = '';
$fullText = '';
// Load available text formats, ...
$this->loadFormats();
// ... physical structure ...
Expand All @@ -677,54 +675,54 @@ protected function getRawTextFromXml($id)
if (!empty($this->physicalStructureInfo[$id])) {
while ($fileGrpFulltext = array_shift($fileGrpsFulltext)) {
if (!empty($this->physicalStructureInfo[$id]['files'][$fileGrpFulltext])) {
// Get fulltext file.
$file = GeneralUtility::getUrl($this->getFileLocation($this->physicalStructureInfo[$id]['files'][$fileGrpFulltext]));
if ($file !== false) {
// Turn off libxml's error logging.
$libxmlErrors = libxml_use_internal_errors(true);
// Disables the functionality to allow external entities to be loaded when parsing the XML, must be kept.
$previousValueOfEntityLoader = libxml_disable_entity_loader(true);
// Load XML from file.
$rawTextXml = simplexml_load_string($file);
// Reset entity loader setting.
libxml_disable_entity_loader($previousValueOfEntityLoader);
// Reset libxml's error logging.
libxml_use_internal_errors($libxmlErrors);
// Get the root element's name as text format.
$textFormat = strtoupper($rawTextXml->getName());
// Get full text file.
$fileContent = GeneralUtility::getUrl($this->getFileLocation($this->physicalStructureInfo[$id]['files'][$fileGrpFulltext]));
if ($fileContent !== false) {
$textFormat = $this->getTextFormat($fileContent);
} else {
$this->logger->warning('Couldn\'t load fulltext file for structure node @ID "' . $id . '"');
return $rawText;
$this->logger->warning('Couldn\'t load full text file for structure node @ID "' . $id . '"');
return $fullText;
}
break;
}
}
} else {
$this->logger->warning('Invalid structure node @ID "' . $id . '"');
return $rawText;
return $fullText;
}
// Is this text format supported?
if (
!empty($rawTextXml)
&& !empty($this->formats[$textFormat])
) {
if (!empty($this->formats[$textFormat]['class'])) {
$class = $this->formats[$textFormat]['class'];
// Get the raw text from class.
if (
class_exists($class)
&& ($obj = GeneralUtility::makeInstance($class)) instanceof FulltextInterface
) {
$rawText = $obj->getRawText($rawTextXml);
$this->rawTextArray[$id] = $rawText;
} else {
$this->logger->warning('Invalid class/method "' . $class . '->getRawText()" for text format "' . $textFormat . '"');
}
}
// This part actually differs from previous version of indexed OCR
if (!empty($fileContent) && !empty($this->formats[$textFormat])) {
$fullText = $fileContent;
} else {
$this->logger->warning('Unsupported text format "' . $textFormat . '" in physical node with @ID "' . $id . '"');
}
return $rawText;
return $fullText;
}

/**
* Get format of the OCR full text
*
* @access private
*
* @param string $fileContent: content of the XML file
*
* @return string The format of the OCR full text
*/
private function getTextFormat($fileContent)
albig marked this conversation as resolved.
Show resolved Hide resolved
{
// Turn off libxml's error logging.
$libxmlErrors = libxml_use_internal_errors(true);
// Disables the functionality to allow external entities to be loaded when parsing the XML, must be kept.
$previousValueOfEntityLoader = libxml_disable_entity_loader(true);
// Load XML from file.
$rawTextXml = simplexml_load_string($fileContent);
// Reset entity loader setting.
libxml_disable_entity_loader($previousValueOfEntityLoader);
// Reset libxml's error logging.
libxml_use_internal_errors($libxmlErrors);
// Get the root element's name as text format.
return strtoupper($rawTextXml->getName());
}

/**
Expand Down Expand Up @@ -1306,9 +1304,14 @@ public function save($pid = 0, $core = 0, $owner = null)
}
// Add document to index.
if ($core) {
Indexer::add($this, $core);
//TODO: change return of this method to true on success and false on failure
$hasErrors = Indexer::add($this, $core);
if ($hasErrors) {
return false;
}
} else {
$this->logger->notice('Invalid UID "' . $core . '" for Solr core');
return false;
}
return true;
}
Expand Down
182 changes: 114 additions & 68 deletions Classes/Common/DocumentList.php
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@

namespace Kitodo\Dlf\Common;

use Kitodo\Dlf\Common\SolrSearchResult\ResultDocument;
use Psr\Log\LoggerAwareInterface;
use Psr\Log\LoggerAwareTrait;
use Solarium\QueryType\Select\Result\Result;
use TYPO3\CMS\Core\SingletonInterface;
use TYPO3\CMS\Core\Database\ConnectionPool;
use TYPO3\CMS\Core\Utility\GeneralUtility;
Expand Down Expand Up @@ -237,74 +239,8 @@ protected function getRecord($element)
&& $this->metadata['options']['source'] == 'search'
) {
if ($this->solrConnect()) {
$fields = Solr::getFields();
$params = [];
// Restrict the fields to the required ones
$params['fields'] = $fields['uid'] . ',' . $fields['id'] . ',' . $fields['toplevel'] . ',' . $fields['thumbnail'] . ',' . $fields['page'];
foreach ($this->solrConfig as $solr_name) {
$params['fields'] .= ',' . $solr_name;
}
// If it is a fulltext search, enable highlighting.
if ($this->metadata['fulltextSearch']) {
$params['component'] = [
'highlighting' => [
'query' => Solr::escapeQuery($this->metadata['searchString']),
'field' => $fields['fulltext'],
'usefastvectorhighlighter' => true
]
];
}
// Set additional query parameters.
$params['start'] = 0;
// Set reasonable limit for safety reasons.
// We don't expect to get more than 10.000 hits per UID.
$params['rows'] = 10000;
// Take over existing filter queries.
$params['filterquery'] = isset($this->metadata['options']['params']['filterquery']) ? $this->metadata['options']['params']['filterquery'] : [];
// Extend filter query to get all documents with the same UID.
foreach ($params['filterquery'] as $key => $value) {
if (isset($value['query'])) {
$params['filterquery'][$key]['query'] = $value['query'] . ' OR ' . $fields['toplevel'] . ':true';
}
}
// Add filter query to get all documents with the required uid.
$params['filterquery'][] = ['query' => $fields['uid'] . ':' . Solr::escapeQuery($record['uid'])];
// Add sorting.
$params['sort'] = $this->metadata['options']['params']['sort'];
// Set query.
$params['query'] = $this->metadata['options']['select'] . ' OR ' . $fields['toplevel'] . ':true';
// Perform search for all documents with the same uid that either fit to the search or marked as toplevel.
$selectQuery = $this->solr->service->createSelect($params);
$result = $this->solr->service->select($selectQuery);
// If it is a fulltext search, fetch the highlighting results.
if ($this->metadata['fulltextSearch']) {
$highlighting = $result->getHighlighting();
}
// Process results.
foreach ($result as $resArray) {
// Prepare document's metadata.
$metadata = [];
foreach ($this->solrConfig as $index_name => $solr_name) {
if (!empty($resArray->$solr_name)) {
$metadata[$index_name] = (is_array($resArray->$solr_name) ? $resArray->$solr_name : [$resArray->$solr_name]);
}
}
// Add metadata to list elements.
if ($resArray->toplevel) {
$record['thumbnail'] = $resArray->thumbnail;
$record['metadata'] = $metadata;
} else {
$highlightedDoc = !empty($highlighting) ? $highlighting->getResult($resArray->id) : null;
$highlight = !empty($highlightedDoc) ? $highlightedDoc->getField($fields['fulltext'])[0] : '';
$record['subparts'][$resArray->id] = [
'uid' => $resArray->uid,
'page' => $resArray->page,
'preview' => $highlight,
'thumbnail' => $resArray->thumbnail,
'metadata' => $metadata
];
}
}
$result = $this->getSolrResult($record);
$record = $this->getSolrRecord($record, $result);
}
}
// Save record for later usage.
Expand All @@ -316,6 +252,116 @@ protected function getRecord($element)
return $record;
}

/**
* It gets SOLR result
*
* @access private
*
* @param array $record: for searched document
*
* @return Result
*/
private function getSolrResult($record) {
albig marked this conversation as resolved.
Show resolved Hide resolved
$fields = Solr::getFields();

$query = $this->solr->service->createSelect();
// Restrict the fields to the required ones
$query->setFields($fields['uid'] .',' . $fields['id'] .',' . $fields['toplevel'] .',' . $fields['thumbnail'] .',' . $fields['page']);
foreach ($this->solrConfig as $solr_name) {
$query->addField($solr_name);
}
// Set additional query parameters.
// Set reasonable limit for safety reasons.
// We don't expect to get more than 10.000 hits per UID.
$query->setStart(0)->setRows(10000);
// Take over existing filter queries.
$filterQueries = isset($this->metadata['options']['params']['filterquery']) ? $this->metadata['options']['params']['filterquery'] : [];
// Extend filter query to get all documents with the same UID.
foreach ($filterQueries as $key => $value) {
if (isset($value['query'])) {
$filterQuery[$key] = $value['query'] . ' OR ' . $fields['toplevel'] . ':true';
beatrycze-volk marked this conversation as resolved.
Show resolved Hide resolved
$query->addFilterQuery($filterQuery);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work either. But I found no way to test this. I think, there is another issue in Search:makeFacetsMenuArray(). So I can't get the available facets from Solr and show the facets menu.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken look to this method. Do you think that it needs to be adjusted now or it can be done later?

Copy link
Collaborator

@albig albig Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addFilterQuery() above has to be fixed. But before doing this, I'm looking for working facets. Currently, the Solr has no facets available and I still don't know, why.

A query like

/solr/dlfCore0/select?facet=on&q=*%3A*&rows=0

Should return something like

{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "q":"*:*",
      "rows":"0",
      "facet":"on"}},
  "response":{"numFound":4517,"start":0,"numFoundExact":true,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

In my case - with this PR working - I get

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "rows":"0",
      "facet":"on"}},
  "response":{"numFound":4376,"start":0,"numFoundExact":true,"docs":[]
  }}

That's why no facetting is shown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The faceting is solved (see other commit suggest).

With selecting a facet, we get the (expected) exception:

"A filterquery must have a key value" in Line 283.

The used GET-parameter are:

tx_dlf[fq][0]=collection_faceting:("Projekt: Illustrierte Magazine der Klassischen Moderne")&tx_dlf[fq][1]=type_faceting:("volume")&tx_dlf[query]=*

So please rewrite Line 283.

}
}
// Add filter query to get all documents with the required uid.
$query->createFilterQuery('uid')->setQuery($fields['uid'] . ':' . Solr::escapeQuery($record['uid']));
// Add sorting.
$query->addSort('score', $this->metadata['options']['params']['sort']['score']);
// Set query.
$query->setQuery($this->metadata['options']['select'] . ' OR ' . $fields['toplevel'] . ':true');

// If it is a fulltext search, enable highlighting.
if ($this->metadata['fulltextSearch']) {
$query->getHighlighting();
};

$solrRequest = $this->solr->service->createRequest($query);

// If it is a fulltext search, enable highlighting.
if ($this->metadata['fulltextSearch']) {
// field for which highlighting is going to be performed,
// is required if you want to have OCR highlighting
$solrRequest->addParam('hl.ocr.fl', $fields['fulltext']);
// return the coordinates of highlighted search as absolute coordinates
$solrRequest->addParam('hl.ocr.absoluteHighlights', 'on');
// max amount of snippets for a single page
$solrRequest->addParam('hl.snippets', 20);
}
// Perform search for all documents with the same uid that either fit to the search or marked as toplevel.
$response = $this->solr->service->executeRequest($solrRequest);
return $this->solr->service->createResult($query, $response);
}

/**
* It processes SOLR result into record, which is
* going to be displayed in the frontend list.
*
* @access private
*
* @param array $record: for searched document
* @param Result $result: found in the SOLR index
*
* @return array
*/
private function getSolrRecord($record, $result) {
// If it is a fulltext search, fetch the highlighting results.
if ($this->metadata['fulltextSearch']) {
$data = $result->getData();
$highlighting = $data['ocrHighlighting'];
}

// Process results.
foreach ($result as $resArray) {
// Prepare document's metadata.
$metadata = [];
foreach ($this->solrConfig as $index_name => $solr_name) {
if (!empty($resArray->$solr_name)) {
$metadata[$index_name] = (is_array($resArray->$solr_name) ? $resArray->$solr_name : [$resArray->$solr_name]);
}
}
// Add metadata to list elements.
if ($resArray->toplevel) {
$record['thumbnail'] = $resArray->thumbnail;
$record['metadata'] = $metadata;
} else {
$highlight = '';
if (!empty($highlighting)) {
$resultDocument = new ResultDocument($resArray, $highlighting, Solr::getFields());
$highlight = $resultDocument->getSnippets();
}

$record['subparts'][$resArray->id] = [
'uid' => $resArray->uid,
'page' => $resArray->page,
'preview' => $highlight,
'thumbnail' => $resArray->thumbnail,
'metadata' => $metadata
];
}
}
return $record;
}

/**
* This returns the current position
* @see \Iterator::key()
Expand Down
1 change: 1 addition & 0 deletions Classes/Common/FulltextInterface.php
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
* @access public
* @abstract
*/
//TODO: check if this is still needed when actually full text xml is indexed
interface FulltextInterface
{
/**
Expand Down
7 changes: 4 additions & 3 deletions Classes/Common/IiifManifest.php
Original file line number Diff line number Diff line change
Expand Up @@ -786,9 +786,10 @@ protected function getParentDocumentUidForSaving($pid, $core, $owner)

/**
* {@inheritDoc}
* @see Document::getRawText()
* @see Document::getFullText()
*/
public function getRawText($id)
//TODO: rewrite it to get full OCR
public function getFullText($id)
{
$rawText = '';
// Get text from raw text array if available.
Expand All @@ -805,7 +806,7 @@ public function getRawText($id)
if (!empty($this->physicalStructureInfo[$id])) {
while ($fileGrpFulltext = array_shift($fileGrpsFulltext)) {
if (!empty($this->physicalStructureInfo[$id]['files'][$fileGrpFulltext])) {
$rawText = parent::getRawTextFromXml($id);
$rawText = parent::getFullTextFromXml($id);
break;
}
}
Expand Down
Loading