Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

462 simplify oai pmh process #489

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

TobiasNx
Copy link
Contributor

@TobiasNx TobiasNx commented Aug 22, 2023

Related to #462

This PR does four things:

  • get rid of the XML Splitting of the OAI-Updates
  • get rid of intervals for the OAI Updates
  • splits the outcome of the OAI Updates and the PICA binary Bulk before writing the results together
  • Updates the tests accordingly

Context:

The old process was overly complex since it set intervals for harvesting and has split all files into single xmls.
The intervals were probably needed since there were a lot of updates due to the old pica xml bulk from 2013 that we do not use anymore. And it is from a time, when all records where merge: https://github.com/search?q=repo%3Ahbz%2Flobid-organisations+interval&type=pullrequests
The splitting into single xmls is not needed since it was only used for the updates.

Three questions need to be answered:

  • How to update the tests, they are still depending on the old XML-Splitting-Process?
    I skip the OAI-PMH with a fake date and a zero interval. Is there a way to skip the OAI-PMH-Process if the OAI-PMH returns an error. ae6fd45

  • Do we want the oai pmh harvested data as an xml file saved?

  • Do we need a fallback if we cannot reach the oai-pmh?

This would also fix #487

@TobiasNx TobiasNx requested a review from dr0i August 22, 2023 14:21
@TobiasNx
Copy link
Contributor Author

It builds locally but the tests fail.

@TobiasNx
Copy link
Contributor Author

Tests are now working.

@TobiasNx TobiasNx marked this pull request as draft August 23, 2023 07:01
@TobiasNx
Copy link
Contributor Author

I put this PR on halt after the next deployment.

@TobiasNx TobiasNx force-pushed the 462-simplifyOaiPmhProcess branch from 65cfc83 to ae6fd45 Compare September 18, 2023 08:58
@TobiasNx TobiasNx changed the base branch from useFixInsteadOfMorph to master September 18, 2023 08:59
@TobiasNx TobiasNx mentioned this pull request Sep 21, 2023
2 tasks
@TobiasNx TobiasNx marked this pull request as ready for review September 21, 2023 15:10
Get rid of xml splitting process and only use a single OAI-PMH inerval steps due to newer pica bulks.
@TobiasNx TobiasNx force-pushed the 462-simplifyOaiPmhProcess branch from 082ac9e to 780d575 Compare October 12, 2023 09:25
@TobiasNx TobiasNx requested review from dr0i and removed request for dr0i October 12, 2023 09:29
@TobiasNx
Copy link
Contributor Author

I updated this branch and got rid of all conflicts. @dr0i could you have a look.
Also could you help to answer the following questions:

  • Do we want the oai pmh harvested data as an xml file saved?
  • Do we need a fallback if we cannot reach the oai-pmh?

@TobiasNx
Copy link
Contributor Author

#462 (comment) @fsteeg hinted that the saving of the oai pmh data was because of rerunning the transformation without getting the data from the oai pmh.

Due to the new pica binary bulk fetching all new sigel updates+transformation only takes 3 minutes. Would this be sufficient?

@dr0i
Copy link
Member

dr0i commented Oct 19, 2023

Due to the new pica binary bulk fetching all new sigel updates+transformation only takes 3 minutes

This will change over time - after 11 months or so , if there are more changes. One cannot be sure. Also, I don't know, what happens if the download of updates does not work while wanting to reindex everything ? Would then all the updates be missing?

@dr0i dr0i assigned TobiasNx and unassigned dr0i Oct 19, 2023
@TobiasNx
Copy link
Contributor Author

Due to the new pica binary bulk fetching all new sigel updates+transformation only takes 3 minutes

This will change over time - after 11 months or so , if there are more changes.

But then there will be a new batch file and we need to exchange the old and update the starting date for the update. We will get a email reminder for that, see: #504

One cannot be sure. Also, I don't know, what happens if the download of updates does not work while wanting to reindex everything? Would then all the updates be missing?

At the moment at night the index is build from the start, fetch ALL data from OAI -> split them -> transform them -> index them.
If I do not mistaken all splittet files are overwritten. We already loose the source data. Only the index data is not lost, if the incoming index data is equal or bigger than the old file.

The source data is only kept to rerun the transformation, manually. Or do I miss something,.

We could add a step that saves the OAI PMH Data in an XML file like here: https://gitlab.com/oersi/oersi-etl/-/blob/master/data/production/duepublico/duepublico-to-oersi.flux?ref_type=heads#L7-15

And test if the new data is bigger or equal to the old.

I am not sure how to do this with JAVA.

@TobiasNx TobiasNx assigned dr0i and unassigned TobiasNx Oct 19, 2023
@dr0i
Copy link
Member

dr0i commented Oct 19, 2023

Only the index data is not lost, if the incoming index data is equal or bigger than the old file.

Oh, right - so we at least have something. I think this is sufficient.


Thus, you will have specify two parameters in @conf/application.conf@ : (1) the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01) and (2) the interval size in days (must not be too large).
Thus, you will have specify one parameters in @conf/application.conf@ : the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that , if I understand correctly, 365 days (the max?!) won't be "too large"?

TransformDbs.process(dbsOutput, geoServer,wikidataLookupFilename); //Start process DBS data.

String sigelBulkOutput = outputPath + "-sigelBulk";
String sigelUpdatesOutput = outputPath + "-sigelUpdates";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are superflous tabs at the end of the line.

writeAll(sigelBulkOutput, resultWriter);
if (startOfUpdates != "") { // exclude updates for the tests, which set startOfUpdates to ""
writeAll(sigelUpdatesOutput, resultWriter);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are superflous tabs at the end of the line.

JsonEncoder encodeJson = new JsonEncoder();
encodeJson.setPrettyPrinting(true);
ObjectWriter objectWriter = new ObjectWriter<>(outputPath);
objectWriter.setAppendIfFileExists(true);
splitFileOpener//
sigelOaiPmhUpdates//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superflous tabs ateol

@dr0i dr0i assigned TobiasNx and unassigned dr0i Oct 19, 2023
@TobiasNx
Copy link
Contributor Author

TobiasNx commented Jun 4, 2024

@dr0i Could you review it again?

@TobiasNx TobiasNx requested a review from dr0i June 4, 2024 15:49
@dr0i dr0i assigned dr0i and unassigned TobiasNx Jun 14, 2024
@@ -61,15 +59,15 @@ public static void tearDown() {

@Test
public void multiLangAlternateName() throws IOException {
assertThat(new String(
assertThat(new String(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superflous tabulator

Files.readAllBytes(Paths.get(TransformAll.DATA_OUTPUT_FILE))))
.as("transformation output with multiLangAlternateName")
.contains("Leibniz Institute").contains("Berlin SBB");
}

@Test
public void separateUrlAndProvidesFields() throws IOException {
assertThat(new String(
assertThat(new String(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superflous tabulator

@@ -78,7 +76,7 @@ public void separateUrlAndProvidesFields() throws IOException {

@Test
public void preferSigelData() throws IOException {
assertThat(new String(
assertThat(new String(
Files.readAllBytes(Paths.get(TransformAll.DATA_OUTPUT_FILE))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superflous tabulator

Copy link
Member

@dr0i dr0i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README seems to be far outadated (mentioning metafacture 4.0.0 etc). I will provide an update of the README and btw. test if the data is updated as expected.

@TobiasNx
Copy link
Contributor Author

@dr0i I have adjusted the formatting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test does not receive updates after 21-07-23 / Issue with the interval for the updates.
2 participants