In this exercise, you will continue to interact with the enriched dataset you created in the previous part. You will first modify the recipe you used to create that dataset, schedule a publication so it can take into account potential metadata changes in the future. Then you will create rules and a rulebook to visually track the data quality issues that were found. Finally you will curate the data.
After completing these steps you will have modified your data preparation recipe to further isolate the data quality issues on a specific column. Then you will schedule a publication of this dataset to take into account potential metadata changes in the future.
-
This opens the Enrich window with the defined Enrich action.
-
The join definition window opens. Scroll down to the bottom of Output Columns.
-
The daset is now enriched with the additional selected column.
-
This filters the data with only claims containing an invalid drug.
-
The data are sorted by 'DRUG_NAME' by ascending order, but the first records contains null values.
-
This shows only the claims with unknown drugs which are not null.
-
The filter defined in the main (previous screen) data preparation are automatically pre-defined in the aggregation window.
-
This defines the aggregation with a preview of the result at the bottom half of the screen.
-
This will generate an expected error message as right now the system considers we are trying to aggregate data twice on the same column.
-
Change the Aggregation function from 'No Aggregation' to 'Count'.
-
This shows the preview of the aggregation that contains the distinct values of drug names with their respective count.
-
The main data preparation room now contains the aggregated data. The data could be exported as-is (as this is based only on a sample of the data) for further study and analysis. We can see already some simple data quality issues in the name of the drugs which we will curate later on.
-
For now, we can disable this aggregation (but still keep it's definition). Click 'Recipe'.
-
Click 'More Actions' ('...') for the 'Aggregate' action in the recipe.
-
The aggregation action is now disabled. The data preparation room shows the data prior executing the aggregation. The filters that were defined were automatically removed (but are still embedded in the definition of the aggregation). At any point of time, we can come back to this data preparation recipe and reactivate this action.
-
Select 'PHARMA-CLAIMS_ENRICHED_##' (where ## is your user number).
-
Note: Write Mode' set to 'Overwrite' will allow the content to overwrite the existing data with this new definition.
-
Wait until you see a notification for the task being completed. You can click on the refresh button on the bottom of the screen.
-
Click 'View Fact Sheet' for the file 'PHARMA_CLAIMS_ENRICH_##' (where ## is your user number).
-
The application shows the Fact Sheet associated to the selected dataset.
-
The columns are not profiled anymore. This is because the structure of the data changed when we modified the definition in the recipe associated to the preparation task (added the column POTENCY, changed one column name).
-
Wait to see 'Profiled:' time stamp has changed indicating that the profile task finished successfully.
-
The data is now profiled again, there is a new version of the profiling information available. Click 'Columns'. Note: Previous information of the profiling metadata stays available.
72 Click 'Data Intelligence Metadata Explorer'.
Because you changed the structure of the dataset, you will also need to publish the dataset once again to the catalog. However, instead of repeating this operation, you can schedule the publication of the dataset instead of doing it manually every time the structure of the data changes.
-
Click the 'Name' field to select the publication name to schedule.
-
Type 'pharma claims publication ##' in the 'Filter Publications' text field. (Where ## is your user number).
-
Select your publication (for example if you are user 99, select Pharma Claims Publication 99).
-
Open the 'Picker' icon for the Start Time to give you few more minutes for the first schedule of the selected publication.
-
Click on the minutes and add 5 more minutes to the original value. For example, if the value was 06:45, set it to 06:50. Click 'OK'.
-
Click the Refresh icon or check notifications to see when the scheduled publishing task is complete.
You have now modified your data preparation recipe to further isolate the data quality issues on a specific column. You also have scheduled a publication of this dataset to take into account potantial metadata changes in the future.
After completing these steps you will have created a glossary and business terms and associated them to your enriched dataset.
-
Type 'DRUGS_GLOSSARY_##' in the 'Name' text field (where ## is your user number).
-
Type the following description: 'Glossary for pharmaceutical drugs'.
-
You created a glossary, and now can start enriching it with new terms.
-
Type 'Pharmaceutical form of medicinal products' for Description.
-
Select 'List of Values' from the Validation Type drop down list.
-
Type 'Tablet' , 'Capsules' . and 'Inhalers' adding to the list of Valid Values.
-
Click '>', to move 'Last Reviewed' to the right column and under 'Reviewer'.
-
Type the following definition for the term: 'Measure of drug activity expressed in terms of the amount required to produce an effect of given intensity'.
-
The term 'Potency' is created and part of the glossary definitions. Click 'Potency'.
-
Check your PHARMA_CLAIMS_ENRICHED_## dataset (where ## is your user number).
-
You can hover over 'PHARMA_CLAIMS_ENRICHED...' to see the full name.
-
The term is now associated to a dataset and a column in the dataset.
-
The Catalog shows the results of the dataset that matches the search pattern.
-
Change the Search string to 'claims*##' (where ## is your user number).
-
The glossary term 'Potency' is associated to the 'POTENCY' column of the dataset.
You have been informed that 'Gina Milo' has changed her last name so you need to change all the values of 'Gina Milo' to 'Gina Johnson', you can do this as a mass update so all the 'Gino Milo' Reviewer values can be update to 'Gino Johnson' at one time.
-
Notice the Reviewer's Name has been changed to 'Gina Johnson'.
-
You are back to the Data Intellience Metadata Explorer Home page.
You have now:
- created a glossary
- added custom attributes, which extend the meaning of an asset beyond the definition
- performed a mass update on a custom attribute value, all 'Gina Milo' references where changed to 'Gina Johnson'
- defined a business term
- associated the business term to a dataset and a column all to enriched your dataset.
After completing these steps you will have created rules and a rulebook that is evaluating the data quality of your enriched dataset.
-
Type 'DRUG_NAME_EXISTS_##' for 'Rule ID' (where ## is your user id).
-
Type 'DRUG_NAME_EXISTS_##' for 'Name' (where ## is your user id).
-
Type the following 'Drug Name must not be null' for 'Description'.
-
Type 'DRUG_NAME' for the 'Name' of the parameter for the rule.
-
Type 'DRUG_NAME_VALID_##' for 'Rule ID' (where ## is your user id).
-
Type 'DRUG_NAME_VALID_##' for 'Name' (where ## is your user id).
29.Select 'DRUG_NAME_VALID' for the 'Parameter Name'
-
You have created two different rules, now you can create a rulebook and associate these rules with a dataset.
-
Type 'PHARMA_CLAIMS_##' for the rulebook 'Name' (where ## is your user number).
-
Type 'Set of rules to evaluate the quality of dataset used to track pharma claims' for the 'Description'.
-
Check your rule named 'DRUG_NAME_VALID_##' (where ## is your user number).
-
Check your rule named 'DRUG_NAME_EXISTS_##' (where ## is your user number).
-
The rules are now imported, you need to associate them with a dataset to be run against.
-
Click 'Create Rule Binding' to bind the rule with a dataset.
-
Select 'DI_DATA_LAKE', from the 'Connections' drop down list.
56 Type '*##' in the 'Filter items' parameter (where ## is your user number).
57 Click your 'TechEd_DA264_##' (where ## is your user number).
58 Type '*##' in the 'Filter items' parameter (where ## is your user number).
-
Select 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number).
-
Select 'VALID_DRUG' from the 'Mapped to Column' drop down list,
-
Notice that you have now bound the Rule DRUG_NAME_## to the PHARMA_CLAIMS_ENRICHED_99.csv (where ## is your user number).
-
Select 'DRUG_NAME' from the 'Mapped to Column' drop down list.
-
The rules are being evaluated. Wait until the disabled 'View Results' is enabled.
-
The Rule Result window is displayed. Expand the PHARMA_CLAIMS_ENRICHED_## (where ## is your user number).
-
You can see the details about records that did not pass the rules.
-
You are back to the Data Intelligence Metadata Explorer home page.
You have now created rules and a rulebook that is evaluating the data quality of your enriched dataset.
After completing these steps you will have fixed data quality issues using the data remediation preparation.
-
Click 'Preparation for PHARMA_CLAIMS_ENRICHED_##.csv' (where ## is your user number). Note you can always use the 'Filter Preparations' to narrow the results.
-
Type 'REMEDIATED_DATA_##' for the 'Dataset Name' (where ## is your user number).
You fixed data quality issues using the data preparation recipe that was automatically generated with the execution of your rulebook. In a normal process, the output dataset resulting of that data remediation process would be used in order to update the original table that contains the data quality issues using workflow and approvals. For this hands-on, you will use the data preparation recipe you created to simulate the update process of the data and see the results in the rulebook.
-
Select the connection 'DI_DATA_LAKE' from the drop down list.
-
Type in '99' in the 'Filter items' search bar (where ## is your user number).
-
Click 'REMEDIATED_DATA_##' (where ## is your user number). Note: You can use the Filter items search to narrow down our results.
-
After the dataset is acquired, the remediated dataset will be added to the list of sources.
-
Drag and drop the 'REMEDIATED_DATA_##' datasource to the left side of 'P1' (where # is your user number).
-
Scroll down and uncheck 'VALID_DRUG' and 'POTENCY' under the 'left source' list of columns.
-
Scroll down and uncheck 'ORIG_PRODUCT', 'DOSAGE', 'ROUTE_ADMINISTERED', and 'NOTES' from the right source list of columns.
-
Type 'PHARMA_CLAIMS_CURATED_##' for the 'Dataset Name' (Where ## is your user number).
-
Type '##' in the 'Filter rulebook names' text field (where ## is your user number).
56 Edit the rules binding associated to the first rule: 'DRUG_NAME_VALID_##' (where ## is your user number).
-
Filter 'Tech*##' and select your folder (where ## is your user number).
-
Select your 'TechEd_DA264_##' folder (where ## is your user number).
-
Select 'PHARMA_CLAIMS_CURATED_##' (where ## is your user number). Then click 'Step 3'.
-
Select 'VALID_DRUG' for the mapping. Then click 'Create Binding'.
74 Edit the rules binding associated to the first rule: 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number).
-
Select the existing 'Connection ID' for 'PHARMA_CLAIMS_CURATED_##' (where ## is your user number).
-
You can check the execution completion status with the notification.
You have now fixed data quality issues using the data remediation preparation.
In this hands-on you:
- Accessed and discovered several data repositories (Database, Cloud Data Lake, Local File System)
- Profiled and published dataset to the catalog
- Found data quality issues
- Created your own enriched dataset
- Created a glossary and terms and associated them with your data
- Assessed the quality of a dataset using rules and a rulebook
- Improved the data with data remediation and witness the results
For more information about SAP Data Intelligence, you can have a look at the following ressources:
- Product Home Page: https://www.sap.com/products/data-intelligence/features.html
- Help Portal: https://help.sap.com/viewer/product/SAP_DATA_INTELLIGENCE/Cloud/en-US
Thanks!