Skip to content
This repository has been archived by the owner on Dec 5, 2024. It is now read-only.

Latest commit

 

History

History

ex2

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Hands-on - Part 2 (Govern and Monitor Data Quality)

In this exercise, you will continue to interact with the enriched dataset you created in the previous part. You will first modify the recipe you used to create that dataset, schedule a publication so it can take into account potential metadata changes in the future. Then you will create rules and a rulebook to visually track the data quality issues that were found. Finally you will curate the data.

Modify Preparation

After completing these steps you will have modified your data preparation recipe to further isolate the data quality issues on a specific column. Then you will schedule a publication of this dataset to take into account potential metadata changes in the future.

  1. Click 'View Preparation'.

  2. Click 'DRUG_##' (where ## is your user number).

  3. This opens your preparation.

  4. Click 'Recipe'.

  5. Edit the 'Enrich' action.

  6. This opens the Enrich window with the defined Enrich action.

  7. Click the 'Join' icon.

  8. The join definition window opens. Scroll down to the bottom of Output Columns.

  9. Check 'POTENCY', under Right Source.

  10. Click 'Apply'(at the bottom right).

  11. 'POTENCY' is now added in the output columns.

  12. Click 'Apply Enrichment'.

  13. The daset is now enriched with the additional selected column.

  14. Click 'Recipe'.

  15. Edit the 'Add Column ValidClaim' action.

  16. Change the 'Column Name' from ValidClaim to VALID_DRUG.

  17. Click 'Save'.

  18. The column name has changed.

  19. Click 'Add Filter'.

  20. Select 'VALID_DRUG'.

  21. Type 'NO' in the associated filter text field.

  22. Press Enter key to validate the filter.

  23. This filters the data with only claims containing an invalid drug.

  24. Click the menu icon next to the 'DRUG_NAME' column header..

  25. Click the sort ascending icon.

  26. The data are sorted by 'DRUG_NAME' by ascending order, but the first records contains null values.

  27. Click 'Add Filter'.

  28. Select 'DRUG_NAME'.

  29. Click the '=' icon associated to the 'DRUG_NAME' filter.

  30. Select 'Not Null'.

  31. This shows only the claims with unknown drugs which are not null.

  32. Click '<' to come back to the Actions menu.

  33. Click 'Aggregate Preparation'.

  34. The filter defined in the main (previous screen) data preparation are automatically pre-defined in the aggregation window.

  35. Drag and drop 'DRUG_NAME' to 'Output Columns'.

  36. This defines the aggregation with a preview of the result at the bottom half of the screen.

  37. Drag and drop 'DRUG_NAME' in the 'Output Columns' area.

  38. This will generate an expected error message as right now the system considers we are trying to aggregate data twice on the same column.

  39. Change the Aggregation function from 'No Aggregation' to 'Count'.

  40. This shows the preview of the aggregation that contains the distinct values of drug names with their respective count.

  41. Click 'Apply Aggregation'.

  42. The main data preparation room now contains the aggregated data. The data could be exported as-is (as this is based only on a sample of the data) for further study and analysis. We can see already some simple data quality issues in the name of the drugs which we will curate later on.

  43. For now, we can disable this aggregation (but still keep it's definition). Click 'Recipe'.

  44. Click 'More Actions' ('...') for the 'Aggregate' action in the recipe.

  45. Click 'Disable'.

  46. The aggregation action is now disabled. The data preparation room shows the data prior executing the aggregation. The filters that were defined were automatically removed (but are still embedded in the definition of the aggregation). At any point of time, we can come back to this data preparation recipe and reactivate this action.

  47. Click 'Actions'.

  48. Click 'Run Preparation'.

  49. Click the 'Container' icon.

  50. Select 'PHARMA-CLAIMS_ENRICHED_##' (where ## is your user number).

  51. Click 'Apply'(at the bottom right).

  52. Note: Write Mode' set to 'Overwrite' will allow the content to overwrite the existing data with this new definition.

  53. Click 'Apply'(at the bottom right).

  54. Click 'Yes'.

  55. Wait until you see a notification for the task being completed. You can click on the refresh button on the bottom of the screen.

  56. Click 'Data Intelligence Metadata Explorer'.

  57. Click 'Browse Connections'.

  58. Click 'DI_DATA_LAKE'.

  59. Click 'shared'.

  60. Filter the data with '*_##' (where ## is your user number).

  61. Click 'TechEd_DA264_##' (where ## is your user number).

  62. Click 'View Fact Sheet' for the file 'PHARMA_CLAIMS_ENRICH_##' (where ## is your user number).

  63. The application shows the Fact Sheet associated to the selected dataset.

  64. Click 'Columns'.

  65. The columns are not profiled anymore. This is because the structure of the data changed when we modified the definition in the recipe associated to the preparation task (added the column POTENCY, changed one column name).

  66. Click 'Profile Data'.

  67. Click 'Yes'.

  68. Click 'Refresh'.

  69. Wait to see 'Profiled:' time stamp has changed indicating that the profile task finished successfully.

  70. The data is now profiled again, there is a new version of the profiling information available. Click 'Columns'. Note: Previous information of the profiling metadata stays available.

  71. All the new profiling information are available.

72 Click 'Data Intelligence Metadata Explorer'.

  1. Click 'Home'.

Because you changed the structure of the dataset, you will also need to publish the dataset once again to the catalog. However, instead of repeating this operation, you can schedule the publication of the dataset instead of doing it manually every time the structure of the data changes.

  1. Click 'scheduled tasks' to create a new scheduled task.

  2. Click 'Add' to add a new task.

  3. Click the 'Name' field to select the publication name to schedule.

  4. Type 'pharma claims publication ##' in the 'Filter Publications' text field. (Where ## is your user number).

  5. Select your publication (for example if you are user 99, select Pharma Claims Publication 99).

  6. Click 'OK'.

  7. Open the 'Picker' icon for the Start Time to give you few more minutes for the first schedule of the selected publication.

  8. Click on the minutes and add 5 more minutes to the original value. For example, if the value was 06:45, set it to 06:50. Click 'OK'.

  9. Click 'OK'.

  10. Click 'Save'.

  11. Wait until the scheduled publication task is executed.

  12. Click the Refresh icon or check notifications to see when the scheduled publishing task is complete.

  13. Once done, the status will change to 'Completed'.

  14. Click on the scheduled task to see the details.

  15. Click 'Data Intelligence Metadata Explorer'.

  16. Click 'Home'.

  17. You are back to the Metadata Explorer home page.

You have now modified your data preparation recipe to further isolate the data quality issues on a specific column. You also have scheduled a publication of this dataset to take into account potantial metadata changes in the future.

Business Glossary

After completing these steps you will have created a glossary and business terms and associated them to your enriched dataset.

  1. Click 'glossaries' to access or create a new glossary.

  2. You can access or create glossaries.

  3. Click '+'.

  4. Type 'DRUGS_GLOSSARY_##' in the 'Name' text field (where ## is your user number).

  5. Select 'Glossary' from the Template drop down menu.

  6. Type the following description: 'Glossary for pharmaceutical drugs'.

  7. Click 'Save' to create the glossary.

  8. You created a glossary, and now can start enriching it with new terms.

  9. Click 'Edit Term Template'.

  10. Click 'Custom Attribute'.

  11. Click 'Add Custom Attribute Group'.

  12. Type 'Pharma Drug Attributes' for 'Enter Group Name'.

  13. Click 'Add Attribute'.

  14. Type 'Form' for Name.

  15. Type 'Pharmaceutical form of medicinal products' for Description.

  16. Select 'String' from the Type drop down list.

  17. Select 'List of Values' from the Validation Type drop down list.

  18. Type 'Liquid' and the Enter key for Valid Values.

  19. Type 'Tablet' , 'Capsules' . and 'Inhalers' adding to the list of Valid Values.

  20. Click 'Ok'.

  21. Click 'Add Attribute'.

  22. Type 'Last Reviewed' for Name.

  23. Type 'Date last reviewed' for Description.

  24. Select 'Date' from the Type drop down list.

  25. Click 'Ok'.

  26. Click 'Add Attribute'.

  27. Select 'Right' from Display Column drop down menu.

  28. Type 'Reviewer' for Name.

  29. Type 'Reviewer's name' for Description.

  30. Select 'String' from the Type drop down list.

  31. Click 'Ok'.

  32. Click 'Last Reviewed'.

  33. Click '>', to move 'Last Reviewed' to the right column and under 'Reviewer'.

  34. Click 'Save'.

  35. Click '+' to create a new term.

  36. The term creation windows shows up.

  37. Type 'Potency' in the 'Term Name' text field.

  38. Type the following definition for the term: 'Measure of drug activity expressed in terms of the amount required to produce an effect of given intensity'.

  39. Click 'Save'.

  40. Click the 'Go back to term list' icon.

  41. The term 'Potency' is created and part of the glossary definitions. Click 'Potency'.

  42. The term window shows up.

  43. Click 'Edit'.

  44. Click 'Pharma Drug Attribute'.

  45. Type 'Gina Milo' for Reviewer.

  46. Click the Open Picker for the Last Reviewed Date.

  47. Click on today's date.

  48. Click 'Save'.

  49. Click 'Edit'.

  50. Click 'Relationships'.

  51. Click 'Manage Relationships'.

  52. Click 'Datasets or Columns'.

  53. Click 'DI_DATA_LAKE'.

  54. Click 'shared'.

  55. Search '*_##' (where ## is your user number) and press .

  56. Check your PHARMA_CLAIMS_ENRICHED_## dataset (where ## is your user number).

  57. You can hover over 'PHARMA_CLAIMS_ENRICHED...' to see the full name.

  58. Click 'View Columns'.

  59. Check 'Potency'.

  60. Click 'Save Related Objects'.

  61. The term is now associated to a dataset and a column in the dataset.

  62. Click 'Data Intelligence Metadata Explorer'.

  63. Click 'Home'.

  64. Type 'claims' in the search bar and press the key.

  65. The Catalog shows the results of the dataset that matches the search pattern.

  66. Click the 'Filter' icon.

  67. Change the Search string to 'claims*##' (where ## is your user number).

  68. Click 'Apply'.

  69. Click 'More Actions'.

  70. Click 'View Factsheet'.

  71. Click 'Overview'.

  72. Click 'Relationships'.

  73. Click 'Dataset Terms and Tags'.

  74. The glossary term 'Potency' is associated to the dataset.

  75. Click 'Columns'.

  76. Click 'POTENCY'.

  77. The glossary term 'Potency' is associated to the 'POTENCY' column of the dataset.

You have been informed that 'Gina Milo' has changed her last name so you need to change all the values of 'Gina Milo' to 'Gina Johnson', you can do this as a mass update so all the 'Gino Milo' Reviewer values can be update to 'Gino Johnson' at one time.

  1. Click 'Data Intelligence Metadata Explorer'.

  2. Click 'Glossary'.

  3. Click 'View Business Glossary'.

  4. Click 'DRUGS_GLOSSARY_##' (where ## is your user number).

  5. Click 'Edit Term Template'.

  6. Click 'Custom Attributes'.

  7. Click 'Edit' (pencil) icon to the right of Reviewer.

  8. Click 'View' to the right of 'Gina Milo'.

  9. Click 'Edit' (pencil) icon for 'Gina Milo'.

  10. For 'Replace With:' type 'Gina Johnson'.

  11. Click 'Replace'.

  12. Click 'Close'.

  13. Click '<' to go back to the term list.

  14. Click 'Potency'.

  15. Click 'Pharma Drug Attributes'.

  16. Notice the Reviewer's Name has been changed to 'Gina Johnson'.

  17. Click 'Data Intelligence Metadata Explorer'.

  18. Click 'Home'.

  19. You are back to the Data Intellience Metadata Explorer Home page.

You have now:

  • created a glossary
  • added custom attributes, which extend the meaning of an asset beyond the definition
  • performed a mass update on a custom attribute value, all 'Gina Milo' references where changed to 'Gina Johnson'
  • defined a business term
  • associated the business term to a dataset and a column all to enriched your dataset.

Rules

After completing these steps you will have created rules and a rulebook that is evaluating the data quality of your enriched dataset.

  1. Click 'rules' to access existing or create new rules.

  2. Click 'Creat Rule' to create a new 'Completeness' rule.

  3. Select 'Completeness' from the Category drop down.

  4. Type 'DRUG_NAME_EXISTS_##' for 'Rule ID' (where ## is your user id).

  5. Type 'DRUG_NAME_EXISTS_##' for 'Name' (where ## is your user id).

  6. Type the following 'Drug Name must not be null' for 'Description'.

  7. Click 'Save'.

  8. The rule editor window appears.

  9. Click '+'.

  10. Type 'DRUG_NAME' for the 'Name' of the parameter for the rule.

  11. From the 'Type' drop down options, select 'String'.

  12. Click 'Save'.

  13. The parameter is created.

  14. Click '+' to define the condition.

  15. Type 'DRUG_NAME_NOT_NULL' for the 'Condition Name'.

  16. Select 'DRUG_NAME' for the 'Parameter Name' drop down list.

  17. Click 'Save'.

  18. Click '<' to go back to the list of categories.

  19. Click 'Create Rule'.

  20. Type 'DRUG_NAME_VALID_##' for 'Rule ID' (where ## is your user id).

  21. Type 'DRUG_NAME_VALID_##' for 'Name' (where ## is your user id).

  22. Type 'The drug name is valid' for 'Description'.

  23. Click 'Save'.

  24. Click '+'.

  25. Type 'DRUG_NAME_VALID' for the parameter 'Name'.

  26. Click the 'Save' icon.

  27. Click '+' on the 'Conditions'.

  28. Type 'DRUG_NAME_VALID' for the 'Condition Name'.

29.Select 'DRUG_NAME_VALID' for the 'Parameter Name'

  1. Select 'equals' from the 'Operator' drop down options.

  2. Type 'YES' for the 'Value or Format'.

  3. Click 'Save'.

  4. Click '<' to navigate back to the list of rule categories.

  5. You have created two different rules, now you can create a rulebook and associate these rules with a dataset.

  6. Click 'Data Intelligence Metadata Explorer'.

  7. Click 'Rules'.

  8. Click 'View Rulebooks'.

  9. The Rulebooks Overview window is displayed.

  10. Click '+' to create a new rulebook.

  11. Type 'PHARMA_CLAIMS_##' for the rulebook 'Name' (where ## is your user number).

  12. Type 'Set of rules to evaluate the quality of dataset used to track pharma claims' for the 'Description'.

  13. Click 'Save'.

  14. The empty rulebook is created.

  15. Click 'Add Rule'.

  16. Expand 'Accuracy'.

  17. Check your rule named 'DRUG_NAME_VALID_##' (where ## is your user number).

  18. Collapse 'Accuracy'.

  19. Expand 'Completeness'.

  20. Check your rule named 'DRUG_NAME_EXISTS_##' (where ## is your user number).

  21. Click 'Save'.

  22. The rules are now imported, you need to associate them with a dataset to be run against.

  23. Click 'Create Rule Binding' to bind the rule with a dataset.

  24. Click 'Step 2'.

  25. Select 'DI_DATA_LAKE', from the 'Connections' drop down list.

55 Click 'shared'.

56 Type '*##' in the 'Filter items' parameter (where ## is your user number).

57 Click your 'TechEd_DA264_##' (where ## is your user number).

58 Type '*##' in the 'Filter items' parameter (where ## is your user number).

  1. Select 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number).

  2. The dataset is selected. Click 'Step 3'.

  3. Select 'VALID_DRUG' from the 'Mapped to Column' drop down list,

  4. Click 'Create Binding'.

  5. Click 'Close'.

  6. Expand Accuracy.

  7. Notice that you have now bound the Rule DRUG_NAME_## to the PHARMA_CLAIMS_ENRICHED_99.csv (where ## is your user number).

  8. Expand Completeness.

  9. Click 'Create Rule Binding'.

  10. Click 'Step 2'.

  11. Select 'DI_DATA_LAKE'.

  12. Click 'Step 3'.

  13. Select 'DRUG_NAME' from the 'Mapped to Column' drop down list.

  14. Click 'Create Binding'.

  15. Click 'Close'.

  16. Both rules are now bound to a dataset. Click 'Run All'.

  17. Enable 'Save All'.

  18. Select 'Replace'.

  19. Enable 'Create Linked Preparation'.

  20. Click 'Run'.

  21. The rules are being evaluated. Wait until the disabled 'View Results' is enabled.

  22. Click 'View Results'.

  23. The Rule Result window is displayed. Expand the PHARMA_CLAIMS_ENRICHED_## (where ## is your user number).

  24. Click 'Details' for the Rule 'DRUG_NAME_VALID_99'.

  25. Expand the 'Sample Rows' window.

  26. You can see a sample of your records.

  27. Click on 'Failed Rows'.

  28. You can see the details about records that did not pass the rules.

  29. Click 'Data Intelligence Metadata Explorer'.

  30. Click 'Home'.

  31. You are back to the Data Intelligence Metadata Explorer home page.

You have now created rules and a rulebook that is evaluating the data quality of your enriched dataset.

Data Remediation

After completing these steps you will have fixed data quality issues using the data remediation preparation.

  1. Click 'View Preparations'.

  2. Click 'Preparation for PHARMA_CLAIMS_ENRICHED_##.csv' (where ## is your user number). Note you can always use the 'Filter Preparations' to narrow the results.

  3. Notice in the column DRUG_NAME' values of 'Nystatine'.

  4. Click the header for the column named 'DRUG_NAME'.

  5. Click 'Replace'.

  6. Type 'Nystatine' for the search string.

  7. Type 'Nystatin' for the 'Replace by'.

  8. Click 'Apply'.

  9. The value 'Nystatine' was replaced by 'Nystatin'.

  10. Notice in the column DRUG_NAME' values of 'Acetaminofen'.

  11. Click 'Replace'.

  12. Type 'Acetaminofen' for the search string.

  13. Type 'Acetaminophen' for the 'Replace by'.

  14. Click 'Apply'.

  15. The value 'Acetaminofen' was replaced by 'Acetaminophen'.

  16. Click '<' to go back to the 'Rulebook Dataset Results'.

  17. Click 'Run Preparation'.

  18. Type 'REMEDIATED_DATA_##' for the 'Dataset Name' (where ## is your user number).

  19. Click 'Apply'.

  20. Click 'Data Intelligence Metadata Explorer'.

  21. Click 'Home'.

You fixed data quality issues using the data preparation recipe that was automatically generated with the execution of your rulebook. In a normal process, the output dataset resulting of that data remediation process would be used in order to update the original table that contains the data quality issues using workflow and approvals. For this hands-on, you will use the data preparation recipe you created to simulate the update process of the data and see the results in the rulebook.

  1. Click 'View Preparations'.

  2. Click 'DRUG_##' (where ## is your user number).

  3. Click 'Recipe'.

  4. Edit the 'Enrich' action.

  5. Click '+' to add a new source.

  6. Click 'Browse'.

  7. Select the connection 'DI_DATA_LAKE' from the drop down list.

  8. Click 'shared'.

  9. Type in '99' in the 'Filter items' search bar (where ## is your user number).

  10. Click 'TechEd_DA264_##' (where ## is your user number).

  11. Click 'REMEDIATED_DATA_##' (where ## is your user number). Note: You can use the Filter items search to narrow down our results.

  12. Click 'OK'.

  13. After the dataset is acquired, the remediated dataset will be added to the list of sources.

  14. Click 'X' to delete the join with the dataset 'D1'.

  15. Drag and drop the 'REMEDIATED_DATA_##' datasource to the left side of 'P1' (where # is your user number).

  16. Select 'Left Join'.

  17. Scroll down and uncheck 'VALID_DRUG' and 'POTENCY' under the 'left source' list of columns.

  18. Scroll down and uncheck 'ORIG_PRODUCT', 'DOSAGE', 'ROUTE_ADMINISTERED', and 'NOTES' from the right source list of columns.

  19. Click 'Apply'.

  20. Click 'Apply Enrichment'.

  21. Click 'POTENCY_0'.

  22. Click 'Rename'.

  23. Type 'POTENCY' for the 'New Column Name'.

  24. Click 'Apply'.

  25. Click '<' to come back Actions menu.

  26. Click 'Run Preparation'.

  27. Type 'PHARMA_CLAIMS_CURATED_##' for the 'Dataset Name' (Where ## is your user number).

  28. Click 'Apply'.

  29. Click 'Data Intelligence Metadata Explorer'.

  30. Click 'Home'.

  31. Click 'rulebooks'.

  32. Type '##' in the 'Filter rulebook names' text field (where ## is your user number).

  33. Click 'PHARMA_CLAIMS_##' (where ## is your user number).

  34. Expand 'Accuracy' set of rules.

56 Edit the rules binding associated to the first rule: 'DRUG_NAME_VALID_##' (where ## is your user number).

  1. Delete the rule binding.

  2. Click 'Save'.

  3. Click 'Close'.

  4. Expand 'Accuracy'.

  5. Click 'Create Rule Binding'.

  6. Click 'Step 2'.

  7. Click 'Browse'.

  8. Select 'DI_DATA_LAKE' from the 'Connection' drop down list.

  9. Click 'shared'.

  10. Filter 'Tech*##' and select your folder (where ## is your user number).

  11. Select your 'TechEd_DA264_##' folder (where ## is your user number).

  12. Select 'PHARMA_CLAIMS_CURATED_##' (where ## is your user number). Then click 'Step 3'.

  13. Click 'Step 3'.

  14. Select 'VALID_DRUG' for the mapping. Then click 'Create Binding'.

  15. Click 'Create Binding'.

  16. Click 'Close'.

  17. Expand 'Completeness', then edit the rule binding.

74 Edit the rules binding associated to the first rule: 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number).

  1. Delete the column mapping for 'DRUG_NAME'.

  2. Click 'Save'.

  3. Click 'Close'.

  4. Expand 'Completeness'.

  5. Click 'Create Rule Binding'.

  6. Click 'Step 2'.

  7. Select the existing 'Connection ID' for 'PHARMA_CLAIMS_CURATED_##' (where ## is your user number).

  8. Click 'Step 3'.

  9. Click 'Create Binding'.

  10. Click 'Close'.

  11. Click 'Run All'.

  12. Click 'Run'.

  13. Wait for the rulebook execution to finish.

  14. You can check the execution completion status with the notification.

  15. Click 'View Results'.

  16. Click 'Yes'.

  17. Expand the dataset.

  18. The quality of the data improved.

  19. Click 'Data Intelligence Metadata Explorer'.

  20. Click 'Home'.

  21. You are back to the home page of the Metadata Explorer.

You have now fixed data quality issues using the data remediation preparation.

Conclusion

In this hands-on you:

  • Accessed and discovered several data repositories (Database, Cloud Data Lake, Local File System)
  • Profiled and published dataset to the catalog
  • Found data quality issues
  • Created your own enriched dataset
  • Created a glossary and terms and associated them with your data
  • Assessed the quality of a dataset using rules and a rulebook
  • Improved the data with data remediation and witness the results

For more information about SAP Data Intelligence, you can have a look at the following ressources:

Thanks!