Skip to content
This repository has been archived by the owner on Dec 5, 2024. It is now read-only.

Latest commit

 

History

History

ex1

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Hands-on - Part 1 (Discover, Prepare, and Enrich)

As a Business Analyst, you want to discover, understand, and gain insight into your data. After accessing various data, profiling the data, and reviewing the profiling results you will be able to add a glossary term, with custom attributes, to a column in your dataset as well as provide a rating and comment on the dataset.

In this exercise, you will discover and interact with various connected systems, upload a dataset, profile the data, rate dataset, create a relationship with data and a glossary term

Dataset Overview

The first dataset, contains information about pharmaceutic claims for an insurance company and is already stored in a HANA system.

It contains 8 fields:

  • RECORD_ID (Unique Identifier associated to a claim)
  • INSURANCE (Name of the insurance company)
  • PLAN (Plan associated at the insurance company)
  • PATIENT_ID (Unique Identifier of the patient this claim is for)
  • OUTSTANDING (Amount for the drug)
  • CO_PAY (Amount of co pay, if any)
  • VISIT (Date of the visit associated to this claim)
  • DRUG_NAME (Drug name for the claim)

The second dataset, DRUG__##.csv, is the flat file you retrieved (and renamed) and saved to your desktop. This file contains a list of drugs which are supported by this insurance company (where ## is your user number).

It contains 6 fields:

  • ORIG_PRODUCT (Non split entry)
  • DRUG_NAME (Drug name)
  • POTENCY (Potency associated to the drug)
  • DOSAGE (Dosage for the drug)
  • ROUTE_ADMINISTERED (How the drug is administered)
  • NOTES (Additional notes, if any)

This hands on will focus on discovering these datasets, find patterns and data quality issues, and fix them.

Log Into SAP Data Intelligence

After completing these steps, you will have logged into SAP Data Intelligence.

  1. Open Chrome and go to the SAP Data Intelligence.

  2. Enter 'default' for Tenant Name.

  3. Click 'Proceed'.

  4. Enter the Username that was assigned to you (e.g. 'teched-da264-##'), for Username.
    Note:

  • where ## is the number assigned to you.
  • If your user number is 99 then your login is 'teched-da264-99'.
  1. Enter the Password that was assigned to you, for your Password.

  2. Click 'Sign In'.

  3. You are now signed in the application.

You have successfully logged into SAP Data Intelligence.

Browse data in a Connected System (Database)

After completing these steps you will have discovered dataset stored in a database.

  1. Click on 'Metadata Explorer'.

  2. Click on 'Browse Connections'.

  3. You are currently viewing a connections in a Grid View layout. Click 'List View'.

  4. This allows you to view a list of connections. Click '...' (More Actions) for 'HANA_DEMO'.

  5. Select 'View Capabilities' for the 'HANA_DEMO' connection.

  6. This lists all the features supported for a given connected system.

  7. Click 'Grid View' to go back to the tiles view.

  8. Click on the 'HANA_DEMO' tile.

  9. Select 'TECHED_DA264'.

  10. The list of all available tables within the schema.

  11. Type 'PHARMA_CLAIMS_##' in the 'Filter items' text field (where ## is your user number, for example if your last 2-digits of your username is '99', then type 'PHARMA_CLAIMS_99').

  12. Click 'View FactSheet' on the 'PHARMA_CLAIM_##' database table tile (where ## is your user number).
    .

  13. This shows the 'Fact Sheet'.

The 'Fact Sheet' is the central place in SAP Data Intelligence Metatadata Explorer to find information about your data. From the Fact Sheet you can get insight into the data, such as column types, lineage, glossary terms that are associated to the data, and more.

You can easily profile the data and get access to metadata information. It also contains links and information about business terms and tags associated to the dataset or the columns. Users can describe, rate, and comment on the data collaboratively. You can preparare the data for other downstream usage.

  1. Click 'Start Profiling'.

  2. Click 'Yes'.

  3. Click 'Notification'.

  4. This shows the details of the notifications. Click anywhere outside the notification window to continue interacting with the application.

  5. Wait for the profiling task to finish, this action can take some time. You can click 'Refresh' until you see a notification that the profiling is complete or if you see the factsheet now states 'Profiled'. Click 'Refresh' after profiling is complete to see the profiling results.

  6. Click 'Refresh' after profiling is complete to view the profiling results.

  7. The factsheet was updated with the profiling information once the task is done.

  8. Click 'Columns'.

  9. Select the Line for 'DRUG_NAME'.

  10. Observe 'Data Preview' and the 'Top 10 Distinct Values'.

  11. Scroll to the bottom of 'Data Preview'. We can see there are data quality issues such as spelling mistakes on the drug names.

  12. Looking at the 'Top 10 Distinct Values', we can see there is a large percentage of null values in the dataset.

  13. Click 'Data Intelligence Metadata Explorer'.

  14. Click 'Home'.

You have now discovered a table in a database, profiled the data, and found some data quality issues.

Upload your Dataset

After completing these steps you will have uploaded a dataset from a flat flat to a cloud data lake data repository using SAP Data Intelligence.

  1. Click on 'Browse Connections'.

  2. Click on the 'DI_DATA_LAKE' tile.

  3. Click on the 'shared' tile.

  4. Click on the New Folder icon (folder with a +).

  5. Enter 'TechEd_DA264_##' for folder name (where ## is the number assigned to you).

  6. Click 'OK'.

  7. Search for 'TechEd_DA264_##' to isolate your newly created folder, then click on your newly added 'TechEd_DA264_##' folder (where ## is the number assigned to you).

  8. Upload a file, click on the 'Upload Files' icon on the toolbar.

  9. Click on 'Upload' in the upper right hand corner of the Upload Files pop-up window.

  10. Browse to your Desktop, where you downloaded and extracted 'DRUG_##.csv' (where ## is the number assigned to you) and select it.

  11. Click 'Open'.

  12. Click 'Upload'.

  13. The file will upload on the data lake.

  14. After Upload is Complete, click 'Close'.

  15. The file is now uploaded and available in the data lake.

You have now uploaded a dataset from a flat file on your local system to a cloud data lake data repository using SAP Data Intelligence Metadata Explorer.

Enrich Dataset and Isolate Data Quality Issues

After completing these steps you will have created a new dataset using self-service data preparation. This new dataset will help to easily isolate invalid claims. Additionally you will profile this dataset, add a rating and description and publish it in the catalog so it can be easily retrieved.

  1. Click 'More Actions'.

  2. Select 'Prepare Data'.

  3. The self-service data preparation appears.

  4. The first record of the data is actually the column header.

  5. On the bottom right, check 'Use first row as header'.

  6. Click 'Continue'.

  7. The application will automatically recreate a new sample with the updated metadata structure.

  8. The dataset now has the proper header.

  9. Click 'Actions'.

  10. Click 'Enrich Preparation'.

  11. You are now able to enrich and prepare your data.

  12. Click '+' to add a new source of data to merge with your csv file.

  13. Click 'Browse'.

  14. Select 'HANA_DEMO as a 'Connection'.

  15. Click 'TECHED_DA264'.

  16. Type 'CLAIMS_##' in the 'Filter items' text field (where ## is your user number).

  17. Select 'PHARMA_CLAIMS_##' (where ## is your user number).

  18. Click 'OK'.

  19. The application is acquiring a sample of the selected PHARMA_CLAIMS_## dataset (where ## is your user number).

  20. The selected PHARMA_CLAIMS_## dataset can now be used to merge data (where ## is your user number).

  21. Drag and drop 'PHARMA_CLAIMS_##' on the cell on the left hand-side of the main dataset (where ## is your user number).

  22. Select 'Left Join'.

  23. Scroll down the list of output columns and uncheck 'ORIG_PRODUCT', 'POTENCY', 'DOSAGE', 'ROUTE_ADMINISTERED', and 'NOTES'.

  24. Click Apply.

  25. You will now see a preview of the merged data.

  26. The merged data now shows a null value for the column 'DRUG_NAME_0' when a record from the claim data is for a drug that is not listed in the list of supported drugs.

  27. Click 'Apply Enrichment' to confirm the join/merge action.

  28. You are now back on the main self-service data preparation, where you can see enriched (merged) dataset.

The enriched dataset now contains null values for the field 'DRUG_NAME_0' for the records in the PHARMA_CLAIMS dataset which the drug name did not exists in our reference.

There are potential multiple reasons for that. Some reasons might be spelling mistakes of drug names or drugs that are not taken into account by the insurance company or the drug name in our claim was null.

You can now use this enriched dataset to isolate the data quality issues to further understand the data.

  1. Click 'Actions'.

  2. Click 'Add Columns'.

  3. Type 'ValidClaim' for the 'Column Name'.

  4. Click 'Expression'.

  5. Type the following expression: 'CASE WHEN "DRUG_NAME_0" IS NULL THEN 'NO' ELSE 'YES' END'. This will add a 'NO' in the new 'ValidClaim' column when the 'DRUG_NAME_0' colum is NULL and a 'YES' value when the column contains data.

  6. Click 'OK'.

  7. Click 'Apply' (at the bottom right).

  8. A new column, ValidClaim' is now created.

  9. Select the column 'DRUG_NAME_0'.

  10. Click 'Remove'.

  11. The column 'DRUG_NAME_0' has been deleted.

  12. Click '<' to navigate back to the 'Actions' menu.

  13. Click 'Run Preparation'.

  14. Type 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number) for the 'Dataset Name'.

  15. Click 'Apply' (at the bottom right). This will create a new dataset called 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number) with the changes you have made by joining tables, creating a new column based on an expression, and removing a column.

  16. Click 'Data Intelligence Metadata Explorer'.

  17. Select 'Monitor'.

  18. Click 'Monitor Tasks'.

  19. 'Monitoring' shows the current running tasks. Wait for your task to complete.

  20. The task is completed.

  21. Click 'Data Intelligence Metadata Explorer'.

  22. Click 'Home'.

  23. Click 'Browse Connections'.

  24. Click 'DI_DATA_LAKE'.

  25. Click 'shared'.

  26. Type '*_##' (where ## is your user number) in the Filter field.

  27. Click TechEd_DAT163_## (where ## is your user number).

  28. Click 'More Actions' ('...') newly created dataset named PHARMA_CLAIMS_ENRICH_## (where ## is your user number).

  29. Select 'View Fact Sheet'.

  30. Click 'Overview'.

  31. The Fact Sheet for the dataset is not profiled and has not been published.

  32. Click the 'Profiling' icon.

  33. Click 'Yes'.

  34. Wait for the profiling to be executed (there will be two notifications which you can check by clicking on the notification icon). Then Click 'Refresh'.

  35. The dataset is now profiled.

  36. Click '<' to come back to the connection browser.

  37. Click 'More Actions' ('...') for PHARMA_CLAIMS_ENRICHED_## (where ## is your user number).

  38. Click 'New Publication'.

  39. Type 'Pharma Claims Publication ##' (where ## is your user number) for the 'Name' text field.

  40. Type 'Publication for enriched claimed data' for the 'Description' text field.

  41. Click 'Publish' (at the bottom right).

  42. The application sends a notification for the publication task trigger.

  43. The application sends another notification when the publication task is finished.

  44. Click 'Refresh'.

  45. You can now see that the dataset is both profiled and published in the application catalog.

  46. Click 'View Fact Sheet'.

  47. Click 'Reviews'.

  48. Click the stars icon to post a rating.

  49. Click and define a rating (for example 4 stars rating is done by clicking the 4th star).

  50. Add a comment: 'This dataset helps to easily identify claims for drugs that are not compliant'.

  51. Click 'OK'.

  52. The dataset has been enriched with a rating and a comment.

  53. Click 'Data Intelligence Metadata Explorer' and Click 'Home'.

  54. You returned to the Metadata Explorer home page.

You have now created a new dataset by merging two datasets using self-service data preparation. This new dataset helps to easily isolate invalid pharmaceutical claims. You also profiled this dataset, added a rating and a description, and published the dataset in the catalog so it can be easily retrieved.

Summary

You've now used Metadata Explorer to connect and interact with different data repositories (Databases, Cloud Data Lake, Local File System). You profiled and discovered data to identify data quality issues. You created a new enriched dataset to isolate these data quality issues. You published this dataset to the catalog.

Continue to - Hands-on - Part 2: Govern and Monitor Data Quality