As a Business Analyst, you want to discover, understand, and gain insight into your data. After accessing various data, profiling the data, and reviewing the profiling results you will be able to add a glossary term, with custom attributes, to a column in your dataset as well as provide a rating and comment on the dataset.
In this exercise, you will discover and interact with various connected systems, upload a dataset, profile the data, rate dataset, create a relationship with data and a glossary term
The first dataset, contains information about pharmaceutic claims for an insurance company and is already stored in a HANA system.
It contains 8 fields:
- RECORD_ID (Unique Identifier associated to a claim)
- INSURANCE (Name of the insurance company)
- PLAN (Plan associated at the insurance company)
- PATIENT_ID (Unique Identifier of the patient this claim is for)
- OUTSTANDING (Amount for the drug)
- CO_PAY (Amount of co pay, if any)
- VISIT (Date of the visit associated to this claim)
- DRUG_NAME (Drug name for the claim)
The second dataset, DRUG__##.csv, is the flat file you retrieved (and renamed) and saved to your desktop. This file contains a list of drugs which are supported by this insurance company (where ## is your user number).
It contains 6 fields:
- ORIG_PRODUCT (Non split entry)
- DRUG_NAME (Drug name)
- POTENCY (Potency associated to the drug)
- DOSAGE (Dosage for the drug)
- ROUTE_ADMINISTERED (How the drug is administered)
- NOTES (Additional notes, if any)
This hands on will focus on discovering these datasets, find patterns and data quality issues, and fix them.
After completing these steps, you will have logged into SAP Data Intelligence.
-
Open Chrome and go to the SAP Data Intelligence.
-
Enter the Username that was assigned to you (e.g. 'teched-da264-##'), for Username.
Note:
- where ## is the number assigned to you.
- If your user number is 99 then your login is 'teched-da264-99'.
You have successfully logged into SAP Data Intelligence.
After completing these steps you will have discovered dataset stored in a database.
-
You are currently viewing a connections in a Grid View layout. Click 'List View'.
-
This allows you to view a list of connections. Click '...' (More Actions) for 'HANA_DEMO'.
-
This lists all the features supported for a given connected system.
-
Type 'PHARMA_CLAIMS_##' in the 'Filter items' text field (where ## is your user number, for example if your last 2-digits of your username is '99', then type 'PHARMA_CLAIMS_99').
-
Click 'View FactSheet' on the 'PHARMA_CLAIM_##' database table tile (where ## is your user number).
.
The 'Fact Sheet' is the central place in SAP Data Intelligence Metatadata Explorer to find information about your data. From the Fact Sheet you can get insight into the data, such as column types, lineage, glossary terms that are associated to the data, and more.
You can easily profile the data and get access to metadata information. It also contains links and information about business terms and tags associated to the dataset or the columns. Users can describe, rate, and comment on the data collaboratively. You can preparare the data for other downstream usage.
-
This shows the details of the notifications. Click anywhere outside the notification window to continue interacting with the application.
-
Wait for the profiling task to finish, this action can take some time. You can click 'Refresh' until you see a notification that the profiling is complete or if you see the factsheet now states 'Profiled'. Click 'Refresh' after profiling is complete to see the profiling results.
-
Click 'Refresh' after profiling is complete to view the profiling results.
-
The factsheet was updated with the profiling information once the task is done.
-
Scroll to the bottom of 'Data Preview'. We can see there are data quality issues such as spelling mistakes on the drug names.
-
Looking at the 'Top 10 Distinct Values', we can see there is a large percentage of null values in the dataset.
You have now discovered a table in a database, profiled the data, and found some data quality issues.
After completing these steps you will have uploaded a dataset from a flat flat to a cloud data lake data repository using SAP Data Intelligence.
-
Enter 'TechEd_DA264_##' for folder name (where ## is the number assigned to you).
-
Search for 'TechEd_DA264_##' to isolate your newly created folder, then click on your newly added 'TechEd_DA264_##' folder (where ## is the number assigned to you).
-
Upload a file, click on the 'Upload Files' icon on the toolbar.
-
Click on 'Upload' in the upper right hand corner of the Upload Files pop-up window.
-
Browse to your Desktop, where you downloaded and extracted 'DRUG_##.csv' (where ## is the number assigned to you) and select it.
You have now uploaded a dataset from a flat file on your local system to a cloud data lake data repository using SAP Data Intelligence Metadata Explorer.
After completing these steps you will have created a new dataset using self-service data preparation. This new dataset will help to easily isolate invalid claims. Additionally you will profile this dataset, add a rating and description and publish it in the catalog so it can be easily retrieved.
-
The application will automatically recreate a new sample with the updated metadata structure.
-
Click '+' to add a new source of data to merge with your csv file.
-
Type 'CLAIMS_##' in the 'Filter items' text field (where ## is your user number).
-
The application is acquiring a sample of the selected PHARMA_CLAIMS_## dataset (where ## is your user number).
-
The selected PHARMA_CLAIMS_## dataset can now be used to merge data (where ## is your user number).
-
Drag and drop 'PHARMA_CLAIMS_##' on the cell on the left hand-side of the main dataset (where ## is your user number).
-
Scroll down the list of output columns and uncheck 'ORIG_PRODUCT', 'POTENCY', 'DOSAGE', 'ROUTE_ADMINISTERED', and 'NOTES'.
-
The merged data now shows a null value for the column 'DRUG_NAME_0' when a record from the claim data is for a drug that is not listed in the list of supported drugs.
-
You are now back on the main self-service data preparation, where you can see enriched (merged) dataset.
The enriched dataset now contains null values for the field 'DRUG_NAME_0' for the records in the PHARMA_CLAIMS dataset which the drug name did not exists in our reference.
There are potential multiple reasons for that. Some reasons might be spelling mistakes of drug names or drugs that are not taken into account by the insurance company or the drug name in our claim was null.
You can now use this enriched dataset to isolate the data quality issues to further understand the data.
-
Type the following expression: 'CASE WHEN "DRUG_NAME_0" IS NULL THEN 'NO' ELSE 'YES' END'. This will add a 'NO' in the new 'ValidClaim' column when the 'DRUG_NAME_0' colum is NULL and a 'YES' value when the column contains data.
-
Type 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number) for the 'Dataset Name'.
-
Click 'Apply' (at the bottom right). This will create a new dataset called 'PHARMA_CLAIMS_ENRICHED_##' (where ## is your user number) with the changes you have made by joining tables, creating a new column based on an expression, and removing a column.
-
'Monitoring' shows the current running tasks. Wait for your task to complete.
-
Type '*_##' (where ## is your user number) in the Filter field.
-
Click 'More Actions' ('...') newly created dataset named PHARMA_CLAIMS_ENRICH_## (where ## is your user number).
-
The Fact Sheet for the dataset is not profiled and has not been published.
-
Wait for the profiling to be executed (there will be two notifications which you can check by clicking on the notification icon). Then Click 'Refresh'.
-
Click 'More Actions' ('...') for PHARMA_CLAIMS_ENRICHED_## (where ## is your user number).
-
Type 'Pharma Claims Publication ##' (where ## is your user number) for the 'Name' text field.
-
Type 'Publication for enriched claimed data' for the 'Description' text field.
-
The application sends a notification for the publication task trigger.
-
The application sends another notification when the publication task is finished.
-
You can now see that the dataset is both profiled and published in the application catalog.
-
Click and define a rating (for example 4 stars rating is done by clicking the 4th star).
-
Add a comment: 'This dataset helps to easily identify claims for drugs that are not compliant'.
-
Click 'Data Intelligence Metadata Explorer' and Click 'Home'.
You have now created a new dataset by merging two datasets using self-service data preparation. This new dataset helps to easily isolate invalid pharmaceutical claims. You also profiled this dataset, added a rating and a description, and published the dataset in the catalog so it can be easily retrieved.
You've now used Metadata Explorer to connect and interact with different data repositories (Databases, Cloud Data Lake, Local File System). You profiled and discovered data to identify data quality issues. You created a new enriched dataset to isolate these data quality issues. You published this dataset to the catalog.
Continue to - Hands-on - Part 2: Govern and Monitor Data Quality