-
Notifications
You must be signed in to change notification settings - Fork 0
/
project_notes.txt
213 lines (141 loc) · 6.29 KB
/
project_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
Mr K DE Project Azure Databricks etc.
€184.67
€182.99 / €2.22
€182.83 / €2.38
€182.22 / €2.99
181.39 / 2.99
181.39 / 3.00
181.37
180.83
To do:
what are common databricks transformations on a relational database?
do pyspark questions to find transformations
do an incremental data loading version of project next
---------------
OVERVIEW
-Common usecase: migrate on-prem MSSQL DB
-Use ADF (ETL tool mainly used for data ingestion) to copy all tables from DB and move to cloud
-Tables will be stored in cloud in Az Data Lake Gen2
-Will then use Az Databricks to transform the data
-Lakehouse architecture used (bronze, silver, gold layers)
-Bronze layer -> source of truth, exact copy of data at source
-Then connect AzDB to bronze layer and do tranformation and load transformed data into Silver layer
-Common bronze->silver transf -> ( change column names/datatypes)
-Then silver -> gold transf - final cleaned curated data
-Synapse, similar to MSSQL on cloud, load gold layer into these cloud db tables
-Synapse -> Power BI
-Security & Governance -> Az Active Directory (identity access management), Az Key Vault
-Whole pipeline is automated
---------------
PART 1: Environment setup
main resource group: rg-mrk-de-project
datasource: mssql
SSMS:
Sql Server Authentication:
mrk
Mind....41-
sa
Mind....44-
setup username/password in keyvault -> now encrypted
---------------
PART 2: Data Ingestion
MSSQL -> ADF: use Integration Runtime to establish connection
-> install SHIR on local database device
AutoResolveIntegrationRuntime: connects cloud to cloud services, e.g. ADF -> ADL
Copy tables from MSSQL (source) -> ADL (sink) with ADF:
Author > Pipelines
Access policies not available.
The access configuration for this key vault is set to role-based access control. To add or manage your access policies, go to the Access control (IAM) page.
FIX: https://stackoverflow.com/questions/76325987/access-policies-not-available
create source
create sink
move Address table to datalake
repeat for all 10 tables
create sql script to list all tables: getschema.sql
use that script in ADF to grab all tables
Azure Data Lake:
his details at 39:06 in vid
Synapse workspace:
DLSG2:
account name: mrkdatalakegen2gc
file system name: fs-mrk-gc
SQL Server admin login: sqladminuser, pw: Min...41-
--------------------------------------
PART 3: Data Transformation
-create spark cluster
-make sure azdb has access to adl -> Enable passthrough checked
(azure data lake gen 2 blob storage, need to create a container to create a blob)
give adf storage blob contributor access
https://learn.microsoft.com/en-us/azure/databricks/archive/credential-passthrough/adls-passthrough
-mount the data lake: storagemount notebook
-use bronze mount point, do first transf and load to silver container using silver mount point
-read data from silver container, do further transf and load to gold layer using gold mount point
data is structured so needs minimal transf
Transformation 1: convert data/time format to date
-any col with date in title then convert from date/time to date format
_delta_log does version tracking
Silver transf generally joining multiple tables together, doing some aggregation, creating facts/dimensions for warehousing, but we'll be doing a simple transf
Transformation 2 (Silver to Gold):
-separate pascal case column names with underscore
Now use ADF pipelines to run notebooks as jobs
update copy_all_tables pipeline to include the databricks notebook
To establish connection between azdb and adf create a linked service connection
%sql - magic command
databricks - adf token:
-----------------------------------------------
PART 4: Data Loading - Synapse
Synapse config on vid @ 1:52:09
Synapse is like ADF + Az DBricks
(a dedicated SQL pool represents a collection of analytic resources that are provisioned while using Synapse)
Serverless SQL DB:
- smaller workloads
- cannot pause decicated sql pool
- just using compute (uses data lake as storage)
Dedicated SQL DB: using compute + storage
- higher workloads
- can pause decicated sql pool
- using both compute and storage (data must be loaded into sql pool)
-create az sql db : gold_db
-will be using serverless sql pool to query data from data lake
-data in gold layer is in delta format
-create a view in the sql serverless db for address table
-create pipeline to create views for all tables using synapse
-to create pipeline we need to create stored procedure with parameters which can dynamically create the views with data from the gold container
-create pipelines to use the sp to make the views
-create linked service connection to connect to serverless sql db
-once views created any change in data lake will be mirrored in views, but if db schema changes the synapse pipeline will need to be re-run
-now use power bi to connect to gold_db and fetch all the views
PART 5: Data Reporting
get data > Azure > Azure Synapse Analytics (SQL DW)
server endpoint: synapse url (synapse workspace>Properties>Serverless SQL endpoint)
(synw-mrk-demo-01-gc-ondemand.sql.azuresynapse.net)
database: gold_db
PART 6: End to End Pipeline Testing
----------------------------------------------
MISC
Fix:
Azure Keyvault: The operation is not allowed by RBAC.
https://learn.microsoft.com/en-us/answers/questions/1370440/azure-keyvault-the-operation-is-not-allowed-by-rba
C:\Program Files\Microsoft SQL Server\MSSQL16.MSSQLSERVER\MSSQL\Backup
If the database being restored does not exist, the user must have CREATE DATABASE permissions to be able to successfully restore the database. If the database exists, RESTORE permissions default to members of the sysadmin and dbcreator fixed server roles and the owner (dbo) of the database.
https://learn.microsoft.com/en-us/sql/relational-databases/backup-restore/restore-database-general-page?view=sql-server-ver16
@{concat('SELECT * FROM ', item().SchemaName, '.', item().TableName)}
bronze/SalesLT/Address/Address.parquet
@{concat(dataset().schemaname, '/', dataset().tablename)}
@{concat(dataset().tablename,'.parquet')}
=========================
March 12 REDO
original resource list:
main resource group: rg-mrk-de-project
kv-mrk-demo-001-gc
adf-mrk-demo-001-gc
mrkdatalakegen2gc
dbw-mrk-demo-001
synw-mrk-demo-01-gc
original resource list:
rg-gavazade
kv-gavazade
adf-gavazade
datalakegen2-gavazade
dbw-gavazade
synw-gavazade