datalake-def

Cloud neutral datalake definition file format and management CLI

By introducing a specific format for defining the shape of your datalake, we can easily use simple templating mechanisms in order to generate the needed artifacts for publishing the datalake to a cloud vendor. Unlike other templating options, this does not result in a restricted datalake shape. You may add as many datalake paths (buckets, containers, filesystems, etc) or as many datalake roles to the definition as you like. The resulting artifacts can be further transformed into terraform scripts, used with cloud vendor APIs or CLIs or copy and pasted manually.

The same file can then be used by additional consumers in order to map users to the most appropriate IAM Roles based on identity and paths being accessed, provide meaningful visualizations of the datalake for use in UIs, etc.

Your datalake definitions may also be commited to SCM systems to share them and allow multiple admins to maintain them over the lifetime of your datalakes.

Installation

Clone this project to your local machine
cd datalake-def
Install python modules needed by app
```
 python setup.py install
```
../ddt.py "set debug true" "new_datalake --name {name}" "build_datalake [--name {name}] --cloud AWS" quit

Commands

 ddt> lmccay@strange:~/Projects/datalake-def$ ./ddt.py 
 ddt> help -v
 
 Documented commands (use 'help -v' for verbose/'help <topic>' for details):
 ================================================================================
 add_path            Add a Storage Path to the Definition.
 add_role            Add a DataLake Role to the Definition.
 alias               Manage aliases
 build_datalake      Build IAM artifacts for given vendor from the DataLake DDF.
 edit                Run a text editor and optionally open a file with it
 help                List available commands or provide detailed help for a specific command
 history             View, run, edit, save, or clear previously entered commands
 load_datalake       Load a DataLake from persisted DDF.
 macro               Manage macros
 new_datalake        Create a new DataLake DDF.
 paths               List all Storage paths in the Definition.
 push_datalake       Publish IAM artifacts and buckets to given vendor.
 py                  Invoke Python command or shell
 quit                Exit this application
 recall_datalake     Unpublish IAM artifacts and buckets from given vendor.
 roles               List all DataLake Roles in the Definition.
 run_pyscript        Run a Python script file inside the console
 run_script          Run commands in script file that is encoded as either ASCII or UTF-8 text
 save_datalake       Persist a DataLake DDF.
 set                 Set a settable parameter or show current settings of parameters
 shell               Execute a command as if at the OS prompt
 shortcuts           List available shortcuts

Example use:

 lmccay@strange:~/Projects/datalake-def$ ./ddt.py "set debug true" "new_datalake -n ljm" "build_datalake -c AWS" quit
 debug - was: False
 now: True
 datalake: ljm
 datalake_roles:
   DATALAKE_ADMIN_ROLE:
     iam_role: cdp-ljm-admin-s3-role
     permissions: ['storage:full-access:STORAGE_LOCATION_BASE']
   IDBROKER_ROLE:
     iam_role: cdp-ljm-idbroker-assume-role
     instance_profile: true
     permissions: ['sts:assume-roles']
   LOG_ROLE:
     iam_role: cdp-ljm-log-role
     instance_profile: true
     permissions: ['storage:read-write:LOGS_LOCATION_BASE']
   RANGER_AUDIT_ROLE:
     iam_role: cdp-ljm-ranger-audit-s3-role
     permissions: ['storage:full-object-access:RANGER_AUDIT_LOCATION', 'storage:list-only:DATALAKE_BUCKET']
 nosql: {TABLE_NAME: ljm}
 permissions:
   storage:
     full-access: {description: the force, rank: 1}
     full-object-access: {description: jedi master, rank: 2}
     list-only: {description: youngling, rank: 5}
     read-only: {description: padawan, rank: 4}
     read-write: {description: jedi knight, rank: 3}
   sts:
     assume-roles: {description: shapeshifter, rank: 1}
 storage:
   DATALAKE_BUCKET: {path: /ljm/data}
   LOGS_BUCKET: {path: /ljm}
   LOGS_LOCATION_BASE: {path: /ljm/logs}
   RANGER_AUDIT_LOCATION: {path: /ljm/ranger/audit}
   STORAGE_LOCATION_BASE: {path: /ljm}
 
 Cloud Type: AWS
 Vendor: Amazon
 Building Amazon Cloud artifacts for datalake named: ljm...
 The datalake role: IDBROKER_ROLE is assigned the iam role: cdp-ljm-idbroker-assume-role which has been granted: assumeRoles
 The datalake role: LOG_ROLE is assigned the iam role: cdp-ljm-log-role which has been granted: read-write for path: /ljm/logs
 The datalake role: RANGER_AUDIT_ROLE is assigned the iam role: cdp-ljm-ranger-audit-s3-role which has been granted: full-object-access for path: /ljm/ranger/audit
 The datalake role: RANGER_AUDIT_ROLE is assigned the iam role: cdp-ljm-ranger-audit-s3-role which has been granted: list-only for path: /ljm/data
 The datalake role: DATALAKE_ADMIN_ROLE is assigned the iam role: cdp-ljm-admin-s3-role which has been granted: full-access for path: /ljm

Generated datalake directories:

 lmccay@strange:~/Projects/datalake-def$ ls datalakes/ljm/
 AWS/      ddf.yaml

Generated AWS Policy Artifacts:

 lmccay@strange:~/Projects/datalake-def$ ls datalakes/ljm/AWS/
 cdp-ljm-admin-s3-role-policy.json  cdp-ljm-idbroker-assume-role-policy.json  cdp-ljm-log-role-policy.json  cdp-ljm-ranger-audit-s3-role-policy1.json  cdp-ljm-ranger-audit-s3-role-policy.json

Generated YAML DDF (Datalake Definition File):

 lmccay@strange:~/Projects/datalake-def$ cat datalakes/ljm/ddf.yaml 
 ---
 datalake: ljm
 datalake_roles:
     IDBROKER_ROLE:
             iam_role: cdp-ljm-idbroker-assume-role
             instance_profile: true
             permissions:
                 - "sts:assume-roles"
     LOG_ROLE:
             iam_role: cdp-ljm-log-role
             instance_profile: true
             permissions:
                 - "storage:read-write:LOGS_LOCATION_BASE"
     RANGER_AUDIT_ROLE:
             iam_role: cdp-ljm-ranger-audit-s3-role
             permissions:
                 - "storage:full-object-access:RANGER_AUDIT_LOCATION"
                 - "storage:list-only:DATALAKE_BUCKET"
     DATALAKE_ADMIN_ROLE:
             iam_role: cdp-ljm-admin-s3-role
             permissions:
                 - "storage:full-access:STORAGE_LOCATION_BASE"
                 - "db:full-table-access:ljm-table"
 storage:
     STORAGE_LOCATION_BASE:
             # main data directory
             path: /ljm
     DATALAKE_BUCKET:
             # main data directory
             path: /ljm/data
     RANGER_AUDIT_LOCATION:
             # ranger audit logs
             path: /ljm/ranger/audit
     LOGS_LOCATION_BASE:
             # logs for fluentd usecases
             path: /ljm/logs
     LOGS_BUCKET:
             # logs for fluentd usecases
             path: /ljm
 permissions:
     storage:
         full-access:
             rank: 1
             description: the force
         full-object-access:
             rank: 2
             description: jedi master
         read-write:
             rank: 3
             description: jedi knight
         read-only:
             rank: 4
             description: padawan
         list-only:
             rank: 5
             description: youngling
     sts:
         assume-roles:
             rank: 1
             description: shapeshifter
     db:
         full-table-access:
             rank: 1
             description: dba

Azure Setup

Following steps describe a way to set up and run scripts for Azure.

Prerequisite

Environment Setup

Sign-in to Azure account
```
  az login
```
Create Service Principal (if not already created) - make sure the SP has Owner role at subscription level
```
  az ad sp create-for-rbac --name KnoxSP --password knox-password > local-sp.json
```
NOTE: If you explicitly want to set permissions use the option --skip-assignment and assign Owner permissions at subscription level later. NOTE: Service Principal should have Owner permissions at subscription level
Get subscription id
```
  az account show 
```

Setup environemnt variables (update values as necessarily) - Required

  AZURE_SUBSCRIPTION_ID="aa11bb33-cc77-dd88-ee99-0918273645aa"
  AZURE_TENANT_ID="00112233-7777-8888-9999-aabbccddeeff"
  AZURE_CLIENT_ID="12345678-1111-2222-3333-1234567890ab"
  AZURE_CLIENT_SECRET="abcdef00-4444-5555-6666-1234567890ab"

Setup environemnt variables (update values as necessarily) - Optional

  AZURE_RESOURCE_GROUP="myResourceGroup" #Resource group under which MSIs will be created, else default is <datalakename>RG

For more information see Azure Configure Authentication docs

Examples

Create a default DDF

  ddt.py "set debug true" "new_datalake -n srm" "build_datalake -c Azure" "push_datalake -c Azure" quit

Google Cloud Platform Setup

Set up and run scripts for Google Cloud Platform (GCP).

Prerequisites

A Google Cloud Platform account with permissions associated with the following GCP roles:
- StorageAdmin (bucket/path creation/removal)
- ServiceAccountAdmin (service account creation/removal)
- SecurityAdmin (IAM policy attachment)

Environment Setup

Download the key file for the Google Cloud Platform account to be used by the Datalake Definition tool.
Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to that key file.
Set the DDF_LOG_LEVEL environment variable to DEBUG to get debug-level log messages from the GCPFactory.

Examples

N.B. Datalake names are translated into Google Cloud Storage bucket names, which must be globally unique. Attempts to access buckets which don't belong to you will result in HTTP 403 errors, even though your permissions include the StorageAdmin role.

Create a Google Cloud Platform datalake definition from the templates

  ddt.py "new_datalake -n mydl" "build_datalake -n mydl -c GCP" quit

Push a Google Cloud Platform datalake definition

  ddt.py "push_datalake -n mydl -c GCP" quit

Recall a Google Cloud Platform datalake definition

  ddt.py "recall_datalake -n mydl -c GCP" quit

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
aws		aws
az		az
gcp		gcp
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ddt.py		ddt.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datalake-def

Cloud neutral datalake definition file format and management CLI

Installation

Commands

Example use:

Generated datalake directories:

Generated AWS Policy Artifacts:

Generated YAML DDF (Datalake Definition File):

Azure Setup

Prerequisite

Environment Setup

Examples

Google Cloud Platform Setup

Prerequisites

Environment Setup

Examples

About

Releases

Packages

Contributors 3

Languages

License

lmccay/datalake-def

Folders and files

Latest commit

History

Repository files navigation

datalake-def

Cloud neutral datalake definition file format and management CLI

Installation

Commands

Example use:

Generated datalake directories:

Generated AWS Policy Artifacts:

Generated YAML DDF (Datalake Definition File):

Azure Setup

Prerequisite

Environment Setup

Examples

Google Cloud Platform Setup

Prerequisites

Environment Setup

Examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages