Azure Data Factory Provisioning

Requirements

Create a Storage Account with name: blobstorageaccount
Get the storage account keys found in "settings -> Access Keys"
Create a blob storage container: blobstoragecontainer
Upload the pig script: "oilproc.pig" to the blob container
Upload the pig utility script "piggybank.jar" to the blob container

I also create an AzureML trained model to make predictions on the data. I get the web service endpoint and API key from the AzureML portal.

Provision Data Factory

Create the factory Name: azuredatafactory

Under "Author and deploy" create the following items:

Linked Services

AzureStorageLinkedService.json: connects to the Azure Blob storage created above
EIATableLinkedService.json: Retrieves the daily Crude Oil Spot Price from www.eia.gov
EIATableLinkedService0.json: Retrieves weekly Gas Prices from www.eia.gov
EIATableLinkedService2.json: Retrieve weekly Stock of Finished Gas from www.eia.gov
EIATableLinkedService3.json: Retrieve weekly Days supply of gas from www.eia.gov
EIATableLinkedService4.json: Retrieve weekly Gas Imports from www.eia.gov
EIATableLinkedService5.json: Retrieve weekly Refinery Utilization from www.eia.gov
EIATableLinkedService7.json: Retrieve weekly Crude Oil Exports from www.eia.gov
EIATableLinkedService8.json: Retrieve weekly Crude Oil Imports www.eia.gov
GasPriceAzureMLLinkedService.json: Connects to the trained Azure ML model web service
HDInsightOnDemandLinkedService.json: Create an HDInsightOnDemand service to run the Hadoop PIG script which processes and joins the data prior to making predictions.

Datasets

EIATableDataset.json, EIATableDataset0.json, EIATableDataset1.json, EIATableDataset2.json, EIATableDataset3.json, EIATableDataset4.json, EIATableDataset5.json, EIATableDataset7.json, EIATableDataset8.json: These datasets provide the inputs from the blob storage to the pipelines.
EIABlobDataset.json, EIABlobDataset0.json, EIABlobDataset1.json, EIABlobDataset2.json, EIABlobDataset3.json, EIABlobDataset4.json, EIABlobDataset5.json, EIABlobDataset7.json, EIABlobDataset8.json: These datasets are the outputs from the pipeline to blob storage.
EIAPigInput.json: Input the pig script
EIAPigOutput.json: Output the results of the pig script
EIAAzureMLResultBlob.json: Creates a prediction of data based on the Azure ML model.

Pipelines

EIAtoBlobPipeline.json: Get the data from the web tables and save it in the Azure Blob storage container.
EIAPigPipeline.json: Run the pig script to process the data.
EIAPredictivePipeline.json: Run the AML model against the data to make predictions.

Pipeline Update:

EIAPipeline.json: Changes the frequency to weekly. Also integreates the other two pipelines into a single script.

Execute Data Factory Jobs

From the "Diagram" on ADF, choose the output end point for one of the jobs by double-clicking. By selecting the "Recently updated slices" in the "Monitoring" tile, I manually run the job (clicking "Run" in the "Data Slice" blade that appears to the right.

Run the first two pipelines to create the training data set before creating the trained model.

After running a single time, use the EIAPipeline to run the entire script weekly.

The final Data Factory pipeline should look like this:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Azure Data Factory Provisioning

Requirements

Provision Data Factory

Linked Services

Datasets

Pipelines

Pipeline Update:

Execute Data Factory Jobs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Azure Data Factory Provisioning

Requirements

Provision Data Factory

Linked Services

Datasets

Pipelines

Pipeline Update:

Execute Data Factory Jobs