Databricks Bulk Ingest of Salesforce Data Into Delta Lake

Learn how to leverage newly released Databricks COPY command for bulk ingest into Delta Lake using the hosted StreamSets Cloud service.

StreamSets is proud to announce an expansion of its partnership with Databricks by participating in Databricks’ newly launched Data Ingestion Network. As part of the expanded partnership, StreamSets is offering additional functionality for StreamSets Cloud with a new connector for Delta Lake, an open source project that provides reliable data lakes at scale. The StreamSets Cloud service provides an integrated, cloud-based user experience for designing, deploying and monitoring your pipelines across your entire organization. A key component of this integration is leveraging the newly released Databricks COPY command for bulk ingest into Delta Lake using the StreamSets Cloud service.

Watch Demo Video

Let’s consider a simple example of ingesting accounts information from Salesforce and storing it in queryable format in a Delta Lake table.

Pipeline Overview

If you’d like to follow along, here are the details to get you started and here’s the GitHub link to the pipeline JSON that you can import in your environment.

Prerequisites

StreamSets Cloud account
Databricks account
Access to a Spark Cluster with Databricks Runtime 6.3
Access to an existing Delta Lake table
Salesforce account
Access to Amazon S3 bucket account

Here are the steps for designing our dataflow pipeline:

Configure Salesforce origin to read (accounts) information
Configure Expression Evaluator processor to transform input data attribute
Configure Amazon S3 destination to load data to a staging location
Configure Databricks Delta Lake executor to run a Spark SQL query to copy the data from Amazon S3 into the Delta Lake table

Salesforce—Origin

Configuration attribute of interest is SOQL Query that will retrieve account details from Salesforce.

SELECT Id, Name, Type, BillingStreet, BillingCity, BillingState, BillingPostalCode, BillingCountry, Website, PhotoUrl, AccountNumber, Industry, Rating, AnnualRevenue, NumberOfEmployees 
FROM Account 
WHERE Id > '${OFFSET}' 
Order By Id

Expressions Evaluator—Processor

Configuration attribute of interest is Field Expression that uses regular expression to remove redundant prefix text “Customer -” from account Type field.

${str:regExCapture(record:value('/Type'),'(.*) - (.*)',2)}

Amazon S3—Destination

Configuration attribute of interest is Data Format (set to JSON) in which the account detail objects will be stored on Amazon S3.

Databricks Delta Lake—Executor

As noted earlier, Databricks provides COPY command to efficiently bulk load large amounts of data into Delta Lake. To use the COPY command, Databricks Delta Lake executor has been added to the pipeline. The Databricks Delta Lake executor is capable of running one or more Spark SQL queries on a Delta Lake table each time it receives an event.

In our example, the Amazon S3 destination is configured to generate events each time it completes writing an object. When it generates an object written event, it also records the bucket where the object is located and the object key name that was written. These bucket and object key attributes are used by the COPY command to load the data from Amazon S3 into an existing Delta Lake table. (See below Spark SQL query.)

Configuration attributes of interest:

JDBC Connection String

jdbc:spark://dbc-5a9fba6c-c704.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/8214879852432560/1123-001005-tow71;AuthMech=3;

Note: The JDBC connection string is where we can specify which Databricks cluster to connect to execute the COPY command. In our case, that’s 1123-001005-tow71.

Spark SQL Query

COPY INTO accounts_d 
FROM (Select Id, Name, Type, BillingStreet, BillingCity, BillingState, BillingCountry, CAST(BillingPostalCode AS INT), Website, PhotoUrl, AccountNumber, Industry, Rating, CAST(NumberOfEmployees AS INT), AnnualRevenue 
FROM "s3a://${record:value('/bucket')}/${record:value('/objectKey')}") 
FILEFORMAT = JSON 
FORMAT_OPTIONS ('header' = 'true')

Preview Pipeline

Previewing the pipeline is a great way to see the transformations occurring on attribute values. See below the selected Expression Evaluator processor and before and after values of Type attribute.

Query Delta Lake

If everything looks good in preview mode, running the pipeline should bulk copy account information from Salesforce to Delta Lake.

Once the Delta Lake table is populated, you can start analyzing the data. For example, running the following query will give us insights into total revenue generated based on account ratings and types of accounts.

Total Revenue by Rating and Account Type

SELECT Rating, Type, format_number(SUM(AnnualRevenue),2) AS Total_Revenue 
FROM accounts_d
WHERE Rating is not null and Type is not null
GROUP BY Rating, Type

Summary

In this blog post, you’ve learned how to leverage Databricks COPY command for bulk ingest into Delta Lake table using the hosted StreamSets Cloud service.

Learn more about how to build, run and monitor pipelines using the hosted StreamSets Cloud service.

You can also ask questions in the #streamsets-cloud channel on our community Slack team—sign up here for access.

The post Databricks Bulk Ingest of Salesforce Data Into Delta Lake appeared first on StreamSets.