Azure Synapse Analytics, the next evolution of Azure SQL Data Warehouse, combines enterprise data warehousing and big data analytics into a single analytics service. StreamSets Cloud‘s new Azure SQL Data Warehouse destination, released today, loads data into Azure Synapse.
Loading data into Azure SQL Data Warehouse destination is a two-stage process. First, data must be written to Azure Storage, then loaded into staging tables in Azure SQL Data Warehouse. The Azure SQL Data Warehouse destination automates this process – all you need to do is to configure the data warehouse and ADLS locations and credentials. The destination can even automatically create a table for you based on the data you are loading.
Ingesting Data into Azure SQL Data Warehouse
Let’s look at a simple use case: loading transactional data into Azure SQL Data Warehouse from Amazon S3. We’ll use New York City taxi data; in our input data set, each transaction contains an accounting of the fare and its components, pickup and dropoff timestamps and coordinates, payment type, and, optionally, a credit card number. Let’s assume we only want to load credit card transactions into Azure SQL Data Warehouse, and we don’t want the actual credit card numbers.
Here’s the StreamSets Cloud pipeline:
You can download the pipeline here and import it into StreamSets Cloud. Sign up for your free trial, if you haven’t already done so!
This short video walks through the pipeline and shows StreamSets Cloud writing data to Azure SQL Data Warehouse.
In this example we’re using the Amazon S3 origin, but we could just as easily read data from Google Cloud Storage, Oracle, Salesforce, or any other data source supported by StreamSets Cloud.
The origin is configured to read CSV data from an S3 bucket. As you saw in the video, for this use case I configured the origin with a single filename and set the pipeline to stop when it finishes reading data from that file. I could instead have used a wildcard, say *.csv
, and let the pipeline run continuously, streaming data from S3 to Azure SQL Data Warehouse as it becomes available.
The Stream Selector processor filters records based on some set of conditions. The expression ${record:value('/payment_type')=='CRD'}
matches records with payment_type
set to CRD
; all other records are discarded. StreamSets Cloud’s Expression Language includes a rich set of functions, allowing you to model a wide variety of business logic in your pipelines.
Next, a Field Remover processor removes the credit_card field – we don’t want this sensitive data in our data warehouse!
Fields read from CSV files are interpreted as strings; we need the Field Type Converter to convert them to float
, integer
or datetime
values as appropriate so that the Azure SQL Data Warehouse destination uses the appropriate data types when it creates the table in Azure SQL Data Warehouse – we don’t want all the columns to be simply VARCHAR
!
Finally, the destination writes the data to an Azure SQL Data Warehouse table. As mentioned above, you need to configure both data warehouse and staging. Note that the data warehouse user will need INSERT
and ADMINISTER DATABASE BULK OPERATIONS
permissions, plus, optionally, CREATE TABLE
if ‘auto create table’ is enabled.
You can configure the destination to leave the staging files in place, which can be useful for debugging purposes, or purge them automatically after they are loaded into the data warehouse.
In this simple example, we define the schema and table directly, but we could use expression language to set the values dynamically. For example, if we were reading fields from a hierarchy of paths in S3, we could use the path to set the table name with an expression such as ${file:pathElement(record:attribute('Name'), 0)}
.
StreamSets Cloud’s Azure SQL Data Warehouse destination manages the process of writing data to staging and then loading data warehouse tables. The destination uses the ‘hash’ distribution strategy, recommended to take advantage of Azure SQL Data Warehouse’s massively parallel architecture.
Conclusion
StreamSets Cloud‘s new Azure SQL Data Warehouse destination makes it easy to load data into tables in Azure Synapse Analytics (formerly SQL Data Warehouse). Configure the data warehouse and staging location and credentials, and the destination does the rest. You can build pipelines for one-time batch operations, or continuously stream data as it arrives. Sign up for a free trial of StreamSets Cloud and try it for yourself!
The post Ingest data into Azure Synapse Analytics (formerly SQL DW) with StreamSets Cloud appeared first on StreamSets.