StreamSets

This is a guest post by Clark Bradley, Solutions Engineer, StreamSets

SNMP stands for simple network management protocol and allow for network devices to share information. SNMP is supported across a wide range of hardware such as conventional network equipment (routers, switches and wireless access points) to network endpoints like applications and internet of things (IoT) devices.

An SNMP managed network consists of managers, agents and MIB (management information base). Managers initiate communication with agents which send responses as PDU (protocol data unit) between them while the MIBs store the information about objects used by a device. The PDU contain information on user information and control data. By issuing queries from a manager to an agent, a user can trap agent alerts, send configuration commands and retrieve OID (object identifier) information from the MIB.

In this blog, we will be using StreamSets Data Collector to build a pipeline to retrieve the data from MIB and store it in Amazon S3. The MIB contains multiple OID that can be used to report on network activity access points, neighbor detection to list known vs rogue access points, uplink statistics to monitor and reduce downtime, interference or bad clients which could identify network slowdown. The OID is an address in the MIB hierarchy which is used to recognize the network devices and their statuses such as hardware temperature, IP address and location, timeticks (time between events) or system up time.

First, we build our pipeline using the Groovy Scripting origin which allows a data engineer/developer to write code for a Java based connector called SNMP4J to allow SNMP polling. SNMP4J is an enterprise class free open source SNMP implementation which supports both command generation (managers) as well as command responding (agents) communication.

As shown above, we’ve created parameters for the Groovy Scripting origin to provide IP or hostname of the network controller, an OID (or list of OID) to traverse, the SNMP version (v1, v2 or v3), the community target which is an SNMP target properties for community based message processing used in v1 and v2 and lastly an MIB name to use for the target table. These runtime parameters are dynamic and can be changed at pipeline execution in order to reuse the pipeline for polling multiple MIBs.

Here’s the pipeline in preview mode.

Note that the pipeline can be set to refresh every second or every few minutes to look for patterns in the data. As the session begins, the pipeline will poll the router with a SNMP bulk walk process. The pipeline uses SNMP GETBULK requests to query a network entity efficiently for a tree of information. We’ve also added sleep cycle that can be customized to the network size and traffic to retrieve updated and relevant data to be provided to visualization for security analysis.

Here’s the sample Groovy Scripting origin code.

import java.util.Map;
import java.util.TreeMap;
import java.io.IOException;
import java.util.List;

import org.snmp4j.CommunityTarget;
import org.snmp4j.PDU;
import org.snmp4j.Snmp;
import org.snmp4j.smi.*;
import org.snmp4j.mp.*;
import org.snmp4j.transport.DefaultUdpTransportMapping;
import org.snmp4j.util.*;

// single threaded - no entityName because we need only one offset
entityName = ''

// get the previously committed offset or start at 0
if (sdc.lastOffsets.containsKey(entityName)) {
  offset = sdc.lastOffsets.get(entityName) as int
} else {
  offset = 0
}

if (sdc.userParams.containsKey('recordPrefix')) {
  prefix = sdc.userParams.get('recordPrefix')
} else {
  prefix = ''
}

cur_batch = sdc.createBatch()
record = sdc.createRecord('generated data')

public static Map doSNMPBulkWalk(String ipAddr, String commStr, String bulkOID, String port) throws IOException {
  Snmp snmp = new Snmp(new DefaultUdpTransportMapping());
  CommunityTarget targetV2 = null;
  PDU request = null;
  snmp.listen();
  Address add = new UdpAddress(ipAddr + "/" + port);
		
  targetV2 = new CommunityTarget();
  targetV2.setCommunity(new OctetString(commStr));
  targetV2.setAddress(add);
  targetV2.setTimeout(1500);
  targetV2.setRetries(2);
  targetV2.setVersion(${versionInput});
  targetV2.setMaxSizeRequestPDU(65535);
  request = new PDU();
  request.setMaxRepetitions(1);
  request.setNonRepeaters(0);
  request.setType(PDU.GETBULK);
		
  OID oID = new OID(bulkOID);
  request.add(new VariableBinding(oID));
		
  OID rootOID = request.get(0).getOid();
  List l = null;
  TreeUtils treeUtils = new TreeUtils(snmp, new DefaultPDUFactory());
  targetV2.setCommunity(new OctetString(commStr));
  OID[] rootOIDs = new OID[1];
  rootOIDs[0] = rootOID;
  l = treeUtils.walk(targetV2, rootOIDs);
  Map result = new TreeMap<>();

  for(TreeEvent t : l) {
    VariableBinding[] vbs = t.getVariableBindings();
			
    for (VariableBinding varBinding : vbs) {
      Variable val = varBinding.getVariable();
      if (val instanceof Integer32) {
	Integer valMod = new Integer(val.toInt());
	result.put("." + varBinding.getOid(), valMod);
      } else {
        result.put("." + varBinding.getOid(), val);
      }
      if (sdc.isStopped()) { break;}
    }
    if (sdc.isStopped()) { break;}
  }
  snmp.close();
  return result;
}

hasNext = true
while(hasNext) {
  record = sdc.createRecord('generated data')
  try {
    String[] oidList = ${oidInput}.split(";");
    for (String oidSingle : oidList) {
      Map result = doSNMPBulkWalk(${addrInput}, ${commInput}, oidSingle, ${portInput}, sdc);
      result.each { key, val ->

        offset = offset + 1
        record = sdc.createRecord('generate records')
      
        //List the OID in the record
        record.value = [:]
        col = "OID"
        record.value[col] = key
      
        //List Object Description in the record
        col = "OID_VALUE"
        record.value[col] = val
      
        //Add the record to the current batch
        cur_batch.add(record)

        // if the batch is full, process it and start a new one
        if (cur_batch.size() >= sdc.batchSize) {
          // blocks until all records are written to all destinations
          // (or failure) and updates offset
          // in accordance with delivery guarantee
          cur_batch.process(entityName, offset.toString())
          cur_batch = sdc.createBatch()
	  sleep(300000)
          if (sdc.isStopped()) {
            hasNext = false
          }
        }
      }
    } 
  } catch (Exception e) {
    sdc.error.write(record, e.toString())
    hasNext = false
  }
}

We’ve also added processors in the pipeline to remove unwanted or missing IP data with a Stream Selector and a Geo IP processor to enrich validated data with latitude, longitude, city and country.

Finally, the data is stored in Amazon S3 for further aggregation, visualization and/or security analysis.

Summary

In this post, we learned that data engineers can quickly and easily build pipelines to offload data origins, such as MIB data from SNMP networks, for integration with other sources or long-term storage for security and network health analysis.

If you are interested in learning more about StreamSets, visit our Resource Finder.

StreamSets Data Collector is open source, under the Apache 2.0 license. To download for free and start developing your data pipelines, visit Download page.

The post StreamSets Data Collector: Simple Network Management Protocol And Management Information Base appeared first on StreamSets.

Azure SQL DW Azure Synapse Analytics, the next evolution of Azure SQL Data Warehouse, combines enterprise data warehousing and big data analytics into a single analytics service. StreamSets Cloud‘s new Azure SQL Data Warehouse destination, released today, loads data into Azure Synapse.

Loading data into Azure SQL Data Warehouse destination is a two-stage process. First, data must be written to Azure Storage, then loaded into staging tables in Azure SQL Data Warehouse. The Azure SQL Data Warehouse destination automates this process – all you need to do is to configure the data warehouse and ADLS locations and credentials. The destination can even automatically create a table for you based on the data you are loading.

Ingesting Data into Azure SQL Data Warehouse

Let’s look at a simple use case: loading transactional data into Azure SQL Data Warehouse from Amazon S3. We’ll use New York City taxi data; in our input data set, each transaction contains an accounting of the fare and its components, pickup and dropoff timestamps and coordinates, payment type, and, optionally, a credit card number. Let’s assume we only want to load credit card transactions into Azure SQL Data Warehouse, and we don’t want the actual credit card numbers.

Here’s the StreamSets Cloud pipeline:

You can download the pipeline here and import it into StreamSets Cloud. Sign up for your free trial, if you haven’t already done so!

This short video walks through the pipeline and shows StreamSets Cloud writing data to Azure SQL Data Warehouse.

In this example we’re using the Amazon S3 origin, but we could just as easily read data from Google Cloud Storage, Oracle, Salesforce, or any other data source supported by StreamSets Cloud.

The origin is configured to read CSV data from an S3 bucket. As you saw in the video, for this use case I configured the origin with a single filename and set the pipeline to stop when it finishes reading data from that file. I could instead have used a wildcard, say *.csv, and let the pipeline run continuously, streaming data from S3 to Azure SQL Data Warehouse as it becomes available.

S3 origin

The Stream Selector processor filters records based on some set of conditions. The expression ${record:value('/payment_type')=='CRD'} matches records with payment_type set to CRD; all other records are discarded. StreamSets Cloud’s Expression Language includes a rich set of functions, allowing you to model a wide variety of business logic in your pipelines.

Stream Selector

Next, a Field Remover processor removes the credit_card field – we don’t want this sensitive data in our data warehouse!

Field Remover

Fields read from CSV files are interpreted as strings; we need the Field Type Converter to convert them to float, integer or datetime values as appropriate so that the Azure SQL Data Warehouse destination uses the appropriate data types when it creates the table in Azure SQL Data Warehouse – we don’t want all the columns to be simply VARCHAR!

Field Type Converter

Finally, the destination writes the data to an Azure SQL Data Warehouse table. As mentioned above, you need to configure both data warehouse and staging. Note that the data warehouse user will need INSERT and ADMINISTER DATABASE BULK OPERATIONS permissions, plus, optionally, CREATE TABLE if ‘auto create table’ is enabled.

Azure SQL DW 1

You can configure the destination to leave the staging files in place, which can be useful for debugging purposes, or purge them automatically after they are loaded into the data warehouse.

Azure SQL DW 2

In this simple example, we define the schema and table directly, but we could use expression language to set the values dynamically. For example, if we were reading fields from a hierarchy of paths in S3, we could use the path to set the table name with an expression such as ${file:pathElement(record:attribute('Name'), 0)}.

Azure SQL DW 3

StreamSets Cloud’s Azure SQL Data Warehouse destination manages the process of writing data to staging and then loading data warehouse tables. The destination uses the ‘hash’ distribution strategy, recommended to take advantage of Azure SQL Data Warehouse’s massively parallel architecture.

Conclusion

StreamSets Cloud‘s new Azure SQL Data Warehouse destination makes it easy to load data into tables in Azure Synapse Analytics (formerly SQL Data Warehouse). Configure the data warehouse and staging location and credentials, and the destination does the rest. You can build pipelines for one-time batch operations, or continuously stream data as it arrives. Sign up for a free trial of StreamSets Cloud and try it for yourself!

The post Ingest data into Azure Synapse Analytics (formerly SQL DW) with StreamSets Cloud appeared first on StreamSets.

In this blog, we will look at a few design patterns for Slowly Changing Dimensions (SCD) Type 2 and see how StreamSets Transformer, the newest addition to the StreamSets DataOps Platform, makes it easy to implement them.

While relatively static data like locations and addresses of entities, such as customers, change rarely (if at all) over time, in most cases it is critical that the history of all changes is maintained. This refers to the concept of dimensions and Slowly Changing Dimensions which are important components of DataOps by way of management and automation of such datasets.

“Dimensions in data management and data warehousing contain relatively static data about such entities as geographical locations, customers, or products. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule” — Wikipedia.

There are six types (Type 1 thru Type 6) of SCD operations and StreamSets Transformer enables you to handle and implement two common ones–Type 1 and Type 2.

Type 1 SCD — Doesn’t require history of dimension changes to be maintained and the old dimension value is simply overwritten with the new one. This type of operation is easy to implement (similar to a normal SQL update) and is often used for things like removing special characters, correcting typos and spelling mistakes in record field values.

Type 2 SCD — Requires maintaining history of all changes made to each key in a dimensional table. Here are some challenges involved when manually dealing with Type 2 SCD:

Every process that updates these tables has to honor the Type 2 SCD pattern of expiring old records and replacing them with new ones
There might not be a built-in constraint to prevent overlapping start and end dates for a given dimension key
When converting an existing table to a Type 2 SCD, it will most likely require you to update every single query that reads from or writes to that table
Every query against that table will need to account for the historical Type 2 SCD pattern by filtering only for current data or for a specific point in time

As you can imagine, Type 2 SCD operations can become complex and hand-written code, SQL queries, etc. may not scale and can be difficult to maintain.

Meet Slowly Changing Dimension processor. This processor makes it easy to implement Type 2 SCD operations by enabling data engineers to centralize all the “logic” (via configuration; not SQL queries or code!) in one place.

Let’s take a look at a few common design patterns.

Pattern 1: One-time migration — File Based (Batch mode)

Let’s first take a very simple yet concrete example of managing customer records (with updates to addresses) for existing and new customers. In this case, the assumption is that the destination is empty so it’s more of a one-time migration scenario for ingesting “master” and “change” records from respective origins to a new file destination.

This scenario involves:

Creating one record for every row in “master” origin
Creating one record for every row in “change” origin
- New customers: Version set to 1 where customer id doesn’t exist in “master” origin
- Existing customers: Version set to current value in “master” origin + 1 where customer id exists in “master” origin

Sample Pipeline:

Note: For details on configuration attributes, click here.

Master Origin Input: Sample master records for existing customers

customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,version
1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521,1
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,1
3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725,1

Change Origin Input: Sample change records for existing and new customers

customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,Park City,UT,80126
3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,1991 Margo Pl,San Fran,CA,00725
11,Mark,Barrett,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,Park City,UT,80126

Final Output: Given the two datasets above the resulting output will look like this

customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,version
1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521,1
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,1
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,Park City,UT,80126,2
3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725,1
3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,1991 Margo Pl,San Fran,CA,00725,2
11,Mark,Barrett,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,Park City,UT,80126,1

Notice that the total number of output records is 6; 3 records from master origin for existing customers and 3 records from the change origin–where two records are for existing customers Mary and Ann with their updated address and version incremented to 2 and one record for new customer Mark with version set to 1.

Pattern 2: Incremental updates — JDBC Based (Streaming mode)

Now let’s say there is a JDBC connection enabled database (for example, MySQL) and it has a dimension table “customers” with composite primary key — customer_id, version. In this case, the goal is still the same as pattern 1 and 2 where we’d like to capture and maintain history of updates for new and existing customer records.

Sample Pipeline:

Note: For details on configuration attributes, click here.

The main differences between this and pattern 1 are as follows:

Pattern 1 is designed to run in batch mode and terminate automatically after all the data has been processed; whereas pipeline in pattern 2 is configured to run in streaming mode–continuously till the pipeline is stopped manually–which means it will “listen” for customer updates being dropped in S3 bucket and process them as soon as they’re available without user intervention.
Pattern 1 can only handle up to one additional update for any given customer record because of the fact that the master origin is not updated with new version number for every corresponding change record — which effectively means every update record coming in via change origin will get assigned version 2.
Unlike pattern 1, the master gets updated with the latest version in pattern 2 (via JDBC Producer destination) so every update record coming in via change origin will get a new version assigned to it.

Query customers in MySQL:

SELECT * FROM customers where customer_id  = 1

Pattern 3: Incremental updates — Databricks Delta Lake DBFS (Streaming mode)

This is very similar to Pattern 2. The main differences are:

Single origin
Delta Lake Lookup — For every update/change record coming in a lookup against the current Delta Lake will be performed based on dimension key customer_id. If there’s a match, the values customer_id and version will be returned and passed on to SCD processor. The SCD processor will increment the version number based on the lookup value and a new record with updated version will be inserted into the Delta Lake table.

Sample Pipeline:

Note: For details on configuration attributes, click here.

Query customers in Delta Lake DBFS:

SELECT * FROM delta.`/DeltaLake/customers` where customer_id in (1)

Pattern 4: Upserts — Databricks Delta Lake And Time Travel (Streaming mode)

If you’re using Delta Lake, another option is to leverage Delta Lake’s built-in upserts using merge functionality. Here the underlying concept is the same as SCD which is to maintain versions of dimensions, but the implementation of it is much simpler.

Sample Pipeline:

Note: For details on configuration attributes, click here.

In this pattern, for every record coming in via the (S3) origin, an insert or an update operation is performed in Delta Lake based on the conditions configured for new (“When Not Matched”) and existing products (“When Matched”) respectively. And since Delta Lake storage layer supports ACID transactions, it is able to create new (parquet) files for updates — while allowing to query for the most recent record with simple SQL without explicitly requiring tracking field (for example, “version”) to be present in the table and the where clause.

For instance, consider this original record:

product_id,product_category_id,product_name,product_description,product_price
1,2,"Quest Q64 10 FT. x 10 FT. Slant Leg Instant U","",59.98

And this change record with updated price from 59.98 to 69.99

product_id,product_category_id,product_name,product_description,product_price
1,2,"Quest Q64 10 FT. x 10 FT. Slant Leg Instant U","",69.98

Query products in Delta Lake Table:

SELECT * FROM products where product_id=1

Note that the table products doesn’t have tracking type field (for example, “version”) while the query still retrieves the most “current” version of the record with product price of $69.98.

To query older versions of the data, Delta Lake provides a feature called “Time Travel”. So in our case, to retrieve the previous (0) version of the product’s price, the SQL query would look like:

SELECT * FROM products VERSION AS OF 0 where product_id=1

Notice the product price of $59.98. For more details and options on Delta Lake Time Travel, click here.

Conclusion

This blog post highlighted some common patterns of handling SCD Type 2 and also illustrated how easy it is to implement those patterns using Slowly Changing Dimension (SCD) processor in StreamSets Transformer.

If you’d like to learn more about StreamSets Transformer, here are some useful resources for you to get started: Product overview | Technical documentation | Overview video | Datasheet | Blogs.

The post StreamSets Transformer: </br> Design Patterns For Slowly Changing Dimensions appeared first on StreamSets.

StreamSets is excited to announce the immediate availability of StreamSets Data Collector 3.12.0 and StreamSets Data Collector Edge 3.12.0.

StreamSets Data Collector is open source under Apache License 2.0 and a powerful design and execution engine. It enables moving data between any source and destination, performing transformations, and push down analytics along the way. To download, click here.

StreamSets Data Collector Edge is a lightweight execution agent that runs on edge devices with limited memory, CPU, and/or connectivity resources. It enables reading data from an edge device or receiveing data from another dataflow pipeline. It supports messaging protocols including HTTP, MQTT, CoAP, and WebSockets. To download, click here.

Highlights

There are some great new features and enhancements included in this release—let’s review some of the highlights. For a complete list of enhancements, new features, bug fixes, and upgrade instructions, please refer to the Release Notes.

StreamSets Data Collector 3.12.0

Origins

Groovy Scripting, Jython Scripting and JavaScript Scripting origins now include two new methods in the batch object to append an error record to the batch and append an event to the batch.
RabbitMQ Consumer now supports transport layer security (TLS) for connections with a RabbitMQ server. You can configure the required properties on the TLS tab.
Salesforce origin has a new property to configure the streaming buffer size when subscribed to notifications. Configure this property to eliminate buffering capacity errors.
SQL Server CDC Client and SQL Server Change Tracking Client origins can now be configured to convert unsupported data types into strings and continue processing data.

Destinations

Azure Event Hub Producer destination can now write records as XML data.
RabbitMQ Producer destination now supports transport layer security (TLS) for connections with a RabbitMQ server. You can configure the required properties on the TLS tab.

Data Formats

Avro – For schemas located in Confluent Schema Registry, Data Collector now includes a property to specify the user information needed to connect to Schema Registry through basic authentication.

Deprecation

Starting December 16th, 2019, the ability to upload support bundles directly from Data Collector will be deprecated, and will be removed in a future release. Going forward, please use Data Collector to generate and download a support bundle to your local machine. Then upload the file to the appropriate support ticket in the StreamSets Zendesk Support portal (no file size limit).

Feedback and Contributions

If you’d like to suggest a feature, enhancement, or if you see something that needs to be fixed or made better, feel free to open a ticket by visiting—https://issues.streamsets.com.

Also note that StreamSets welcomes contributions from the community. For guidelines on contributing code, visit—https://github.com/streamsets/datacollector/blob/master/CONTRIBUTING.md

For more information about StreamSets Data Collector, visit our documentation. For more information about StreamSets Data Collector Edge, visit our documentation.

For any other questions and inquiries, please contact us.

The post Announcing StreamSets Data Collector 3.12.0 and StreamSets Data Collector Edge 3.12.0 appeared first on StreamSets.

In two of my previous blogs I illustrated how easily you can extend StreamSets Transformer using Scala: 1) to train Spark ML RandomForestRegressor model, and 2) to serialize the trained model and save it to Amazon S3.

In this blog, you will learn a way to train a Spark ML Logistic Regression model for Natural Language Processing (NLP) using PySpark in StreamSets Transformer. The model will be trained to classify given tweet as a positive or negative sentiment.

Prerequisites

StreamSets Transformer version 3.12.0+
PySpark Processor Prerequisites
NumPy library installed on the same machine

StreamSets Transformer Pipeline Overview

Before we dive into the details, here is the pipeline overview.

Input

Two File origins are configured to load datasets that contain positive and negative tweets that will be used to train the model.

Transformations

Field Remover processor is configured to only keep tweet Id and tweet text fields because the other fields/values of a tweet aren’t used to train the model in this example.
Spark SQL Expression processor enables us to add true (sentiment) “label” column with values 1 and 0 to the two dataframes. This label column will be used for training the model.
Union processor is configured to combine the two dataframes into one that will be used for training the model.
PySpark processor is where we have the code to train and evaluate the model. (See below for details.)

Output

File destination stores model accuracy–which is the output dataframe generated by PySpark processor.

PySpark Processor

Below is the PySpark code inserted into PySpark processor >> PySpark tab >> PySpark Code section. It basically takes the input data(frame) and trains Spark ML Logistic Regression model based on it–the code details include train-test split, tokenizing text, removing stop words, setting up hyperparameter tuning grid, cross validation over n folds, training on “train” split dataset, and evaluating the trained model on “test” split dataset. (See in-line comments for a walk-through.)

# Import required libraries
from pyspark.ml.feature import VectorAssembler, StopWordsRemover, Tokenizer, CountVectorizer, IDF
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.types import FloatType

# Setup variables for convenience and readability 
trainSplit = 0.8
testSplit = 0.2
maxIter = 10
regParam = 0.3
elasticNetParam = 0.8
numberOfCVFolds = 3

# The input dataframe is accessible via inputs[0]
df = inputs[0]

# Split dataset into "train" and "test" sets
(train, test) = df.randomSplit([trainSplit, testSplit], 42) 

tokenizer = Tokenizer(inputCol="text",outputCol="tokenized")
stopWordsRemover = StopWordsRemover(inputCol=tokenizer.getOutputCol(),outputCol="stopWordsRemoved")
countVectorizer = CountVectorizer(inputCol=stopWordsRemover.getOutputCol(),outputCol="countVectorized")
idf = IDF(inputCol=countVectorizer.getOutputCol(),outputCol="inverted")

# MUST for Spark features
assembler = VectorAssembler(inputCols=[idf.getOutputCol()], outputCol="features")

# LogisticRegression Model
lr = LogisticRegression(maxIter=maxIter, regParam=regParam, elasticNetParam=elasticNetParam)

# Setup pipeline -- pay attention to the order -- it matters!
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, countVectorizer, idf, assembler, lr])

# Setup evaluator -- default is F1 score
classEvaluator = MulticlassClassificationEvaluator(metricName="accuracy")

# Setup hyperparams grid
paramGrid = ParamGridBuilder().addGrid(lr.elasticNetParam,[0.0]).addGrid(countVectorizer.vocabSize,[5000]).build()

# Setup cross validator
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=classEvaluator, numFolds=numberOfCVFolds) 

# Fit model on "train" set
cvModel = cv.fit(train)

# Get the best model based on CrossValidator
model = cvModel.bestModel

# Run inference on "test" set
predictions = model.transform(test)

# Return accuracy as output dataframe
accuracy = classEvaluator.evaluate(predictions)
output = spark.createDataFrame([accuracy], FloatType()).withColumnRenamed("value","Accuracy")

Assuming all goes well, the “output” dataframe will contain “Accuracy” written out to a file in a location configured in the File destination labeled “Capture Accuracy” in the above pipeline. And, for example, if Data Format of the File destination is set to JSON, the contents of the JSON file might look similar to:

{“Accuracy”:0.97682446}

To reiterate what I’ve mentioned in my other ML related blogs, it goes without saying that the model accuracy will depend on the size and quality of the train and test datasets as well as feature engineering and hyperparameter tuning–which isn’t exactly the point of this blog but to showcase how StreamSets Transformer can be extended for other use cases.

Pipeline And Datasets on GitHub

If you’d like to try this out for yourself, download the pipelines and datasets from GitHub.

Summary

In this blog you learned how easily you can extend StreamSets Transformer’s functionality. In particular, you learned how to incorporate custom PySpark code to train

While the platform is easily extensible, it’s important to note that the custom code still leverages underlying, built-in features and power of StreamSets Transformer. To name a few:

Executing on any Spark cluster, on-prem on Hadoop or on cloud hosted Spark services, for example, Databricks.
Progressive error handling to learn exactly where and why errors occur, without needing to decipher complex log files.
Highly instrumented pipelines that reveal exactly how every operation, and the application as a whole, is performing.

To learn more about StreamSets Transformer visit our website, refer to the documentation, and download the binaries.

The post StreamSets Transformer: Natural Language Processing in PySpark appeared first on StreamSets.

StreamSets Cloud makes it easy to build data pipelines integrating with cloud-based relational database services such as Amazon RDS, Google Cloud SQL and Azure’s databases. Now, with the January 13, 2020 release, StreamSets Cloud also supports on-premise databases accessed securely via an SSH tunnel. In this blog post I’ll explain how to configure data pipelines with SSH tunnel, and how to work with your IT team to provide secure access to on-premise databases. If you’re looking for a deeper dive into how SSH tunnel works, watch this space for another, more detailed, blog post on the topic!

SSH Tunnel in StreamSets Cloud

SSH tunneling is supported by all of the relational database stages currently available in StreamSets Cloud, allowing you to integrate securely with on-premise instances of MySQL, Oracle, PostgreSQL and SQL Server. Briefly, an SSH tunnel is a mechanism to expose internal services, such as databases, to external networks such as the Internet, in a secure, controlled manner.

SSH Tunnel Configuration

When you configure SSH tunnel in a data pipeline, you will need to ask your network administrator for the SSH tunnel’s host, port, username and, optionally, host fingerprint. The host fingerprint allows StreamSets Cloud to authenticate the SSH server. You may leave the fingerprint blank, in which case StreamSets Cloud will establish the connection without verifying the SSH server.

With this configuration data in hand, you can configure your data pipeline. In the relevant pipeline stage, select the ‘SSH Tunnel’ tab and enable ‘Use SSH Tunneling’. Enter the above parameters, clicking ‘Show Advanced Options’ if necessary to set the port (if it differs from the default of 22) and/or host fingerprint. Finally, click to download the SSH public key.

StreamSets Cloud SSH Tunnel Config

You will need to give the SSH public key file to your network administrator, so that they can allow StreamSets Cloud to access the SSH tunnel. They will also need to allow inbound connections from the StreamSets Cloud IP addresses, as detailed in the documentation. StreamSets Cloud provides a single public SSH key that is unique per account. All pipeline stages in that account use the same public key to connect to the SSH server.

With that done, you can proceed to configure the pipeline stage with the connection string, query, credentials etc. Here I configured the PostgreSQL Query Consumer origin to read data via an SSH Tunnel. Note the hostname in the connection string – I used SSH tunnel to allow StreamSets Cloud to access my laptop in my home office. I’ll explain the details in a future blog post.

StreamSets Cloud JDBC Config

Let’s See Some Data!

Previewing the pipeline verifies the connection and retrieves the first few rows from the database, just as it does for any data pipeline:

Preview PostgreSQL Pipeline

Once you’ve verified that the database is accessible, you can move on and build the remainder of your pipeline. In this simple example, I used StreamSets Cloud’s Kinesis Producer to write messages to an Amazon Kinesis stream:

Run PostgreSQL to Kinesis pipeline

Watch this short video to see the pipeline in action:

Conclusion

StreamSets Cloud can securely access cloud-based relational database services such as Amazon RDS, Google Cloud SQL and Azure’s databases directly, or on-premise databases such as PostgreSQL, MySQL, Oracle and SQL Server, via an SSH tunnel. Start your free trial of StreamSets Cloud today!

The post Access Relational Databases Anywhere with StreamSets Cloud and SSH Tunnel appeared first on StreamSets.

Here’s how to enable greater flexibility in interacting with Big Data and managing it in a way that makes it easy to use the data for machine learning and analytical tasks. It allows data team members to deploy scalable clusters of SQL Server, Apache Spark, and HDFS containers running on Kubernetes.

In this blog post, I’ve highlighted the integration between Microsoft SQL Server 2019 Big Data Cluster and StreamSets DataOps Platform to perform sentiment analysis on streaming data.

SQL Server 2019 Big Data Clusters enable creating a virtual data hub for users to query data from many sources, structured and unstructured through a single, unified interface via Polybase. StreamSets enhances the data hub by providing a data integration platform for physically moving data from disparate sources, locations, and formats in a continuous and reliable way, allowing you to build a modern data hub driving real time analytics.

Given that SQL Server Big Data Cluster is deployed as a set of containers on a Kubernetes cluster, it’s easy to deploy StreamSets in the same environment using provisioning agents. A provisioning agent is a containerized application that runs in Kubernetes. The agent communicates with StreamSets Control Hub to automatically provision containers for StreamSets data plane components for moving data across data stacks. Provisioning includes automatically deploying, registering, starting, scaling and stopping data plane containers.

As noted, in this blog post, we’ll look at the integration between the two technologies to perform sentiment analysis on streaming data from Twitter.

Watch Demo Video

This demo is broken up into two flows as described below.

Ingestion: Twitter To Apache Kafka

Ingest
- Query tweets from Twitter using its Search API.
Transform
- Transformations include discarding deleted and duplicate tweets using Stream Selector, pivoting array of tweets returned by Twitter’s API into individual tweet records using Field Pivoter, flattening nested tweet structure using Field Flattener, and filtering and renaming fields using Field Remover and Field Renamer.
Store
- The transformed tweet records are sent to Apache Kafka destination.

Sentiment Analysis: Apache Kafka To SQL Server 2019 Big Data Cluster

Ingest
- Transformed tweet records are read from Apache Kafka.
Transform
- Transformations include prepping tweet records into a collection of JSON documents with attributes id, text, and language as required by Azure Sentiment Analysis API. (See below sample API input and output***.) The processors used in this pipeline are: Jython Evaluator, JSON Parser, and HTTP Client.
Machine Learning
- HTTP Client processor initiates request to Azure Sentiment Analysis API to analyze and score tweet text. Scores close to 1 indicate positive sentiment and scores close to 0 indicate negative sentiment.
Store
- Each tweet record along with its sentiment analysis score is stored in SQL Server 2019 Big Data Cluster for querying and further analysis.

***Sample Azure Sentiment Analysis API input:

{
  "documents": [
    {
      "language": "en",
      "id": "1",
      "text": "RT @UnoPlatform: Hey tweeps – will any of you be at Ignite conference in Orlando? We will be there and would love to connect, get a coffee."
    },
    {
      "language": "es",
      "id": "2",
      "text": "MS-500 está reservado mientras estoy en #MSIgnite2019. Ahora a golpear los libros!"
    },
    {
      "language": "en",
      "id": "3",
      "text": "Take a look at @VirtDesktopTT's top 10 Microsoft Ignite 2019 sessions for VDI admins. You won't want to miss."
    }
  ]
}

***Sample Azure Sentiment Analysis API output:

{
  "documents": [
    {
      "id": "1",
      "score": 0.92
    },
    {
      "id": "2",
      "score": 0.85
    },
    {
      "id": "3",
      "score": 0.64
    }
  ],
  "errors": []
}

Query Sentiment Analysis on SQL Server 2019 Big Data Cluster

Once the tweets data along with sentiment analysis scores is stored in SQL Server Big Data Cluster, they’re ready for querying in Azure Data Studio.

Retrieve the tweet records.

SELECT * FROM [dashdb].[dbo].[TwitterStream];

Create “bins” based on scoring range.

select 
(select count(*) 
from dbo.TwitterStream
where score between 0 and 0.24) as "Tweets with Score between 0 - 0.24",
(select count(*) 
from dbo.TwitterStream
where score between 0.25 and 0.49) as "Tweets with Score between 0.25 - 0.49",
(select count(*) 
from dbo.TwitterStream
where score between 0.50 and 0.74) as "Tweets with Score between 0.50 - 0.74",
(select count(*) 
from dbo.TwitterStream
where score between 0.75 and 1) as "Tweets with Score between 0.75 - 1";

Summary

In this blog, I’ve illustrated how easily you can get started with Microsoft SQL Server 2019 Big Data Cluster and StreamSets DataOps Platform. In the highlighted use case, you learned how to quickly gain insights using Machine Learning on streaming data using the two technologies.

Learn more about StreamSets DataOps Platform and the StreamSets and Microsoft partnership.

The post Sentiment Analysis: Microsoft SQL Server 2019 Big Data Cluster And StreamSets DataOps Platform appeared first on StreamSets.

The recent blog post, Access Relational Databases Anywhere with StreamSets Cloud and SSH Tunnel, presented an overview of StreamSets Cloud’s SSH tunnel feature. In this article, aimed at IT professionals who might have to configure an SSH tunnel, I’ll look a little more deeply into SSH tunneling, and explain how I configured my home router and SSH server to allow StreamSets Cloud access to PostgreSQL running on my laptop on my home network.

Scope of SSH Tunnel in StreamSets Cloud

From its initial release, StreamSets Cloud has been able to directly access cloud-based relational database services such as Amazon RDS, Google Cloud SQL and Azure’s databases. From the January 13, 2020 release, StreamSets Cloud has also been able to securely access on-premise databases via an SSH tunnel.

SSH tunneling is supported by all of the relational database stages currently available in StreamSets Cloud:

We’ll start with a quick look at how an SSH tunnel actually works, move on to cover SSH tunnel configuration in StreamSets Cloud, and finish with a quick demo of SSH tunneling in action.

SSH Tunnel Basics

SSH tunnel is a mechanism for securely exposing a service, such as a database, located on a private network, to another network, such as the Internet. A bastion host runs the Secure Shell (SSH) server; the bastion host is typically accessible from the outside for SSH only, running no other services. The bastion host in turn is allowed to access a limited set of services on the private network.

Here’s a simple example – the bastion host is located in a DMZ, between two firewalls. Clients on the Internet can access the bastion host via SSH on port 22; all other ports are blocked by the outer firewall. The inner firewall allows traffic to one or more internal services, such as a database, via a limited set of ports. There is no direct access from the Internet to the private network.

SSH Tunnel Diagram

The SSH server on the bastion host must be configured with each client’s SSH public key. The public key is used to authenticate access to the SSH service; only clients with the corresponding private key can set up an SSH Tunnel.

To access the database, then, the client first sets up an SSH tunnel to the bastion host, using its private key to authenticate to the SSH server. The SSH tunnel listens on a local port at the client machine, and a client app can then connect to that local port as if it were accessing the database directly. The SSH tunnel encrypts the traffic as it crosses the Internet, and the SSH server acts as a proxy, forwarding traffic to the database.

SSH Tunnel and StreamSets Cloud

Users wishing to configure SSH tunnel in a data pipeline will need the SSH tunnel’s host, port (if it differs from the default of 22), username and, optionally, host fingerprint. The host fingerprint allows StreamSets Cloud to authenticate the SSH server; it may be left blank, in which case StreamSets Cloud will establish the connection without verifying the SSH server. Note that the fingerprint must be generated according to a particular format – see the StreamSets Cloud documentation for details.

With this in hand, a user can configure their data pipeline. In the relevant pipeline stage, they will select the ‘SSH Tunnel’ tab and enable ‘Use SSH Tunneling’. The user will enter the above parameters, clicking ‘Show Advanced Options’ if necessary to set the port and/or host fingerprint, and download the SSH public key.

You, as the network administrator, will need to install the SSH public key onto the SSH server so that StreamSets Cloud is allowed access. You will also need to allow inbound connections from the StreamSets Cloud IP addresses, as detailed in the documentation. StreamSets Cloud provides a single public SSH key that is unique per account. All pipeline stages in that account use the same public key to connect to the SSH server.

With that done, the user can proceed to configure the pipeline stage with the connection string, query, credentials etc, use preview to verify the connection and see the first few rows of data, and run the pipeline to read data from, or write data to, the database.

SSH Tunneling in Action

I was keen to try out SSH tunneling with the minimum number of steps, and I realized that I could allow StreamSets Cloud to query the PostgreSQL database running on my laptop, at home. I’ll walk through the steps I took to achieve this, so you can play along at home. Note that a real enterprise data center will have very different practices and procedures!

My network architecture is a little simpler than the example above – I don’t have a DMZ, and I simply run the SSH server and database on my laptop:

Diagram of home SSH network architecture

My first step was to configure my router to forward traffic on port 22 to my laptop:

Port Forwarding Config

I then created a simple pipeline with a PostgreSQL Query Consumer:

Simple StreamSets Cloud pipeline with PostgreSQL origin

Following the StreamSets Cloud documentation for SSH tunnel, I generated the host fingerprint on my laptop:

pat@pat-macbookpro ~ % echo "$(ssh-keyscan localhost 2>/dev/null | ssh-keygen -l -f - | cut -d ' ' -f 2)" | paste -sd "," -

SHA256:20Be0xJMrtkbuWIonS4Ct17Clvrkkefd6PMAbfaXOyc,SHA256:KGVrkRiM9IT0I6fB7JnAcHSUryXeDcCKC5Q0xxEKmUo,SHA256:3DA7J2SIZ96DsxaE0fzuktFfR5FA0uCSrBu+7oWWbyY

Now I could fill out the SSH Tunnel tab on the PostgreSQL origin. Note that I enabled ‘Show Advanced Options’ to configure the host fingerprint.

SSH Tunnel Options

Looking at each configuration property in turn:

Use SSH Tunneling is checked, revealing the remainder of the properties
SSH Tunnel Host is set to my router’s external IP address. You can check your router configuration or simply google what’s my ip to discover this.
SSH Tunnel Port is left with the default SSH port number, 22.
SSH Tunnel Host Fingerprint has the fingerprint value I generated on my laptop.
SSH Tunnel Username has my username on my laptop

Next I downloaded the SSH public key file to my laptop and installed it. Again, I following the StreamSets Cloud documentation:

pat@pat-macbookpro ~ % ssh-copy-id -f -i ~/Downloads/streamsetscloud_tunnelkey.pub pat@localhost
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/Users/pat/Downloads/streamsetscloud_tunnelkey.pub"

Number of key(s) added:        1

Now try logging into the machine, with:   "ssh 'pat@localhost'"
and check to make sure that only the key(s) you wanted were added.

Before I could test it out, I needed to configure the PostgreSQL Query Consumer with a JDBC connection string, query etc:

StreamSets Cloud JDBC Config

I configured a simple incremental query to retrieve rows based on a primary key, so I could add data as the pipeline was running and see it being processed in near real-time.

Note that the hostname in the connection string is internal to my home network. Remember, my laptop is not directly exposed to the Internet; the SSH server, running inside the local network, makes the connection, so the hostname need only be locally resolvable.

Without a doubt, my favorite StreamSets Cloud feature is preview. I can preview even a partial pipeline, comprising just an origin, to check its configuration and see the first few records from the data source. Previewing my embryonic pipeline showed that StreamSets Cloud could indeed read data from my laptop, located on my home network, via SSH tunnel:

Preview PostgreSQL Pipeline

Now that I can read data successfully, I need to send it somewhere! For the purposes of this demo, I selected Amazon Kinesis.

StreamSets Cloud pipeline - PostgreSQL to Kinesis

StreamSets Cloud’s Kinesis Producer sends messages to a Kinesis stream, and I can easily run a command-line tool to consume and display messages to verify that the pipeline is operating correctly. The blog entry Creating Dataflow Pipelines with Amazon Kinesis explains how to build data pipelines to send data to and receive data from Kinesis Data Streams in the context of StreamSets Data Collector; StreamSets Cloud’s Kinesis stages work in exactly the same way.

I could now run the pipeline to see data being streamed from PostgreSQL to Kinesis:

Run PostgreSQL to Kinesis pipeline

Using the aws command line tool and a bit of shell magic to read the Kinesis stream:

pat@pat-macbookpro ~ % shard_iterator=$(aws kinesis get-shard-iterator --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON --stream-name pat-test-stream | jq -r .ShardIterator)
pat@pat-macbookpro ~ % while true; do
    response=$(aws kinesis get-records --shard-iterator ${shard_iterator})
    echo ${response} | \
    jq -r '.Records[]|[.Data] | @tsv' |
    while IFS=$'\t' read -r data; do
        echo $data | \
        openssl base64 -d -A | \
        jq .
    done
    shard_iterator=$(echo ${response} | jq -r .NextShardIterator)
done

{
  "id": 758,
  "name": "Bob"
}
{
  "id": 759,
  "name": "Jim"
}
{
  "id": 760,
  "name": "William"
}
{
  "id": 761,
  "name": "Jane"
}
{
  "id": 762,
  "name": "Sarah"
}
{
  "id": 763,
  "name": "Kirti"
}

Success!

This short video shows the pipeline in action:

Conclusion

StreamSets Cloud can securely access cloud-based relational database services such as Amazon RDS, Google Cloud SQL and Azure’s databases directly, or on-premise databases such as PostgreSQL, MySQL, Oracle and SQL Server via an SSH tunnel. Start your free trial of StreamSets Cloud today!

The post Digging Deeper into StreamSets Cloud and SSH Tunnel appeared first on StreamSets.

Learn how to leverage newly released Databricks COPY command for bulk ingest into Delta Lake using the hosted StreamSets Cloud service.

StreamSets is proud to announce an expansion of its partnership with Databricks by participating in Databricks’ newly launched Data Ingestion Network. As part of the expanded partnership, StreamSets is offering additional functionality for StreamSets Cloud with a new connector for Delta Lake, an open source project that provides reliable data lakes at scale. The StreamSets Cloud service provides an integrated, cloud-based user experience for designing, deploying and monitoring your pipelines across your entire organization. A key component of this integration is leveraging the newly released Databricks COPY command for bulk ingest into Delta Lake using the StreamSets Cloud service.

Watch Demo Video

Let’s consider a simple example of ingesting accounts information from Salesforce and storing it in queryable format in a Delta Lake table.

Pipeline Overview

If you’d like to follow along, here are the details to get you started and here’s the GitHub link to the pipeline JSON that you can import in your environment.

Prerequisites

StreamSets Cloud account
Databricks account
Access to a Spark Cluster with Databricks Runtime 6.3
Access to an existing Delta Lake table
Salesforce account
Access to Amazon S3 bucket account

Here are the steps for designing our dataflow pipeline:

Configure Salesforce origin to read (accounts) information
Configure Expression Evaluator processor to transform input data attribute
Configure Amazon S3 destination to load data to a staging location
Configure Databricks Delta Lake executor to run a Spark SQL query to copy the data from Amazon S3 into the Delta Lake table

Salesforce—Origin

Configuration attribute of interest is SOQL Query that will retrieve account details from Salesforce.

SELECT Id, Name, Type, BillingStreet, BillingCity, BillingState, BillingPostalCode, BillingCountry, Website, PhotoUrl, AccountNumber, Industry, Rating, AnnualRevenue, NumberOfEmployees 
FROM Account 
WHERE Id > '${OFFSET}' 
Order By Id

Expressions Evaluator—Processor

Configuration attribute of interest is Field Expression that uses regular expression to remove redundant prefix text “Customer -” from account Type field.

${str:regExCapture(record:value('/Type'),'(.*) - (.*)',2)}

Amazon S3—Destination

Configuration attribute of interest is Data Format (set to JSON) in which the account detail objects will be stored on Amazon S3.

Databricks Delta Lake—Executor

As noted earlier, Databricks provides COPY command to efficiently bulk load large amounts of data into Delta Lake. To use the COPY command, Databricks Delta Lake executor has been added to the pipeline. The Databricks Delta Lake executor is capable of running one or more Spark SQL queries on a Delta Lake table each time it receives an event.

In our example, the Amazon S3 destination is configured to generate events each time it completes writing an object. When it generates an object written event, it also records the bucket where the object is located and the object key name that was written. These bucket and object key attributes are used by the COPY command to load the data from Amazon S3 into an existing Delta Lake table. (See below Spark SQL query.)

Configuration attributes of interest:

JDBC Connection String

jdbc:spark://dbc-5a9fba6c-c704.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/8214879852432560/1123-001005-tow71;AuthMech=3;

Note: The JDBC connection string is where we can specify which Databricks cluster to connect to execute the COPY command. In our case, that’s 1123-001005-tow71.

Spark SQL Query

COPY INTO accounts_d 
FROM (Select Id, Name, Type, BillingStreet, BillingCity, BillingState, BillingCountry, CAST(BillingPostalCode AS INT), Website, PhotoUrl, AccountNumber, Industry, Rating, CAST(NumberOfEmployees AS INT), AnnualRevenue 
FROM "s3a://${record:value('/bucket')}/${record:value('/objectKey')}") 
FILEFORMAT = JSON 
FORMAT_OPTIONS ('header' = 'true')

Preview Pipeline

Previewing the pipeline is a great way to see the transformations occurring on attribute values. See below the selected Expression Evaluator processor and before and after values of Type attribute.

Query Delta Lake

If everything looks good in preview mode, running the pipeline should bulk copy account information from Salesforce to Delta Lake.

Once the Delta Lake table is populated, you can start analyzing the data. For example, running the following query will give us insights into total revenue generated based on account ratings and types of accounts.

Total Revenue by Rating and Account Type

SELECT Rating, Type, format_number(SUM(AnnualRevenue),2) AS Total_Revenue 
FROM accounts_d
WHERE Rating is not null and Type is not null
GROUP BY Rating, Type

Summary

In this blog post, you’ve learned how to leverage Databricks COPY command for bulk ingest into Delta Lake table using the hosted StreamSets Cloud service.

Learn more about how to build, run and monitor pipelines using the hosted StreamSets Cloud service.

You can also ask questions in the #streamsets-cloud channel on our community Slack team—sign up here for access.

The post Databricks Bulk Ingest of Salesforce Data Into Delta Lake appeared first on StreamSets.

Antonin Bruneau is a Solution Engineer at StreamSets, based in Paris, France. He recently created a fun demo system showing how the StreamSets DataOps Platform can collect, ingest, and transform IoT data. Over to you, Antonin…

Traditional product blog posts explain how to solve complex problems you might find in your daily job, but today I want to show you how to have a fun time with StreamSets.

Who doesn’t remember having great fun racing slot cars, competing with your best friend, and epic car crashes?

In this blog post, I’ll show you how to add sensors to your race track with a Raspberry Pi, and then collect metrics in real-time and compute statistics about your races, using StreamSets Data Collector and StreamSets Transformer.

The Electronic Part

To measure your race performance, we are going to add sensors to the start/finish line of the race track, and some buttons to control the game. We will use a Raspberry Pi to collect the measurements and output the events to a file.

Details about the components, schematics, and code for the Raspberry can be found here: https://github.com/abruneau/hacking_electric_race_track.

Event Collection

Reading Data from the Raspberry Pi

On the Raspberry Pi, a Python application reads the sensor and button events and writes them to a text file. Each line in the file is composed of a timestamp and an event number. To have a minimum impact on the device, we will use StreamSets Data Collector Edge to read the file and send the events to a Kafka topic.

StreamSets Data Collector Edge pipeline

Converting Raw Data to Game Information

In this second part of event collection, we will parse the raw data and store the results in MySQL tables. Here is the schema I used:

Slot car database schema

And here is the pipeline to read Kafka and write data to MySQL:

From Kafka, we receive a series of timestamps and event numbers:

0: race starts
1: sensor 1
2: sensor 2
3: race stops
4: reset race

A Stream Selector (‘dispatch events’ in the pipeline shown above) dispatches the events to different routes.

The start route (event 0) will retrieve the newly created game (see below how we create it with the API) and set the start time for it.

The sensor route (event 1 and 2) uses Redis to cache each sensor event timestamp and retrieve the previous one if it exists. If the sensor has been triggered before, we can calculate the duration of the lap and update the Laps table.

The stop route (event 3) retrieves the current game and updates the game table with the finish time. It also triggers two other pipelines. One Data Collector pipeline that will update the winner table, and a Transformer pipeline I will explain later.

The last one, the reset route (event 4), gets the current game and resets it by deleting all associated laps, resets the winner table, Redis, and start/stop time of the game.

The pipeline responsible for setting the winner is very simple. It uses a SQL query to find the user based on the game Id we pass in params and stores the results in the winners’ table. A Pipeline finisher stops it when the winner has been set and resets the origin.

Game Statistics

To add some more competition, a Transformer pipeline is used to compute statistics about the games.

This pipeline joins the players table with the laps table to rank players on their fastest lap and their fastest overall race. You could easily compute more statistics like average lap/race time, or player stats to measure progress. It is up to you to get creative.

This pipeline is triggered by the stop route of the Data Collector pipeline.

Exposing the Data

Now that we can collect the race track data, comes the question of exposing them. I could have chosen a traditional BI tool and plugged it into the database directly but I wanted something light that I could customize easily. So I created a small web app in Angular that queries an API to collect data.

I built the API using a Data Collector microservice pipeline.

This is a special type of pipeline that has a REST Service origin as input and sends back an HTTP response. I used an HTTP Router to route the event depending on the URL path. To retrieve the data, I used a series of JDBC Lookup processors, and Expression Evaluators to format the response. This blog post explains more about microservice pipelines, and contains a link to a tutorial.

Wrap Up

By using StreamSets, we were able to collect data on an IoT device, stream the events from Kafka, populate a database, compute statistics leveraging Spark, and expose data through an API.

Now it is your turn! Play it, hack it, share it!!

This GitHub repository contains a docker-compose file with StreamSets Data Collector and StreamSets Transformer, local Spark, Kafka (Lenses), and the Web UI: https://github.com/streamsets/slot_car_demo

The post Fun with Spark and Kafka: Slot Car Performance Tracking appeared first on StreamSets.

Akin to art and music, the most appealing part of the design process is always the ideation, the canvases where you get to explore new ideas that usher in new value and opportunity. Consider The Space Travelers Lullaby from jazz artist Kamasi Washington, made almost completely as an improvisational jam. Kamasi is often quoted as saying “This process can be gradual, even frustrating”, but in the world of music these heroics are rewarded with fanfare and critical acclaim.

Growing Complexity

As companies target and acquire new data sources and aim to deliver new form factors of data analytics, a fair amount of creative capital is spent designing these new patterns and solutions. Not surprisingly, rich tools have arisen that give visual control over analytics and more recently data movement. Data engineers now have highly intuitive tools to control the integration of data across their business. A savvy engineer can boost their internal brand immensely by bringing a forward-thinking, new capability into reality. However, how can they find the time when they are so often mired in the task of keeping existing ideas healthy, modern, and in production?

The truth is that they often spend a great deal of time handling the operations to keep data pipelines running and functioning to meet the requirements of downstream projects. It’s understood widely among data engineers, but not broadly acknowledged, as an activity that will be rewarded with praise and accolades. In fact, it might be considered part and parcel with the ideation. But when things go wrong the pain that data engineers feel is real. Whether it’s a data science model that an analyst convinced an engineer to support or the daily dashboard supporting the sales team, when these capabilities break, friendship and an impeccable record of new ideas will only win you so many graces.

However, data operations doesn’t have to be a zero-sum game. You can build a system that provides resilience and flexibility even at scale. Thinking smart about how you scale and react to the operational needs of your data pipelines and data processing can be cumbersome at first but will pay dividends as workloads and data projects amass.

In a static data world, upfront developer productivity matters more than operations. In a continuous data world, operations is everything.

So how do we build operations for a continuous data world? The answer is DataOps. DataOps is the mechanism for running a data driven business in a world of constant change. The following components are key to delivering DataOps functionality.

Visibility

Think of a single data pipeline as a tab on your internet browser. At a small scale, toggling between tabs on your browser is relatively manageable (though not ideal). Depending on the size of your organization you likely have multiple tools (legacy and modern) to build data pipelines. These tools will all have a different degree of control and granularity to understand both the progress and health of your data pipelines. Data teams often spend a good deal of their time managing to the limitations of the tools.

But what about when you have 30, 100, 1K pipelines? The idea of diving through 100 internet browser tabs with differing degrees of helpfulness will likely only produce delays and cause an increasing amount of pain as these connections grow. With DataOps, the goal is to have a comprehensive and living map of all of your data movement and data processing jobs. When pipelines or stages of a pipeline breakdown, the errors aggregate into a single point of visual remediation. That way data engineers understand the issue and react in a manner that doesn’t harm downstream projects and take a blow to the engineer’s brand. This map should be as useful and as reactive for today’s data workloads, as it is for handling tomorrow’s workloads.

Monitoring

A big component of visibility is active monitoring. Many tools aim to monitor the operations inside their product portfolio, but DataOps demands a level of monitoring that persists outside of a single system or workload. Continuous monitoring requires that systems share metadata and operational information so users can see a broader scope of challenges. These monitoring capabilities also help companies define, refine, and deliver on downstream data SLA’s. This ensures that analytics can be delivered with confidence and the company can evolve the art of self-service, which further removes the data engineer from risk. Monitoring should not only tackle alerting a team when something breaks, but in a DataOps scenario, it should actively monitor for the precursors to potential problems. Monitoring in DataOps should also be comprehensive, not allowing for data systems and silos to become operational black holes.

Automation

As companies scale their data practice, automation and integration with automation tooling becomes paramount. No aspect of self-service can be delivered without some level of automation. When engineers are able to work on automation tasks the impact can be amplified across multiple workloads. In the DataOps ecosystem, users will want to automate anything they can, while remaining reliable. Proprietary systems often offer poor extensibility making them complicated to automate and integrate with automation tools. This is why DataOps is often focused on open solutions or platforms that provide API extensibility and programmable integration with infrastructure and computing platforms.

Managing an Evolving Landscape

What is the cost of not modernizing? Will it cost you your competitive stance? Today’s solution landscape is evolving at a feverish pace and the toll this takes on data professionals is almost criminal. Your data strategy must not only be competitive with today’s requirements, but also be future leaning to consider your company’s transformation over the next five to ten years.

Managing and evaluating this fast-moving landscape requires tools and platforms that can abstract away from reliance on a single data platform or analytics solution. Logic for designing pipelines should be transferable, no matter the source and destination, allowing for change based on the business requirements vs managing to the limitations of the system. In DataOps, change is a given. DataOps systems embrace change by allowing their users to easily adopt and understand complex new platforms in order to deliver the business functionality they need to remain competitive. DataOps embraces the cloud and helps companies build hybrid cloud solutions that may someday live natively in the cloud.

DataOps Is Real

I leave you by assuring you today that DataOps is real and is providing measurable business value for modern organizations. Consider this example from my own company (StreamSets). We have a customer that utilizes our software and is able to manage 194 pipelines with real-time visibility and proactive monitoring. This customer also executes 58 Apache Spark-based ETL jobs. They execute this all via one engineer. The value they receive from the work of a single individual generates numerous operational efficiencies while decreasing data duplication and redundant processes. While I don’t particularly advocate a single source inside any company to dictate its data success, the ability to streamline data operations allows more of those talented individuals to do the sexy ideation, much like the melodic sonnets of artist Kamasi Washington.

The post The Hidden Complexity of Data Operations appeared first on StreamSets.

The StreamSets team will be working hard and working remotely for the next few weeks. As I headed back across the San Francisco Bay to my home in Marin County, I posed the question on our LifeatStreamSets Slack Channel: what helps you stay productive and connected?

Whether you have a dedicated home office or sit at the kitchen table, we hope you find these tips helpful and worth sharing.

1. Stick to a Routine

Wake up at the same time, follow your regular routine and hit the desk at your usual start time. If you have children, a routine will be important for providing structure, but don’t feel like you have to create a school day at home. That could overwhelm everyone! At the end of the day, close your laptop, turn off your notifications and spend time with friends or family. Make a nice meal. Call your Mom or a friend. When was the last time you had a conversation over the phone?

2. Be Technically Prepared

Make sure your laptop or home desktop is set up to work with all the software you need, and you have reliable internet connectivity. Put in a ticket to IT as soon as you can if you need VPN set up. A quality headset with a microphone is a great addition. At StreamSets, we use the Google Suite for productivity tools and filesharing. Github is an essential channel as well.

3. Keep up the Collaboration

Quick resolutions are easier when your colleagues sit next to you. Set the tone for virtual collaboration by posting questions in appropriate channels, answering questions in a timely manner, and communicating in a polite, friendly way. If you need input on something a little more complex, summarize it in an email, send it, then follow up with the person later (to give them a chance to digest it). Be patient. Everyone’s schedule might be a little off as they juggle new responsibilities at home.

4. Dedicate a Time and Place to Work

Whether you share your space with roommates, a spouse or children, you need a defined workspace. When I’m in my home office, I know I’m working and so does everyone else in my household. If you don’t have a dedicated room, headphones and facing a wall or a window helps tune out the household activity. Most messaging apps have a status to indicate when you are available, and when you are focused and do not want to be disturbed. Use it and be respectful about disrupting others. You might want to turn off those news alerts…

5. Build Your Skills

With tradeshows and other face-to-face events cancelled, you’ll be doing your discovery and research online. We put together a virtual DataOps Summit with keynotes from thought leaders and technical sessions for data engineers and data architects. Feel free to watch one or all of the sessions. Lots of companies now have online office hours like our Demos with Dash sessions.

6. Get Physical

At StreamSets, we have a slack channel dedicated to fitness and post our workouts to keep each other motivated. Here’s a sampling of apps my colleagues and I use to guide our routines: TRX Suspension Training, Gaia, RunKeeper, Strava, Garmin Connect, Fitbit, Strong. Too much commitment? 7-minute workout gets your blood flowing and generates endorphins to counter negative thoughts.

7. Invite a Colleague to Lunch

If lunchtime is your favorite time of day, invite a colleague to join you. Pick a time and a virtual hangout space, then make your lunch and dial in. It’s a great way to stay connected in an informal way, avoid the news feed, and stave off the boredom.

8. Go for a Walk

When you feel blocked by a problem, it’s easy to get down on yourself. Without a colleague to shift your perspective, the temptation is to go to the kitchen and start munching. Instead, go outside and walk around the block. You might find that the answer comes to you the moment you stop thinking about it.

9. Manage by Objective

If you are a manager, it can be disconcerting to be disconnected. At StreamSets, we’re using 15Five to set weekly objectives that role up to our larger goals. My manager and I can see exactly what I’ve committed to doing this week and I stay focused on what matters.

10. Kids at Home? This One’s for You

The iPad, game console, TV is only going to buy you so much time. We’re getting back to basics with cooking, gardening, writing stories, and interviewing family members by video chat. We’re all a little anxious right now. Check in with your children regularly to reassure them that, even though you are working, you’re there for them.

Thanks to my colleagues for their input! I miss our team lunches, the buzz of people hard at work, and the beautiful commute on the ferry across San Francisco Bay. We are finding new ways to connect and replace that camaraderie. What’s on your list? Let me know!

The post 10 (More) Tips for Working from Home appeared first on StreamSets.

Have a stressfree day This is my first blog for StreamSets and I’m going to do it all wrong. I should be writing a thought leadership piece on the DataOps category, drop hints about how uniquely differentiated our product is, and send you to various calls-to-action to “learn more” or try our product.

I should have hyperlinked several words in that last sentence to send you deeper into our website to further explore our platform and value proposition, and to boost search engine rankings. But that’s not the blog I feel like writing right now, because frankly I think we all have other things on our minds.

What’s Really on My Mind

The up-ending of our way of life and work, our societal norms and our sense of well-being in the world has been unlike any other in our life times. The disruption to the day-to-day is simply unprecedented.

I have two children, ages 10 and 8, and live in San Francisco. We were amongst the first in the nation to be ordered to shelter-in-place. I’ve been really struggling to balance my children’s needs– to help establish new routines; to encourage them to do their online schooling; to learn how not to kill each other when cooped up in our house most of the day; to work through their deep-seated anxieties which are manifested very differently between the two of them– with my own ability to stay productive with my work. And I, like many, haven’t been sleeping well, which also has impacted my productivity.

The first week of shelter-in-place, last week, my productivity was down by about half. It’s improving day-by-day as we get into new routines, but I still have 30% of my work calendar blocked off for dedicated home schooling for the foreseeable future.

You know what? My diminished productivity is okay. It takes a lot to adjust to a new reality. We are all human and while some people are able to keep chugging along, perhaps even enjoying fewer work distractions than normal or finding relief in their work, others cannot. I’m somewhere in the middle. I do look forward to my “day job” work as it keeps me looking forward and gives purpose. But I have to balance it against family life in a very different way than before, and I also have to give myself a break if I spend a night tossing and turning.

Flexing, Adapting, Coping, Growing

I suspect that most of you who are reading this blog (data engineers, developers, data platform operators, etc.) are in a situation where you are logistically able to work from home given the nature of the job. If you have been asked to work from home and that is new for you, perhaps you’re doing relatively okay and you’ve been able to double down on your work without the distractions of office chatter. But maybe it’s been a difficult shift in work style which you’re still trying to adjust to. Maybe you’re really struggling to process all the changes and news on a daily basis. Maybe you have extra time on your hands because you no longer have a long commute. Or maybe you’re finding it nearly impossible to juggle kids and work at home. Wherever you’re at, be kind to yourself.

At StreamSets we’ve often talked about our primary customers, data engineers, being squeezed in the middle between multiple demands: the business clamoring for more data; data producers wanting more control; IT leaders wanting to keep up with the latest and greatest. Now, you may be feeling a whole different squeeze as the realities of life during COVID19 creates a new set of pressures and demands upon you, all of which were unanticipated.

We Are All “Here” Together

We will do whatever we can to support you, whether that is helping you squeeze out more time by increasing your productivity so you can find the hour you need to review your kids’ schoolwork, or filling up your time by bootstrapping your skills in new areas or adding new capabilities to your toolkit.

I hope that you are getting the support and flexibility you need from your managers and your peers to adjust your life and your work for the new reality. Hang in there, and if we can help in any way, just let us know.

Judy Ko is the Chief Product Officer at StreamSets, responsible for integrating product management and marketing, managing the full product life cycle and product portfolio, improving user experience, and driving alignment across the business.

The post Coping in the Time of COVID19 – A CPO’s (and Mom’s) Perspective appeared first on StreamSets.

Recently I attended an inspirational tech talk hosted by Databricks where the presenters shared some great tips and techniques around analyzing COVID-19 Open Research Dataset (CORD-19) freely available here. As stated in its description:

“In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.”

In this blog post, I’m sharing some of the analysis presented during the tech talk which I have now replicated using StreamSets DataOps Platform. In particular, I’ve created the following dataflow pipelines in StreamSets Transformer running on Databricks cluster.

While this blog addresses a sensitive subject, the goal is to show how to use some of the tools you might have at your disposal in order to better understand and analyze the problem at hand.

Ok, let’s get started.

JSON to Parquet

The first pipeline reads a subset of CORD-19 data available in JSON format, repartitions the data (where the number of partitions is set to number of nodes in the cluster), and converts the data into Parquet for efficient downstream processing in Apache Spark. (Note: the data in my case just happens to be stored on Amazon S3, but in your case it may be stored on other sources as well.)

Delta Lake

The second pipeline reads the CORD-19 data stored in Parquet format, performs some transformations, and stores the curated data on Databricks File System (DBFS) for further analysis.

One of the key transformations that occur in this pipeline is “pivot” – the CORD-19 dataset is structured in a way where each research paper (record) contains a nested list of affiliated authors. So in order to analyze individual author records, we need to create one output record for each author in the nested list.

Databricks Notebook

Once the transformed data was available on DBFS, I created a Databricks Notebook to visualize geo locations of the authors — basically showing the total number of research papers produced by authors from each country.

Here I am showing the transformed data on DBFS — list of Parquet files and the dataframe schema that was generated/created when the data was written out to DBFS by Delta Lake destination in the second pipeline above.

Here the first query creates a temp table paper_by_country which holds paper_id and the author’s country affiliated with it based on authors_affiliation_location_country column. The second query uses the temp table and joins it with country_mapping to count the number of papers and plots the location on the map.

paper_by_country = spark.sql("""select paper_id,min(authors_affiliation_location_country) as author_country from covid 
where authors_affiliation_location_country is NOT NULL
group by paper_id""")
paper_by_country.createOrReplaceTempView("paper_by_country")

select m.Alpha3, count(distinct p.paper_id) as number_of_papers from paper_by_country p
left join country_mapping m on m.AuthorCountry = p.author_country group by m.Alpha3

Notes:

I’ve analyzed a small subset of the CORD-19 dataset so the numbers shown in any queries, results, map, etc. are not a true representation of the entire dataset.
The country code mapping (“country_mapping“) I’ve used in the above SQL has been put together by one of the Databricks presenters and should be available on their webinar page. (If I find it, I will provide a link to it here.)

Sample pipelines on GitHub

JSON to Parquet

After importing the sample pipeline, update pipeline parameters with your Spark cluster details as well as AWS credentials, S3 bucket name, etc. on both origin and destination, and number of partitions before running the pipeline.

S3 To Delta Lake

After importing the sample pipeline, update pipeline parameters with your Spark cluster details as well as AWS credentials, S3 bucket name, etc. on the origin and DBFS directory path on Delta Lake destination before running the pipeline.

Conclusion

As you can imagine, I haven’t even scratched the surface and there’s ample opportunity for exploring and analyzing the CORD-19 dataset. You can also participate in this Kaggle competition to win prizes and, more importantly, help the community by generating new insights. Hopefully this blog post gives you some ideas, tips and tricks to get started.

Learn more about StreamSets for Databricks which is available on Microsoft Azure Marketplace and AWS Marketplace.

The post COVID-19 Open Research Dataset Analysis on StreamSets DataOps Platform appeared first on StreamSets.

On March 25, 2020 I hosted my first StreamSets Live: Demos with Dash and I want to thank everyone that took the time to join despite what might have been going on in your lives. Times are definitely not “normal”, and more challenging for some than others, but I hope everyone is continuing to stay safe and well.

During the session I showed three different demos covering variety of use cases using the StreamSets DataOps Platform.

Demos Summary

Streaming Twitter data to Kafka, performing sentiment analysis on tweets using Microsoft’s Cognitive Service API and storing the scores in MySQL
Parsing & transforming web application logs on Amazon S3 and storing the enriched logs in Snowflake cloud data warehouse for analysis–including location and HTTP response code based queries
Running ML pipelines on Apache Spark on Databricks cluster and storing the output on Databricks File System (DBFS)

Your Questions Answered

Below I have answered some of the burning questions that were asked during the live demo session.

Q: “Can pipelines only be made via the GUI? or can we create pipelines in code?”
Answer: Pipelines can be created using the GUI, REST APIs and Python SDK. For available REST API endpoints, browse to http(s)://StreamSetsDataCollector_HOST:StreamSetsDataCollector_PORT/collector/restapi where your Data Collector instance is running.

Q: “The connectors list is based on jdbc available drivers?”
Answer: There are several JDBC-based origins, processors, and destinations available as well as many others that don’t rely on JDBC drivers. For example, Kafka, Amazon S3, Azure Data Lake, Google Pub Sub, HTTP Client, Redis, Elasticsearch, UDP, WebSocket, Azure IoT Hub, etc. Refer to our documentation for a comprehensive list of origins, processors, and destinations.

Q: “How on-prem data can be transferred to for example GCP? publish to pub sub”
Answer: On-prem data can be transferred using available origins to Google Cloud Storage, Google BigQuery and published to Google Pub Sub.

Q: “Is there documentation how to add the PySpark processor you mentioned?”
Answer: You don’t need to add or download it. It comes bundled with StreamSets Transformer.

Sample Pipelines on GitHub

You can download sample pipelines I used for the demos from GitHub.

After importing the sample pipelines, update AWS, Snowflake, and Twitter credentials as well as pipeline parameters and other configuration details, such as, S3 bucket name, Kafka broker, MySQL, etc. where applicable before running the pipelines.

For PySpark ML demo, refer to my blog StreamSets Transformer: Natural Language Processing in PySpark. For Scala ML demo, refer to my blog StreamSets Transformer Extensibility: Spark and Machine Learning.

Thank you!

I hope to see you and your fellow data engineers, data scientists, and developers in my next StreamSets Live: Demos with Dash session where I will be showing a different set of demos!

Learn more about StreamSets DataOps Platform which is available on Microsoft Azure Marketplace and AWS Marketplace.

The post StreamSets Live: Demos with Dash — Your Questions Answered appeared first on StreamSets.

Learn how to load a serialized Spark ML model stored in MLeap bundle format on Databricks File System (DBFS), and use it for classification on new, streaming data flowing through the StreamSets DataOps Platform.

In my previous blogs, I illustrated how easily you can extend the capabilities of StreamSets Transformer using Scala and PySpark. If you have not perused blogs train Spark ML Random Forest Regressor model, serialize the trained model, train Logistic Regression NLP model, I highly recommend it before proceeding because this blog builds upon them.

Ok, let’s get right to it!

Watch It In Action

If you’d like to see everything in action, checkout this short demo video—there’s no audio though… please continue reading this blog if you’d like to know the technical details.

Streaming Data: Twitter to Kafka

I’ve designed this StreamSets Data Collector pipeline to ingest and transform tweets, and store them in Kafka. This pipeline is the main source of our streaming data that we will perform sentiment analysis on in the second pipeline.

Pipeline overview:

Ingest
- Query tweets from Twitter using its Search API for #quarantinelife using HTTP Client origin in polling mode.
Transform
- Transformations include discarding deleted and duplicate tweets using Stream Selector, pivoting array of tweets returned by Twitter’s API into individual tweet records using Field Pivoter, flattening nested tweet structure using Field Flattener, and filtering and renaming fields using Field Remover and Field Renamer.
Store
- The transformed tweet records are sent to Apache Kafka destination.

Here’s an example of the original Twitter Search API response as ingested by the HTTP Client origin.

And here’s an example of the transformed tweet written to Kafka.

Classification on Streaming Data: Kafka to Spark ML Model to Databricks

As detailed in my previous blogs, let’s assume that you have trained and serialized a model in MLeap bundle format, and it’s stored on DBFS as shown below.

Next up… I’ve designed this StreamSets Transformer pipeline running on Databricks cluster.

Pipeline overview:

Ingest
- Transformed tweet records are read from Apache Kafka–from the same topic as written to by the first pipeline.
Transform
- Scala processor loads the Spark ML model (/dbfs/dash/ml/spark_nlp_model.zip) and classifies each tweet. A value of 1 indicates positive sentiment and 0 indicates negative sentiment. (*See code snippet below.)
Store
- Each tweet record along with its classification is stored on DBFS in Parquet format for querying and further analysis. The DBFS location in this case is /dash/nlp/.

*Below is the Scala code inserted into Scala processor >> Scala tab >> Scala Code section.

import spark.implicits._
import scala.collection.mutable.Buffer

import org.apache.spark.ml.feature.VectorAssembler
import ml.combust.bundle._
import ml.combust.mleap.spark.SparkSupport._
import ml.combust.mleap.runtime.MleapSupport._
import org.apache.spark.ml.bundle.SparkBundleContext

var df = inputs(0)

if (df.columns.contains("text")) {
  // Load MLeap bundle to make predictions on new data
  val saveModelZipPath = "/dbfs/dash/ml/spark_nlp_model.zip"
  val bundle = BundleFile("jar:file:" + saveModelZipPath)
  var loadedMLeapBundle = bundle.loadMleapBundle().get.root
  bundle.close()

  // Return original/input data/features + respective predictions
  output = loadedMLeapBundle.sparkTransform(df.select("text")).select("text","prediction")
} else {
  output = df
}

It basically takes the input data(frame) and if it contains column “text” (tweet), it loads the NLP model (“spark_nlp_model.zip”) and classifies each tweet. Then it creates a new dataframe with just the tweet and its classification stored in “prediction” column. (Note that you could also pass along/include all columns present in the input dataframe instead of just the two–“text” and “prediction”.)

Analysis on Databricks

Once the tweets, along with their classification, are stored on the Databricks File System, they’re ready for querying in Databricks Notebook.

Query the tweets and their classification

Here I’ve created a dataframe that reads all the Parquet files output by the second pipeline in DBFS location /dash/nlp/ and shows what the data looks like.

Create temp table and aggregate data

Here I’ve created a temp table that reads the same data stored in /dash/nlp/ DBFS location and an aggregate query showing total number of positive tweets vs negative tweets.

Structured Streaming

In the demo video, I’ve also shown how to create and run structured streaming query in Databricks to auto-update the counts–total number of positive and negative sentiment tweets–without having to manually refresh the source dataframe as new data is flowing in from the second pipeline.

Good News!

Based on my model and the data I’ve collected, there appears to be more positive sentiments than negative sentiments when it comes to #quarantinelife hashtag. That is something to feel good about!

In all honesty and fairness though, it goes without saying that the model accuracy depends on the size and quality of the training and test datasets as well as feature engineering and hyperparameter tuning–which isn’t exactly the point of this blog; rather to showcase how StreamSets DataOps Platform can be used and extended for variety of use cases.

Sample Pipelines on GitHub

If you’d like to get a head start and build upon these pipelines, you can download them from GitHub.

Streaming Data: Twitter to Kafka

After importing the sample pipeline, update Twitter credentials as well as pipeline parameters and other configuration details, such as, Kafka broker, and Kafka topic before running the pipeline.

Classification on Streaming Data: Kafka to Spark ML Model to Databricks

After importing the sample pipeline, update pipeline parameters and other configuration details, such as, Databricks cluster, Kafka broker, Kafka topic, and DBFS location of the model and output files before running the pipelines.

Learn more about StreamSets for Databricks and StreamSets DataOps Platform which is available on Microsoft Azure Marketplace and AWS Marketplace.

The post Streaming Analysis Using Spark ML in StreamSets DataOps Platform appeared first on StreamSets.

“These are the times that try men’s souls.” —Thomas Paine, The American Crisis

Many of you, reading this blog, weathered the systemic disruptions of the dot-com bubble burst and the great financial crisis. The current environment is fundamentally different. While scientists had previously identified the possibility of such a pandemic, a 2020 business plan could not with certainty have “planned” for the speed and scale of the economic disruption we are experiencing.

In this time of uncertainty, when the extent and timing of COVID-19 impacts is unknown, rapid and frequent decision making is critical to crisis management. Business leaders have to take action in the fight against COVID-19, to protect their people and their organizations, and to prepare for the future. There is no precedent or consensus on the path forward, but here’s a proposal:

Focus on What Matters Right Now
Make Better Decisions, Faster
Prepare for What Comes Next

What Matters Right Now

What matters right now is protecting the population and protecting the nation. And companies that have a role to play must step up to that challenge.

At the beginning of the lockdown, we received a letter from one of our customers, a Fortune 100 food manufacturing company. StreamSets helps this customer ensure continuous operations of their data supply chain in support of the food manufacturing supply chain, and ultimately put food on the shelves of your local grocery store. Because of the critical role StreamSets plays in helping this company maintain the nation’s food supply, we have been designated as essential critical infrastructure by the U.S. Department of Homeland Security.

The COVID-19 Open Research Dataset consists of more than 29,000 machine-readable coronavirus articles ripe for analysis. When a “COVID-19 task force” formed to address the data integration challenges of tracking and responding to the crisis, StreamSets dedicated engineers and technology to the cause. This consortium of data analytics technology partners and global systems integrators are building solutions that enable coronavirus test scheduling and administration, and data pipelines to integrate state and local government data with census, demographic, weather and other open data sets to enable better response planning. Now is the time for epidemiologists, physicians, and data scientists to analyze and learn from this information and shorten this crisis. We are proud to be of help, and do not take these responsibilities lightly.

Make Better Decisions Faster

In recent research, Gartner found that 81% of decision makers believe that the quality of their decisions could be equally as good with less spend, and 81% believe they could have made the same or higher-quality decisions in less time. (See Gartner Report.) We no longer have weeks or months to plan. We are making decisions in days or even hours.

At StreamSets, we have been applying DataOps principles since our inception. DataOps is the set of technologies and practices that ensures that the latest data is continuously available for decision making, even in the face of change. Knowing that your data is keeping up with the unprecedented pace and scale of change gives you confidence to make better decisions fast.

Our Head of DataOps reports directly to me, and is responsible for understanding the needs of our business leaders, building data pipelines for application and data integration, creating dashboards, and enabling self-service for advanced analytics. But most importantly, our environment has been designed for resiliency. I can focus on the most important data points to make better decisions confidently, and reduce the time it takes to get that analysis.

This Too Shall Pass

I am confident that, in the coming months, tests will become broadly available, we will figure out an immunization strategy, our children will go back to school, and we will go back to work. But the world will look different on the other side. Companies that will thrive in the post-coronavirus world are the ones that do what it takes to survive, and instead of settling for good enough, they take this opportunity to do things right.

For our part, StreamSets was founded on the belief that the next wave in data integration needed to focus on both speed and confidence in the face of data drift. Organizations that modernize data integration and adopt a DataOps mindset during this pandemic will ultimately emerge stronger, faster and better able to respond to the next challenge or crisis.

Conclusion

So here we are today, making real-time decisions in the midst of so much uncertainty. As CEO of StreamSets, I am focusing my attention on helping the global response to the pandemic, continuing to serve our customers, improving our decision making abilities, and applying it to things in our control.

In the midst of upheaval 2 centuries ago, Thomas Paine reminded us that the “harder the conflict, the more glorious the triumph.” He was right.

Check out #amazingdatastories for more about what’s working in the world today.

The post Action in the Face of Uncertainty: Leading through Crisis appeared first on StreamSets.

Every year the insideBigData team puts together a list of companies making the biggest impact on this space, and for the 3rd year in a row StreamSets is one of these 50 prominent disruptors. StreamSets is ranked #22 alongside notable ecosystem pillars like Nvidia, Snowflake, Cloudera, DataRobot, and Databricks.

Opinions are abundant in today’s economy which is why insideBigData has taken a different approach. They use inference and machine intelligence to help choose these top players. According to the Impact 50 List:

“The selected companies come from our massive data set of vendors and industry metrics. Yes, we use machine learning to analyze the industry in a detailed manner to determine a ranking for this list. We’re using a custom RankBoost algorithm adapted specifically for the big data community along with a plethora of proprietary data sources. The rankings include an indicator for upward movement in the list and also new companies.”

How cool is that? Using machine learning to decide the companies making the biggest impact in machine learning!

InsideBigData is a publication deeply entrenched in the ecosystem of converging topics including Big Data, data science, and machine learning. None of the columnists are closer to the pulse of these ecosystems than managing director and practicing data scientist, Daniel Gutierrez. Daniel’s experience charts back to the days when these topics were en vogue and his opinion involves both important context and evolutionary movement.

While it is hard to dive into the mind of a finely tuned algorithm, this last year has seen StreamSets become not only a de facto tool for ingestion into big data platforms but also a critical capability in developing and operating Apache Spark native applications.

This year at DataOps Summit (a conference dedicated to the people, processes, and technology enabling agile data movement) companies like Shell talked about how StreamSets is enabling their data science teams to develop self-service data access and the impact it had in accelerating exploratory data science and machine learning. Also, with the recent addition of StreamSets Transformer and custom processing stages users are able to design pipelines that can perform machine learning scoring directly in the pipeline.

As StreamSets tackles the world’s most complex data problems, we hope to continue making a much needed impact with our customers, marketplace users, and the industry at large.

To read the full list of companies and rankings please visit the original article.

The post StreamSets Named in InsideBigData Impact 50 List appeared first on StreamSets.

Dramatic Demand Spike Requires Major Operational Changes

You’ve seen a spike in demand due to the pandemic. Your organization is stretched to the extreme while making major operational changes, and ensuring the safety of everyone involved– your employees, your customers and your partners. Here are four things to keep in mind with respect to your data practice as you push your delivery capabilities to the max.

This is the final article in our series: 3 Scenarios for Adjusting Your Data Practice to Business During the Pandemic

Data holds the key to the new products and solutions your constituents clamor for.

Of course, you need data to make better decisions, and you need it faster than ever given the massive ramp up your organization has undertaken. Data is critical to enabling governments and healthcare organizations to track pandemic statistics and plan their response. Data can reveal shifts in demand patterns so you can figure out where to deliver your products and how to package them (such as the shift in demand for flour from restaurants and bakeries to grocery stores serving home bakers). It can help analyze candidate pools to let you recruit more effectively if you’ve got to radically ramp up staffing.

For many organizations, data and data science will drive new products or solutions innovation. The most obvious example is all the epidemiological and genomic data feeding the research underway in the race to develop COVID-19 tests, treatments and vaccines. But there are many other realms where data science, AI or machine learning can drive new or enhanced product offerings. For example, edtech companies can analyze how new categories of teachers and students are utilizing their platforms, and put in software enhancements to deliver online learning better suited for these new audiences, which appeared overnight.

The StreamSets DataOps Platform is designed for the modern practice of delivering data rapidly and continuously with confidence in a world of constant change— DataOps. It’s all about going fast, and keeping everything going no matter what changes come at you. The pace of change is at a level never seen before, and designing for change with a DataOps mindset can be the thing that sets apart those who will rise to the demands in these times, and those who will fall apart.

Reduce friction in adopting the latest-and-greatest data infrastructure.

A Harvard Business Review article “How to survive a recession and thrive afterward” recommends investing in technology during a downturn, even if that may sound counter-intuitive. To support new demands from the business, you may need to rapidly adopt new data platforms that can provide the horsepower or functionality you now need. You may realize that your legacy systems simply can’t scale, and you absolutely need to shift to a cloud platform that can seamlessly scale and burst up to meet demand spikes. Or you may find that you need to adopt a Spark engine, such as Databricks, to do the massive data processing demanded by your data scientists, or your AI and machine learning algorithms. The trick is how to adopt new platforms quickly while keeping the existing infrastructure going– you’re building the airplane while in flight.

StreamSets supports all the key data platforms, and the StreamSets DataOps platform is fundamentally architected to enable portability across them. So if your goal is to extend the capabilities of your existing infrastructure investments, StreamSets makes it easy to get new data sources into those platforms and transformed to be fit-for-purpose. If you’re moving to cloud data platforms, StreamSets can accelerate the migration, and ensure you keep on-premises and cloud platforms in sync.

Boost productivity, reduce ramp-up time.

To support major growth in new demand, first, your existing data team has to do more than incrementally improve productivity to keep the lights on and meet new business demands. The right data integration tooling can boost productivity of your data team by an order of magnitude. The ultimate goal is self-service, having data integration tools so easy-to-use that the developers, data engineers and data scientists who are closest to business are able to access the data themselves. Second, tooling that abstracts away the complexities of coding languages and implementation details can make it easier to hire people from a broader applicant pool. You don’t have time to find “ninja” experts in languages like Scala or PySpark, or deep experts in the details of a platform like Azure Synapse.

StreamSets’ easy-to-use, visual tools greatly increase the productivity of your developers and data engineers, regardless of which types of data pipelines they are building. Prebuilt support for a breadth of data patterns, from streaming to ETL to CDC, and data platforms, from Oracle to Hadoop to Databricks enables a modern data integration practice designed for today’s data workloads and platforms. By abstracting away the complexity of platforms like Spark, your team members can easily ramp up in new areas without having to be Scala or PySpark coding ninjas.

Go fast without breaking things.

You’ve got to go fast. Really, really fast. Adopting new platforms or tools. Building new pipelines. Adding new data sources. Hiring new people. But with so many things changing so fast, and no time to go through traditional change management processes, you need to architect in resiliency to change so that the lights don’t go out and data isn’t lost in the midst of all the hustle. Data drift detection and handling, and having full operational visibility to how data moves in real time, is the key to preventing data loss and data flow breakages as teams feverishly build new data pipelines or make changes to data platforms.

StreamSets DataOps Platform enables you to build fully instrumented pipelines that give you real-time operational visibility into how your data is flowing. Our unique drift detection and handling capabilities minimize the risk of outages or data loss. That way you can move fast and not worry about things breaking.

How are you adapting to The Overnight Spike? Share your story with us: #amazingdatastories.

Final Thoughts

However things are playing out for you, you’re probably facing unanticipated challenges– both personal and professional. We can’t fix the macroeconomic environment, but we can all apply data to make better decisions to see our companies, teams and employees through these times in the best way possible. And data will be key to eventually discovering the treatments and vaccines needed to end the pandemic. We do hope that we can help you, the data leaders, experts and practitioners, step up to meet the demands of your organizations, your teams, and society. Many StreamSetters are already helping state and federal governments access and analyze COVID-19 data to help plan their response, and others are volunteering their time as engineers and data scientists to support COVID-19 research.

Take care and let us know if we can help you in any way.

The post Scenario 3: The Overnight Spike appeared first on StreamSets.

Maintain Essential Operations When Demand Drops

Essential Operations for Business Closed Due to COVID-19 Your business has dropped significantly. You’re still operating but with major downsizing or other changes. First, I’m so sorry that you and your employees are going through such extraordinary difficulties. Here are some things to keep in mind with respect to the role of data in your efforts to keep going.

This is the 3rd article in our series: 3 Scenarios for Adjusting Your Data Practice to Business During the Pandemic

Pinpoint where to cut, what to preserve with data.

Good, data-driven decision making is critical, so you can pinpoint the most valuable parts of the business to shore up, as well as the cuts needed to survive. What business lines are still seeing demand? What operational changes are working? Where can some additional costs be cut? Which vendor payments can be postponed or renegotiated? While this can be emotionally gut-wrenching, having access to the right data will help leaders make the tough calls.

From the outset, the goal of StreamSets has been to get data as fast as possible into the hands of those who need it. In times like these, hours matter. Dollars matter. Organizations who rely on the StreamSets DataOps Platform have been able to deliver big cost savings and big reductions in development time.

Slash your data infrastructure costs.

You’ve got to squeeze every penny out of your existing data infrastructure, consolidate as much as possible and wring out any waste. Depending on your situation, that may mean staying on legacy platforms and postponing modernization efforts. It also may mean the opposite. For example, if you have already started migrating to lower-cost, cloud-based platforms, accelerate the move so you can eliminate legacy systems and their costs.

StreamSets supports all the key data platforms, and the StreamSets DataOps platform is fundamentally architected to enable portability across them. So if your goal is to wring the most value possible out of existing infrastructure investments, StreamSets makes it easy to get new data sources into those platforms and transformed to be fit-for-purpose. If you’re moving to cloud data platforms as a way to lower cost and down-scale, StreamSets can accelerate the migration to reap those savings sooner.

Maximize productivity of whoever you still have.

If you’ve had to freeze or reduce headcount, you have to increase the productivity of those who remain, in spite of the emotional trauma they are likely experiencing. People who are shouldering more work than before require cross-training to cover areas and projects which have now fallen in their laps. Tools that can help data teams be more productive and cross over into new areas can be a game changer.

Turn on a dime.

You’ve got to make changes at lightning speed to pivot your business model, preserve cash, and protect whatever remaining jobs you can. Having full operational visibility into what is going on with your data can help you see where problems are, and make the call regarding which to fix and which to just let go. If you have resiliency and drift detection built into your data flows, you are less likely to experience data loss or outages even as you make overnight changes to your business processes or systems.

Pipelines built with the StreamSets DataOps Platform are fully instrumented, giving you real-time operational visibility into how your data is flowing. Our unique drift detection and handling capabilities minimize the risk of outages or data loss as you make cuts and other changes. That way you can move fast and minimize the risk of unexpected breakages.

How is your organization going into Temporary Hibernation? Share your coping story with us: #amazingdatastories.

The post Scenario 2: Temporary Hibernation appeared first on StreamSets.