Creating a Custom Origin for StreamSets Data Collector

December 11, 2016, 6:20 pm

≫ Next: Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

≪ Previous: Running Apache Spark Code in StreamSets Data Collector

Git Commit Log Record Since writing tutorials for creating custom destinations and processors for StreamSets Data Collector (SDC), I’ve been looking for a good use case for a custom origin tutorial. It’s been trickier than I expected, partly because the list of out of the box origins is so extensive, and partly because the HTTP Client origin can access most web service APIs, rendering a custom origin redundant. Then, last week, StreamSets software engineer Jeff Evans suggested Git. Creating a custom origin to read the Git commit log turned into the perfect tutorial.

“Why?” I hear you ask. Well, there are many reasons:

Git is familiar to most developers
The Git commit log is an ordered sequence of entries, each with a unique identifier – the commit hash
JGit offers an easy way to read the commit log, either in its entirety, or across a range of entries
It’s easy to create a repository, and add commits, to test the origin
Git is free – and who doesn’t love free?

If you’ve been wondering how to get started writing a custom origin, then wonder no more, head on over to the article, Creating a Custom StreamSets Origin, and get started, today!

The post Creating a Custom Origin for StreamSets Data Collector appeared first on StreamSets.

↧

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

December 19, 2016, 6:27 am

≫ Next: Building an Amazon SQS Custom Origin for StreamSets Data Collector

≪ Previous: Creating a Custom Origin for StreamSets Data Collector

Databricks Logo I’m frequently asked, ‘How does StreamSets Data Collector (SDC) integrate with Spark Streaming? How about on Databricks?’. In this blog entry, I’ll explain how to use SDC to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere.

Databricks is a cloud-based data platform powered by Apache Spark. You can spin up a cluster, upload your code, and run jobs via a browser-based UI or REST API. Databricks integrates with Amazon S3 for storage – you can mount S3 buckets into the Databricks File System (DBFS) and read the data into your Spark app as if it were on the local disk. With this in mind, I built a simple demo to show how SDC’s S3 support allows you to feed files to Databricks and retrieve your Spark Streaming app’s output.

I started with the works of Shakespeare, split into four files – the comedies, histories, poems and tragedies. I wanted to run them through a simple Spark Streaming word count app, written in Python, running remotely on Databricks. Here’s the app:

# Databricks notebook initializes sc as Spark Context

from pyspark.streaming import StreamingContext
from uuid import uuid4

batchIntervalSeconds = 10

inputDir = '/mnt/input/shakespeare/'
outputDir = '/mnt/output/counts/'

def creatingFunc():
  ssc = StreamingContext(sc, batchIntervalSeconds)

  def saveRDD(rdd):
    if not rdd.isEmpty():
      rdd.saveAsTextFile(outputDir + uuid4().hex)

  lines = ssc.textFileStream(inputDir)
  counts = lines.flatMap(lambda line: line.split(" ")) \
                .map(lambda word: (word, 1)) \
                .reduceByKey(lambda a, b: a + b)
  counts.foreachRDD(saveRDD)

  return ssc

# Start the app
ssc = StreamingContext.getActiveOrCreate(None, creatingFunc)
ssc.start()

# Wait a few seconds for the app to get started
ssc.awaitTerminationOrTimeout(batchIntervalSeconds * 2)

If you’ve already run a simple word count app like this yourself, you’ll have noticed that it considers punctuation part of the adjacent word – for example, the final word of “To be or not to be,” will be counted as “be,” rather than “be”. This is pretty annoying when you’re analyzing the results, so let’s use SDC to remove punctuation as we ingest the text, as well as removing any empty lines:

local-dir-to-s3

The pipeline reads files from a local directory and writes to an S3 bucket that is mounted in DBFS at /mnt/input. Note that you must start the Spark Streaming app before you move any data to its input directory, as it will ignore any preexisting files. I mounted a second S3 bucket to /mnt/output; the word count app writes its results to the output directory; a second pipeline retrieves files from the output S3 bucket, writing them into Hadoop FS:

s3-to-hdfs

With the app and both pipelines running, the system works like this:

I drop one or more files into a local directory
The first SDC pipeline reads them in, removing punctuation and empty lines, and writes the resulting text to S3
The Spark Streaming app reads the text from DBFS, runs word count, and writes its results to DBFS
The second SDC pipeline reads files as they appear in S3, writing them to Hadoop FS

databricksflow

I walk through the process in this short video:

Databricks offers a Community Edition as well as a trial of their full platform; StreamSets Data Collector is open source and free to download, so you can easily replicate this setup for yourself.

What data are you processing with Spark Streaming? Let us know in the comments!

The post Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks appeared first on StreamSets.

↧

Building an Amazon SQS Custom Origin for StreamSets Data Collector

December 20, 2016, 6:07 am

≫ Next: Calling External Java Code from Script Evaluators

≪ Previous: Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

sqsorigin As I explained in my recent tutorial, Creating a Custom Origin for StreamSets Data Collector, it’s straightforward to extend StreamSets Data Collector (SDC) to ingest data from pretty much any source. Yogesh Choudhary, a software engineer at consulting and services company Clairvoyant, just posted his own walkthrough of building a custom origin for Amazon Simple Queue Service (SQS). Yogesh does a great job of walking you through the process of creating a custom origin project from the Maven archetype, building it, and then adding the Amazon SQS functionality. Read more at Creating a Custom Origin for StreamSets.

The post Building an Amazon SQS Custom Origin for StreamSets Data Collector appeared first on StreamSets.

↧

Calling External Java Code from Script Evaluators

December 21, 2016, 10:46 am

≫ Next: Data in Motion Evolution: Where We’ve Been…Where We Need to Go

≪ Previous: Building an Amazon SQS Custom Origin for StreamSets Data Collector

groovy logo When you’re building a pipeline with StreamSets Data Collector (SDC), you can often implement the data transformations you require using a combination of ‘off-the-shelf’ processors. Sometimes, though, you need to write some code. The script evaluators included with SDC allow you to manipulate records in Groovy, JavaScript and Jython (an implementation of Python integrated with the Java platform). You can usually achieve your goal using built-in scripting functions, as in the credit card issuing network computation shown in the SDC tutorial, but, again, sometimes you need to go a little further. For example, a member of the StreamSets community Slack channel recently asked about computing SHA-3 digests in JavaScript. In this blog entry I’ll show you how to do just this from Groovy, JavaScript and Jython.

Scripting on the JVM

The SDC script evaluators are implemented using the Java Scripting API defined by JSR-223, the specification for how scripts run in a JVM. It’s the Java Scripting API that enables SDC to expose objects to your scripts such as records, state and log. The Java Scripting API also allows your scripts to access arbitrary Java code from external JARs.

Let’s take SHA-3 as our example. The Bouncy Castle Crypto APIs for Java include SHA-3, and much more. It’s a snap to compute a SHA-3 hash in Java – just include the Bouncy Castle JAR and do:

// One time
SHA3.DigestSHA3 sha3 = new SHA3.DigestSHA3(256);

// As often as you like - just reset sha3 before each digest
sha3.reset();
byte[] digest = sha3.digest(inputString.getBytes("UTF-8"));
System.out.println(Hex.toHexString(digest));

Running this on an input of "abc" results in a hex-encoded digest value of "3a985da74fe225b2045c172d6bd390bd855f086e3e9d525b46bfe24511431532".
Let’s see how to do the same from Groovy, JavaScript and Jython.

Calling External Java Code from Groovy

The first step is to follow the SDC documentation for including external libraries. Establish a base directory for external libraries – I’ll use /opt/sdc-extras as an example, but you can put it anywhere you like as long as it’s outside SDC’s directory tree. Where the documentation tells you to create a specific subdirectory for the stage, you’ll need to create /opt/sdc-extras/streamsets-datacollector-groovy_2_4-lib/lib/. Make sure you edit SDC’s environment and security policy configuration files to set STREAMSETS_LIBRARIES_EXTRA_DIR and the security policy for external libraries. Note also that, if you’re starting SDC as a service, you should set the STREAMSETS_LIBRARIES_EXTRA_DIR environment variable in libexec/sdcd-env.sh, otherwise, if you’re running bin/streamsets dc interactively, set it in libexec/sdc-env.sh.

Now download the Bouncy Castle provider jar file (currently bcprov-jdk15on-155.jar) and put it in the Groovy external libs subdirectory, /opt/sdc-extras/streamsets-datacollector-groovy_2_4-lib/lib/. Restart SDC, and create a test pipeline.

Drag a Dev Raw Data Source onto the pipeline canvas, and configure it with JSON data format and the following raw data:

{ "data" : "abc" }
{ "data" : "" }
{ "data" : "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq" }
{ "data" : "abcdefghbcdefghicdefghijdefghijkefghijklfghijklmghijklmnhijklmnoijklmnopjklmnopqklmnopqrlmnopqrsmnopqrstnopqrstu" }

Dev Raw Data Origin

Now add a Groovy Evaluator and paste in this script:

import org.bouncycastle.jcajce.provider.digest.SHA3

// Only need a single SHA3 instance
if (!state['sha3']) {
  state['sha3'] = new SHA3.DigestSHA3(256)
}

SHA3.DigestSHA3 sha3 = state['sha3'];

for (record in records) {
  try {
    // Need to reset the SHA3 instance for every field we digest
    sha3.reset()
    byte[] digest = sha3.digest(record.value['data'].getBytes("UTF-8"))
    record.value['digest'] = digest.encodeHex().toString()

    output.write(record)
  } catch (e) {
    // Write a record to the error pipeline
    log.error(e.toString(), e)
    error.write(record, e.toString())
  }
}

There are a few useful techniques here. First, we import the Bouncy Castle SHA3 class exactly like the Java sample. We can use the same SHA3 instance for every hash we compute, so we only need to create a single instance, and keep it in the state object. We retrieve the SHA3 instance from the state map before looping through the records, to minimize the work done in the loop. Since we’re reusing the same SHA3 instance, we need to reset it before computing a new hash. After that, it’s just a case of calling the digest() method on the input field’s bytes, hex encoding the resulting byte array, and putting it in the digest field of the record.

Preview the pipeline, click on the Groovy Evaluator, and you should see that the digests of the various sample input values.

SHA-3 Digests

You can check the results against this handy list of SHA test vectors – thanks, DI Management!

Calling External Java Code from JavaScript

The process is very similar in JavaScript, with a couple of exceptions. The Bouncy Castle jar file needs to go in /opt/sdc-extras/streamsets-datacollector-basic-lib/lib/, since the JavaScript Evaluator is included in SDC’s basic stage library.

The JavaScript is very similar to the Groovy, except it uses the hex encoder from Bouncy Castle, since there is no native encoder in JavaScript.

// Only need single SHA3, Hex instances
if (!state.sha3 || !state.Hex) {
  var DigestSHA3 = Java.type('org.bouncycastle.jcajce.provider.digest.SHA3.DigestSHA3');

  state.sha3 = new DigestSHA3(256);
  state.Hex = Java.type('org.bouncycastle.util.encoders.Hex');
}

var sha3 = state.sha3;
var Hex = state.Hex;

for(var i = 0; i < records.length; i++) {
  var record = records[i];

  try {
    // Need to reset the message digest object for every field!
    sha3.reset();
    var digest = sha3.digest(record.value['data'].getBytes('UTF-8'));
    record.value.digest = Hex.toHexString(digest);

    output.write(record);
  } catch (e) {
    // Send record to error
    error.write(record, e);
  }
}

I used the same techniques here as in the Groovy example – caching objects in state, assigning local variables to minimize the work done in the record loop, and resetting the sha3 instance for each digest.

The Java SE 8 documentation provides further information on calling Java from JavaScript.

Calling External Java Code from Jython

Again, the process is very similar to Groovy. Put the Bouncy Castle jar file in /opt/sdc-extras/streamsets-datacollector-jython_2_7-lib/lib/, and use the following script:

from org.python.core.util import StringUtil
from org.bouncycastle.jcajce.provider.digest.SHA3 import DigestSHA3
import binascii

# Only need a single SHA3 instance
if ('sha3' not in state):
  state['sha3'] = DigestSHA3(256)

sha3 = state['sha3']

for record in records:
  try:
    # Need to reset the message digest object for every field!
    sha3.reset()
    digest = sha3.digest(StringUtil.toBytes(record.value['data']))
    record.value['digest'] = binascii.hexlify(digest)

    output.write(record)

  except Exception as e:
    # Send record to error
    error.write(record, str(e))

The principles are exactly the same as for Groovy and JavaScript – use state to cache long-lived objects, minimize processing within the loop, and remember to reset the SHA3 digest before each use.

You can find more information on calling Java from Jython in the Jython User Guide.

Performance

With the same functionality implemented in three script evaluators, a natural question is, “Which is fastest?” I added a Trash Destination to each pipeline and ran a quick test on my 8GB MacBook Air. To remove any effects of loading the script engines into memory I first started the pipeline, let it run for a minute, and stopped it. I then restarted it and let it run for a second minute, measuring the record throughput from the second run. Figures are in records/second – don’t consider this a scientific test of SDC performance – my heavily loaded laptop is not a representative testbed; rather, focus on the relative numbers for the three evaluators:

Groovy	1000
JavaScript	750
Jython	650

A clear win for Groovy, likely due to its tighter coupling to the JVM.

For comparison, I coded a Custom Processor in Java to do the same job; it was able to process 1800 records/second. See the custom processor tutorial if you want to go down this road.

External Script Code

A related topic is “How do I call external Groovy/JavaScript/Jython code from my script?” Since external jar files are already compiled into bytecode, performance is *much* better calling Java libraries than external script code – a JavaScript implementation of SHA-3 processed only 200 records/second on my laptop. Having said that, there are some rare occasions when the functionality you need is only available in your scripting language. I’ll cover external script code in a future blog entry.

Conclusion

StreamSets Data Collector’s use of the Java Scripting API allows you to call existing Java code from your scripts, running in any of the Groovy, JavaScript or Jython Evaluators. It’s straightforward to import and call Java code from any of the scripting languages, and, while native Java code gives you the ultimate in performance, the Script Evaluator gives you flexibility to iterate on your code much more quickly than a custom processor’s build/copy/restart/run loop, and has clear benefits if you’re more comfortable working in Groovy, JavaScript or Jython rather than Java. One final note – the code in this article was all developed and tested on JDK 8. If you are not yet running SDC on JDK 8, you should strongly consider migrating, as JDK 7 is in the end-of-life process, and deprecated in SDC as of version 2.2.0.0.

The post Calling External Java Code from Script Evaluators appeared first on StreamSets.

↧

Data in Motion Evolution: Where We’ve Been…Where We Need to Go

January 17, 2017, 1:43 pm

≫ Next: Ingest Data into Splunk with StreamSets Data Collector

≪ Previous: Calling External Java Code from Script Evaluators

Today we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data. Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.

Exerting firm control over data in motion is a critical competency which has become core to modern data operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.

First Generation: Stocking the Warehouse via ETL

Let’s roll back a couple decades. The first substantial data movement problems emerged in the mid-1990s with the enterprise data warehouse (EDW) trend. The goal was to move transaction data provided by disparate applications or residing in heterogeneous databases into a single location for analytical use by business units. Organizations operated a variety of applications, such as SAP, PeopleSoft and Siebel, as well as a variety of database technologies like Oracle, Sybase and IBM. As a result, there was no simple way to access and move data; each was a bespoke project requiring an understanding of vendor-specific schema and languages. The inability to “stock the data warehouse” efficiently led to EDW projects failing or becoming excessively expensive.

ETL emerged as the tooling to successfully load the warehouse by creating connectors for applications and databases. I refer to this first generation as “schema-driven ETL” because for each source one needed to specify and map every incoming field into the data warehouse. It was developer-centric, focused on pre-processing (aggregating and blending) data at scale from multiple sources to create a uniform data set, primarily for business intelligence (BI) consumption. Enterprises spent millions of dollars on these first-generation tools that allowed developers to move data without dealing with the myriad languages of custom applications, fueling the creation of a multi-billion dollar industry.

Second Generation: SaaS leads to iPaaS

Over time, consolidation of the database and application markets into a small number of mega-vendors created a more homogeneous world. Organizations began to wonder if ETL was even relevant, since the new world order had done away with the fragmentation that spawned its existence.

But a new challenge replaced the old. By the mid-2000s, the emergence of SaaS applications, led by Salesforce.com, added another layer of complexity. The new questions were:

How do we get cloud-based SaaS transaction data into warehouses?
How do we synchronize information across multiple SaaS applications?
Should we deploy data integration middleware in the cloud, on-premise or both?

As the SaaS delivery model proliferated, customer, product and other domain data became fragmented across dozens of different applications, usually with inconsistent data structures. Because SaaS applications are API-driven rather than language-driven, organizations faced a new challenge of rationalizing across the different flavors of APIs needed to send data between these various locations.

The SaaS revolution forced data movement technologies to evolve from analytic data integration, the sweet spot for data warehouses and ETL, to operational data integration, featuring data movement between applications. The increased focus on operational use increased the pressure on the system to deliver trustworthy data quickly.

This new challenge led to the emergence of integration Platform-as-a-Service (iPaas) as the second generation of tools for data in motion. These systems were provided by both legacy ETL vendors like Informatica and newcomers like Mulesoft and SnapLogic. They featured myriad API-based connectors, a focus on data quality and master data management capabilities and the ability to subscribe to data in motion as a cloud-based service. But some of the old characteristics of ETL systems were retained, in particular a reliance on schema mapping and a focus on “citizen integrator” productivity, with less attention paid to the challenges of streaming data and continuous operations.

This second generation continues to contribute to the rapid growth of a multi-billion dollar industry.

The Need for a Third Generation

Of course, these first two generations were architected before the emergence of the Hadoop ecosystem and the big data revolution, which we are in the midst of today. Big data adds several new dimensions to the data movement problem:

Data drift from new sources, including log files, IoT sensor output, and clickstream data. Data drift occurs when these sources undergo unexpected mutations to their schema and/or semantics, usually as a result of an upgrade to the source system. Data drift, if not detected and dealt with, leads to data loss and corrosion which, in turn, pollutes downstream analysis and jeopardizes data-driven decisions.
The emergence of streaming interaction data – think clickstreams or social network activity – that must be classified as “events” and processed quickly. This is a higher order of complexity than transactional data, and these events tend to be highly perishable, requiring analysis as close as possible to event occurrence.

Data processing infrastructure that has become heterogeneous and complex. The big data stack is based on myriad open source projects, proprietary tools and cloud services. It chains together numerous components from data acquisition to complex flows that cross various systems like message queues, distributed key-value stores, storage/compute platforms and analytic frameworks. Most of these systems fall under different administrative and operational governance zones which leads to a very complex maintenance schedule, multiple upgrade paths and more.

This combination of factors breaks data movement systems which were built for the needs of previous generations. They end up being too tightly coupled, too opaque and too brittle to thrive in the big data world.

Legacy data movement systems are tightly coupled in that they rely on knowledge of specific characteristics of the data sources and processing components they connect. This was reasonable in an era where the data infrastructure was upgraded infrequently and in concert – the so called (albeit painful) “galactic upgrade”. Today, a tightly coupled approach hampers agility by stopping the enterprise from taking advantage of new functionality and performance improvements to independent infrastructure components.

Opaqueness comes from the developer-centric approach of earlier solutions. Because the standard use case was batch movement of “slow data” from highly stable and well-governed sources, runtime visibility was not a high priority. Consuming real-time interaction data is an entirely new ballgame which requires continuous operational visibility.

Brittleness comes from the fact that schema is no longer static and, in some cases, does not exist. Processes built using schema-centric systems cannot be easily reworked in the face of data drift and tend to break unexpectedly. When combined with tight coupling and operational opaqueness, brittleness can lead to data quality issues – delivering both false positives and false negatives to consuming applications.

Third Generation: Data in Motion Middleware

To address these new challenges, enterprises need a third generation data in motion technology, middleware that can “performance manage” the flow of data by continuously monitoring and measuring the accuracy and availability of data as it makes it’s way from origin to destination.

Such a technology should have the following qualities:

It should be intent-driven rather than schema-driven. At design time only a minimally required set of conditions should be defined, not the entire schema. In a world of data drift, minimizing specification reduces the chance of data flows breaking and data loss.

It should detect and address data drift during operations, replacing brittleness and opaqueness with flexibility and visibility. Streaming data requires complete operational control over the data flow so that quality issues can be quickly detected and corrected, either automatically or through alerts and proactive remediation. A set-and-forget approach that forces reactive operations is simply insufficient.

Because modern data processing environments are more heterogeneous and dynamic, a third-generation solution must work as loosely coupled middleware, where components (origins, processors, destinations) are logically isolated and thus can be upgraded or swapped out independently of one another and the middleware itself. Besides affording the enterprise greater architectural agility, loose coupling also reduces risk of technology lock-in.

From Classical to Jazz

To employ music as a metaphor, if the move from ETL to iPaaS was like adding instruments to an orchestra, the evolution from iPaaS to DPM is akin to moving from classical music to improvisational jazz. The first transition changed old instruments for new, but you were still reading sheet music. The transition we now face throws out the sheet music and asks the band to make continual, subtle, and unexpected shifts in the composition while maintaining the rhythm.

In the new world of drifting data, real-time requirements and complex and evolving infrastructure, we must embrace a jazz-like approach, channel our inner Miles Davis, and shift to a third-generation mindset. Even though modern sources and data flows are dynamic and chaotic, they can be can still be blended to deliver beautiful insights.

The post Data in Motion Evolution: Where We’ve Been…Where We Need to Go appeared first on StreamSets.

↧

Ingest Data into Splunk with StreamSets Data Collector

January 18, 2017, 11:40 am

≫ Next: Ingesting data into Couchbase using StreamSets Data Collector

≪ Previous: Data in Motion Evolution: Where We’ve Been…Where We Need to Go

Splunk Chart Splunk indexes and correlates log and machine data, providing a rich set of search, analysis and visualization capabilities. In this blog post, I’ll explain how to efficiently send high volumes of data to Splunk’s HTTP Event Collector via the StreamSets Data Collector Jython Evaluator. I’ll present a Jython script with which you’ll be able to build pipelines to read records from just about anywhere and send them to Splunk for indexing, analysis and visualization.

The Splunk HTTP Event Collector

The Splunk HTTP Event Collector (HEC) allows you to send data directly into Splunk Enterprise or Splunk Cloud over HTTP or HTTPS. To use HEC you must enable its endpoint (it is not enabled by default) and generate an HEC token. Applications can use the HEC token to POST event data to the HEC endpoint. Events are indexed on receipt and may be accessed via the Splunk browser interface.

Using StreamSets Data Collector with the Splunk HTTP Event Collector

Follow the Splunk documentation to enable HEC (if you are a managed Splunk Cloud customer, you must file a request ticket with Splunk Support) and generate a token. Save the token in the SDC resources directory:

$ echo -n YOUR-HEC-TOKEN-VALUE > /path/to/your/sdc/resources/splunkToken

Note the use of the -n option to omit the trailing newline character.

We’ll use SDC’s Jython Evaluator to send a single API request to Splunk for each batch of records. I’m going to use the pipeline from the SDC taxi transactions tutorial as an example, but you can adapt the same script for use in just about any pipeline.

Add a Jython Evaluator to your pipeline, copy the Jython script for Splunk from here, and paste it into the evaluator. I’ve attached a ‘Trash’ stage to the evaluator’s output, since I don’t need the records once they’re in Splunk.

Pipeline with Jython Splunk evaluator Running the pipeline writes the 5000+ taxi transaction records to Splunk in just a few seconds; we can then query Splunk for the top credit card types for transactions with payment type of ‘CRD’:

Splunk Jython Evaluator script

The script uses several useful techniques; let’s take a closer look:

We’ll be using the Requests library to access Splunk, so we add its location to the system module search path, and import it:

import sys
# Set to wherever the requests package lives on your machine
sys.path.append('/Library/Python/2.7/site-packages')
import requests

You will need to configure the appropriate endpoint for HEC. For simplicity, I’m using HTTP, but you can also configure HTTPS:

# Endpoint for Splunk HTTP Event Collector
url = 'http://localhost:8088/services/collector'

HEC recognizes several metadata keys, defined in the Format events for HTTP Event Collector document:

# Splunk metadata fields
metadata = ['time', 'host', 'source', 'sourcetype', 'index']

Including credentials such as usernames and passwords in source code is a BAD THING, so we read the Splunk token from its resource file. We don’t want to do this for every batch, so we save it in the state object:

# Read Splunk token from file and cache in state
if state.get('headers') is None:
  state['headers'] = {'Authorization': 'Splunk ${runtime:loadResource('splunkToken', false)}'}

Now we initialize a buffer string, and loop through the records in the batch:

buffer = ''

# Loop through batch, building request payload
for record in records:

For each record, we build a payload dictionary containing the metadata keys, such as host and time, that Splunk recognizes. Any fields in the record with matching keys will be copied to the top level of the payload.

# Metadata fields are passed as top level properties
payload = dict((key, record.value[key]) for key in record.value if key in metadata)

The remainder of the fields are added to an event dictionary within the payload. Note that the entire content of the record is used ‘as-is’. If you want to rename or exclude fields you can do so via processors such as the Field Renamer and Field Remover, or manipulate them in the script as required.

# Everything else is passed in the 'event' property
payload['event'] = dict((key, record.value[key]) for key in record.value if key not in metadata)

Now the JSON representation of the payload is added to the buffer:

buffer += json.dumps(payload) + '\n'

We write the record to the processor’s output, so we could send it to a destination if we chose to do so:

# Write record to processor output
output.write(record)

If there is data in the buffer, we send it to Splunk, and decode the JSON response:

if len(buffer) > 0:
  # Now submit a single request for the entire batch
  r = requests.post(url,
                    headers=state['headers'],
                    data=buffer).json()

We need to check that Splunk correctly received the data, and raise an exception if it did not. This ensures that, in the event of an error, the pipeline will be stopped and the data can be reprocessed once the error is rectified:

# Check for errors from Splunk
if r['code'] != 0:
  log.error('Splunk error: {}: {}', r['code'], r['text'])
  raise Exception('Splunk API error {0}: {1}'.format(r['code'], r['text']))

Finally, we log the status message we received from Splunk:

# All is good
log.info('Splunk API response: {}', r['text'])

Conclusion

StreamSets Data Collector’s Jython evaluator allows you to efficiently integrate with APIs such as Splunk’s HTTP Event Collector, where you want to make a single HTTP request per batch of records. While the script presented above allows you to efficiently send records to Splunk, the same techniques can be used with any web service API.

The post Ingest Data into Splunk with StreamSets Data Collector appeared first on StreamSets.

↧

Ingesting data into Couchbase using StreamSets Data Collector

January 20, 2017, 7:17 am

≫ Next: Announcing Data Collector ver 2.3.0.0

≪ Previous: Ingest Data into Splunk with StreamSets Data Collector

Nick Cadenhead, a Senior Consultant at 9th BIT Consulting in Johannesburg, South Africa, uses Couchbase Server to power analytics solutions for his clients. In this blog entry, reposted from his article at LinkedIn, Nick explains why he selected StreamSets Data Collector for data ingest, and how he extended it with a custom destination to write data to Couchbase.

For some time, I have been working with the Couchbase NoSQL database solution and it’s been an interesting journey so far.

Historically, I’m not a database guy, so I’ve not worked much with databases in terms of designing, building and maintaining them as a full time job. However, I do know the basics. This position has allowed me to get into the “mindset” of NoSQL concepts like no structures, no transactions, denormalizing of data and more without having many conflicting situations with the paradigms of the structured world of SQL and relational databases.

So during my sales engineering activities supporting Couchbase proof of concepts (POC) engagements, there is always a requirement to ingest data into a Couchbase bucket (think of a bucket as a relational database) in order to demonstrate and highlight the features and capabilities of Couchbase.

Usually data ingestion requires some code to be written to ingest data into Couchbase. Couchbase provides quite a few SDKs (Java, .Net, Node JS and more) for developers to enable their applications to use Couchbase.

So this got me thinking. Why can’t there be a standard way, or tool for that matter, to ingest data into Couchbase instead of writing code all the time?

Don’t get me wrong. There’s nothing wrong with writing code!!!

Then I came across StreamSets Data Collector (SDC).

SDC is an open source platform for the ingestion of streaming and batch data into big data stores. It features a graphical web-based console for configuring data “pipelines” to handle data flows from origins to destinations, monitoring runtime dataflow metrics and automating the handling of data drift.

Data pipelines are constructed in the web-based console via a drag and drop process. Pipelines connect to origins (sources) and ingest data into destinations (targets). Between origins and destinations are processor steps which are essentially data transformation steps for doing field masking, field evaluating, looking up data in a database or external cloud services such as Salesforce, evaluating expressions on fields, routing data, evaluating/manipulating data using JavaScript, Jython or Groovy, and many more.

Couchbase integration with StreamSets
A StreamSets Data Collector pipeline ingesting data into a Couchbase Bucket

Thus SDC is a great option for my data ingestion needs. It’s open source and available to download immediately. There are a large number of technologies supported for data ingestion ranging from databases to flat files, logs, HTTP services and big data platforms like Hadoop, MongoDB and cloud platforms like Salesforce. But there was one problem. Couchbase was not on the list of technology data connectors available for SDC. No problem! I decided to write my own connector for Couchbase.

Leveraging the Data Connector Java-based API available for the open community to extend the integration capabilities of SDC, together with the online documentation and guides, I was able to implement a data connector very quickly for Couchbase. The initial build of the connector is very simple; just ingest JSON data into a Couchbase bucket. Over time the connector will be expanded to query a Couchbase bucket, better ingestion capabilities and more. For now, it serves my needs.

One of the added benefits with SDC is data pipeline analytics. The analytics features in the SDC console give users an insight into how data is flowing from origins to destinations. The standard visualizations in the SDC console give detailed analysis on the performance of the data pipeline. The analysis of the pipeline showed me very quickly how my data was being ingested into the Couchbase Buckets and highlighted any errors which occurred throughout the stages of the data pipeline.

So by using data pipelines in SDC, they allow me to ingest data very quickly into Couchbase without writing much or any code at all.

The data connector is open of course at GitHub.

Please feel free to contact me if you have any questions on StreamSets Data Collector, Couchbase or the code of the Couchbase connector.

Nick’s Couchbase destination currently targets StreamSets Data Collector API version 1.2.2.0. We at StreamSets are working with Nick to update it for the upcoming 2.3.0.0 release and ultimately add it to StreamSets Data Collector as a supported integration.

The post Ingesting data into Couchbase using StreamSets Data Collector appeared first on StreamSets.

↧

Announcing Data Collector ver 2.3.0.0

February 2, 2017, 4:12 pm

≫ Next: Replicating Relational Databases with StreamSets Data Collector

≪ Previous: Ingesting data into Couchbase using StreamSets Data Collector

We’re excited to release the next version of the StreamSets Data Collector. This release has 80+ new features and improvements, and 150+ bug fixes.

Multithreaded Pipelines

We’ve updated the SDC framework to allow individual pipelines to scale up on a single machine. This functionality is origin dependent. To start, we’ve designed a new HTTP Server origin that can ingest data concurrently across multiple listeners.

Over the next few releases, we will update some of the existing origins to allow them to use this functionality. Please vote here to help us prioritize what other workloads you’d like to see use such scale up capabilities.

Multitable Copy

You can use the new JDBC Multitable Consumer origin to select one or more tables in a database, by name or by wildcard, and have the system automatically copy the tables over to the destination.

This is very useful for workloads such as migrations of Enterprise Data Warehouses into Hadoop.

Also check out this handy tutorial on using StreamSets to replicate relational databases.

Update Delete Operations

We’ve included Update Delete verbs into the framework. This means that many destinations now honor the operations listed in the sdc.operation.type record header. For example, all the Change Data Capture origins that can be used to stream changes to a database will automatically mark Updated or Deleted records as such, and those actions can be faithfully reproduced on destinations that support those operations.

This feature is also very useful for keeping your source Enterprise Data Warehouse synced with the Data Warehouse in your Hadoop system.

IoT and API Gateway

The new HTTP Server origin is able to listen to HTTP post requests from IoT Devices and act as an endpoint for API calls. It is a multithreaded origin and can concurrently process data at high volumes.

MapR DB JSON Support

We’ve added support for reading from and writing to MapR DB using the JSON document model.

MongoDB CDC Support

This origin can read entries from the MongoDB OpLog which is used to store change information for data or database operations.

OAuth2 Support in HTTP Client

The HTTP Client origin and processor can now authenticate against API’s supporting OAuth2 for Server to Server applications. For example, you can connect to API’s within the Microsoft Azure ecosystem, Google Applications, or any apps that support Server to Server OAuth2 or JWT.

Cluster Mode for Azure Data Lake Store Destination

The ADLS Destination can now be used in Hadoop – cluster mode. This is useful for use cases where you are trying to migrate large volumes of data from on-prem clusters to the Cloud.

HTTP API Support for Elasticsearch

We now support the Elastic HTTP API to deliver data into Elasticsearch.

Please note: The Elasticsearch destination requires you to run Data Collector on Java 8. Even if you don’t use this destination, we HIGHLY recommend you upgrade to Java 8 as soon as possible. We will officially drop support for Java 7 over the next few releases.

Rate Limiting while Transferring Whole Files.

You can limit the bandwidth consumed by the Data Collector when transferring Whole Files. This is useful for on-prem to cloud migration use cases where you may have limited bandwidth in your Internet pipe and would want to constrain how much is allocated for this dedicated data transfer.

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

The post Announcing Data Collector ver 2.3.0.0 appeared first on StreamSets.

↧

Replicating Relational Databases with StreamSets Data Collector

February 2, 2017, 4:18 pm

≫ Next: Ingest Data into Azure Data Lake Store with StreamSets Data Collector

≪ Previous: Announcing Data Collector ver 2.3.0.0

HiveDrift2 StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. StreamSets Data Collector (SDC) 2.3.0.0 introduces the JDBC Multitable Consumer, a new pipeline origin that can read data from multiple tables through a single database connection. In this blog entry, I’ll explain how the JDBC Multitable Consumer can implement a typical use case – replicating an entire relational database into Hadoop.

Installing a JDBC Driver

The first task is to install the JDBC driver corresponding to your database. Follow the Additional Drivers documentation carefully; this is quite a delicate process and any errors in configuration will prevent the driver from being loaded.

In the sample below, I use MySQL, but you should be able to ingest data from any relational database, as long as it has a JDBC driver.

Replicating a Database

I used the MySQL retail_db database included in the Cloudera Quickstart VM as my sample data source. It contains 7 tables:

Although this is a simple schema, the sample database contains a significant amount of data – over a quarter of a million rows in total. My goal is to replicate the entire database – every table, every row – into an Apache Hive data warehouse.

I created a new pipeline, and dropped in the JDBC Multitable Consumer. The key piece of configuration here is the JDBC Connection String. In my case, this was jdbc:mysql://pat-retaildb.my-rds-instance.rds.amazonaws.com:3306/retail_db, but you’ll need to change this to match your database.

I used the default values for the remainder of the JDBC tab – see the documentation for an explanation of these items, in particular, batch strategy.

On the Tables tab, I used a single table configuration, setting JDBC schema to retail_db, and left the table name pattern with its default value, %, to match all tables in the database. You can create multiple table configurations, each with its own table name pattern, to configure whichever subset of tables you require. Since each of the tables in the sample database has a suitable primary key, I didn’t need to configure any offset columns, but you have the option to do so if necessary. The documentation describes table configurations in some detail, and is worth reading carefully.

One MySQL-specific note: the default transaction isolation level for MySQL InnoDB tables is REPEATABLE READ. This means that repeating the same SELECT statement in the same transaction gives the same results. Since the JDBC Multitable Consumer repeatedly queries MySQL tables, and we want to pick up changes between those queries, I set ‘Transaction isolation’ to ‘Read committed’ in the consumer’s Advanced tab.

With the origin configured, I was able to preview data. I checked ‘Show Record/Field Header’ in Preview Configuration so I could see the attributes that the origin sets:

Note in particular the jdbc.tables attribute – every record carries with it the name of its originating table. Note also that, in preview mode, the origin reads only the first table that matches the configuration.

Transforming Data

When we load data from transactional databases into the data warehouse, we often want to filter out personally identifiable information (PII). Looking at the customers table, it has columns for first name, last name, email, password and street address. I don’t want those in the data warehouse – for my analyses, customer id, city, state and ZIP Code suffice.

mysql> describe customers;
+-------------------+--------------+------+-----+---------+----------------+
| Field             | Type         | Null | Key | Default | Extra          |
+-------------------+--------------+------+-----+---------+----------------+
| customer_id       | int(11)      | NO   | PRI | NULL    | auto_increment |
| customer_fname    | varchar(45)  | NO   |     | NULL    |                |
| customer_lname    | varchar(45)  | NO   |     | NULL    |                |
| customer_email    | varchar(45)  | NO   |     | NULL    |                |
| customer_password | varchar(45)  | NO   |     | NULL    |                |
| customer_street   | varchar(255) | NO   |     | NULL    |                |
| customer_city     | varchar(45)  | NO   |     | NULL    |                |
| customer_state    | varchar(45)  | NO   |     | NULL    |                |
| customer_zipcode  | varchar(45)  | NO   |     | NULL    |                |
+-------------------+--------------+------+-----+---------+----------------+
9 rows in set (0.03 sec)

I separated out customer records in the pipeline with a Stream Selector with a condition

${record:attribute('jdbc.tables') == 'customers'}

A Field Remover then stripped the PII from the customer records

Writing Data to Hive

The combination of the Hive Metadata Processor and the Hadoop FS and Hive Metastore destinations allow us to write data to Hive without needing to predefine the Hive schema. The Hive Metadata Processor examines the schema of incoming records and reconciles any difference with the corresponding Hive schema, emitting data records for consumption by the Hadoop FS destination and metadata records for the Hive Metastore destination. See Ingesting Drifting Data into Hive and Impala for a detailed tutorial on setting it all up.

Configuration of the three stages mostly involves specifying the Hive JDBC URL and Hadoop FS location, but there is one piece of ‘magic’: I set the Hive Metadata Processor’s Table Name to retaildb-${record:attribute('jdbc.tables')}. This tells the processor to use each record’s table name attribute, as we saw in the preview above, to build the destination Hive table name. I’m prefixing the Hive table names with retaildb- so I can distinguish the tables from similarly named tables that might come from other upstream databases.

Since I was building a sample system, I changed the Hadoop FS destination’s Idle Timeout from its default ${1 * HOURS} to ${1 * MINUTES} – I was more interested in being able to quickly see records appearing in Hive, rather than maximizing the size of my output files!

Refreshing Impala’s Metadata Cache

As it is now, this pipeline would read all of the data from MySQL and write it to Hive, but, if we were to use Impala to query Hive, we would not see any data. This is because we are writing data directly into Hadoop FS and metadata into Hive, we need to send Impala the INVALIDATE METADATA statement so that it reloads the metadata on the next query. We can do this automatically using SDC’s Event Framework. I just followed the Impala Metadata Updates for HDS case study, adding an Expression Evaluator and Hive Query Executor to automatically send INVALIDATE METADATA statements to Impala when closing a data file or changing metadata. Note that you’ll have to specify the Impala endpoint in the Hive Query Evaluator’s JDBC URL – for my setup, it was jdbc:hive2://node-2.cluster:21050/;auth=noSasl.

Events

Once my pipeline was configured, I was able to run it. After a few minutes, the pipeline was quiet:

Ingest Multitable

I was able to check that all the rows had been ingested by running SELECT COUNT(*) FROM tablename for each table in both MySQL and Impala:

[node-2.cluster:21000] > show tables;
Query: show tables
+--------------------------+
| name                     |
+--------------------------+
| retaildb_categories      |
| retaildb_customers       |
| retaildb_departments     |
| retaildb_order_items     |
| retaildb_orders          |
| retaildb_products        |
| retaildb_shipping_events |
+--------------------------+
Fetched 7 row(s) in 0.01s
[node-2.cluster:21000] > select count(*) from retaildb_orders;
Query: select count(*) from retaildb_orders
Query submitted at: 2017-02-01 19:29:29 (Coordinator: http://node-2.cluster:25000)
Query progress can be monitored at: http://node-2.cluster:25000/query_plan?query_id=b14f2b075be0e3b1:3e87dc1600000000
+----------+
| count(*) |
+----------+
| 68883    |
+----------+
Fetched 1 row(s) in 4.30s
[node-2.cluster:21000] > select * from retaildb_orders order by order_id desc limit 3; 
Query: select * from retaildb_orders order by order_id desc limit 3
Query submitted at: 2017-02-01 19:30:05 (Coordinator: http://node-2.cluster:25000)
Query progress can be monitored at: http://node-2.cluster:25000/query_plan?query_id=f441e0ae44664199:13321e9700000000
+----------+---------------------+-------------------+-----------------+
| order_id | order_date          | order_customer_id | order_status    |
+----------+---------------------+-------------------+-----------------+
| 68883    | 2014-07-23 00:00:00 | 5533              | COMPLETE        |
| 68882    | 2014-07-22 00:00:00 | 10000             | ON_HOLD         |
| 68881    | 2014-07-19 00:00:00 | 2518              | PENDING_PAYMENT |
+----------+---------------------+-------------------+-----------------+
Fetched 3 row(s) in 0.32s

Continuous Replication

A key feature of the JDBC Multitable Consumer origin is that, like almost all SDC origins, it runs continuously. The origin retrieves records from each table in turn according to the configured interval, using a query of the form:

SELECT * FROM table WHERE offset_col > last_offset ORDER BY offset_col

In my sample, this will result in any new rows created in MySQL being copied across to Hive. Since I configured the Hadoop FS destination’s idle timeout to just one minute, I was able to quickly see data appear in the Hive table.

This short video highlights the key points in configuring the use case, and shows propagation of new rows from MySQL to Hive:

https://www.youtube.com/watch?v=hFj-glriUX8

Conclusion

The JDBC Multitable Consumer allows a single StreamSets Data Collector pipeline to continuously ingest data from any number of tables in a relational database. Combined with the Hive Drift Solution you can quickly create powerful data pipelines to implement data warehouse use cases.

The post Replicating Relational Databases with StreamSets Data Collector appeared first on StreamSets.

↧

Ingest Data into Azure Data Lake Store with StreamSets Data Collector

February 20, 2017, 1:15 pm

≫ Next: Running Scala Code in StreamSets Data Collector

≪ Previous: Replicating Relational Databases with StreamSets Data Collector

SDC and Power BI Azure Data Lake Store (ADLS) is Microsoft's cloud repository for big data analytic workloads, designed to capture data for operational and exploratory analytics. StreamSets Data Collector (SDC) version 2.3.0.0 included an Azure Data Lake Store destination, so you can create pipelines to read data from any supported data source and write it to ADLS.

Since configuring the ADLS destination is a multi-step process; our new tutorial, Ingesting Local Data into Azure Data Lake Store, walks you through the process of adding SDC an an application in Azure Active Directory, creating a Data Lake Store, building a simple data ingest pipeline, and then configuring the ADLS destination with credentials to write to an ADLS directory.

The sample pipeline reads CSV-formatted transaction data, masks credit card numbers, and writes JSON records to files in ADLS, but, once you have mastered the basics, you'll be able to build more complex pipelines and write a variety of data formats. If you don't already use Azure, you can create a free Azure account, including $200 free credit and 30 days of Azure services. This is more than enough to complete the tutorial – I think I've used $0.10 of the allowance so far!

In this short video, I show how I combined the taxi data tutorial pipeline with the ADLS destination, then used Microsoft Power BI to visualize the data, reading it directly from ADLS (this Microsoft blog entry explains how to create the visualization).

https://www.youtube.com/watch?v=yCU8_tFnFag

Once you have your data in Azure Data Lake Store, you can use StreamSets Data Collector for HDInsight on Microsoft's fully-managed cloud Hadoop service. Watch for a future tutorial focusing on ingesting data from ADLS to Azure SQL Data Warehouse on HDInsight!

The post Ingest Data into Azure Data Lake Store with StreamSets Data Collector appeared first on StreamSets.

↧

Running Scala Code in StreamSets Data Collector

February 27, 2017, 5:00 am

≫ Next: Announcing StreamSets Data Collector ver 2.4.0.0

≪ Previous: Ingest Data into Azure Data Lake Store with StreamSets Data Collector

Scala logo The Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Transformer as a Spark Resilient Distributed Dataset (RDD). Your code can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination. Since Scala is Spark's ‘native tongue', it's well suited to the task of creating a Transformer. Here is the skeleton CustomTransformer class – you'll notice it's a bit briefer than the Java equivalent:

class CustomTransformer extends SparkTransformer with Serializable {
  var emptyRDD: JavaRDD[(Record, String)] = _

  override def init(javaSparkContextInstance: JavaSparkContext, params: util.List[String]): Unit = {
    // Create an empty JavaPairRDD to return as 'errors'
    emptyRDD = javaSparkContextInstance.emptyRDD
  }

  override def transform(recordRDD: JavaRDD[Record]): TransformResult = {
    val rdd = recordRDD.rdd

    val errors = emptyRDD

    // Apply a map to the incoming records
    val result = rdd.map((record)=> record)

    // return result
    new TransformResult(result.toJavaRDD(), new JavaPairRDD[Record, String](errors))
  }
}

The tutorial starts from this sample and extends it to compute the credit card issuing network (Visa, Mastercard, etc) from a credit card number, validate that incoming records have a credit card field, and even allow configuration of card issuer prefixes from the SDC user interface rather than being hardcoded in Scala.

If you've been looking to implement custom functionality in SDC, and you're a fan of Scala's brevity and expressiveness, work through the tutorial and let us know how you get on in the comments.

The post Running Scala Code in StreamSets Data Collector appeared first on StreamSets.

↧

Announcing StreamSets Data Collector ver 2.4.0.0

March 3, 2017, 9:15 am

≫ Next: Read and Write JSON to MapR DB with StreamSets Data Collector

≪ Previous: Running Scala Code in StreamSets Data Collector

We are happy to announce the newest version of StreamSets Data Collector is available for download. This short release has over 25 new features and improvements and over 50 bug fixes. This is an enterprise-focused release that addresses the needs of some of the world's largest organizations using StreamSets. Below is a short list of what's new, please check out the release notes for more details.

Multi-tenancy

To better enforce security standards for your data operations, we've introduced a set of features to enable multi-tenancy including access control lists and support for groups within both StreamSets Data Collector andStreamSets Dataflow Performance Manager (DPM), our operations management environment. Enterprises can use this functionality to restrict access to pipelines, jobs or topologies to specific groups of users.

Setting up groups and access control lists in StreamSets Data Collector is easy and seamless, as it is integrated with the process of registering pipelines within DPM.

Support for Cloudera's Apache Kafka 2.1, CDH 5.10, and Kudu 1.x

StreamSets Data Collector now supports the latest versions of Kudu and the Cloudera distributions.

UI for installing external libraries

You no longer have to go looking through config and properties files to install database drivers. You can now install external libraries such as database drivers or external java libraries for the language processors directly through the StreamSets Data Collector user interface. If you are automating installation through scripting, you can do the same using a REST API. Incidentally, when you run pipelines on a YARN cluster, the system will automatically copy all the requisite resource files to the nodes on the cluster; you don't have to do this manually.

Sending metrics to DPM without a message queue

If you want to send metric data into StreamSets Dataflow Performance Manager to enable long-term statistics monitoring of your pipelines, you no longer need to use a message queue. You can use our built-in RPC stages to send this data. This reduces the overhead for getting started with DPM.

In environments where you may have network outages and cannot afford to lose any metrics, you may still want to implement a message queue.

NOTE: Please upgrade to Java 8

If you are not already using Java 8, please plan to upgrade ASAP. With the next major release, 2.5.0.0, SDC will no longer run on Java 7.

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

The post Announcing StreamSets Data Collector ver 2.4.0.0 appeared first on StreamSets.

↧

Read and Write JSON to MapR DB with StreamSets Data Collector

March 5, 2017, 2:36 pm

≫ Next: Drift Synchronization with StreamSets Data Collector and Azure Data Lake

≪ Previous: Announcing StreamSets Data Collector ver 2.4.0.0

MapR DB logo MapR-DB is an enterprise-grade, high performance, NoSQL database management system. As a multi-model NoSQL database, it supports both JSON document models and wide column data models. MapR-DB stores JSON documents in tables; documents within a table in MapR-DB can have different structures. StreamSets Data Collector enables working with MapR-DB documents with its powerful schema-on-read and ingestion capability.

With StreamSets Data Collector, I’ll show you how easy it is to stream data from MongoDB into a MapR-DB table as well as stream data out of the MapR-DB table into MapR Streams.

In the example below, I will use MongoDB to capture CDC data from the oplog, cleanse and enrich the data in the documents, and persist them in MapR-DB JSON table. I’ll also create another pipeline to read data from this JSON table and put the documents into a topic within MapR Streams for other downstream applications to consume.

Enabling oplog for CDC in MongoDB

To create a MongoDB database with CDC enabled, use a local instance of mongo and enable oplog for this standalone mongod server. To enable oplog, start mongo server with --master option as follows:

mongod --master --dbpath mongo_data

Open a terminal to work with the mongo server and create a database called retail_db:

mongo
use retail_db

Create a capped collection called pos_data:

db.createCollection("pos_data",{"capped":true,"size":10000})

Create a new pipeline and drop in a MongoDB Oplog Origin and configured the values shown below in the MongoDB tab:

Writing JSON Data to MapR-DB

Next, add the destination by selecting MapR-DB JSON from the drop down list.

Provide the Table Name same as the collection name, that is pos_data.

Check the option to create the table if it doesn’t exist in MapR-DB.

Documents inside MapR-DB must have a unique identifier stored in the _id field. I’m using the _id generated by MongoDB when documents are created in the collection for the _id field in the MapR-DB table.

With the pipeline configured with the origin and destination, and error records sent to ‘Discard’, validate the pipeline. This checks that the connection string specified is valid and a successful connection can be made to the mongo server. It also validates that the MongoDB database has the oplog specified, meaning the database is enabled for CDC.

Open a terminal and run mapr dbshell to work with MapR-DB JSON tables:

At this time, there is no data to preview in the pipeline since it’s a new collection, so we’ll just start the pipeline.

Insert a document into the pos_data collection in mongo:

db.pos_data.insert({"billing_address": {"address1": "478 Avila Village Apt. 671", "address2": "Suite 079", "country": "Niue", "company": "Wang, Day and Sanders", "city": "Heatherton"}, "buyer_accepts_marketing": false, "shopper": {"username": "george12", "name": "Lisa Jones", "birthdate": "1977-12-07", "sex": "F", "address": "14036 Corey Lake\nHendrixbury, OR 70750", "mail": "brianthompson@gmail.com"}, "cart_token": "d76d0e12-de1e-4d8b-980a-6ab892d57a5c", "fulfilment": {"fulfillable_quantity": 1, "total_price": 894.0, "grams": 306.51, "fulfillment_status": "fulfilled", "products": {"sku": "IPOD-342-N", "vendor": "Apple", "product_id": "6981718224312", "title": "IPod Nano", "requires_shipping": 1, "name": "IPod Nano - Pink", "variant_id": 4264112, "variant_title": "Pink", "quantity": 7.498855850621649}, "fulfillment_service": "amazon", "id": "7111705632496"}, "credit_card": {"card_expiry_date": "07/17", "description": "Carlson Group", "transaction_date": "11/29/2016", "purchase_amount": 894.0, "card_security_code": "346", "card_number": "4144406331250819"}})

Check the pipeline in StreamSets Data Collector. It picks up the document from the oplog and inserts it into MapR-DB.

Now, list the tables in mapr dbshell, and you can see that SDC created pos_data table and also inserted the document.

The document inserted is the full transaction log. I only want to insert the original document that is captured in the o attribute. The transaction type is defined by op – it will be automatically handled by the MapR-DB JSON table property ‘Use MapR InsertOrReplace API’. If the same _id is encountered, SDC will replace the document in MapR-DB JSON table, otherwise it will insert the document.

To capture the original document o to insert into the table, add a Field Remover in the pipeline to only keep o as follows:

I also added a Field Merger because I don’t want the document to be named o, but instead be the main document.

Update the Row Key in the destination to now be /_id as we’ve removed the parent document o

Run a Preview to see what it will look like:

Delete the document from the pos_data table in MapR-DB:

Reset the Origin in the pipeline so that original record is processed again:

Start the pipeline and we should see that one record flowing through.

Query the pos_data table in mapr dbshell and notice how the document is inserted now

Now lets add 3 more documents to the mongo collection pos_data:

db.pos_data.insert({"billing_address": {"address1": "967 Jay Canyon", "address2": "Suite 111", "country": "Singapore", "company": "Bell PLC", "city": "South Kyle"}, "buyer_accepts_marketing": true, "shopper": {"username": "ashleyrussell", "name": "Miguel Cline", "birthdate": "1977-06-01", "sex": "M", "address": "9751 Myers Drive Apt. 650\nScottberg, OH 91053-5987", "mail": "tylerhughes@gmail.com"}, "cart_token": "c03b2b44-a7a8-4670-a411-d5772dcf3e33", "fulfilment": {"fulfillable_quantity": 1, "total_price": 479.31, "grams": 248.47, "fulfillment_status": "fulfilled", "products": {"sku": "IPOD-342-N", "vendor": "Apple", "product_id": "9502805730614", "title": "IPod Nano", "requires_shipping": 1, "name": "IPod Nano - Pink", "variant_id": 4264112, "variant_title": "Pink", "quantity": 3.4824490937721935}, "fulfillment_service": "manual", "id": "4719367439632"}, "credit_card": {"card_expiry_date": "06/17", "description": "Campbell, Kennedy and Lewis", "transaction_date": "11/29/2016", "purchase_amount": 479.31, "card_security_code": "729", "card_number": "5296253772628936"}})

db.pos_data.insert({"billing_address": {"address1": "55581 Swanson Loop Apt. 626", "address2": "Apt. 645", "country": "Guinea-Bissau", "company": "Green, Sullivan and Haney", "city": "Stevensview"}, "buyer_accepts_marketing": false, "shopper": {"username": "gloriacardenas", "name": "Marcus Sanchez", "birthdate": "1996-07-09", "sex": "M", "address": "440 Sanchez Park\nSouth Heidimouth, NE 57720", "mail": "manningdevon@yahoo.com"}, "cart_token": "7893ad04-5b51-4629-b54a-80dd2ed64b28", "fulfilment": {"fulfillable_quantity": 1, "total_price": 48.8, "grams": 412.43, "fulfillment_status": "fulfilled", "products": {"sku": "IPOD-342-N", "vendor": "Apple", "product_id": "0613704736870", "title": "IPod Nano", "requires_shipping": 1, "name": "IPod Nano - Pink", "variant_id": 4264112, "variant_title": "Pink", "quantity": 9.250352450141936}, "fulfillment_service": "manual", "id": "3891348777849"}, "credit_card": {"card_expiry_date": "09/24", "description": "Sweeney, Walsh and Berry", "transaction_date": "02/20/2017", "purchase_amount": 48.8, "card_security_code": "693", "card_number": "5579105578886415"}})

db.pos_data.insert({"billing_address": {"address1": "046 Duncan Knoll Suite 541", "address2": "Apt. 543", "country": "Andorra", "company": "Warren Inc", "city": "Williamshaven"}, "buyer_accepts_marketing": true, "shopper": {"username": "justin56", "name": "Darrell Nguyen", "birthdate": "1995-04-18", "sex": "M", "address": "728 Moore Squares\nPort Danafurt, AZ 93202", "mail": "thomasroberts@yahoo.com"}, "cart_token": "d0989da5-cf6a-4ca2-bcbc-714ff0039128", "fulfilment": {"fulfillable_quantity": 1, "total_price": 190.34, "grams": 383.64, "fulfillment_status": "fulfilled", "products": {"sku": "IPOD-342-N", "vendor": "Apple", "product_id": "4410352990489", "title": "IPod Nano", "requires_shipping": 1, "name": "IPod Nano - Pink", "variant_id": 4264112, "variant_title": "Pink", "quantity": 8.428226968389547}, "fulfillment_service": "fedex", "id": "2771204627833"}, "credit_card": {"card_expiry_date": "02/25", "description": "Johnson, Garcia and Melendez", "transaction_date": "02/20/2017", "purchase_amount": 190.34, "card_security_code": "9523", "card_number": "349693032739718"}})

Have a look at the running pipeline to see the counts increase :

Validate all 4 documents are in the MapR-DB table:

Now, let’s update an existing document in mongo:

db.pos_data.update({"_id":"58abe7cffbc0a523bbafcbd2"},{name: "Rupal", rating: 1},{upsert: true})

Once again, glance at the pipeline to see that the update is also picked up:

Query MapR-DB table for the same _id, 58abe7cffbc0a523bbafcbd2

The output verifies the Replace logic for the same id without having to do any lookup in the pipeline. The MapR-DB JSON destination handles the check for an existing id.

Tracking all MongoDB databases and collections in a single pipeline

MongoDB oplog captures CDC data for all collections in all the databases. Hence, one can easily create a MapR-DB table per database and per collection by just parameterizing the output table name. The oplog captures the name of the database and collection in the field attribute ns. Replacing the table name in the destination with ${record:value('/ns')} will ensure that each transaction log is routed to the corresponding MapR-DB table matching the database and collection name, keeping both MongoDB and MapR-DB tables in sync.

Adding MapR Streams Destination

With the above use case, you could also just add a MapR Streams destination so that the same data flows through a topic for downstream processes. The topic name can also be parameterized to correspond to the MongoDB database and collection name as ${record:attribute('ns')}

Reading from a MapR-DB table

Let’s assume that there is an existing MapR-DB table that is being fed by one or more processes and we now want to consolidate this data from the table for other processes downstream. We can easily create a pipeline to read from a MapR-DB JSON table. Here I’ll use the same pos_data table as an example. To query pos_data, add a MapR-DB JSON origin in the pipeline and specify the table name as pos_data.

Now we are going to route all this data to MapR Streams so that any downstream applications can get real time data from the pos_data table. To create a MapR Stream, open a terminal that has a MapR client configured for the cluster you want to work with. Issue the following commands to create a stream and a topic:

maprcli stream create retail_stream -path /user/mapr/retail -produceperm p -consumeperm p -topicperm p
maprcli stream topic create -topic pos_data -path /user/mapr/retail

Configure MapR Streams Producer in StreamSets Data Collector pipeline as follows and validate the pipeline:

You can choose to write the data in any data format in MapR Streams topic. Here we have selected JSON as the data format

Run a preview to ensure we’re getting data from the table:

All looks good so I start the pipeline and see all 4 records streaming through

To validate the records in MapR Streams for pos_data topic, issue the following command:

mapr streamanalyzer -path /user/mapr/retail -topics pos_data

Conclusion

StreamSets Data Collector makes it easy to ingest data both from any source into MapR-DB document store and out of the MapR-DB document store for downstream applications without writing a single line of code. It also allows easy ingestion from external document stores keeping the document store in the MapR Hadoop system in sync.

StreamSets Data Collector is fully open source. Feel free to try this out for yourself using a MapR Sandbox or even with your own MapR cluster. Click here to learn more about StreamSets and MapR integration.

The post Read and Write JSON to MapR DB with StreamSets Data Collector appeared first on StreamSets.

↧

Drift Synchronization with StreamSets Data Collector and Azure Data Lake

March 6, 2017, 5:33 am

≫ Next: Transform Data in StreamSets Data Collector

≪ Previous: Read and Write JSON to MapR DB with StreamSets Data Collector

ADLS Drift Pipeline One of the great things about StreamSets Data Collector is that its record-oriented architecture allows great flexibility in creating data pipelines – you can plug together pretty much any combination of origins, processors and destinations to build a data flow. After I wrote the Ingesting Local Data into Azure Data Lake Store tutorial, it occurred to me that the Azure Data Lake Store destination should work with the Hive Metadata processor and Hive Metastore destination to allow me to replicate schema changes from a data source such as a relational database into Apache Hive running on HDInsight. Of course, there is a world of difference between should and does, so I was quite apprehensive as I duplicated the pipeline that I used for the Ingesting Drifting Data into Hive and Impala tutorial and replaced the Hadoop FS destination with the Azure Data Lake Store equivalent.

It turned out, though, that my misgivings were unfounded and, with some careful configuration, my pipeline just worked! Here are the critical config properties for getting this working:

Hive Metadata Processor

Most of the configuration is as for the Hive/Impala tutorial, with the exception of:

Hive Tab

JDBC URL: jdbc:hive2://<your-hdinsight-cluster>.azurehdinsight.net:443/default;ssl=true;user=<username>;password=<password>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=/hive2

Create a JDBC URL using your HDInsight cluster's URL and the parameters shown above. You will need to use a suitable username and password for your HDInsight cluster.

Hadoop Configuration Directory: /path/to/local/copy/of/hadoop/config

SDC needs to access various Hadoop settings, so you'll need to copy the Hadoop config files (including hive-site.xml) from an HDInsight head node to a local directory

Hive Metastore Destination

Again, set the destination up as for the Hive/Impala tutorial, except:

Hive Tab

JDBC URL: This should be the same as for the Hive Metadata Processor

Hadoop Configuration Directory: Again, copy this from the Hive Metadata Processor configuration

Azure Data Lake Store Destination

Configure the Data Lake tab exactly as in the Azure Data Lake Store tutorial.

Output Files Tab

Directory in Header: Unchecked

Directory Template: /clusters/<cluster-name>${record:attribute('targetDirectory')}

The Directory Template must start with the Root Path of the HDInsight cluster, followed by the targetDirectory record attribute. Be careful not to insert a slash between the two – at present this will trigger an error in the Azure Data Lake client library.

Use Roll Attribute: Checked

Roll Attribute Name: roll

The Roll Attribute tells the ADLS destination to ‘roll' the output file – removing the _tmp_ prefix from the current output file and opening a new _tmp_ file for writing.

Data Format Tab

Data Format: Avro

Avro Schema Location: In Record Header

With this configuration, the Hive Metadata processor and Hive Metastore and Azure Data Lake Store destinations ‘just work'. Watch the pipeline in action in this short video:

https://www.youtube.com/watch?v=LsCwkmNZsSM

Conclusion

StreamSets Data Collector‘s modular, record-based architecture provides great flexibility when creating data pipelines. Download SDC today, and get started building your own data flows!

The post Drift Synchronization with StreamSets Data Collector and Azure Data Lake appeared first on StreamSets.

↧

Transform Data in StreamSets Data Collector

April 11, 2017, 4:33 pm

≫ Next: Installing StreamSets Data Collector on Amazon Web Services EC2

≪ Previous: Drift Synchronization with StreamSets Data Collector and Azure Data Lake

I've written quite a bit over the past few months about the more advanced aspects of data manipulation in StreamSets Data Collector (SDC) – writing custom processors, calling Java libraries from JavaScript, Groovy & Python, and even using Java and Scala with the Spark Evaluator. As a developer, it's always great fun to break out the editor and get to work, but we should be careful not to jump the gun. Just because you can solve a problem with code, doesn't mean you should. Using SDC's built-in processor stages is not only easier than writing code, it typically results in better performance. In this blog entry, I'll look at some of these stages, and the problems you can solve with them.

I've created a sample pipeline showing an extended use case using the Dev Raw Data origin and Trash destination so it will run no matter your configuration. You'll need SDC 2.4.0.0 or higher to import it. Run preview, and you'll see exactly what's going on. Experiment with the configuration and sample data to see how you can apply the same techniques to your data transformations.

We'll start with some company data in JSON format such as you might receive from an API call and proceed through a series of transformations resulting in records that are ready to be inserted in a database.

Field Pivoter

The Field Pivoter can split one record into many, ‘pivoting' on a collection field – that is, a list, map, or list-map. This is particularly useful when processing API responses, which frequently contain an array of results.

For example, let's say we have the following input:

{
  "status": 0,
  "results": [
    {
      "name": "StreamSets",
      "address" : {
        "street": "2 Bryant St",
        "city": "San Francisco",
        "state": "CA",
        "zip": "94105"
      },
      "phone": "(415) 851-1018"
    },
    {
      "name": "Salesforce",
      "address" : {
        "street": "1 Market St",
        "city": "San Francisco",
        "state": "CA",
        "zip": "94105"
      },
      "phone": "(415) 901-7000"
    }
  ]
}

We want to create a record for each result, with name, street, etc as fields. Configure a Field Pivoter to pivot on the /results field and copy all of its fields to the root of the new record, /

Field Pivoter

Let's take a look in preview:

Pivoted Fields

Field Flattener

Data formats such as Avro and JSON can represent hierarchical structures, where records contain fields that are themselves collections of other fields. For example, each record in our use case now has the following structure:

{
    "name": "StreamSets",
    "address": {
      "street": "2 Bryant St",
      "city": "San Francisco",
      "state": "CA",
      "zip": "94105"
    },
    "phone": "(415) 851-1018"
}

Many destinations, such as relational databases, however, require a ‘flat' record, where each field is simply a string, integer, etc. The Field Flattener, as its name implies, flattens the structure of the record. Configuration is straightforward – specify whether you want to flatten the entire record, or just a specific field, and the separator you would like to use. For example:

Applying this to the sample data above results in a record with fields address.street, address.city, etc:

Field Flattener

Field Renamer

So we've got a nice flat record structure, but those field names don't match the columns in the database. It would be nice to rename those fields to street, city, etc. The Field Renamer has you covered here. Now, you could explicitly specify each field in the address, but that's a bit laborious, not to mention brittle in the face of data drift. We want to be able to handle new fields, such as latitude and longitude, appearing in the input data without having to stop the pipeline, reconfigure it, and restart it. Let's specify a regular expression to match field names with the prefix address.

That regular expression – /'address\.(.*)' – is a little complex, so let's unpack it. The initial / is the field path – we want to match fields in the root of the record. We quote the field name since it contains what, for SDC, is a special character – a period. We include that period in the prefix that we want to match, escaping it with a backslash, since the period has a special meaning in regular expressions. Finally, we use parentheses to delimit the group of characters that we want to capture for the new field name: every remaining character in the existing name.

The target field expression specifies the new name for the field – a slash (its path), followed by whatever was in the delimited group. This was all a precise, if slightly roundabout, way to say “Look for fields that start with address. and rename them with whatever is after that prefix.

Let's see it in action on the output from the Field Flattener:

Field Flattener

Field Splitter

Our (imaginary!) destination needs street number separately from street name. No problem! The Field Splitter was designed for exactly this situation. It works in a similar way to the renamer, but on field values rather than names.

Field Splitter

We're splitting the /street field on sequences of one or more whitespace characters. This will ensure that 2 Bryant St (with two spaces) will be treated the same as 2 Bryant St (with a single space). We put the results of the split operation into two new fields: /street_number and /street_name. If we can't split the /street field in two, then we send the record to the error stream; if there are more than two results, we want to put the remaining text, for example, Bryant St in the last field. Finally, we want to remove the original /street field.

The results are as you might expect:

Split Fields

Conclusion

Here is the final pipeline, that pivots, flattens, renames and splits fields in the incoming data:

Field Pipeline

I mentioned performance earlier. As a quick, albeit unscientific, test, I ran this pipeline on my MacBook Air for a minute or so; it processed about 512 records/second. I coded up the same functionality in Groovy, ran it for a similar length of time, and it only processed about 436 records/second.

Download StreamSets Data Collector, follow the installation docs, import the field manipulation pipeline, and try it out. Let us know in the comments if you find a (generally useful) transformation that we can't do ‘out of the box'!

The post Transform Data in StreamSets Data Collector appeared first on StreamSets.

↧

Installing StreamSets Data Collector on Amazon Web Services EC2

April 17, 2017, 3:02 pm

≫ Next: StreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale

≪ Previous: Transform Data in StreamSets Data Collector

Mike Fuller, a consultant at Red Pill Analytics, recently wrote Stream Me Up (to the Cloud), Scotty, a tutorial on installing StreamSets Data Collector (SDC) on Amazon Web Services EC2. Mike's article takes you all the way from logging in to a fresh EC2 instance to seeing your first pipeline in action. We're reposting it here courtesy of Mike and Red Pill.

I’ve had some fun working with StreamSets Data Collector lately and wanted to share how to quickly get up and running on an Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance and build a simple pipeline.

For anyone unaware, StreamSets Data Collector is, in their own words, a low-latency ingest infrastructure tool that lets you create continuous data ingest pipelines using a drag and drop UI within an integrated development environment (IDE).

To be able to follow along, it is encouraged that you have enough working knowledge of AWS to be able to create and start an AWS EC2 instance and create and access an AWS Simple Storage Service (S3) bucket. That being said, these instructions also apply, for the most part, to any linux installation.

The most important prequisite is to have access to an instance that meets StreamSets installation requirements outlined here. I’m running an AWS Red Hat Enterprise Linux (RHEL) t2.micro instance for this demo; you will no doubt want something with a little more horsepower if you intend to use your instance for true development.

It is important to note that this is just one of many ways to install and configure StreamSets Data Collector. Make sure to check out the StreamSets site and read through the documentation to determine which method will work best for your use case. Now that the basics (and a slew of acronyms) are covered, we can get started.

Fire up the AWS EC2 instance and log in. I’m running on a Mac and using the built in terminal; I recommend PuTTY or something similar for folks running Windows.

ssh ec2-user@ -i /.pem

Install wget, if you haven’t already.

sudo yum install wget

Create a new directory for the StreamSets download and navigate to the new directory

sudo mkdir /home/ec2-user/StreamSets
cd /home/ec2-user/StreamSets

Download StreamSets Data Collector using wget. The URL below is for version 2.4.1 rpm install but a new version of Data Collector is due in the next few days (those guys and gals move quickly!). Be sure to check for the latest version on the StreamSets Data Collector website.

Look for the download here:

You’ll want to right-click Full Download (RPM) and select ‘Copy Link’. Replace the link in the command below with the latest and greatest.

sudo wget https://archives.streamsets.com/datacollector/2.4.1.0/rpm/streamsets-datacollector-2.4.1.0-all-rpms.tgz

Extract the StreamSets Data Collector install files.

sudo tar -xzf /home/ec2-user/StreamSets/streamsets-datacollector-2.4.1.0-all-rpms.tgz

Install StreamSets using yum/localinstall

sudo yum localinstall /home/ec2-user/StreamSets/streamsets-datacollector-2.4.1.0-all-rpms/streamsets*

Attempting to start the service reveals there is one step remaining as the command fails.

sudo service sdc start
ERROR: sdc has died, see log in '/var/log/sdc'.

Note the File descriptors: 32768 line item in the installation requirements

Running ulimit -n shows 1024, this needs to be ≥ 32768

ulimit -n
1024

To increase the limit, edit /etc/security/limits.conf by navigating to /etc/security

cd /etc/security

As good habits dictate, make a copy of the limits.conf file

sudo cp limits.conf orig_limits.conf

Edit the limits.conf file

sudo vi limits.conf

Add the following two lines at the end of the file, setting the limits to a value greater than or equal to 32768

*               hard    nofile          33000
*               soft    nofile          33000

Log out of the AWS machine and log back in for the changes to take effect

Check that the changes were successful by running ulimit -n

ulimit -n
33000

Start the SDC service

sudo service sdc start

This message may show up:

Unit sdc.service could not be found.

The service will start fine. If you’re annoyed enough by the message, stop the service, run the command below, and start the service again

sudo systemctl daemon-reload

One last thing before we get to building a pipeline. Create a new subdirectory under the streamsets-datacollector directory to store a sample data file.

sudo mkdir /opt/streamsets-datacollector/SampleData

Create a sample file.

sudo vi /opt/streamsets-datacollector/SampleData/TestFile.csv

Enter the following records and save the file.

Rownum,Descr
1,Hello
2,World

The StreamSets Data Collector user interface (UI) is browser-based. In order to access the UI from your local machine, set up an SSH tunnel to forward port 18630, the port StreamSets runs on, to localhost:18630. Replace the appropriate IP address and .pem information.

ssh -N -p 22 ec2-user@ -i /≤name_of_pem&gt;.pem -L 18630:localhost:18630

On the local machine, open a browser, type or paste the following URL and press enter.

http://localhost:18630/

The StreamSets login page should now be displayed. The initial username/password are admin/admin.

In the following steps, we’ll create a pipeline that streams the data from the TestFile.csv file created in the steps above to Amazon S3. First, create a new pipeline and give it a name.

Add an origin and a destination. For this example, I have selected the origin Directory — Basic and the destination Amazon S3 — Amazon Web Services 1.10.59

Notice that the pipeline will display errors until all required elements are configured.

Configure the pipeline Error Records to Discard (Library: Basic)

To set up the origin, under Files, configure the File Directory and File Name Pattern fields to /opt/streamsets-datacollector/SampleData and *.csv, respectively.

On the Data Format tab, configure the Data Format to Delimited and change the Header Line drop-down to With Header Line

Configure the Amazon S3 items for your Access Key ID, Secret Access Key, and Bucket.

Set the Data Format for S3; I’ve chosen Delimited but other options work just fine.

In the top right corner, there are options for previewing and validating the pipeline as well as to start the pipeline. After everything checks out, start the pipeline.

The pipeline is alive and moved the data!

A review of S3 shows that the file has been created and contains the records that were created in the steps above. Notice that the pipeline is in ‘Running’ status and will continue to stream data from the directory as changes are made or *.csv files are added.

There you have it. This is just a basic pipeline to move a .csv file from one location to another without any data manipulation and is simply the tip of the iceberg. There is an abundance of options available to manipulate the data as well as technologies that StreamSets Data Collector integrates with that go far beyond this example. Happy streaming!

Many thanks to Mike for taking the time to document his experience with StreamSets Data Collector!

If you're using SDC and would like to share your experiences with our community, please let me know in the comments, or via Twitter at @metadaddy.

The post Installing StreamSets Data Collector on Amazon Web Services EC2 appeared first on StreamSets.

↧

StreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale

April 18, 2017, 9:04 pm

≫ Next: Making Sense of Stream Processing

≪ Previous: Installing StreamSets Data Collector on Amazon Web Services EC2

We’re thrilled to announce version 2.5 of StreamSets Data Collector, a major release which includes important functionality related to the Internet of Things (IoT), high-performance database ingest, integration with Apache Spark and integration into your enterprise infrastructure. You can download the latest open source release here.

This release has over 22 new features, 95 improvements and 150 bug fixes.

Reduce the Cost of Internet of Things

This release includes new connectors for MQTT and Websockets. Both are offered as origins or destinations that allow you to use SDC as your IoT ingestion gateway, potentially saving you substantial money vs. proprietary solutions, particular metered solutions from cloud providers. Also included in this release is a HTTP Client Destination that can write a batch full of data to an HTTP Endpoint.

We will soon be adding other IoT gateway connectors such as OPC-UA and CoAP. If you'd like to see other IoT related functionality, please feel free to tell us through this survey.

High-performance Database Ingest

We have added a multi-threaded multi-table JDBC origin to our latest release in order to greatly speed ingest from databases. This support greatly increases throughput and shortens the time it takes to re-platform your data warehouse to Hadoop.

Scale Up Processing

In the last release we introduced a scale up architecture that allows pipelines to run in a multithreaded mode. With this release we’ve added a few more origins that can utilize this scale up architecture : Elasticsearch Origin, JDBC Multitable origin, Kineses Consumer and the Websocket origin.

Spark Executor

As part of our Dataflow Trigger framework we have added a Spark Executor which will allow you kick off simple Spark jobs such as format conversions and file compression all the way to running sophisticated machine learning algorithms. The Spark job can be run on a YARN or Databricks cloud.

StreamSets can greatly simplify developing ingest pipelines when using the Databricks cloud, The Spark Executor processor can trigger a Spark job, or even a Databricks notebook.

Cluster Mode Spark Evaluator

The Spark evaluator in StreamSets lets you inject Spark within the pipeline. If you are writing algorithms in Spark for machine learning you need not worry about the semantics or plumbing of reading and writing data from a myriad of sources and destinations; the pipeline will give your Spark job a RDD to work with, and will similarly read out a RDD and guarantee delivery to whatever destinations you choose. When the pipeline is run in standalone mode, it uses Spark local mode, but if you run the pipeline in cluster streaming mode, it runs on the Spark cluster and your Spark transformation code will be able to see the data available to the entire cluster.

Elastic Origin for Easy Offload

The new Elasticsearch origin lets you get data out of an Elastic index. Customers use this in architectures where they want to migrate data off Elastic into other databases or lower cost systems such as Hadoop, move to cloud hosted systems or archive data to the cloud.

Improved Integration

Two new features advance the ability of StreamSets Data Collector to be integrated into your data workflow. First we have added the ability to pass parameters to a pipeline via an API or CLI, making it possible to have fine grained control of pipelines from external scripts. Second, we have added support for alerting via Webhooks, meaning that you can trigger actions on external systems based on system, data and drift alerts within the pipeline.

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

The post StreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale appeared first on StreamSets.

↧

Making Sense of Stream Processing

April 19, 2017, 9:51 am

≫ Next: Creating a Custom Multithreaded Origin for StreamSets Data Collector

≪ Previous: StreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale

Stream There has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.

In this article, I'll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.

The specific categories I’ll cover include stream processing frameworks, stream processing APIs, and streaming dataflow systems.

What is Stream Processing?

The easiest way to explain stream processing is in relation to its predecessor, batch processing. Much data processing in the past was oriented around processing regular, predictable batches of data – the nightly job that, during “quiet” time, would process the previous day’s transactions; the monthly report that provided summary statistics for dashboards, etc. Batch processing was straightforward, scalable and predictable, and enterprises tolerated the latency inherent in the model – it could be hours, or even days, before an event was processed and visible in downstream data stores.

As businesses demanded more timely information, batches grew smaller and were processed more frequently. As the batch size tended towards a single record, stream processing emerged. In the stream processing model, events are processed as they occur. This more dynamic model brings with it more complexity. Often, stream processing is unpredictable, with events arriving in bursts, so the system has to be able to apply back-pressure, buffer events for processing, or, better yet, scale dynamically to meet the load. More complex scenarios require dealing with out-of-order events, heterogeneous event streams, and duplicated or missing event data.

While batch sizes were shrinking, data volumes grew, along with a demand for fault tolerance. Distributed storage architectures blossomed with Hadoop, Cassandra, S3 and many other technologies. Hadoop’s file system (HDFS) brought a simple API for writing data to a cluster, while MapReduce enabled developers to write scalable batch jobs that would process billions of records using a simple programming model.

MapReduce was a powerful tool for scaling up data processing, but its model turned out to be somewhat limiting; developers at UC Berkeley’s AMPLab created Apache Spark, improving on MapReduce by providing a wider variety of operations beyond just map and reduce, and allowing intermediate results to be held in memory rather than stored on disk, greatly improving performance. Spark also presented a consistent API whether running on a cluster or as a standalone application. Now developers could write distributed applications and test them at small scale – even on their own laptop! – before rolling them out to a cluster of hundreds or thousands of nodes.

The trends of shrinking batch sizes and rising data volumes met in Spark Streaming, which adapted the Spark programming model to micro-batches by time-slicing the data stream into discrete chunks. Micro-batches provide a compromise between larger batch sizes and individual event processing, aiming to balance throughput with latency. Moving to the limit of micro-batching, single-event batches, Apache Flink provides low-latency processing with exactly-once delivery guarantees.

Fast-forward to today and Flink and Spark Streaming are just two examples of streaming frameworks. Streaming frameworks allow developers to build applications to address near real-time analytical use cases such as complex event processing (CEP). CEP combines data from multiple sources to identify patterns and complex relationships across various events. One example of CEP is analyzing parameters from a set of medical monitors, such as temperature, heart rate and respiratory rate, across a sliding time window to identify critical conditions, such as a patient going into shock.

To a large extent, the various frameworks present similar functionality: the ability to distribute code and data across a cluster, to configure data sources and targets, to join event streams, to deliver events to application code, etc. They differ in the ways they do this, offering trade-offs in latency, throughput, deployment complexity, and so on.

Streaming frameworks and APIs are aimed at developers, but there is a huge audience of data engineers looking for higher-level tools to build data pipelines – the plumbing that moves events from where they are generated to where they can be analyzed. Streaming dataflow systems such as StreamSets Data Collector and Apache NiFi provide a browser-based UI for users to design pipelines, offering a selection of out-of-the-box connectors and processors, plus extension points for adding custom code.

Stream Processing Frameworks

There are at least seven open source stream processing frameworks. Most are under the Apache banner, and each implements its own streaming abstraction with trade-offs in latency and throughput:

In terms of mindshare and adoption, Apache Spark is the 800-pound gorilla here, but each framework has its adherents. There are trade-offs in terms of latency, throughput, code complexity, programming language, etc., across the different frameworks, but they all have one thing in common: they all provide an environment in which developers can implement their business logic in code.

As an example of the developer’s-eye view of stream processing frameworks, here’s the word count application from the Spark documentation, the streaming equivalent of ‘Hello World’:

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length &lt; 2) {
      System.err.println("Usage: NetworkWordCount  ")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(1))

    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x =&gt; (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

The streaming frameworks offer power and flexibility for coding streaming applications for use cases such as CEP, but have a high barrier to entry – only developers need apply.

Stream Processing APIs

The streaming frameworks differ in aspects such as event processing latency and throughput, but have many functional similarities – they all offer a way to operate on a continuous stream of events, and they all offer their own API. In addition, stream processing API abstractions offer a another level of abstraction above the frameworks’ own APIs, allowing a single app to run in a variety of environments.

A great example of an API abstraction is Apache Beam, which originated at Google as an implementation of the Dataflow model. Beam presents a unified programming model, allowing developers to implement streaming (and batch!) jobs that can run on a variety of frameworks. At present, there are Beam ‘Runners’ for Apex, Flink, Spark and Google’s own Cloud Dataflow.

Beam’s minimal word count example (stripped of its copious comments for space!) is not that different from the Spark code, even though it’s in Java rather than Scala:

public static void main(String[] args) {
  PipelineOptions options = PipelineOptionsFactory.create();

  Pipeline p = Pipeline.create(options);

  p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/*"))
   .apply("ExtractWords", ParDo.of(new DoFn&lt;String, String&gt;() {
                     @ProcessElement
                     public void processElement(ProcessContext c) {
                       for (String word : c.element().split("[^a-zA-Z']+")) {
                         if (!word.isEmpty()) {
                           c.output(word);
                         }
                       }
                     }
                   }))

   .apply(Count.perElement())
   .apply("FormatResults", MapElements.via(new SimpleFunction&lt;KV&lt;String, Long&gt;, String&gt;() {
                     @Override
                     public String apply(KV&lt;String, Long&gt; input) {
                       return input.getKey() + ": " + input.getValue();
                     }
                   }))
   .apply(TextIO.Write.to("wordcounts"));
  p.run().waitUntilFinish();
}

So, Beam gives developers some independence from the underlying streaming framework, but you’ll still be writing code to take advantage of it.

Kafka Streams is a more specialized stream processing API. Unlike Beam, Kafka Streams provides specific abstractions that work exclusively with Apache Kafka as the source and destination of your data streams. Rather than a framework, Kafka Streams is a client library that can be used to implement your own stream processing applications which can then be deployed on top of cluster frameworks such as Mesos. Kafka Connect is connectivity software that bridges the gap between Kafka and a range of other systems, with an API allowing developers to create Kafka consumers and producers.

Streaming Dataflow Systems

Stream processing frameworks and APIs allow developers to build streaming analysis applications for use cases such as CEP, but can be overkill when you just want to get data from some source, apply a series of single-event transformations, and write to one or more destinations. For example, you might want to read events from web server log files, look up the physical location of each event’s client IP, and write the resulting records to Hadoop FS – a classic big data ingest use case.

Apache Flume was created for exactly this kind of process. Flume allows you to configure data pipelines to ingest from a variety of sources, apply transformations, and write to a number of destinations. Flume is a battle-tested, reliable tool, but it’s not the easiest to set up. The user interface is not exactly friendly, as shown here:

# Sample Flume configuration to copy lines from
# log files to Hadoop FS

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = spooldir
a1.sources.r1.channels = c1
a1.sources.r1.spoolDir = /Users/pat/flumeSpool

a1.channels.c1.type = memory

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.path = /flume/events
a1.sinks.k1.hdfs.useLocalTimeStamp = true

For a broader comparison of Flume to StreamSets Data Collector, see this blog entry.

StreamSets Data Collector (SDC) and Apache NiFi, on the other hand, each provide a browser-based UI to build data pipelines, allowing data engineers and data scientists to build data flows that can execute over a cluster of machines, without necessarily needing to write code. Although SDC is not an Apache-governed project, it is open source and freely available under the same Apache 2.0 license as NiFi, Spark, etc.

This pipeline, taken from the SDC tutorial, reads CSV-formatted transaction data from local disk storage, computes the credit card issuing network from the credit card number, masks all but the last 4 digits of the credit card number, and writes the resulting data to Hadoop:

Tutorial-RunPipeline

Of course, not every problem can be solved by plugging together prebuilt processing stages, so both SDC and NiFi allow customization via scripting languages such as Scala, Groovy, Python and JavaScript, as well as their common implementation language, Java.

Aside from UI, another aspect of the evolution of stream processing tools from Flume to NiFi and SDC is distributed processing. Flume has no direct support for clusters – it’s up to you to deploy and manage multiple Flume instances and partition data between them. NiFi can run either as a standalone instance or distributed via its own clustering mechanism, although one might expect NiFi to transition to YARN, the Hadoop cluster resource manager, at some point.

SDC can similarly run standalone, as a MapReduce job on YARN, or as a Spark Streaming application on YARN and Mesos. In addition, to incorporate stream processing into a pipeline, SDC includes a Spark Evaluator, allowing developers to integrate existing Spark code as a pipeline stage.

So What Should I Use?

Selecting the right system for your specific workload depends on a host of factors ranging from the functional processing requirements to service level agreements that must be honored by the solution. Some general guidelines apply:

If you are implementing your own stream processing application from scratch for an analytical use case such as CEP, you should use one of the stream processing frameworks or APIs.
If your workload is on the cluster and you want to set up continuous data streams to ingest data into the cluster, using a stream processing framework or API may be overkill. In this case you are better off deploying a streaming dataflow system such as Flume, NiFi or SDC.
If you want to perform single-event processing on data already residing in the cluster, use SDC in cluster mode to apply transformations to records and either write them back to the cluster, or send them to other data stores.

In practice, we see enterprises using a mix of stream processing and batch/interactive analytics applications on the back end. In this environment, single-event processing is handled by a system like SDC, depositing the correct data in the data stores, keeping analytical applications supplied with clean, fresh data at all times.

Conclusion

The stream processing landscape is complex, but can be simplified by separating the various projects into frameworks, APIs and streaming dataflow systems. Developers have a wide variety of choices in frameworks and APIs for more complex use cases, while higher-level tools allow data engineers and data scientists to create pipelines for big data ingest.

References

This article focuses on the big picture and how all of these projects relate to each other. There are many articles that dive deeper, providing a basis for selecting one or more technologies for evaluation. Here are a few, in no particular order:

The post Making Sense of Stream Processing appeared first on StreamSets.

↧

Creating a Custom Multithreaded Origin for StreamSets Data Collector

April 24, 2017, 5:45 pm

≫ Next: Create a Custom Expression Language Function for StreamSets Data Collector

≪ Previous: Making Sense of Stream Processing

Multithreaded Pipelines, introduced a couple of releases back, in StreamSets Data Collector (SDC) 2.3.0.0, enable a single pipeline instance to process high volumes of data, taking full advantage of all available CPUs on the machine. In this blog entry I'll explain a little about how multithreaded pipelines work, and how you can implement your own multithreaded pipeline origin thanks to a new tutorial by Guglielmo Iozzia, Big Data Analytics Manager at Optum, part of UnitedHealth Group.

Multithreaded Origins

To take advantage of multithreading, a pipeline's origin must itself be capable of multithreaded operation. At present, in SDC 2.5.0.0, there are six such origins:

Elasticsearch – Reads data from an Elasticsearch cluster.
HTTP Server – Listens on a HTTP endpoint and processes the contents of all authorized HTTP POST requests.
JDBC Multitable Consumer – Reads database data from multiple tables through a JDBC connection.
Kinesis Consumer – Reads data from a Kinesis cluster.
WebSocket Server – Listens on a WebSocket endpoint and processes the contents of all authorized WebSocket requests.
Dev Data Generator – Generates random data for development and testing.

Each origin can spawn a configurable number of threads to read incoming data in parallel, the details varying across the origins. For example, the HTTP Server origin spawns multiple threads to enable parallel processing of data from multiple HTTP clients: as data from one client is being processed, the origin can accept a connection and process data from another client. Each thread spawned by the Kinesis Consumer origin, on the other hand, maintains a connection to a Kinesis shard, so incoming data from different shards is processed in parallel.

SDC creates a pipeline runner for each thread in the origin; each pipeline runner is responsible for managing its own copy of the processors and destinations that comprise the remainder of the pipeline. The origin threads each create batches of records, which can then be processed in parallel in the downstream pipeline stages. Conceptually, the multithreaded pipeline looks like this:

Multithreaded Pipeline

Multithreaded Origin Tutorial

As you might expect, creating a multithreaded origin is somewhat more complex than creating a ‘traditional' origin for SDC, but Guglielmo does a great job explaining the process in his tutorial, Creating a Custom Multithreaded StreamSets Origin. Guglielmo has kindly contributed his tutorial to the project, so, if you're looking to ingest data as efficiently as possible, take a look, and let us know what you come up with in the comments!

The post Creating a Custom Multithreaded Origin for StreamSets Data Collector appeared first on StreamSets.

↧

Create a Custom Expression Language Function for StreamSets Data Collector

April 27, 2017, 10:59 pm

≫ Next: Quick Tip: Resolving ‘minReplication’ Hadoop FS Error

≪ Previous: Creating a Custom Multithreaded Origin for StreamSets Data Collector

Custom EL Snapshot One of the most powerful features in StreamSets Data Collector (SDC) is support for Expression Language, or ‘EL' for short. EL was introduced in JavaServer Pages (JSP) 2.0 as a mechanism for accessing Java code from JSP. The Expression Evaluator and Stream Selector stages rely heavily on EL, but you can use EL in configuring almost every SDC stage. In this blog entry I'll explain a little about EL and show you how to write your own EL functions.

EL Basics

As its name implies, EL allows you to do more than just access Java code – you can write expressions such as

${str:length(record:value('/id')) > 10}

This will evaluate to true if the /id field's value is more than 10 characters long, otherwise it will be false.

SDC includes a wide variety of EL functions for purposes such as accessing a record's fields and attributes, detecting drift in record structure, and performing standard math and string operations. You can get a long way with these ‘off-the-shelf' functions and, when you want to go further, it's really straightforward to create custom EL functions.

Custom EL Functions

Let's say you're processing web server log data and you want to filter out any requests from clients in the local ‘private' network address ranges; for example, the 192.168.1.0 – 192.168.1.255 range. Java helpfully provides an isSiteLocalAddress() method on the InetAddress class, and Google's Guava library allows us to create an InetAddress object from a string containing an IP address without hitting the network, so we can easily sketch out a class with an isPrivate() method:

package com.streamsets.el.example;

import com.google.common.net.InetAddresses;

public class DomainNameEL {
  public static boolean isPrivate(String address) {
    return InetAddresses.forString(address).isSiteLocalAddress();
  }
}

What if we pass something that isn't an IP address at all? InetAddresses.forString() will throw an exception, which we don't want to happen while we're running our pipeline, so let's catch that, and any other possible exceptions such as null pointers:

package com.streamsets.el.example;

import com.google.common.net.InetAddresses;

public class DomainNameEL {
  public static boolean isPrivate(String address) {
    try {
      return InetAddresses.forString(address).isSiteLocalAddress();
    } catch (Exception e) {
      return false;
    }
  }
}

Note – when you use EL in a processor stage, SDC evaluates it using a dummy record as part of pipeline validation, so you should take care that your EL functions don't throw an exception on empty or unexpected input.

To make this available as an EL function we just need to add some annotations:

package com.streamsets.el.example;

import com.google.common.net.InetAddresses;
import com.streamsets.pipeline.api.ElFunction;
import com.streamsets.pipeline.api.ElParam;
import com.streamsets.pipeline.api.ElDef;

// @ElDef marks this as a class that contains EL functions
@ElDef
public class DomainNameEL {
  private static final String DNS = "dns";

  // This is an EL function - it must be public static
  @ElFunction(
    prefix = DNS,
    name = "isPrivate",
    description = "Returns true if this is a private IPv4 address."
  )
  public static boolean isPrivate(
    // @ElParam assigns a UI name to a parameter
    @ElParam("address") String address
  ) {
    try {
      return InetAddresses.forString(address).isSiteLocalAddress();
    } catch (Exception e) {
      return false;
    }
  }
}

And that's it! We can use Maven to build this into a JAR using a very standard pom.xml file, copy the JAR to $SDC_DIST/libs-common-lib, and use it in a pipeline:

Custom EL Pipeline

Let's preview the pipeline on some test data:

Custom EL Preview

Success! Note that this is a deliberately simple, but still useful, example. EL functions can accept any number of arguments and be arbitrarily complex. What functionality are you going to build into a custom EL function? Let us know in the comments!

The post Create a Custom Expression Language Function for StreamSets Data Collector appeared first on StreamSets.

↧