Quantcast
Channel: StreamSets
Viewing all 475 articles
Browse latest View live

Announcing Data Collector ver 1.5.1.0

$
0
0

We’re happy to announce a version release of StreamSets Data Collector. This is a relatively minor mid term update with a number of important bug fixes, yet packs in a couple of fun features.

  • Support for Azure Blob storage using the WASB protocol. Customers can now use Data Collector to write directly to Azure HDInsight.
  • Support for Apache Solr 6 (for both standalone and cluster mode).

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

The post Announcing Data Collector ver 1.5.1.0 appeared first on StreamSets.


Standard Deviations on Cassandra – Rolling Your Own Aggregate Function

$
0
0

Cassandra logoIf you’ve been following the StreamSets blog over the past few weeks, you’ll know that I’ve been building an Internet of Things testbed on the Raspberry Pi. First, I got StreamSets Data Collector (SDC) running on the Pi, ingesting sensor data and sending it to Apache Cassandra, and then I wrote a Python app to display SDC metrics on the PiTFT screen. In this blog entry I’ll take the next step, querying Cassandra for statistics on my sensor data.

Detecting Outlier Values

Now that I have sensor data flowing into Cassandra, I want to analyze that data and then feed it back into SDC so I can detect outlier values. A common way of detecting outliers is to flag readings that fall outside some range expressed in terms of the mean +/- some number of standard deviations (also known as ‘sigma’ or σ). For instance, assuming a normal distribution, 99.7% of values should fall within 3σ of the mean, while 99.95% of values should fall within 4σ. Wikipedia has a handy table with the values. Cassandra can give me the mean of the values in a column, with its avg function, but not standard deviation. Fortunately, though, it’s possible to define your own ‘user-defined aggregate’ (UDA) functions for Cassandra. Here’s how I created a UDA for standard deviation.

Cassandra User Defined Aggregate Functions

Cassandra UDA’s are defined in terms of two user defined functions (UDF’s): a state function and a final function. The state function, called for each row in turn, takes a state parameter and a value as parameters and returns a new state. After all rows have been processed by the state function, the final function is called with the last state value as its parameter, and returns the aggregate value. The Cassandra docs on UDA’s show how to calculate the mean in this way. Let’s pull the two functions out of the example and format them as Java functions to better see how they work:

Tuple avgState(Tuple state, int x) {
    state.setInt(0, state.getInt(0) + 1);
    state.setDouble(1, state.getDouble(1) + x);

    return state;
}

Double avgFinal(Tuple state) {
    double r = 0;

    if (state.getInt(0) == 0)
        return null;

    r = state.getLong(1);
    r /= state.getInt(0);

    return Double.valueOf(r);
}

As you can see, the avgState function keeps track of the count of values and their total, while avgFinal simply divides the total by the count to get the mean. Elementary school math!

Standard deviation, however, is a bit more complicated. A measure of the ‘spread’ of a set of values from their mean, standard deviation is found by “taking the square root of the average of the squared deviations of the values from their average value”. Helpfully, Wikipedia contains an ‘online algorithm‘ for computing variance (the square of standard deviation) in a single pass through the data – just what we need! Transliterating from Wikipedia’s Python implementation to Java, we get:

static Tuple sdState(Tuple state, double x) {
    // For clarity, set up local variables
    int n = state.getInt(0);
    double mean = state.getDouble(1);
    double m2 = state.getDouble(2);

    // Do the calculation
    n++;
    double delta = x - mean;
    mean += delta / n;
    m2 += delta * (x - mean);

    // Update the state
    state.setInt(0, n);
    state.setDouble(1, mean);
    state.setDouble(2, m2);

    return state;
}

static Double sdFinal(Tuple state) {
    int n = state.getInt(0);
    double m2 = state.getDouble(2);

    if (n < 1) {
        // Need at least two values to have a meaningful standard deviation!
        return null;
    }

    // Online algorithm computes variance - take the square root to get standard deviation
    return Math.sqrt(m2 / (n - 1));
}

I wrote a test harness to calculate the mean and standard deviation of the integers from 1 to 10 using the ‘online’ algorithm, a simpler iterative algorithm, and also created an Excel spreadsheet to do the same calculation using the AVERAGE() and STDEV() functions. Happily, all three methods gave the same mean of 5.5 and standard deviation of 3.02765!

Excel mean and SD

One piece of housekeeping was necessary before I could actually define my own function: since I’m using Cassandra 2.2, I needed to add the following line to cassandra.yaml:

enable_user_defined_functions: true

Cassandra 3.0 lets you define Java functions without explicitly enabling them. See the docs on UDF’s for more details.

That done, I just needed to remove comments and line breaks from my Java functions, paste them into the Cassandra function definitions and feed them into cqlsh:

cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdState ( state tuple<int,double,double>, val double ) CALLED ON NULL INPUT RETURNS tuple<int,double,double> LANGUAGE java AS 
 ... 'int n = state.getInt(0); double mean = state.getDouble(1); double m2 = state.getDouble(2); n++; double delta = val - mean; mean += delta / n; m2 += delta * (val - mean); state.setInt(0, n); state.setDouble(1, mean); state.setDouble(2, m2); return state;'; 

cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdFinal ( state tuple<int,double,double> ) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 
 ... 'int n = state.getInt(0); double m2 = state.getDouble(2); if (n < 1) { return null; } return Math.sqrt(m2 / (n - 1));';

cqlsh:mykeyspace> CREATE AGGREGATE IF NOT EXISTS stdev ( double ) 
 ... SFUNC sdState STYPE tuple<int,double,double> FINALFUNC sdFinal INITCOND (0,0,0);

So far, so good… Let’s find the mean and standard deviation of the integers from 1 to 10 in Cassandra:

cqlsh:mykeyspace> CREATE TABLE one_to_ten (value double PRIMARY KEY);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (1);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (2);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (3);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (4);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (5);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (6);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (7);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (8);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (9);
cqlsh:mykeyspace> INSERT INTO one_to_ten (value) VALUES (10);
cqlsh:mykeyspace> SELECT COUNT(*), AVG(value), STDEV(value) FROM one_to_ten;

 count | system.avg(value) | mykeyspace.stdev(value)
-------+-------------------+-------------------------
    10 |               5.5 |                 3.02765

Success! Now let’s try getting statistics for the past day’s temperature readings:

cqlsh:mykeyspace> SELECT COUNT(*), AVG(temperature), STDEV(temperature) FROM sensor_readings 
WHERE sensor_id = 1 AND time > '2016-07-26 15:00:00-0700';

count  | system.avg(temperature) | mykeyspace.stdev(temperature)
-------+-------------------------+-------------------------------
 1417  |                32.48066 |                      0.867008

It’s been hot here in San Jose, California!

Now I’m able to get statistics from Cassandra, the next trick is to feed them into SDC to be able to filter out outlier values for closer inspection. I’ll cover that in my next blog entry. Watch this space!

The post Standard Deviations on Cassandra – Rolling Your Own Aggregate Function appeared first on StreamSets.

Dynamic Outlier Detection with StreamSets and Cassandra

$
0
0

This blog post concludes a short series building up a IoT sensor testbed with StreamSets Data Collector (SDC), a Raspberry Pi and Apache Cassandra. Previously, I covered:

To wrap up, I’ll show you how I retrieved statistics from Cassandra, fed them into SDC, and was able to filter out ‘outlier’ values.

Filtering in StreamSets Data Collector

It’s easy to use SDC’s Stream Selector to filter out records, sending them to some destination for further analysis. Recall from part 1, the Raspberry Pi is collecting temperature, pressure and altitude data and appending readings as JSON objects to a text file. Let’s define some static boundaries for expected readings. Since the sensor is located in my home office, I’m pretty sure it’s never going to read temperatures lower than freezing (0C) or higher than boiling (100C), so let’s make those our boundary conditions.

First, I’ll insert a Stream Selector between the File Tail origin and the Field Converter by selecting the stream (arrow) between them, choosing Stream Selector from the dropdown list of processors, and hitting the auto-arrange button to tidy things up:

Stream Selector 1

Now I’ll add a condition to filter my outlier readings out of the pipeline. Stream Selector uses SDC’s expression language to define conditions. The expression we need is simply ${record:value('/temp_deg_C') < 0 or record:value('/temp_deg_C') > 100}:

Stream Selector 2

We need to send the outlier readings to their own store for later analysis. For simplicity, I’ll send JSON objects to a text file:

Output File

The sensor is appending JSON objects to a file that the pipeline is reading as its origin. It’s easy to backup that file, modify some readings at the top to fall outside our boundaries, and then use preview to check that they are correctly sent to the outlier file:

Preview

Dynamic Outlier Detection

Now we’re successfully filtering out readings based on static boundaries, let’s look at a more dynamic approach. The temperature in my home office fluctuates throughout the day and night. A reading of 15C (60F) might be commonplace in the middle of the night, but an outlier on a warm summer afternoon, and might indicate sensor failure, or some environmental problem.

In the last blog entry, I implemented a user defined aggregate (UDA) in Cassandra, giving me the ability to calculate the standard deviation across a set of values. As I explained then, a common way of detecting outliers is to flag readings that fall outside some range expressed in terms of the mean +/- some number of standard deviations (also known as ‘sigma’ or σ). For instance, assuming a normal distribution, 99.7% of values should fall within 3σ of the mean, while 99.95% of values should fall within 4σ.

Since temperature fluctuates throughout the day, I decided that ‘expected’ readings should fall within 4σ of the mean for the last hour, with anything outside that range being classed as an outlier. Let’s see the statistics for the past hour:

cqlsh:mykeyspace> select count(*), avg(temperature), stdev(temperature) from sensor_readings where time > '2016-08-17 18:11:00+0000' and sensor_id = 1;

 count | system.avg(temperature) | mykeyspace.stdev(temperature)
-------+-------------------------+-------------------------------
   337 |                24.85668 |                      0.234314

So, with a mean of 24.86 and a standard deviation of 0.2343, I want any reading less than 23.92 or greater than 25.80 to be sent to the outliers file for further analysis.

Dynamic Configuration in StreamSets Data Collector

At this point, I knew how to get the statistics I needed from Cassandra, but how would I implement dynamic outlier detection in SDC? The REST API allows us to modify pipeline configuration, but that would require the pipeline to be stopped. Can we dynamically filter data without an interruption in data flow? Runtime Resources allow us to do exactly that.

A quick test illustrates the concept. We can create a pipeline using the Dev Raw Data Source origin, an Expression Evaluator, and the Trash destination. The Expression Evaluator will simply set a field on each record with the current content of the file:

Expression Evaluator

Now we can set the pipeline running, modify our resource file, and take snapshots to see the result:

https://www.youtube.com/watch?v=JP97_e3kXNM

Success – we are injecting data dynamically into the pipeline!

Reading Statistics from Cassandra

The final piece of the puzzle is to periodically retrieve mean and standard deviation from Cassandra and write them to files in the SDC resource directory. I wrote a Java app to do just that (full source in gist). Here is the core of the app:

// Use a PreparedStatement since we'll be issuing the same query many times
PreparedStatement statement = session.prepare(
    "SELECT COUNT(*), AVG(temperature), STDEV(temperature) " +
    "FROM sensor_readings " +
    "WHERE sensor_id = ? " +
    "AND TIME > ?");
BoundStatement boundStatement = new BoundStatement(statement);

while (true) {
  long startMillis = System.currentTimeMillis() - timeRangeMillis;
  ResultSet results = session.execute(boundStatement.bind(sensorId, new Date(startMillis)));
  Row row = results.one();

  long count = row.getLong("count");
  double avg = row.getDouble("system.avg(temperature)"),
         sd  = row.getDouble("mykeyspace.stdev(temperature)");

  System.out.println("COUNT: "+count+", AVG: "+avg+", SD: "+sd);

  try (PrintWriter writer = new PrintWriter(resourceDir + "mean.txt", "UTF-8")) {
    writer.format("%g", avg);
  }
  try (PrintWriter writer = new PrintWriter(resourceDir + "sd.txt", "UTF-8")) {
    writer.format("%g", sd);
  }

  Thread.sleep(sleepMillis);
}

The app is very simple – it creates a PreparedStatement to query for statistics on sensor readings within a given time range, periodically executes that statement to get the count, mean and standard deviation, displays the data, and writes the mean and standard deviation to two separate files in SDC’s resource directory.

We want to filter outliers more than 4 standard devations from the mean, so let’s modify our Stream Selector to use the following expression as its filter:

${record:value('/temp_deg_C') < (runtime:loadResource("mean.txt", false) - 4 * runtime:loadResource("sd.txt", false)) or record:value('/temp_deg_C') > (runtime:loadResource("mean.txt", false) + 4 * runtime:loadResource("sd.txt", false))}

Now we can restart the pipeline, and any incoming records with anomalous temperature values will be written to the outlier file. Here’s the pipeline in action:

https://www.youtube.com/watch?v=aAq9dP63voM

Conclusion

We’ve covered a lot of ground over the course of this series of blog entries! We’ve looked at:

* Running SDC on the Raspberry Pi to ingest IoT sensor data
* Using the SDC REST API to extract pipeline metrics
* Creating a User Defined Aggregate on Cassandra to implement standard deviations
* Dynamically filtering outlier data in an SDC pipeline

Over the course of a few hours, I’ve built a fairly sophisticated IoT testbed, and learned a lot about both SDC and Cassandra. I’ll be presenting this content in a session at the Cassandra Summit next month in San Jose: Adaptive Data Cleansing with StreamSets and Cassandra (Thursday, September 8, 2016 at 1:15 PM in room LL20D). Come along, say hi, and see the system in action, live!

The post Dynamic Outlier Detection with StreamSets and Cassandra appeared first on StreamSets.

Whole File Transfer with StreamSets Data Collector

$
0
0

A key aspect of StreamSets Data Collector (SDC) is its ability to parse incoming data, giving you unprecedented flexibility in processing data flows. Sometimes, though, you don’t need to see ‘inside’ files – you just need to move them from a source to one or more destinations. Breaking news – the upcoming StreamSets Data Collector 1.6.0.0 release will include a new ‘Whole File Transfer’ feature to do just that. If you’re keen to try it out right now (on test data, of course!), you can download a nightly build of SDC and give it a whirl. In this blog entry I’ll explain everything you need to know to be able to get started with Whole File Transfer, today!

Downloading and Installing Nightly Builds

Downloading and installing a nightly SDC builds is easy. The latest nightly artifacts are always at http://nightly.streamsets.com/latest/tarball/ and, currently, streamsets-datacollector-all-1.6.0.0-SNAPSHOT.tgz contains all of the stages and their dependencies.

Installing the nightly is just like installing a regular build. In fact, since this is a nightly build, rather than a release that you might be putting into production, you will probably want to just use the default directory locations and start it manually, so all you need to do is extract the tarball, cd into its directory and launch it:

$ tar xvfz streamsets-datacollector-all-1.6.0.0-SNAPSHOT.tgz 
x streamsets-datacollector-1.6.0.0-SNAPSHOT/LICENSE.txt
x streamsets-datacollector-1.6.0.0-SNAPSHOT/NOTICE.txt
x streamsets-datacollector-1.6.0.0-SNAPSHOT/
...
$ cd streamsets-datacollector-1.6.0.0-SNAPSHOT/
$ bin/streamsets dc
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
objc[13032]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_73.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_73.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be used. Which one is undefined.
Running on URI : 'http://192.168.56.1:18630'

Whole File Transfer in StreamSets Data Collector

In the 1.6.0.0 release, Whole File Transfer can read files from the Amazon S3 and Directory sources, and write them to the Amazon S3, Local FS and Hadoop FS Destinations. In this mode, files are treated as opaque blobs of data, rather than being parsed into records. You can transfer PDFs, text files, spreadsheets, whatever you like. SDC processors can act on the file’s metadata – its name, size, etc – but not the file content.

As an example, let’s imagine I have some set of applications writing a variety of files to an S3 bucket. I want to download them, discard any that are less than 2MB in size, and write the remainder to local disk. I want PDF files to be written to one directory and other content types written to another. Here’s how I built an SDC pipeline to do just that.

Reading Whole Files From Amazon S3

Configuring an S3 Origin for Whole File Transfer was almost identical to configuring it for reading records – the only difference being the Data Format: Whole File.

S3 Origin

A quick preview revealed that the origin creates fileRef and fileInfo fields:

S3 Preview

/fileRef contains the actual file data, and is not currently accessible to stages, except for being passed along and written to destinations. /fileInfo contains the file’s metadata, including its name (in the objectKey subfield), size and content type. Note that different origins will set different fields – the Directory origin uses filename, rather than objectKey, and provides a number of other filesystem-specific fields:

Directory Origin

Processing Whole Files with StreamSets Data Collector

Processors can operate on any of the file’s fields, with the exception of fileRef. I used an Expression Evaluator to set a /dirname field to pdf or text depending on the value of /fileInfo/"Content-Type":

Expression Evaluator

If I was working with more than two content types, I could have used a Static Lookup, or even one of the scripting processors, to do the same job.

Stream Selector allowed me to send files to different destinations based on their size:

Stream Selector

Writing Whole Files to a Destination

I could have written files to S3 or HDFS, but, to keep things simple, I wrote them to local disk. There are some rules to configuring the destination for Whole File Transfer:

  • Max Records in File must be 1 – the file is considered to be a single record
  • Max File Size must be 0 – meaning that there is no limit to the size of file that will be written
  • Idle Timeout must be -1 – files will be closed immediately their content is written

I used the /dirname field in the Local FS destination’s Directory Template configuration to separate PDFs from text files:

Local FS 1

In the new Whole File tab, I set File Name Expression to ${record:value('/fileInfo/objectKey')} to pass the S3 file name on to the file on disk.

Local FS 2

Running a Whole File Transfer Pipeline

Now it was time to run the pipeline and see files being processed! Once the pipeline was running, clicking on the Stream Selector revealed the number of ‘small’ files being discarded, and ‘big’ files being written to local disk.

Pipeline Running

Clicking the Local FS destination showed me the new File Transfer Statistics monitoring panel:

File Transfer Statistics

Checking the output directory:

Folder

Success! Since SDC runs pipelines continuously, I was even able to write more files to the S3 bucket and see them being processed and written to the local disk.

Conclusion

StreamSets Data Collector’s new Whole File Transfer feature, available in the latest nightly builds and scheduled for the 1.6.0.0 release, allows you to build pipelines to transfer opaque file data from S3 or Local FS origins to S3, Local FS or Hadoop FS destinations. File metadata is accessible to processor stages, enabling you to build pipelines that send data exactly where it is needed. Download the latest nightly and try it out!

The post Whole File Transfer with StreamSets Data Collector appeared first on StreamSets.

Announcing Data Collector ver 1.6.0.0

$
0
0

It’s been a busy summer here at StreamSets, we’ve been enabling some exciting use-cases for our customers, partners and the community of open-source users all over the world. We are excited to announce the newest version of the StreamSets Data Collector.

This version has a host of new features and over 100 bug fixes.

  • Whole file transfer – You can now use the Data Collector to bring any type of binary data into your data lake in Hadoop, or move files between on-prem or cloud systems.
  • High throughput writes to the S3 destination – While writing files to multiple partitions in S3, you can achieve linear scaling based on the number of threads allocated in the threadpool.
  • Enterprise security in the MongoDB origin and destination including SSL and login credentials.
  • Enterprise security in the Solr destination including Kerberos authentication.
  • Simplified getting started for the MapR integration – Run a simple command to automatically configure MapR binaries with the Data Collector and get up and running in seconds.
  • Support for our MapR integrations with the powerful Data Collector feature – Automatic updates to Hive/Impala schemas based on changing data.
  • HTTP Client/Lookup processor can now add response headers to the data record.
  • Field Converter processor (now called Field Type Converter) can now convert fields en-masse by field name or by data type.
  • New List Pivoter processor that pivots List datatypes.
  • New JDBC Lookup processor that performs in-stream lookups/enrichment from relational databases.
  • New JDBC Tee processor that writes data to relational databases and reads back additional columns to enrich the record in-stream.
  • Reading from JDBC sources no longer require WHERE or ORDER BY clauses.
  • The HTTP origin now supports reading data from paginated webpages, one-shot batch transfers, and support for reading compressed and archive files.
  • Smaller installer packages – We previously introduced the concept of a small sized Core tarball file that lets you install individual stages manually. We’ve extended this concept now to our RPM packages – you can now install the smaller Core RPM Data Collector package, and individual stages manually.
  • Updates to the Kafka Consumer to generate a record per message (datagram) for collectd, netflow and syslog data.
  • Updated versions on the following integrations: Apache Kafka 0.10, Cassandra 3.x, Cloudera CDH 5.8, Elasticsearch 2.3.5.
  • New EL’s to support to trim time portions of Date/Time fields.

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

Powered byTypeform

The post Announcing Data Collector ver 1.6.0.0 appeared first on StreamSets.

Ingesting Drifting Data into Hive and Impala

$
0
0

HiveDrift2Importing data into Apache Hive is one of the most common use cases in big data ingest, but gets tricky when data sources ‘drift’, changing the schema or semantics of incoming data. Introduced in StreamSets Data Collector (SDC) 1.5.0.0, the Hive Drift Solution monitors the structure of incoming data, detecting schema drift and updating the Hive Metastore accordingly, allowing data to keep flowing. In this blog entry, I’ll give you an overview of the Hive Drift Solution and explain how you can try it out for yourself, today.

The StreamSets Hive Drift Solution

Apache Hive is a data warehouse system built on Hadoop-compatible file systems such as Hadoop Distributed File System (HDFS) and MapR FS. Records are written to files using a format such as Avro, with schema being stored in the Hive Metastore. Hive runs queries using the Hadoop MapReduce framework; Apache Impala (incubating) shares the Hive Metastore, files and SQL syntax, but allows interactive queries.

The StreamSets Hive Drift Solution comprises three SDC stages: the Hive Metadata processor, the Hive Metastore destination and either the Hadoop FS or the MapR FS destination.

HiveMeta-PipelineHiveMeta-MapR-Pipeline

  • The Hive Metadata processor reconciles any differences between the incoming record structure and the corresponding table schema in Hive, sending metadata records to the Hive Metastore destination.
  • The Hive Metastore destination interprets incoming metadata records, creating or altering Hive tables accordingly.
  • The Hadoop FS/MapR FS destination writes the actual data files in the Avro data format.

Why three stages instead of just one? Flexibility and scalability. As you can see, we can write data to either Hadoop FS or MapR FS by just swapping out the data destination. Also, data volume tends to increase far faster than metadata, so we can scale out the data pathway to many pipelines independent of the metadata stream. We can even fan in multiple metadata paths to a single Hive Metastore destination to control the amount of load that SDC puts on the Hive Metastore.

StreamSets Hive Drift Solution Tutorial

If you’d like to try the solution out, you can follow our new tutorial, Ingesting Drifting Data into Hive and Impala. The tutorial guides you through configuring the solution to ingest data from MySQL to Apache Hive running on any of the Apache, Cloudera, MapR or Hortonworks distributions. This short video shows the tutorial configuration in action:

 

StreamSets Data Collector is 100% open source, so you can download it today, work through the tutorial and get started ingesting drifting data into Apache Hive!

The post Ingesting Drifting Data into Hive and Impala appeared first on StreamSets.

StreamSets Data Collector in Action at IBM Ireland

$
0
0

Guglielmo IozziaAfter Guglielmo Iozzia, a big data infrastructure engineer on the Ethical Hacking Team at IBM Ireland, recently spoke about building data pipelines using StreamSets Data Collector at Hadoop User Group Ireland, I invited him to contribute a blog post outlining how he discovered StreamSets Data Collector (SDC) and the kinds of problems he and his team are solving with it. Read on to discover how SDC is saving time and making Guglielmo and his team’s lives a whole lot easier…

As a major user of StreamSets Data Collector in internal big data analytics projects at IBM, I have been asked to write some notes about my team’s experience with Data Collector so far.

Let’s start from the beginning, when I and the other members of the team were unaware of the existence of this useful tool…

One of the main goals of the team was to find answers to some questions about the outages occurring in the data centers hosting some of our cloud solutions. A culture change was ongoing in order to move to a proactive approach to problems. Among the main priorities we had to understand how to automatically classify outages as soon as they happened in order to alert the proper team, identify the root cause and, finally, suggest potential remedies/fixes for them.

All of the raw data we needed for analytics used to be stored in different internal systems with different formats and accessible using different protocols. We had the exact same situation for raw data needed to answer other questions related to different matters.

The original idea was to use existing open source tools (when available) like Sqoop, Flume or Storm, and, in other cases, to write custom agents for some internal legacy systems to move the data to destination storage clusters. Even where tools existed, this solution required a great deal of initial effort in coding customizations and significant effort and time in terms of maintenance as well. Furthermore we had to deal by ourselves with other matters like data serialization (more tools to configure and manage this, like Avro and Protocol Buffer), data clean up (implementing a dedicated job for this), and security. Another matter we had was the following: as soon as analytics jobs (originally only Hadoop MapReduce jobs, then progressively moving to real-time through Spark) completed their execution, most results would need to become available to be consumed by users’ dashboards, which only had access to MongoDB database clusters. So other agents (or MapReduce jobs) needed to be implemented to move data from HDFS to MongoDB.

The main issue in that scenario was the size of the team. Allocating people to work on data ingestion, first level clean up, serialization and data movement reduced the number of stories (and story points) dedicated during each sprint to the main business of the team.

Luckily in March this year I attended an online conference called “Hadoop with the Best.” There one of the speakers, Andrew Psaltsis, briefly mentioned StreamSets Data Collector in a list of tools that were going to change the big data analytics scenario. That was the moment I achieved enlightenment.

I started a POC to understand if Data Collector really provided the features and the ease of use the StreamSets guys claimed. And I can confirm that whatever you read in the StreamSets official web site is what you really can achieve starting to build data pipelines through the Data Collector tool.

Seven months later, this is the list of benefits we have achieved so far:

  • Less maintenance time and trouble using a single tool to build data pipelines handling different data source types and different destination types. Just to give an example: in order to do analytics for data center outages, we are now able to ingest raw data coming from relational databases, HTTPS RESTful APIs, and legacy data sources (through the UDP origin), filter and perform initial clean up, and feed data to both HDFS (to be consumed by Hadoop or Spark) and MongoDB clusters (for reporting systems) using only SDC.
  • No more need to waste time coding to implement agents and/or develop customizations of third party tools to move data.
  • Possibility of shifting the majority of data clean up from Hadoop MapReduce or Spark jobs to SDC, significantly reducing data redundancy and storage space in the destinations.
  • Easy pipeline management using only a web UI. We found that the learning time here is short also for people without strong development skills.
  • Wide variety of supported data formats – text, JSON, XML, Avro, delimited, log, etc.
  • Plenty of real-time stats to check for data flow quality.
  • Plenty of metrics available to monitor pipelines and performance of SDC instances.
  • An excellent and highly customizable alerting system.
  • Dozens of managed origins and destinations that cover most of the tools/systems that can be part of a big data analytics ecosystem. It’s also possible to extend the existing pool of managed origins, processors and destinations through the SDC APIs, but this is something you shouldn’t need because each new SDC release comes with a bunch of new stages.
  • In the very few situations where none of the existing processors fits the need for a particular operation, it is possible to do the required action on the incoming records through a Groovy, JavaScript or Jython script. We wrote some simple scripts in Groovy, as we were comfortable with this scripting language since we use it in Jenkins build jobs.

We found StreamSets Data Collector to be a mature, stable product, despite its relatively short life (the first release was just a year ago), with an active and wonderful community behind the scenes. New releases are going to be more and more frequent and every new one always covers new systems/tools as sources and/or destinations and comes with more new useful processors that are making the scripting stages quite obsolete. I highly recommend you use SDC if you are struggling to build data pipelines and don’t feel comfortable with traditional ETL practices.

Many thanks to Guglielmo for taking the time to document his experience with StreamSets Data Collector! 

If you’re using SDC and would like to share your experiences with our community, please let me know in the comments, or via Twitter at @metadaddy.

The post StreamSets Data Collector in Action at IBM Ireland appeared first on StreamSets.

Introducing StreamSets DPM – Operational Control of Your Data in Motion

$
0
0

Friends of StreamSets,

Today I am delighted to announce our new product, StreamSets Dataflow Performance Manager, or DPM, the industry’s first solution for managing operations of a company’s end-to-end dataflows within a single pane of glass. The result of a year’s worth of innovative engineering and collaboration with key customers, DPM will be generally available on or before September 27, in time for Strata. We invite you to come by our booth (#451) for a live demonstration.

DPM is a natural follow-on to our first product, StreamSets Data Collector, which is open source software for building and deploying any-to-any dataflow pipelines. That product has enjoyed a great deal of success in its first year in market, with an accelerating number of weekly downloads, which now total in the tens of thousands across hundreds of enterprises, and numerous production use cases in Fortune 500 companies across a variety of industries.

While StreamSets Data Collector is a best-in-class tool for data engineers designing complex pipelines in the face of data drift, that is only half the battle.  Our customers don’t just struggle with building pipelines, but also with managing the day in and day out operations of their dataflows, so that they can be confident that data-driven applications and business processes are getting timely and trustworthy data.

This is where DPM comes in. You can think of it as a control panel for managing all of your dataflow topologies from a single point. A topology is a series of interconnected dataflows, sometimes dozens or hundreds of individual pipelines, that work together to continuously serve data in support of  business and IT imperatives, such as Customer 360, Cybersecurity, IOT and Data Lakes.

We call the operational discipline that DPM enables Data Performance Management. We think of DPM as a high level of process maturity akin to that delivered by Network Performance Management and Application Performance Management. Data has been neglected in this regard. While data stores are well-managed (data at rest), end-to-end dataflows are not (data in motion).  If you don’t professionally manage your data in motion, you risk having applications malfunction because of incomplete or corrupt  data.   

DPM lets you to map, measure and master your dataflow operations.  First, DPM maps dataflow pipelines into broader topologies, not just as a snapshot but as a living, breathing and interactive data architecture.  Real-time visualization of dataflow topologies is truly empowering, replacing manual mapping exercises that become outdated as soon as they are published.  

Data Performance Manager - Screenshot:MapStreamSets Dataflow Performance Manager™ maps the dataflows for a Customer 360 topology, which feeds batch and streaming data from multiple sources to multiple destinations. It also shows record throughput across the topology.

But that’s only the first step. Next you can measure dataflow performance across each topology, from end-to-end or point-to-point.  You can establish baselines for what is normal throughput, travel time or error rates, and then monitor these metrics to ensure operational stability.  You can also assess the performance impact of topology changes, such as new or updated infrastructure, applications and dataflows.

Data Performance Manager - Screenshot:MeasureThe StreamSets DPM dashboard shows metrics for all dataflow topologies on a single screen.  Operators can drill into each topology or the dataflow pipelines they use.

Still, the ultimate goal to which enterprises should strive is to master their dataflow operations by implementing Data SLAs that ensure incoming data meets business requirements for availability and accuracy.  DPM lets you set Data SLAs and then warn or alert when there is a violation so you can proactively address dataflow operational problems before they become business problems.

The power in these Data SLAs is that they are not limited to system-specific rules, like “is there data backpressure in Kafka” but rather reflect consumption-specific business goals, such as “is more than 95% of the application log data feeding my personalization algorithm arriving within 1 hour of being produced?” or “is the data in my BI dashboard complete; can I trust the results?” Of course you can also set SLAs for path segments or systems that create risk within a given topology, but the true innovation is the end-to-end visibility and control you gain.

Data Performance Manager - Screenshot:MasterStreamSets DPM allows you to set SLAs for Data Availability and Data Accuracy, from end-to-end or for a segment of the dataflow.  SLAs can be programmed to  trigger alerts when violated.

With DPM, we feel we have made great progress in our mission to empower organizations to harness their data in motion.  And there is much more to come.  To learn more about the new solution, you can visit the DPM product page or contact us and we’ll set a time for a discussion of your needs as well as a live demonstration demo.

Girish Pancha, CEO and Co-Founder

StreamSets Inc.

The post Introducing StreamSets DPM – Operational Control of Your Data in Motion appeared first on StreamSets.


Creating a Post-Lambda World with Apache Kudu

$
0
0

Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing

As originally posted on the Cloudera VISION Blog.

At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.

This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective. Not only does this make dataflow and ingestion difficult, it also puts the burden of reconciling different characteristics of various components onto the applications. With numerous boundary conditions and special cases to deal with, companies often find the complexity overwhelming, requiring a team of engineers to maintain and operate it.

In integrating our product, StreamSets Data Collector, alongside Apache Kudu, we’ve found users can reduce the overall complexity of their applications by orders of magnitude and make them more performant, manageable, predictable and expandable at a fraction of the cost.

Take the example of real-time personalization for social websites, a use case we have seen a few times. Previously, a typical implementation would use offline batch jobs to train the models, whereas the scores are calculated on real-time interactions. If the calculated scores do not reflect the most recent trends or viral effects, chances are the model training has become stale. A key question is “how do you ensure that the models are trained with latest information so that the calculated scores are up-to-the-minute?”

For the longest time, the choice of the underlying storage tier dictated what kind of analysis you could possibly do. For example, for large volume batch jobs such as training models on large datasets, the most effective storage layer is HDFS. However HDFS is not suited for aggregating trickle feed information due to various limitations, such as the inability to handle updates and requiring large files. As a result, applications often utilize multiple storage tiers and partition the data manually to route real-time streams into a system like Apache HBase and aggregated sets into HDFS. This is the reason why lambda architectures, among other architectural patterns, exist for Apache Hadoop applications.

In our example of real-time personalization, user interaction information is captured and sent to both an online and offline store. The online store, HBase in this case, is used for real-time scoring that creates personalization indices that are used by the web application to serve personalized content. The offline store, HDFS in this case, is used for training the models in a periodic batch manner. A minimum threshold of data is accumulated before it is sent to HDFS to be used by the batch training jobs. This introduces significant latency into the personalization process and introduces a host of problematic boundary conditions onto the application. For instance, since data is being captured from multiple web servers, the application must make sure it can handle any out-of-sequence or late-arriving data correctly.

All this sounds remarkably complex and hard to implement, which it is. But it does not need to be that way anymore. We’ve found that by integrating StreamSets Data Collector and Kudu, such applications can be greatly simplified and built more quickly than ever before. What’s more, operating and managing such applications is easier as well.

Kudu is an innovative new storage engine that is designed from the ground up to overcome the limitations of various storage systems available today in the Hadoop ecosystem. For the very first time, Kudu enables the use of the same storage engine for large scale batch jobs and complex data processing jobs that require fast random access and updates. As a result, applications that require both batch as well as real-time data processing capabilities can use Kudu for both types of workloads. With Kudu’s ability to handle atomic updates, you no longer need to worry about boundary conditions relating to late-arriving or out-of-sequence data. In fact, data with inconsistencies can be fixed in place in almost real time, without wasting time deleting or refreshing large datasets. Having one system of record that is capable of handling fast data for both analytics and real-time workloads greatly simplifies application design and implementation.

An example of a personalization dataflow topology using StreamSets Data Collector and Apache Kudu.

An example of a personalization dataflow topology using StreamSets Data Collector and Apache Kudu.

From the perspective of StreamSets, we’ve worked very hard to make the ingest layer simple and easy to use via a drag-and-drop UI. This means that when you introduce new data sources, change your data formats, modify the schema or structure of data, or upgrade your infrastructure to introduce new or updated components, your dataflows will continue to operate with minimal intervention. While this reduces the application complexity considerably on the ingestion side, the processing tier remains considerably complex when working with patched-together components. The introduction of Kudu reduces that complexity and allows users to build applications that can truly focus on business needs as opposed to handling complex boundary conditions.

Going back to our example of real-time personalization, the overall implementation can be greatly simplified by using StreamSets Data Collector based dataflows feeding into Kudu. The batch jobs that train the models can then directly run on top of Kudu as well as the real-time jobs that calculate the interaction scores. With StreamSets Data Collector depositing most recent interaction data into Kudu, the models will be trained with the latest information always and the real-time personalization will not need to suffer from artificial latency to buffer sufficient data for processing first.

For the first time in the Hadoop ecosystem, you now have the tools necessary to build applications that truly focus on business logic as opposed to doing a balancing act between different technology components that have significant impedance mismatch. No longer do you need to worry about capturing data from numerous sources, or feeding different systems of record to harness their native capabilities that are required for your applications’ processing logic. Consequently, you no longer need to reconcile the differences between such systems that create numerous boundary conditions and special case scenarios.

The post Creating a Post-Lambda World with Apache Kudu appeared first on StreamSets.

MySQL Database Change Capture with MapR Streams, Apache Drill, and StreamSets

$
0
0

raphael headshotToday’s post is from Raphaël Velfre, a senior data engineer at MapR. Raphaël has spent some time working with StreamSets Data Collector (SDC) and MapR’s Converged Data Platform. In this blog entry, originally published on the MapR Converge blog, Raphaël explains how to use SDC to extract data from MySQL and write it to MapR Streams, and then move data from MapR Streams to MapR-FS via SDC, where it can be queried with Apache Drill.

Overview

A very common use case for the MapR Converged Data Platform is collecting and analyzing data from a variety of sources, including traditional relational databases. Until recently, data engineers would build an ETL pipeline that periodically walks the relational database and loads the data into files on the MapR cluster, then perform batch analytics on that data.

This model breaks down when use cases demand more instant access to that same data, in order to make a decision, raise an alert, or make an offer, since these batch pipelines are often scheduled to run hourly or even daily. To get to real-time data processing one must build a real-time data pipeline that is continuously collecting the latest data. Building a real-time data pipeline doesn’t mean you have to give up batch analytics, since you can always write data from the streaming pipeline to files or tables, but you can’t do real-time analytics using a batch pipeline.

To build a real-time data pipeline, you should start with MapR Streams, the publish/subscribe event streaming service of the MapR Converged Data Platform, as it is the most critical component to handle distribution of real-time data between applications.

Next, a tool is needed to extract the data out of the database and publish it into MapR Streams, as well as take data out of MapR Streams and write it to files or database tables. StreamSets Data Collector (SDC) is an open source, easy to use, GUI-based tool that runs directly on the MapR cluster and allows anyone to build robust data pipelines.

In this blog, we’ll walk through an example of building a real-time data pipeline from a MySQL database into MapR Streams, and even show how this data can be written to MapR-FS for batch or interactive analytics.

Prerequisites

For this example, I assume you have a MapR 5.1 cluster running and SDC properly installed and configured. Specific instructions for setting up SDC with MapR are provided in the StreamSets documentation.

Architecture of Our Use Case

Producer

The source will be a MySQL database that is running on my MapR cluster.

Note: MySQL could also run outside of the cluster; I just wanted to remove complexity in this architecture.

We will stream data from the clients table in that database and publish data to MapR Streams:

Consumer

Then we will stream data from a stream/topic to MapR-FS.

Set up the Environment

MySQL Database

First of all, we need to add the MySQL JDBC driver to StreamSets. Follow the instructions for installing additional drivers into SDC.

Now, we need to make sure that the user which is running StreamSets exists in the MySQL users’ table and has enough rights.

Log into MySQL using root from your node and enter the following queries:

>CREATE USER '<User_Running_Streamsets>'@'<Host_Running_Streamsets>' IDENTIFIED BY 'password';

>GRANT ALL PRIVILEGES ON *.* TO '<User_Running_Streamsets>'@'<Host_Running_Streamsets>'  WITH GRANT OPTION;

Then create a database crm and a table named clients:

>CREATE DATABASE crm;

>CREATE TABLE clients (ID INT, Name VARCHAR(10), Surname VARCHAR(10), City VARCHAR(10), Timestamp VARCHAR(10));

Now let’s add some clients into this table:

>INSERT INTO clients VALUES (1,'Velfre','Raphael','Paris','20160701');
>INSERT INTO clients VALUES (2,'Dupont','Jean','Paris','20160701');

STREAMS AND TOPICS

We will create a stream named clients and two topics:

>maprcli stream create -path /clients
>maprcli stream edit -path /clients -produceperm p -consumeperm p -topicperm p
>maprcli stream topic create -path /clients -topic clients_from_paris
>maprcli stream topic create -path /clients -topic clients_from_everywhere_else

StreamSets Runtime Properties

Runtime properties can be set up in a file locally and used in a pipeline. This is really useful in a production environment. Here we will set up three properties related to MySQL. Open $SDC_HOME/etc/sdc.properties and add the following, below the existing runtime.conf.location properties:

runtime.conf_MYSQL_HOST=jdbc:mysql://<Mysql_Host_IP>:3306/crm
runtime.conf_MYSQL_USER=root
runtime.conf_MYSQL_PWD=password

By doing this, you will be able to use the data pipeline that I developed.

Build StreamSets Pipelines

MySQL to MapR Streams

Log into StreamSets (port 18630), click Import Pipeline, and import Mysql_to_Streams.json:

Now you are able to the see the pipeline. Click on any component to see its configuration.

JDBC Consumer

This origin stage is used to query the MySQL database and to retrieve data.

JDBC Connection String
The Connection String is required to be able to connect to MySQL. Here we will be using the runtime property called “MYSQL_HOST” that we set up in $SDC_HOME/etc/sdc.properties.

Incremental Mode
Here we want our data pipeline to read new rows from our MySQL table as they are written. By checking this property, our pipeline will maintain the last value of the specified offset column to use in the next query.

SQL Query
We want to retrieve all data from the clients table. The WHERE clause is mandatory because we are running a pipeline and not a batch (like classic ETL). So the pipeline will run many small batch files based on the WHERE clause and offset column. Here we chose ID as the offset, since it will be unique and increment each time a client will be created.

Initial Offset
This is the initial value of the offset column.

Offset Column
As mentioned before, ID will be our offset column for this pipeline.

Query Interval
This component will run many batch files. Query interval is the time between two batches.

Stream Selector

The Stream Selector processor is used to dispatch data in many streams depending on one or more condition. Here I would like to split my clients that are from “Paris” from all the others using the “City” field from the clients table.

Records that satisfy condition 1 will be sent to the first output. Records that do not pass condition 1 will be streamed into the second (default) output.

MapR Streams Producer

Based on the Stream Selector output, data will be dispatched to two different topics in the /clients stream. Both of the MapR Streams Producers are configured the same way. Only the topic name changes.

Topic
Here we will produce data to topic clients_from_paris into the /clients stream.

Data Format
We will used delimited format and configure the format to be “Default CSV”:

MapR Streams to MapR-FS

Hit Import Pipeline, select Streams_to_MapRFS.json and click on Import.

MapR Streams Consumer

Consumer configuration is almost the same as the Producer, except for the following properties:

Consumer Group
Consumer group name that will retrieve data from topic clients_from_paris and the /clients stream.

MapR Streams Configuration
Here we would like to retrieve all data from the streams including data that has already been produced.

MapR-FS

File Prefix
This is the default, but it can be changed. It corresponds to the File Prefix that will be created by the pipeline.

Directory Template
Again, this is the default.
Folder /tmp will be located at MapR-FS root.

Note: here we are using the MapR-FS specific component, but we could also use the Local FS component since we can access MapR-FS from a node that is running NFS Gateway. In this case, the directory template would be: /mapr/<cluster_name>/tmp/out/…

Run the Data Integration Process

Go to the StreamSets home page, select the two pipelines and start them:

Now that both pipelines are running, the two records that we’ve created in the clients table should have been produced in our stream. Let’s confirm it using the streamanalyzer tool:

> mapr streamanalyzer -path /clients -topics clients_from_paris

Output should be:

Total number of messages: 2

We can also open the Mysql_to_Streams pipeline and look at the metrics information:

The same information is displayed in Streams_to_MapRFS:

For now, the output file at: maprfs://tmp/out/<timestamp>/ is hidden. This is controlled by the Idle Timeout property in the MapR-FS component – one hour is the default value.

The output file will remain hidden for one hour, or until the pipeline is stopped.

Note: this is not useful to MapR-FS, since it’s a full random R/W file system.

Now let’s add some data into the clients table:

>INSERT INTO clients VALUES (3,'Lee','Camille','Paris','20160702');
>INSERT INTO clients VALUES (4,'Petit','Emma','Paris','20160702');

Again, we can use streamanalyzer:

> mapr streamanalyzer -path /clients -topics clients_from_paris

Output should be:

Total number of messages: 4

And the metrics from the 2 pipeline should show 4 records:


Query the Data Using Drill

Here’s a quick reminder on how to query data with Drill. You have three tools to query data using Drill:

Here I will use Drill Explorer to query the data just injected into my cluster.

Set up the Storage Plugin

Since the output file generated on MapR-FS by SDC has no extension, we need to configure a default input format in the storage plugin page. Let’s update dfs Storage Plugins that is enabled by default:

“defaultInputFormat” : “csv” is the change

Query Data Using Drill Explorer

Open Drill Explorer and navigate into MapR-FS to find the output file. The output file should be located at:

Dfs.root > tmp > out > YYY-MM-DD-hh 

Then let’s click on this file:

If go into the SQL tab, you will see the query that has been executed.

If you want to execute more complex queries, you can do that via this tab:

Thanks to Drill, you are able to query data immediately after the data has been written, without any ETL process to build.

Conclusion

Our job is done. In this blog post, you learned how to use StreamSets Data Collector to easily integrate any data from your relational database with MapR Streams and subsequently MapR-FS, and even use Drill to query this data with ANSI SQL.

StreamSets Data Collector is open source. Download it now and start ingesting data today!

The post MySQL Database Change Capture with MapR Streams, Apache Drill, and StreamSets appeared first on StreamSets.

Announcing StreamSets Data Collector version 2.0

$
0
0

Last October, we publicly announced StreamSets Data Collector version 1.0. Over the last 12 months we have seen an awesome (a word we don’t use lightly) amount of adoption of our first product – from individual developers simplifying their day-to-day work, to small startups building the next big thing, to the very largest companies building global scale enterprise architectures with StreamSets Data Collector at its core.

Drawing from the experience of our co-founders over the last few decades, and numerous interviews we’ve had with companies over the last year, we are excited to launch the next version of the Data Collector that ties deeply with our newest product StreamSets Dataflow Performance Manager (DPM).

The days of writing individual point-to-point pipelines are behind us – true value lies in a high-level view of how pipelines work together to deliver data to enable the larger application. And when you see a multitude of pipelines through a single pane of glass, you want to see delivery metrics at that aggregate level and you want to know if and when data delivery is not optimal, and get alerted when you need to take action.

If you are a developer, DPM lets you perform release and configuration management of your pipelines, share pipelines within your team, execute pipelines on production systems – and finally, see a multitude of pipelines (yours and those created by other members of your team) come together in a topology.

If you are an architect or Chief Data Officer, DPM lets you monitor data flows for the complete application. If you are responsible for the building and upkeep of all data flowing into the larger Customer 360 application within the enterprise and you have different groups building discreet pipelines to feed different pieces of data, you can use the DPM pull all these pipelines together into a central canvas and visualize the complete data flow. DPM also lets you drive standardization across your enterprise and lets you think about metrics and Service Level Agreements for all data in motion.

 

Data Performance Manager - Screenshot:Map

Sign up for a webinar on October 5th for a complete introduction to DPM from myself and our CTO Arvind Prabhakar.

Version 2.0 has a host of new features:
– Integration with StreamSets DPM Cloud.
– Support for Oracle CDC. If you’d like to get real-time data from an Oracle database, use the Oracle CDC Client origin to get started.
– Support for MapR version 5.2.0.
– Support for cluster mode streaming using MapR Streams.
Field Flattener processor that flattens nested records.
– Enhancements to the GeoIP lookup processor to perform lookups from multiple databases.
– Updates to the FTP/SFTP Client origin to allow transferring whole binary files.
– Also 50+ bug fixes

Check out the release notes for the complete list of new features and updates in SDC 2.0.

Download it now, and let us know what you think.

Here are some frequently asked questions about SDC and DPM:

Q. Will upgrading to StreamSets Data Collector 2.0 break my older pipelines?
A. No, it will automatically function with pipelines developed in older versions.

Q. Do I have to use DPM to continue using StreamSets Data Collector?
A. No, you can continue to use it as is.

Q. How can I get started with DPM?
A. You will need a couple of things to get started: 1) Sign up for a free trial account, 2) Download SDC 2.0 and follow the steps outlined here to get started.

Q. Is the DPM software also open source?
A. No, DPM is proprietary software that runs on the StreamSets Cloud or can be deployed to your private cloud.

Q. Will StreamSets Data Collector continue to be open source?
A. In short, it’s business as usual. It will continue to remain 100% open source. Since DPM relies on the ingest abilities in StreamSets Data Collector, we will continue to aggressively build new features and integrations.

The post Announcing StreamSets Data Collector version 2.0 appeared first on StreamSets.

Visualizing NetFlow Data with StreamSets Data Collector, Kudu, Impala and D3

$
0
0

sandish kumarSandish Kumar, a Solutions Engineer at phData, builds and manages solutions for phData customers. In this article, reposted from the phData blog, he explains how to generate simulated NetFlow data, read it into StreamSets Data Collector via the UDP origin, then buffer it in Apache Kafka before sending it to Apache Kudu. A true big data enthusiast, Sandish spends his spare time working to understand Kudu internals.

NetFlow is a data format that reflects the IP statistics of all network interfaces interacting with a network router or switch. NetFlow records can be generated and collected in near real-time for the purposes of cybersecurity, network quality of service, and capacity planning. For network and cybersecurity analysts interested in these data, being able to have fast, up-to-the second insights can mean faster threat detection and higher quality network service.

Ingesting data and making it immediately available for query in Hadoop has traditionally been difficult, requiring a complex architecture commonly known as the Lambda architecture. Lambda requires the coordination of two storage layers: the “speed layer” and the “batch layer”. The complexity of Lambda has put many real-time analytical use-cases out of reach for Hadoop.  However, with Apache Kudu we can implement a new, simpler architecture that provides real-time inserts, fast analytics, and fast random access, all from a single storage layer. In this article, we are going to discuss how we can use Kudu, Apache Impala (incubating), Apache Kafka, StreamSets Data Collector (SDC), and D3.js to visualize raw network traffic ingested in the NetFlow V5 format.

The data will flow through the following stages:

  1. A UDP NetFlow simulator generates a stream of NetFlow events.
  2. A StreamSets Data Collector ingest pipeline consumes these events in real-time and persists them into Kafka. A second pipeline then performs some in-stream transformations and then persists events into Kudu for analytics
  3. A D3 visualization then queries Kudu via Impala

Network Traffic Simulator to Kafka

The goal of this section is to simulate network traffic and send that data to Kafka.

Step 1: Install and Run StreamSets Data Collector

  1. Download the StreamSets Data Collector TGZ binary from its download page, extract and run it:
    $ tar xvzf streamsets-datacollector-all-2.0.0.0.tgz
    $ cd streamsets-datacollector-2.0.0.0/
    $ bin/streamsets dc
  2. Once you see the message Running on URI : ‘http://localhost:18630′ , navigate to localhost:18630 on your favorite browser and use admin/admin as username/password

Step 2: Create a Pipeline from UDP Origin to Kafka Producer

  1. On the StreamSets Dashboard, click on + Create New Pipeline and specify the name, “UDP Kafka Producer”, and description, “Source data from UDP to Kafka”, of the pipeline.
  2. Click on the Save button.
  3. Once you’ve done this, a grid should appear on your screen with a few icons on the right hand side.
  4. Make sure the drop down selector reads ‘Origins’. The various icons below list the data sources that StreamSets can read from. Scroll down to find the icon that reads ‘UDP Source’, and drag the icon onto the StreamSets grid.
  5. Once the UDP Source is in place, click on it. In the Configuration Panel below, select the UDP tab and change the Data Format to NetFlow. Leave Port and all other settings with their default values.
  6. Now let’s create a Kafka Producer! Change the drop down selector on the right from “Origins” to “Destinations”, and drag the “Kafka Producer” onto the StreamSets grid
  7. In the Kafka producer configuration, select Kafka and change TopicName to “NETFLOW”
  8. Make sure that Data Format is “SDC Record”
  9. Now let’s connect “UDP Source” to “Kafka Producer” by dragging an arrow from the UDP Source to the Kafka Producer.

Finally, click in the background of the grid, then, in the bottom pane, Configuration, Error Records and set Error Records to Discard.

visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-01

Step 3: Test Data Transfer between Traffic Simulator and Kafka Producer

  1. Start the kafka consumer by running the following command on the shell. Set your Zookeeper host as appropriate.
    $ kafka-console-consumer --zookeeper zk_host:2181 --topic NETFLOW --from-beginning
  2. Let’s now start the UDP Kafka Producer pipeline by clicking on the Start button on the top right corner of the StreamSets Dashboard.
  3. Get UPDClient.java here and dataset-3-raw-netflow here
  4. Start the Traffic Simulator by compiling and running the Java-based UDP client using the shell:
    $ javac UDPClient.java
    $ java UDPClient dataset-3-raw-netflow
  5. Upon success, you should see something like this on your Kafka Consumer terminal and the visualization in StreamSets:visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-02visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-03

StreamSets Data Collector is now receiving the UDP data, parsing the NetFlow format, and sending it to Kafka in its own internal record format. Kafka can buffer the records while we build another pipeline to write them to Kudu.

Kafka to Kudu

The goal of this section is to read the data from Kafka and ingest into Kudu, performing some lightweight transformations along the way.

Step 1: Create a New Table in Kudu

Start the impala-shell on your terminal, and paste the sql query given below to create an empty table called “netflow“

CREATE TABLE netflow(
    id string,
    packet_timestamp string,
    srcaddr string,
    dstas string,
    dstaddr_s string,
    dstport int32,
    dstaddr string,
    srcaddr_s string,
    tcp_flags string,
    dPkts string,
    tos string,
    engineid string,
    enginetype string,
    srcas string,
    packetid string,
    nexthop_s string,
    samplingmode string,
    dst_mask string,
    snmponput string,
    length string,
    flowseq string,
    samplingint string,
    readerId string,
    snmpinput string,
    src_mask string,
    version string,
    nexthop string,
    uptime string,
    dOctets string,
    sender string,
    proto string,
    srcport int32)
DISTRIBUTE BY HASH(id) INTO 4 buckets, RANGE (packet_timestamp) SPLIT ROWS(('2015-05-01'), ('2015-05-02'), ('2015-05-03'), ('2015-05-05'))
TBLPROPERTIES(
    'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
    'kudu.table_name' = ‘netflow’,
    'kudu.master_addresses' = '{kudu master}',
    'kudu.key_columns' = 'id,packet_timestamp’
);

The table netflow is hash partitioned by the id field which is a unique key and should result in the rows being uniformly distributed among buckets and thus cluster nodes. Hash partitioning provides us high throughput for writes because (provided there are enough buckets) all nodes will contain a hash partition. Hash partitioning also provides for read parallelism when scanning across many id values because all nodes which contain a hash partition will participate in the scan.

The table also has been range partitioned by time so that for queries scanning only a specific time slice can exclude tablets not containing relevant data. This should increase cluster parallelism for large scans (across days) while limiting overhead for small scans (single day). Range partitioning also ensures partition growth is not unbounded and queries don’t slow down as the volume of data stored in the table grows, because we would be querying only certain portion of data and data is distributed across nodes by hash and range partitions.

The above table creation schema creates 16 tablets; first it creates 4 buckets hash partitioned by ID field and then 4 range partitioned tablets for each hash bucket. When writing data to Kudu, a given insert will first be hash partitioned by the id field and then range partitioned by the packet_timestamp field. The result is that writes will spread out to four tablets (servers). Meanwhile read operations, if bounded to a single day, will query only the tablets containing data for that day. This is important: without much effort, we are able to scale out writes and also bound the amount of data read on time series reads.

Step 2: Build the Pipeline

  1. If you don’t already have Kudu, you will need to either download and run the Kudu Quickstart VM or install Kudu and Impala-Kudu integration.
  2. If you’re still in the UDP Kafka Producer pipeline, click the ‘Pipelines’ link (top left).
  3. Click on + Create New Pipeline and enter the pipeline name, “Kafka Consumer to Apache Kudu”, and Description, “Loading data from Kafka consumer to Apache Kudu”.
  4. Click on the Save button
  5. Once you’ve done this, an empty SDC grid should appear on your screen with a few icons on the right hand side.
  6. Make sure the drop down selector reads ‘Origins’. Scroll down to find the icon that reads “Kafka Consumer”, and drag the icon onto the StreamSets canvas.
  7. Once the Kafka consumer is in place, click on it. In the configuration panel below, select the Kafka tab and change Data Format to “SDC Record”, and set the topic name to “NETFLOW” (Make sure it’s the same name as above)
    visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-04
  8. Now let’s create Apache Kudu as a destination. On the drop down selector, select “destinations” and drag the “Apache Kudu” icon onto the Map
  9. Select the Kudu tab and enter the appropriate details for your cluster
    Kudu Masters Kudu master hostname
    Table Name netflow
  10. The word timestamp is reserved in Impala, so let’s rename the field. Set “Field to Column Mapping” to:
    SDC Field Column Name
    /timestamp packet_timestamp

Now we’ll use a JavaScript Evaluator to convert the timestamp field from a long integer to DateTime ISO String format which will be compatible with Kudu range partition queries. (Note this should be possible via a field converter in the future.) We’ll talk more about range partitions later below. For now, let’s draw a JavaScript Evaluator between Kafka Consumer and Apache Kudu.

  1. In the drop down selector, select “Processors”. Drag the “JavaScript Evaluator” icon and drop it in between the Kafka Consumer and Apache Kudu. Draw intermediate paths between Kafka Consumer → JavaScript Evaluator and JavaScript Evaluator → Kudu
  2. Select the JavaScript tab and replace the script with this code:
    for(var i = 0; i < records.length; i++) {
      try {
        var convertedDate = new Date(records[i].value.timestamp);
        records[i].value.timestamp=convertedDate.toISOString();
        output.write(records[i]);
      } catch (e) {
        error.write(records[i], e);
      }
    }

visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-05

If everything you’ve done so far is correct is set, clicking on “Validate” will return a success message

Step 3: Test the entire topology: UDP → KafkaProducer → KafkaConsumer → ApacheKudu

  1. Ensure both the traffic simulator and the UDP Kafka Producer pipeline are running
  2. On the “Kafka Consumer to Apache Kudu” pipeline, click the “start” button
  3. You should see some statistics as seen below:
    visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-06
  4. In the Impala shell, query the netflow table to see the data in Kudu:
    visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-07

Impala-Kudu to D3 Visualization

This is a small custom visualization which shows the source IP and destination IP along with the time interval. The visualization is colored based on the number of packets the source system sent to the destination system.

  1. Since D3 is a JavaScript visualization library, you’ll need to serve a page from a web server. Execute the following commands to download the visualization code and start the server. Note that you will need Maven to be installed on your machine
  2. $ git clone https://github.com/phdata/network-traffic-visualization.git
    $ cd code/app-dataviz-from-impala
    $ mvn spring-boot:run

On starting the server, navigate to http://localhost:1990/timetravel.html on your favorite browser, and admire the beauty of these real-time NetFlow IP communication visualizations.

visualizing-netflow-data-with-apache-kudu-apache-impala-incubating-streamsets-data-collector-and-d3-js-08

Conclusion

StreamSets Data Collector allows you to easily move NetFlow data from UDP to Apache Kafka to Apache Kudu for analysis. Kafka provides an intermediate message buffer, while Kudu provides both real-time inserts and fast analytics. Download SDC today and build your first pipeline!

The post Visualizing NetFlow Data with StreamSets Data Collector, Kudu, Impala and D3 appeared first on StreamSets.

Announcing Data Collector ver 2.1.0.0

$
0
0

We’re happy to announce a new release of the Data Collector. This minor release has over 30+ bug fixes and a number of  improvements and a few new features :

  • A Package Manager that allows you to install new Stage Libraries (Origins, Processors, Destinations) right from the User Interface. With this feature you can download the smaller Core Tarball and only install the stage libraries you want. Note: This feature is currently only available in the Tarball package and the Docker image, in the future we will add this functionality to the other package options.
  • Ability to access the file stream from within the scripting processors while doing Whole File Transfers. This is useful if you want to access the file midstream and do things like access metadata from MP3’s or Images, extract text from PDF files etc.
  • Support for MapR FS as an origin, including cluster mode support for reading out of MapR FS
  • Support for Confluent Schema Registry
  • Support for ElasticSearch 2.4
  • Caching support for the JDBC Lookup Processor

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

The post Announcing Data Collector ver 2.1.0.0 appeared first on StreamSets.

Creating a Custom Processor for StreamSets Data Collector

$
0
0

gps image dataBack in March, I wrote a tutorial showing how to create a custom destination for StreamSets Data Collector (SDC). Since then I’ve been looking for a good sample use case for a custom processor. It’s tricky to find one, since the set of out-of-the-box processors is pretty extensive now! In particular, the scripting processors make it easy to operate on records with Groovy, JavaScript or Jython, without needing to break out the Java compiler.

Looking at the Whole File data format, introduced last month in SDC 1.6.0.0, inspired me… Our latest tutorial, Creating a Custom StreamSets Processor, explains how to extract metadata tags from image files as they are ingested, adding them to records as fields.

With the help of Drew Noakes‘ excellent metadata-extractor you can access Exif and other metadata in a wide variety of image file types. In the tutorial, I give a simple example of reading and writing record fields, then show you how to integrate the metadata-extractor library, access whole file content, and write the resulting metadata tags as record fields. Having this metadata in the SDC record is incredibly useful. Want to search your photos by their location? As I describe in the tutorial, you can easily write the image’s GPS coordinates, as well as the filename, to a database table.

Do you have a custom processor in mind for SDC? Follow the tutorial to get started, and let us know how it goes in the comments!

The post Creating a Custom Processor for StreamSets Data Collector appeared first on StreamSets.

The Challenge of Fetching Data for Apache Spot (incubating)

$
0
0

Reposted from the Cloudera Vision blog.

What do Sony, Target and the Democratic Party have in common?

Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us. The patterns that must be matched today are much more subtle, spanning activity across numerous systems and, in the case of Advanced Persistent Threats (APTs), long time frames. And as we’ve learned from experience at Target and others, systems that throw off excessive false positives are at least as devastating to security as having no protection at all, with these highly-touted systems becoming just another form of security theater.

This is why Apache Spot (incubating), driven by the leadership of Intel and Cloudera, is such an important step in the right direction. Cybersecurity is complex, ever-changing and data-driven, which makes security analytics an excellent use case for an open source big data solution. This challenge is too big to be solved by a single vendor and the Apache Spot framework allows you to apply the power of Apache Hadoop, Apache Spark, and the community to the problem.

One of the many challenges to creating an effective security analytics system is being able to build and operate timely and trustworthy dataflows from a wide variety of sources, such as netflow data, DNS logs or proxy server logs, into the storage/compute engine. Since the task at hand is threat detection through machine learning based algorithms, any data loss or data corrosion during ingest harms the ability of the system to perform, leading to false positives that have plagued previous systems, or worse, false negatives where attacks go undetected.

In particular, when combining a large number of data sources to inform your model you must recognize that system vendors can change the schema without notice, a problem which we call data drift, which creates an opportunity for data corrosion. The more source variety you have, the more risk to have of data drift mucking with your analysis.

Also, the dataflow problem is dynamic in nature. There will be a frequent need to quickly incorporate new data sources to improve the robustness of the overall analysis and enhance the insights delivered on a continuous basis. Being able to onboard new data sources in hours rather than weeks will help keep enterprises ahead of the black hats.

It was in order to provide this flexible dataflow capability that StreamSets was delighted to join Cloudera as a founding member of the Apache Spot ecosystem. We believe that the ability of our open sourceStreamSets Data Collector to simplify development and operation of the complex ingest pipelines required for threat detection is a linchpin component to the framework. StreamSets’ allows organizations to quickly setup data ingestion pipelines to land data into Apache Spot’s endpoint, user, and network Open Data Models.

Apache Spot Stack

For those not familiar with the tool, StreamSets Data Collector is an adaptable engine for ingestion of the wide variety of data sources using plug and play origins, destinations and transformations. You can build, test, and deploy ingestion pipelines within an IDE with little to no code and then run them continuously with real-time monitoring and alerting for data drift and other performance issues. If you need to customize it also allows for insertion of your own scripts as well.

StreamSets Data Collector also offers an important built-in capability that is quite valuable for cybersecurity. We standardize the record format of the incoming data. This means you get highly efficient inspection of the data as part of the package. This allows you to continually test for data skews as the data moves in order to ensure trustworthy data is being sent to the ML.

Our vision is that as users create and prove out templates for security analytic ingestion, that this forms a growing library that benefits the entire community and creates tremendous development leverage for Apache Spot adopters. For more information, or to download StreamSets Data Collector, visit our website.

 

The post The Challenge of Fetching Data for Apache Spot (incubating) appeared first on StreamSets.


Contributing to the StreamSets Data Collector Community

$
0
0

StreamSets MeetupAs you likely already know, StreamSets Data Collector (SDC) is open source, made available via the Apache 2.0 license. The entire source code for the product is hosted in a GitHub project and the binaries are always available for download.

As well as being part of our engineering culture, open source gives us a number of business advantages. Prospective users can freely download, install, evaluate and even put SDC into production, customers have access to the source code without a costly escrow process and, perhaps most importantly, our users can contribute fixes and enhancements to improve the product for the benefit of the whole community. In this post, I’d like to acknowledge some of those contributions, and invite you to contribute, too.

Since SDC was released, back in September 2015, we’ve received a wide variety of code contributions from our community. Some of these have been small: Jurjen Vorhauer, a consultant at JDriven in the Netherlands, contributed a single line of code that fixed an annoying bug in the Cassandra target. Other contributions have improved the general quality of the product: Sudhanshu Bahety, a student at UC San Diego, cleaned up a whole series of exception messages. Alexander Ulyanov, CTO of BeKitzur Consulting & Development in Saint Petersburg, Russia, contributed a complete pipeline stage – the Redis Consumer that’s now part of the product. The most recent major contribution, a MySQL binary log ‘change data capture’ origin from the developers at Wargaming.net, is over 4,800 lines of code and adds significant new functionality to SDC.

When developers contribute code back to the project, everyone benefits. The community has access to new features and fixes, while contributors see their code reviewed, incorporated into SDC, and extended by the product team and other developers. Of course, code is not the only contribution. Many community members have reported issues in SDC, whether bugs or feature requests. We’ve seen some great blog posts, articles, and meetup sessions. You certainly don’t need to be a developer to make your mark!

So, I hear you asking yourself, how do I get in on this community awesomeness? If you find a bug, or wish SDC had a particular feature, check out our issues list to see if it’s there already. If it is, then vote for it so we can prioritize it accordingly, and/or watch it so you get notified of progress. Does one of your pipelines demonstrate an innovative technique? Contribute it to the SDC tutorials project! Engage with your local big data community (search big data on meetup.com) and present a session on your experiences with SDC. If you’re a developer, and there’s an itch you want to scratch, fork the GitHub project and get coding. In common with many open source projects, we need you to sign our contributor license agreement before we can incorporate your code, so get that done and you can file a pull request when you’re ready.

Even after over 25 years as a developer, I’m still thrilled to see my code running in production; as community champion for StreamSets, one of my greatest pleasures is enabling developers and users around the world to share that excitement as their contributions are accepted and recognized. Step up and make your mark on StreamSets Data Collector!

The post Contributing to the StreamSets Data Collector Community appeared first on StreamSets.

More Than One Third of the Fortune 100 Have Downloaded StreamSets Data Collector

$
0
0

It’s been a little over a year (9/24/15) since we launched StreamSets Data Collector as an open source project. For those of you unfamiliar with the product, it’s any-to-any big data ingestion software through which you can build and place into production complex batch and streaming pipelines using built-in processors for all sorts of data transformations. The product features, plus video demos, tutorials, etc. can all be “ingested” through the SDC product page.

We’re thrilled to announce that as of last month StreamSets Data Collector had been downloaded by over ⅓ of the Fortune 100! That’s several dozen of the largest companies in the U.S. And downloads of this award-winning software have been accelerating, with over 500% growth in the quarter ending in October versus the previous quarter.

In fact, this is probably a substantial understatement as we only know the corporate identity of a small sliver of the large number of developers who have downloaded the software.  

Amongst the Fortune 500 companies where we have experienced download activity, the industry breakdown is interesting. As shown below, the largest single sector is financial services, accounting for 36% of the identified companies.  Within this sector there is heavy representation from banks, credit institutions and insurance companies.

pie-chart-downloads

Following the financial folks are technology companies with 17% of the total, healthcare companies (11%) and services companies (8%). Other industries represented include media, energy, apparel, consumer package goods and even agriculture. This broad cross-section of the economy is a consequence of both the widening adoption of big data as well as the flexibility of StreamSets Data Collector to enable a diverse array of dataflow use cases including IoT, customer 360, cybersecurity, cloud migration and architectural modernization.

If you’re one of the many who has downloaded StreamSets Data Collector we thank you for giving us a try; we’re honored to help you make the most of your data in motion.  For those of you have yet to try, you can download it here.  And also take a look at SDC’s companion product, StreamSets Dataflow Performance Manager, which helps you operationalize production of large number of data pipelines, helping you to create a well-managed dataflow operation at your company.

The post More Than One Third of the Fortune 100 Have Downloaded StreamSets Data Collector appeared first on StreamSets.

Upgrading From Apache Flume to StreamSets Data Collector

$
0
0

Apache Flume “is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data”. The typical use case is collecting log data and pushing it to a destination such as the Hadoop Distributed File System. In this blog entry we’ll look at a couple of Flume use cases, and see how they can be implemented with StreamSets Data Collector.

As reliable as Flume is, it’s not the easiest of systems to set up – even the simplest deployment demands a pretty arcane configuration file. For example:

# Sample Flume configuration to copy lines from
# log files to Hadoop FS

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = spooldir
a1.sources.r1.channels = c1
a1.sources.r1.spoolDir = /Users/pat/flumeSpool

a1.channels.c1.type = memory

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.path = /flume/events
a1.sinks.k1.hdfs.useLocalTimeStamp = true

A few minutes study (with a couple of references to the Flume documentation) reveals that the config file defines a Flume agent that reads data from a directory and writes to Hadoop FS. The data will be written as plain text, with the default 10 lines per Hadoop file.

While this example is relatively straightforward, things get complicated quickly as you add more functionality. For example, if you wanted to filter out log entries referencing a given IP address, you would add an interceptor:

# Throw away entries beginning with 1.2.3.4
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^1\.2\.3\.4
a1.sources.r1.interceptors.i1.excludeEvents = true

It soon becomes difficult to look at a Flume config file and understand exactly what it does, let alone make changes or additions, without investing considerable time in mentally parsing it.

StreamSets Data Collector is an any-to-any big data ingest tool that picks up where Flume runs out of steam. Data Collector can interface with files, databases of every flavor, message queues; in fact, pretty much anything. A wide variety of prebuilt processors handle most transformations while script evaluators allow you flexibility to manipulate data in Python, JavaScript and Groovy.

Let’s take a look at an equivalent pipeline, reading log data from a directory and writing to Hadoop FS, in Data Collector:

screen-shot-2016-11-29-at-6-07-44-pm

Clicking around the UI reveals the configuration for each pipeline stage, including the default values for every setting:

screen-shot-2016-11-29-at-6-07-56-pm

Running the pipeline, we can clearly see that a short test file is correctly sent to Hadoop FS:

screen-shot-2016-11-29-at-6-07-31-pm

This is great, but it gets even better as we add more functionality. Let’s replicate the Flume interceptor, filtering out records starting with ‘1.2.3.4’. Data Collector doesn’t have a direct equivalent of ‘regex filter’ – it’s actually much more flexible than that. We can use a Stream Selector stage to separate out records that match a given condition. In this case, we’ll send them to the trash. We also get to use Data Collector’s Expression Language (based on JSP 2.0 expression language) to define conditions, so we can match records starting with a given string without having to construct a regular expression:

screen-shot-2016-11-29-at-6-14-58-pm

Here’s the first few lines of the input test file:

$ head -n 6 ~/sdcSpool/tester.txt
1.2.3.4 Trash
2.3.4.5 Keep
1.2.3.4 Trash
2.3.4.5 Keep
1.2.3.4 Trash
2.3.4.5 Keep 1.2.3.4

Let’s run the pipeline with the Stream Selector:

screen-shot-2016-11-29-at-6-25-04-pm

Uh-oh – that doesn’t look right! The pipeline should be filtering out those lines that start 1.2.3.4, but the record count at the bottom of the display shows that the Stream Selector is sending everything to Hadoop FS.

The first few lines of output confirm the problem:

$ hdfs dfs -cat /sdc/events/sdc-20f17369-362f-42b7-a526-41d87aa3b21c_f9c32a2b-3996-450b-bdc7-d3a665d7f6ae | head -n 6
1.2.3.4 Trash
2.3.4.5 Keep
1.2.3.4 Trash
2.3.4.5 Keep
1.2.3.4 Trash
2.3.4.5 Keep 1.2.3.4

Let’s debug the pipeline – preview mode reads the first few lines of input and shows us the record’s state at every stage of the data flow. Let’s take a look:

screen-shot-2016-11-29-at-6-20-17-pm

All of the data is being sent to Hadoop FS, including the 1.2.3.4 records. Let’s look at the Stream Selector configuration:

screen-shot-2016-11-29-at-6-21-22-pm

How did that extra dot get in there? Never mind… Let’s fix it and preview again, just to check:

screen-shot-2016-11-29-at-6-22-13-pm

That’s better! We can see that lines starting ‘1.2.3.4’ match the condition, and are sent to trash via stream 1, while everything else is sent to Hadoop FS via stream 2.

We can reset the directory origin, rerun the pipeline, and see what happens:

screen-shot-2016-11-29-at-6-24-40-pm

Looks good – records are being sent to both trash and Hadoop FS – let’s check the output!

$ hdfs dfs -cat /sdc/events/sdc-20f17369-362f-42b7-a526-41d87aa3b21c_8095f7f2-fbb4-45c0-b827-66034ebbbc98 | head -n 6
2.3.4.5 Keep
2.3.4.5 Keep
2.3.4.5 Keep 1.2.3.4
2.3.4.5 Keep
2.3.4.5 Keep
2.3.4.5 Keep 1.2.3.4

Success – we see only the lines that do not have the prefix ‘1.2.3.4’!

Even as we extend the pipeline to do more and more, Data Collector’s web UI allows us to easily comprehend the data flow. For example, this pipeline operates on transaction records, computing the credit card type and masking credit card numbers as appropriate:

screen-shot-2016-11-29-at-6-34-15-pm

In these examples, we’re running Data Collector in ‘standalone’ mode, reading log files from local disk, but we could just as easily scale out with a cluster pipeline to work with batch or streaming data, and even integrate with Kerberos for authentication.

If your Flume configuration files are getting out of hand, download StreamSets Data Collector (it’s open source!) and try it out. It’s likely quicker and easier to recreate a Flume configuration from scratch in Data Collector than it is to extend your existing Flume agent!

The post Upgrading From Apache Flume to StreamSets Data Collector appeared first on StreamSets.

Announcing Data Collector ver 2.2.0.0

$
0
0

And here it is folks, the last release of 2016 – StreamSets Data Collector version 2.2.0.0. We’ve put in a host of important new features and resolved 120+ bugs.

We’re gearing up for a solid roadmap in 2017, enabling exciting new use cases and bringing in some great contributions from customers and our community.

Please take this out for a spin and let us know what you think. Without further adieu, here are some of the top features in 2.2.0.0:

Origins and Destinations

Processors

  • Support for executing Spark jobs within the pipeline. As you develop applications in Spark you no longer have to worry about writing plumbing code to read and write data from a multitude of origins and destinations. Just write your Spark code in Java or Scala and drop the jar file into the pipeline, the Spark Evaluator processor takes care of converting SDC data formats to RDDs and reading them back out again.

We currently support Spark in local mode, coming up next will be the facility to run this in Cluster mode.

Event Framework

  • We have a new framework for post processing type tasks. You can use this to do things like run a MapReduce job after writing a file to Hadoop, refresh Hive or Impala table statistics after depositing new data, and most anything else you can imagine. We currently support the following executors:

Here’s a more complete description of all the capabilities of this system.

Soon we will be adding support for triggering REST API calls, or executing Spark jobs – if you want to do something else, let us know.

Other Changes

  • A few new functions to get information about files/directories, pipeline information and to work with Datetime fields exposed via the Expression Language (EL).
  • We’ve cleaned up the data format options in the UI – they are all consolidated within a single tab for all origins and destinations.
  • The Whole File transfer option now has the ability to generate and test checksums of the files.
  • LDAP Authentication is now possible across multiple directory servers.

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

Powered byTypeform

The post Announcing Data Collector ver 2.2.0.0 appeared first on StreamSets.

Running Apache Spark Code in StreamSets Data Collector

$
0
0

Spark LogoNew in StreamSets Data Collector (SDC) 2.2.0.0 is the Spark Evaluator, a processor stage that allows you to run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Java or Scala code as a Spark Resilient Distributed Dataset (RDD). Your Spark Transformer can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination.

The Spark Evaluator is particularly suited for CPU-intensive tasks, such as sentiment analysis, or image classification, as it can partition batches of records and execute your code in several parallel threads. A new tutorial, Creating a StreamSets Spark Transformer, explains the details, and walks you through a simple example, computing the issuing network of a credit card given its number.

Spark Evaluator Pipeline

In this first implementation, SDC runs Spark in ‘local-mode’, within its own process, but we are working on a cluster-mode implementation. I recently presented a session, Building Data Pipelines with Spark and StreamSets, at Spark Summit Europe 2016. Watch the video to get a closer look at the Spark Evaluator in action, and our Spark roadmap:

https://www.youtube.com/watch?v=djt8532UWow

The post Running Apache Spark Code in StreamSets Data Collector appeared first on StreamSets.

Viewing all 475 articles
Browse latest View live