‘Simplicity is the ultimate sophistication.’
– Leonardo da Vinci
As a recent hire on the Engineering Productivity team here at StreamSets, my early days at the company were marked by efforts to dive head-first into StreamSets Data Collector (SDC). As it turns out, the Docker images we publish for SDC were the easiest way to explore its vast set of features and capabilities, which is exactly why I am writing this blog post.
Without further ado, let’s get started.
Start a Docker container with SDC
To start a Docker container with the most recent release of StreamSets Data Collector, just run the following command:
$ docker run -dP --name sdc streamsets/datacollector
Here are the options we specified (for a full list, check out the image notes on Docker Hub):
-d |
Create Docker container in the background in detached mode |
-P |
Publish all exposed ports to the host |
--name |
Name for this container |
If all goes well, running docker ps
will show output like the following:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b40764fb427f streamsets/datacollector "/docker-entrypoin..." 5 seconds ago Up 4 seconds 0.0.0.0:32771->18630/tcp sdc
Voila! We have successfully created a Docker container with SDC. Pretty simple. Right?
Note the port 32771. This is the randomly-assigned host port to which Docker has published the SDC container’s port 18630. We can verify that the service has started by using a web browser pointed to localhost:32771
This will present a prompt for username and password. Type the default credentials (admin:admin
) and we would see a screen like following:
Working with SDC
Now that we have access to the web UI, we can start playing with all the cool capabilities that SDC has to offer. For someone new, a great place to start would be our tutorials, which walk one through everything from creating and running a pipeline to more advanced operations like data manipulation.
Here are a few tricks I learned along the way which helped a lot.
Exploring the Docker container
After we have created the Docker container, we might want to take a look around (e.g. just to see how files are laid out). One simple way is to run the following command to start a Bash session inside the container:
$ docker exec -it sdc bash
Once we are inside, we can run whatever commands we need and, when we’re done, can use exit (or CTRL+D) to come back to the host:
$ docker exec -it sdc bash bash-4.3$ ls bin home mnt run usr data lib opt sbin var dev lib64 proc srv docker-entrypoint.sh logs resources sys etc media root tmp bash-4.3$ pwd / bash-4.3$ exit exit $
Restarting SDC
One common gotcha with running SDC in Docker happens when we need to install additional stage libraries. In the web UI, go ahead and select a library and then click the “Install” icon (see documentation for details). At this point, we would see a dialog like the following:
If we click “Restart Data Collector,” we might discover that the container never comes back online. The reason for this is since we started SDC in Docker, the Docker daemon interprets the command that started the container as having completed and that there's no more work to be done. Since this isn't the case for us, we can get our container (and its new stage library) back online by doing the following:
- From a terminal, run the following command to restart the container:
$ docker restart sdc
Note: This will change the port number at which Docker exposes SDC, so look for the new port number by running docker ps
again:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b40764fb427f streamsets/datacollector "/docker-entrypoin..." About an hour ago Up 4 seconds 0.0.0.0:32772->18630/tcp sdc
2. From this point on, to access SDC, we would use http://localhost:32772
.
Looking into logs
While exploring, if we do something that ends up crashing SDC, here is how to see its logs along with some sample output:
$ docker logs sdc Java 1.8 detected; adding $SDC_JAVA8_OPTS of "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -Djdk.nio.maxCachedBufferSize=262144" to $SDC_JAVA_OPTS Logging initialized @1296ms to org.eclipse.jetty.util.log.Slf4jLog 2017-07-20 16:07:20,542 [user:] [pipeline:] [runner:] [thread:main] INFO Main - ----------------------------------------------------------------- 2017-07-20 16:07:20,545 [user:] [pipeline:] [runner:] [thread:main] INFO Main - Build info: 2017-07-20 16:07:20,545 [user:] [pipeline:] [runner:] [thread:main] INFO Main - Version : 2.6.0.0 …. Running on URI : 'http://b40764fb427f:18630' 2017-07-20 17:18:55,369 [user:] [pipeline:] [runner:] [thread:main] INFO WebServerTask - Running on URI : 'http://b40764fb427f:18630' $
If SDC is running, we could also tail these by adding in the -f
argument to docker logs
.
Removing SDC
To clean up our instance of StreamSets Data Collector and all the resources it is using, just run the following command. Keep in mind that this will remove our SDC instance and we shall not be able to get back any data/logs/resources that were created in the process.
$ docker rm -f sdc
Conclusion
In this blog post, we have learned how to start StreamSets Data Collector in a Docker container, how to use it (along with a few tricks), and, finally, how to remove it; a complete cycle of working with SDC and Docker.
What interesting facts have you come across in your journey running SDC in Docker? Did I miss something here? I would love to hear from you in the comments or over in the StreamSets community.
The post Getting Started with StreamSets Data Collector on Docker appeared first on StreamSets.