Analyse your Splunk data from a Jupyter Notebook, as a Pandas Dataframe. From there, use the Pandas Dataframe to develop and test any Machine Learning model.
One of the biggest challenges in performing Data Science is the sourcing, cleaning, and transforming of data. What is known as the ETL process takes about 80% of Data Scientists' time, which must focus most of their time in the ETL process rather than on the fun of building and testing algorithms.
Splunk's schema-on-read architecture allows users to create an ephemeral schema on the data each time, with each search. Splunk has created hundreds of data type mappings that extract the fields out of thousands of log files, and automatically recognizing the structure from CSV, JSON, and XML files, greatly simplifying the ETL process.
Jupyter notebooks have been used by the Data Science community for years to develop and test models, so it's only natural to bring two great things together.
It's really simple, all you need to do is install Splunk's SDK in your Jupyter instance, and then import it from a Notepad to connect to Splunk. Of course, your Jupyter server must be able to reach your Splunk instance, and you'll need Splunk login credentials.
I've included a method that installs and uses the Splunk SDK library to query Splunk in the Walkthrough Notebook.
The Notebook included in this repo serves as an example on how to install the SDK, query Splunk, and convert the results into a Pandas Dataframe. From there, you're free to use the libraries you know and love.
If you'd like to play with the idea but are not ready to bring it to your production environment, I've built a docker environment for you to play with.
If you haven't downloaded Docker at this point, please visit:
First, we must create a User-Defined Network. This network will host our containers and allow for DNS hostname discovery for the containers within this network, as the default docker bridge network doesn't allow for it.
Then, open a shell session and copy/paste the following:
### Create a User-Defined bridge network (needed for DNS hostname discovery)
docker network create \
--driver=bridge \
--subnet=172.28.0.0/16 \
--ip-range=172.28.5.0/24 \
--gateway=172.28.5.254 \
jupyter-splunk
We must now create two docker containers, one for Jupyter, and one for Splunk. Upon running the commands, docker will automatically pull the images and get the containers going for us.
### Create a docker container running an unauthenticated Jupyter instance.
docker run -itd \
--restart always \
--name jupyter \
--hostname jupyter \
--network=jupyter-splunk \
-p 8888:8888 \
jupyter/tensorflow-notebook:latest \
start-notebook.sh --NotebookApp.token=''
### Create a docker container running the latest Splunk version.
docker run -itd \
--restart always \
--name splunk \
--hostname splunk \
--network=jupyter-splunk \
-p 8000:8000 \
-e "SPLUNK_START_ARGS=--accept-license" \
-e "SPLUNK_USER=root" \
splunk/splunk:latest
Give it a minute or so for Splunk and Jupyter to start, and head to the following URLs:
Splunk:
Jupyter:
You should now have both Jupyter and Splunk running. If after a minute you can't reach the URLs, check that the containers are running correctly and the network has been created by typing:
### Check the User Defined network has been created and containers are running
docker ps -a
docker network ls
If anything else goes wrong, look at the instructions at the end of the page to delete the containers, and run the commands again.
It's now time to download the Notebook...
...go back to http://localhost:8888, load your Notebook into Jupyter and run it.
Make sure the Notebook is running on Python 2.7. Splunk's SDK works only with that version.
After you've finished testing, you can delete your docker environment by typing:
### Delete Environment
docker rm --force jupyter
docker rm --force splunk
docker network rm jupyter-splunk
### Check the Environment has been deleted
docker ps -a
docker network ls
About Splunk - http://splunk.com
Splunk Enterprise monitors and analyzes machine data from any source to deliver Operational Intelligence to optimize your IT, security and business performance. With intuitive analysis features, machine learning, packaged applications and open APIs, Splunk Enterprise is a flexible platform that scales from focused use cases to an enterprise-wide analytics backbone.
About Jupyter - http://jupyter.org
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.