Elasticsearch has become part of my daily routine so the more I use it, the more I think of ways of using it outside work so came up with the idea of why not creating my own ingestion with sentiment analysis so that data can be processed and tagged before being indexed into Elastic?.
I know Logstash has already a plugin to ingest data from twitter but since i also wanted to add a bit of polarity to each tweet and also wanted to control the process since I truly don’t want to ingest a lot of data as I don’t have unlimited storage so i decided to make my own and turns out it was quite simple.
Now to being, the dependencies I used for this were:
Elasticsearch 6.5
python-elasticsearch
twython
textblob
Elastic offers 2 libraries to interact with your node, so make sure you pip install this one.
Start your ES instance
Now setting an instance could be complicated so i’ll just go over some very basic setup, if you want something more ellaborate the elastic.co documentation is quite good.
Make sure you have java installed.
Download Elasticsearch from here. This will be different based on your OS/Distro. Again in my case I went with 6.5 since I run “Linux-Manjaro”.
Extract the contents.
Locate and run the binary, it’s usually located inside elasticsearch/bin/elasticsearch. The process should start and you should see something like this.
NOTE: If you want to run it in the background add parameters -d to daemonize it.
Finally test to see if your node is ready by performing a request against your localhost in port 9200 which is the default used by ElasticSearch. In my case I named my node “node-1” and my cluster “home-cluster”
Ok so now you have your single node cluster set, next step would be to create a “model” for the data you will ingest, again since i don’t have unlimited storage or more nodes I will tweak the mapping for all of the indices that get created to just have 1 shard with no replicas. This is an elasticsearch type of deal so if you want to learn more, i would again point you to the documentation or you can ask me (social media stuff at the bottom).
Now i could create the mapping everything i index the data but then again, that’s manual stuff which i kind of despise so i went ahead and created a template so that all indices that would match the pattern would adopt the settings.
So once you have the mapping defined we are finally ready to push some data using Python!.
Ingesting data with python-elasticsearch
Alright so the first thing we have to do is acquire some twitter credentials and token so that we can make use of the libraries to retrieve tweets, to get those credentials go here.
First thin is to define the connection object that we will use to interact with Elasticsearch, also we will import the whole thing, since we are doing sentiment analysis we of course need those libraries.
In the last portion we tell elasticsearch that if the index called ‘trump’ does not exist
Next, we will define the data model used to describe each ‘tweet’ or event and pass it down to elasticsearch, in here is where we do the sentiment analysis using library ‘TextBlob’.
Finally we will make use of the client and data objects to start a stream that will push all of the tweets with our added data to the Elasticsearch index so that we can later do some searches and visualizations with it using Kibana.
Now that we have everything ready we can simply run the script and this should start pushing data to our single node cluster.