Scalable architecture for real-time Twitter sentiment analysis

This project implements a scalable architecture to monitor and visualize sentiment against a twitter hashtag in real-time. It streams live tweets from Twitter against a hashtag, performs sentiment analysis on each tweet, and calculates the rolling mean of sentiments. This sentiment mean is continuously sent to connected browser clients and displayed in a sparkline graph.

System design

Diagram below illustrates different components and information flow (from right to left).

Project breakdown

Project has three parts

1. Web server

WebServer is a python flask server. It fetches data from twitter using Tweepy. Tweets are pushed into Kafka. A sentiment analyzer picks tweets from kafka, performs sentiment analysis using NLTK and pushes the result back in Kafka. Sentiment is read by Spark Streaming server (part 3), it calculates the rolling average and writes data back in Kafka. In the final step, the web server reads the rolling mean from Kafka and sends it to connected clients via SocketIo. A html/JS client displays the live sentiment in a sparkline graph using google annotation charts.

Web server runs each independent task in a separate thread.
Thread 1: fetches data from twitter
Thread 2: performs sentiment analysis on each tweet
Thread 3: looks for rolling mean from spark streaming

All these threads can run as an independent service to provide a scalable and fault tolerant system.

2. Kafka

Kafka acts as a message broker between different modules running within the web server as well as between web server and spark streaming server. It provides a scalable and fault tolerant mechanism of communication between independently running services.

3. Calculating rolling mean of sentiments

A separate pyspark program reads sentiment from Kafka using spark streaming, calculates the rolling average using spark window operations, and writes the results back to Kafka.

How to run

To run the project

Download, setup and run Apache Kafka. I use following commands on Ubuntu from bin dir of kafka -bin directory I set it up as KAFKA_HOME in bashrc file using following commands

export KAFKA_HOME=/path/to/kafka
export PATH=$KAFKA_HOME/bin:$PATH

-Now use following commands:

i)Start Zookeeper

zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties

ii)Start Broker

kafka-server-start.sh $KAFKA_HOME/config/server.properties

Install complete NLTK
Create a twitter app and set your keys in
live_twitter_sentiment_analysis/tweet_ingestion/config.py This will require developer account on twitter(keys here refers to credentials to your developer account)
Install python packages

pip install -r /live_twitter_sentiment_analysis/webapp/requirements.txt

Run webserver

python3 live_twitter_sentiment_analysis/main.py

Run the PySpark project seperately once tweets start streaming.

python3 live_twitter_sentiment_analysis/rolling_avg/rolling_avg.py

open the url localhost:8001/index.html

Output

Here is what final output looks like in browser

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
rolling_avg		rolling_avg
semantic_analysis		semantic_analysis
static		static
tweet_ingestion		tweet_ingestion
README.md		README.md
main.py		main.py
threads.py		threads.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable architecture for real-time Twitter sentiment analysis

System design

Project breakdown

1. Web server

2. Kafka

3. Calculating rolling mean of sentiments

How to run

Output

About

Releases

Packages

Languages

akshitshetty007/LiveTwitterSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Scalable architecture for real-time Twitter sentiment analysis

System design

Project breakdown

1. Web server

2. Kafka

3. Calculating rolling mean of sentiments

How to run

Output

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages