Here comes the coding part. I swear it was a nightmare. As the saying goes, "There are no geniuses, but lazy fools". I decided to try my best to become a genius.
Let's import our data set and call it 'bic' and perform operations on it.
The extracted data is barely raw tweets. It contains unnecessary data that provides no meaningful insights. It is convenient to remove these unnecessary data from extracted twitter dataset such as HTML links, emoticons, punctuations ‘@’, stopwords(i.e. is, at the, on), RT, numbers, white spaces; perform stemming- stemDocument(i.e. “running” will be changed to its root form run), convert tweets to lower case such that the resultant data set hold only valuable information for the sentiment analysis.
head(bic)
paste(bic$text, collapse=" ")
bic$text = gsub("&", " ", bic$text) #remove ampersand in the tweets
bic$text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "",bic$text)
bic$text = gsub("@\\w+, "", bic$text)
bic$text = gsub("[[:punct:]]", "", bic$text) #remove punctuation
bic$text = gsub("[[:digit:]]", " ", bic$text)
bic$text = gsub("<\\w*>", "", bic$text) #remove website name
bic$text = gsub("http\\w+", " ", bic$text) #remove URL's
bic$text = gsub("[ \t]{2,}", "", bic$text)
bic$text = gsub("^\\s+|\\s+$", " ", bic$text) #remove leading and trailing spaces
#getting rid of unnecessary spaces
bic$text = str_replace_all(bic$text ," "," ")
#take out retweets header
bic$text = str_replace(bic$text, "RT @[a-z,A-Z]*: ","")
#get rid of hashtags
bic$text = str_replace_all(bic$text , "#[a-z,A-Z]*","")
#get rid of references to other screen names
bic$text = str_replace_all(bic_text ,"@[a-z,A-Z]*","")
NOTE:
Codes are best learned by "CODING". Literally, there is no way out. Start with a simple line, you will get the hang of it surely. All it requires is some patience and you will go a long way.
bicCleaned = Corpus(VectorSource(bic$text))
tdm = TermDocumentMatrix(bicCleaned,
control = list(removePunctuation = TRUE,
stopwords = c("machine", "learning",
stopwords("english")), removeNumbers = TRUE,
tolower = TRUE))
m = as.matrix(tdm) #we define tdm as matrix
word_freqs = sort(rowSums(m), decreasing = TRUE) #now we get the words in decreasing order
dm = data.frame(word = names(word_freqs), freq = word_freqs) #we create our data set with word frequencies
#Remove stop words from dm
stop_words = tidytext::stop_words
stop = stop_words$word
dm_filter = subset(dm, !(word %int% stop))
wordcloud(words = dm_filter$word, freq = dm_filter$freq, min.freq = 10,
max.words = 50, random.order = FALSE, rot.per = 0.15,
colors = "black")
library(syuzhet)
# Converting tweets to ASCII to trackle strange characters
bic$text = iconv (bic$text, from="UTF-8", to="ASCII", sub="")
# removing retweets, in case needed
bic$text= gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",bic$text)
# removing mentions, in case needed
bic$text= gsub("@\\w+","",bic$text)
ew_sentiment = get_nrc_sentiment ((bic$text))
sentimentscores = data.frame(colSums (ew_sentiment[,]))
names(sentimentscores) = "Score"
sentimentscores = cbind("sentiment"=rownames(sentimentscores), sentimentscores)
rownames (sentimentscores) = NULL
library(ggplot2)
ggplot(data-sentimentscores, aes (x=sentiment,y=Score))+
geom_bar (aes (fill=sentiment),stat = "identity")+
theme (legend.position="none")+
xlab ("Sentiments")+ylab ("Scores")+
ggtitle("Total sentiment based on scores")+
theme_minimal()
This was my word cloud and sentiment scoreboard for 'bitcoin'.
This was my word cloud and sentiment scoreboard for 'ethereum'; (using the same procedure).
Evaluation and aftermath
I admit that this research was a little faulty because of the extraction parameters from Twitter.
The data pre-processing could have been better( because it is supposed to remove words such as retweetamp and others that don’t add any sense to the emotions). However, for the purpose of workflow, we can say that the people are highly positive toward Bitcoin since that was the first digital coin, although Ethereum is gaining momentum now.
One of the greatest difficulties encountered had to be determining the best approach for detecting sentiments in Twitter data because comparing various approaches is a highly challenging task when there is a lack of agreed benchmarks.
A further area of study might be the utilization of active learning techniques to detect Twitter sentiments and to increase the confidence of decision-makers.
Comments