Category

Datastudio

Building a multilinguistic sentiment analysis tool

Using R to understand how people interact with your brand

What if i told you that you could build a multilinguistic sentiment analysis tool, that let you scrape data from anywhere you want, translate it, put it through sentiment analysis and then make it possible to visualize in Datastudio within 70 lines of R? What if i told you this solution is free?

Qurious? Good! In this blogpost i will share how i have build my sentiment analysis tool, and share all the code with you. This should also give you the skills to do other cool stuff with R such as:

  • Scraping data from the web: In this case tripadvisor
  • Formatting data within R
  • Push dataframes into Google Sheets
  • Use the Google Natural Language API
  • Integrate Google Sheets with Datastudio

Inspiration

I just went to the coolest data conference in the world probably called Superweek. Here there was a price for the best analytics solution, where the winner would get a golden punchcard as a price.

As i used a total of 2 hours making the solution, and half an hour to make my pitch it did not win the price, however it was still a lot of fun, and i promised to write a blogpost about if afterwards!

The final product

The final product is a dashboard, however in my personal opinion, it is actually the dataset that you get that makes it interesting.

Requirements

  • An empty Google sheet with 2 sub-sheets
  • A Google Cloud service client credential file downloaded with the Google Natural Language API activated (Is free, but requires a credit card to sign in)
  • R installed (Having R studio is recommended as a client)
  • A Google account

The howto all the the things

Web scraping tripadvisor
As this was made for a minor project for the contest, i selected tripadvisor to be the source i web scraped and the Hunguest Grandhotel Galya as the page that was going to be scraped (It is the hotel that hosts Superweek).

At first we need to know which class the comments where in. I used the tool SelectorGadget to find the class that i needed to scrape data from:

As seen in the picture above, the class that contains the information i need is placed in the .review-container of the html. Therefore we can use the following code to retrieve all the information embedded:

Formatting the tripadvisor review data with R
As we have the information now, we can start to extract information from the scraped html:

As seen in the picture above, the class that contains the information i need is placed in the .review-container of the html. Therefore we can use the following code to retrieve all the information embedded:

Credits goes to: https://github.com/hadley/rvest/blob/master/demo/tripadvisor.R for doing most of the pre-work for this.

NB: In the dataframe we have access to the stars of each comment, however it will not be used further in this example as this post revolves using the Google Natural Language API.

Using the Google Natural Language API

As we have formatted the data we can build a dataframe with all the information we have extracted:

That is done by writing the following line:

Which gives us a nice dataframe like this:

As i dont speak hungarian, we will now use the google language API to translate everything for us thanks to Mark Edmonson who made this package:

Which translates both the comment and the title:

Finally to the sentiment analysis

To the people who don’t know what a sentiment analysis is, it is a way of determining if text has a positive or negative meaning in a scale from -1 to 1. The closer the scale is to 1, the more positive the context is. Google defines their analysis the following way:

  • score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text.
  • magnitude indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text’s magnitude (so longer text blocks may have greater magnitudes).
  • salience indicates the importance or relevance of this entity to the entire document text. This score can assist information retrieval and summarization by prioritizing salient entities. Scores closer to 0.0 are less important, while scores closer to 1.0 are highly important.

As this is was made for the demonstration only and not real analysis, i have divided the data into two different dataframes:

The review comments

To generate analysis for the comments we use the following lines:

which gives us the following dataframe:

The titles

For analyzing the titles from TripAdvisor, the following lines are used:

Which gives us an understanding of the tone of voice for each review title

Pushing the data to Google Sheets

As mentioned in the requirements section above we need to have created a Google Sheet. As i build this for Superweek i called it: Superweek (Suprise)

Quite easy right?

Now that it has been linked, we can push both our dataframes into the sheets we have made:

If you open your sheet it should now look something like this:

This is definitely one of the most neat features mentioned in this blog, as it have an infinite amount of other possibilites.

Visualizing the data

At last, you can connect to the data within Datastudio:

format it as you want and here you go: A multilinguistic sentiment analysis tool made with very few lines of code!

Using sentiment analysis in your business & final thoughts

This is just one way of scraping data from the net. This could be applied from any site that mentions your brand. Of other datasources you could pull tweets, hashtags from different types of social media, csv files from questionnaires and other datasources.

By using the Google Language API you can save a lot of time interpreting what people say about a brand, and use that information to make sure to catch negative reputation and deal with it in an early stage. It can also give an indication of what matters to the people writing, and help you use data to understand people.

As i mentioned in the beginning, this was a quick setup i made specifically for the SuperWeek conferences, it is therefore not intended to be the best solution for how you can scrape and do sentiment analysis. Instead it is an encouragement for everyone who are curious about R and webanalytics to try out something new, and play around with some data that they do not see every day.