It’s no secret that every language is a complex beast to master. Even English, with its watered down Germanic grammar and extensive borrowings from Latin, will often befuddle the most learned of minds. Unsurprisingly, therefore, at least for now, computer programs are not yet entirely able to decode human language as most people can.
What may come as a surprise, however, is just how much our modern software is already capable of doing! In fact, in many new and exciting ways, Natural Language Processing (NLP), the subfield of artificial intelligence concerned with the interactions between computers and human languages, may completely revolutionise computing — so for this blog post in particular, you should probably sit upright!
Pre-Processing
Before the magic begins, at SPG, we believe it is crucial to spend some time pre-processing our data. This refers to a number of small, yet hugely important tasks that will essentially convert our sentences into a format machines can grasp.
1. Clean data
Cleaning data may take on a variety of different forms, but when dealing with text specifically, the most popular ways to do this are to completely eliminate numbers, do away with all punctuation, and make certain that all the words used are set appropriately in lowercase.
2. Tokenise
In essence, “tokenisation” is just a fancy word for breaking down into smaller components, and in this context simply refers to separating into lonesome words. So given the following example: “All right, Charlie. Hope you have a great 21st birthday!” after cleaning and tokenising the data, we would end up with “all right charlie hope you have a great birthday,” whose order is inconsequential.
3. Remove stop words
Finally, we would have to remove all the stop words, which while not in a standardised catalogue, are generally accepted to mean terms with little to no meaning at all. In English, these include the articles “the” and “a,” among many others. So using the above example, we would now have what is known as a simple bag of words: “right charlie hope great birthday”. This may seem like an overly minimalistic way to view your data, but when it comes to NLP, it is actually extremely handy.
Exploring the Possibilities
Once we are done with cleaning and organising our information, incredible things become immediately possible.
Sentiment Analysis
Imagine that you are the manager at a customer service centre and are working for a company that sells both jackets and shoes. After receiving a number of phone calls from people who love your shoes but appear to dislike your jackets, you now want to establish if this is the general consensus. Thanks to NLP, instead of having to listen to potentially thousands of calls, you can simply convert audio to text and — based on the terms used in every conversation — automatically score feelings from positive to negative.
Topic Modelling
Another possibility concerns the use of topic modelling. If you worked in an affluent law firm and were attempting to identify the culprit behind a very large sum of suspected embezzled money, you would probably have to pore over tens of thousands of company emails. Alternatively, however, you could make use of an NLP tool that would label each message by topic, and begin your investigation with all the messages pertaining to cash.
One Step Further
Together, both topic modelling and sentiment analysis can deliver mind-blowing benefits. Consider, for instance, the hypothetical case of stand-up comic Jimmy Carr. What if you, as someone who did not normally derive pleasure from stand-up, for whatever god-given reason, found Carr to be wholly hilarious. You would likely be inclined to ask yourself why. It could be the fact that he’s British, or maybe his irreverence that can always make you belly-laugh, but regardless of your preconceived notions, you’d set out to find out what sets Carr apart.
Gathering Data
In order to solve this mystery, the first thing you would have to do is decide which data to gather, and that, of course, would probably be immediately obvious — transcripts! To keep things as accurate as possible, you would need to find a way to gather transcripts of Carr’s routines along with those of stand-up gigs by comics of comparable clout. Thankfully, however, in both of these cases, machine learning can help you out, as not only do solutions exist that will automatically convert audio to text, but you can also rely on algorithms to provide you with similar entertainers.
Cleaning and Organising
After gathering transcriptions for approximately 12 comedians, it’s time to make use of our pre-processing methods and organise our data into two types of tables: a so-called data frame and a document term matrix.
Exploratory Data Analysis (EDA)
Next, we perform what is known as Exploratory Data Analysis, or EDA for short. Our main goal here is to discover and summarise the many insights that can be gained from our data — and to do so in a visual way.
So first and foremost, with your document term matrix to hand, you can find the most used terms for every individual comedian and create useful word clouds that represent their particular inclinations.
Then, you could compare the number of words used and each comic’s unique speed of delivery, whose data may be presented using simple bar charts.
It is important to note here that because this analysis is related to your own personal preferences, the data you choose to include may be anything that appeals to you. So if you are someone who tends to swear like a trooper, then perhaps you should take a look at the amount of profanity used.
The Results + Some Extra Fun
In the end, not only would machine learning algorithms allow you to find out exactly what makes Jimmy Carr so special, but by making use of Markov Chains, you could take this experiment to the ultimate next level and automatically generate entire stand-up routines. These would be based on Carr’s own unique style — and yes, they would even make you laugh!
Unleash the Power of Text
So there you have it! Thankfully, in spite of the complexity of the English language, with simple maths and a plethora of pre-built libraries and services, Software Planet can help you to unleash the power of text. By applying NLP methods to your ML and AI-leaning projects, we can help your company save an extraordinary amount of time.