It\u2019s no secret that every language is a complex beast to master. Even English, with its watered down Germanic grammar and extensive borrowings from Latin, will often befuddle the most learned of minds. Unsurprisingly, therefore, at least for now, computer programs are not yet entirely able to decode human language as most people can. What may come as a surprise, however, is just how much our modern software is already capable of doing! In fact, in many new and exciting ways, Natural Language Processing (NLP), the subfield of artificial intelligence concerned with the interactions between computers and human languages, may completely revolutionise computing \u2014\u00a0so for this blog post in particular, you should probably sit upright! Pre-Processing Before the magic begins, at SPG, we believe it is crucial to spend some time pre-processing our data. This refers to a number of small, yet hugely important tasks that will essentially convert our sentences into a format machines can grasp. 1. Clean data Cleaning data may take on a variety of different forms, but when dealing with text specifically, the most popular ways to do this are to completely eliminate numbers, do away with all punctuation, and make certain that all the words used are set appropriately in lowercase. 2. Tokenise In essence, \u201ctokenisation\u201d is just a fancy word for\u00a0breaking down into smaller components, and in this context simply refers to separating into lonesome words. So given the following example:\u00a0\u201cAll right, Charlie. Hope you have a great 21st birthday!\u201d\u00a0after cleaning and tokenising the data, we would end up with \u201call right charlie hope you have a great birthday,\u201d whose order is inconsequential. 3. Remove stop words Finally, we would have to remove all the\u00a0stop words, which while not in a standardised catalogue, are generally accepted to mean terms with little to no meaning at all. In English, these include the articles \u201cthe\u201d and \u201ca,\u201d among many others. So using the above example, we would now have what is known as a simple\u00a0bag of words:\u00a0\u201cright charlie hope great birthday\u201d. This may seem like an overly minimalistic way to view your data, but when it comes to NLP, it is actually extremely handy. Exploring the Possibilities Once we are done with cleaning and organising our information, incredible things become immediately possible. Sentiment Analysis Imagine that you are the manager at a customer service centre and are working for a company that sells both jackets and shoes. After receiving a number of phone calls from people who love your shoes but appear to dislike your jackets, you now want to establish if this is the general consensus. Thanks to NLP, instead of having to listen to potentially thousands of calls, you can simply convert audio to text and \u2014 based on the terms used in every conversation \u2014 automatically score feelings from positive to negative. Topic Modelling Another possibility concerns the use of\u00a0topic modelling. If you worked in an affluent law firm and were attempting to identify the culprit behind a very large sum of suspected embezzled money, you would probably have to pore over tens of thousands of company emails. Alternatively, however, you could make use of an NLP tool that would label each message by topic, and begin your investigation with all the messages pertaining to cash. One Step Further Together, both topic modelling and sentiment analysis can deliver mind-blowing benefits. Consider, for instance, the hypothetical case of stand-up comic Jimmy Carr. What if you, as someone who did not normally derive pleasure from stand-up, for whatever god-given reason, found Carr to be wholly hilarious. You would likely be inclined to ask yourself\u00a0why. It could be the fact that he\u2019s British, or maybe his irreverence that can always make you belly-laugh, but regardless of your preconceived notions, you\u2019d set out to find out what sets Carr apart. Gathering Data In order to solve this mystery, the first thing you would have to do is decide which data to gather, and that, of course, would probably be immediately obvious \u2014\u00a0transcripts!\u00a0To keep things as accurate as possible, you would need to find a way to gather transcripts of Carr\u2019s routines along with those of stand-up gigs by comics of comparable clout. Thankfully, however, in both of these cases, machine learning can help you out, as not only do solutions exist that will automatically convert audio to text, but you can also rely on algorithms to provide you with similar entertainers. Cleaning and Organising After gathering transcriptions for approximately 12 comedians, it\u2019s time to make use of our pre-processing methods and organise our data into two types of tables: a so-called\u00a0data frame\u00a0and a\u00a0document term matrix. Exploratory Data Analysis (EDA) Next, we perform what is known as\u00a0Exploratory Data Analysis, or EDA for short. Our main goal here is to discover and summarise the many insights that can be gained from our data \u2014 and to do so in a visual way. So first and foremost, with your document term matrix to hand, you can find the most used terms for every individual comedian and create useful\u00a0word clouds\u00a0that represent their particular inclinations. Then, you could compare the number of words used and each comic\u2019s unique speed of delivery, whose data may be presented using simple\u00a0bar charts. It is important to note here that because this analysis is related to your own personal preferences, the data you choose to include may be anything that appeals to you. So if you are someone who tends to swear like a trooper, then perhaps you should take a look at the amount of profanity used. The Results + Some Extra Fun In the end, not only would machine learning algorithms allow you to find out exactly what makes Jimmy Carr so special, but by making use of\u00a0Markov Chains, you could take this experiment to the ultimate next level and\u00a0automatically generate entire stand-up routines. These would be based on Carr\u2019s own unique style \u2014 and yes, they would even make you laugh! Unleash the Power of Text So there you have it! Thankfully, in spite of the complexity of the English language, with simple maths and a plethora of pre-built libraries and services, Software Planet can help you to unleash the power of text. By applying NLP methods to your ML and AI-leaning projects, we can help your company save an extraordinary amount of time.