In this talk, Jess Bowden introduces the area of NLP (Natural Language Processing) and a basic introduction of its principles. She uses Python and some of its fundamental NLP packages, such as NLTK, to illustrate examples and topics, demonstrating how to get started with processing and analysing Natural Languages. She also looks at what NLP can be used for, a broad overview of the sub-topics, and how to get yourself started with a demo project.
This talk was part of AsyncJS (May event).
[00:00:08] Today, I’m going to be talking about natural language processing, specifically with Python, just a bit of an introduction. Yes. This is an overview of what I’ll be talking about today, so I’m going to give a little introduction to who I am in case you don’t know me. Then a bit of an introduction into what natural language processing is and what’s going on, in case you don’t know, and then why I think you should be using Python for NLP. Some people might disagree, that’s fine. Then an introduction, like a crash course on what the syntaxes in Python, just some things that are different that you might not know about. Sorry if you do. Then I’ll be looking at preparing your data for building prototypes and how to load data in Python and then a little look at how to explore and analyse data, just like the same things like tokenising and then I’ll be looking at a couple of little sentiment based projects that you could hopefully play around with yourself. Then looking at some more advanced things, perhaps.
[00:01:17] Yes, like I said, I’m Jessica, I work at Brand Watch, Dan said that. I just started on the data science team and I’ve been there for about two years now, and that’s my Twitter handle. Yes, natural language processing is a really, really broad topic. I’ll be trying to cover some basic techniques today. It covers some topics like machine translation, summarising blocks of text, like something that got big like Sumly, which is a terrible name; spam detection, sentiment analysis. There are a couple more really big fields. I think Python is great, which is the main language that I use for programming now. It’s really readable so it makes really fast prototypes and it’s got really rich support for text, analysis, strings and lists. There are loads of great available NLP libraries like NLTK, Spacey, Text Block and there are also some really great passing libraries. I’ve also just added a couple of tools I like using, if you want to have a look in your spare time.
[00:06:02] Yes, just a little intro to getting started with NLP in Python. In case you need to, that’s how you open and read text files from a local file and then like this for online files, in case you need to read an online text files, to process. Then I’m going to do a little introduction to NLTK, which is a really popular NLP library in Python. It’s quite old and it’s not often updated now but it’s really great for educational purposes, which is why I’m introducing it here. It’s got a free book included, it’s got loads of open data sets that are free, you can just use – it’s great. The first thing I want to go over is tokenising. Tokenising is where you just split your document up into logical chunks, which are usually broken up by sentences, so if I wanted to tokenise the first line from Alice in Wonderland, it would end up looking like this if I used the default NLTK tokenizer. It just breaks it up. It looks like punctuation and spaces. The next thing is stemmers and lemmatizes, they basically just reduce words to their normalised form. Am would become be and cars would become car. That’s how you use a stemmer and this is how you use a lemmatizer in NLTK. They look like they do pretty much the same thing but stemmers are a lot more naïve and they don’t analyse the text like a lemmatizer does. They’re a lot faster. If you just want to chunk your text and just have it in a comparable format, then you’re better off using lemmatizers if you just want to cluster the similar text in some way. You’ll notices things like the E of Alice has just been chopped off but that’s not a plural just because it’s got an E on the end. The same with Lewis or Carol as well, that’s really crap.
[00:08:20] Lemmatizing is just the same as stemming. It’s reducing it to its normal form. This one doesn’t work so well because I haven’t added in the part of speech that it is. It should come to you later, but it just basically considers the context and it doesn’t just do it naively and goes through it and chop off where it sees an S.
Okay, so stemmers do it towards the end or the beginning, does it at the end as well?
No, just from the end.
Just from the end.
[00:08:54] Yes, so plurals and lemmatizers consider the context. Okay, so now I’m going to look at exploring and analysing data. The first thing that’s quite fun that you can do with NLTK is explore the frequency distributions so we can try and find out which are the most informative tokens in our text. To do this we can just use the freak dist package from NLTK, run it against our set of individual tokens from Alice in Wonderland and extract the top 25 most informative and most common tokens from the text. That’s what it looks like. It’s not very informative because it’s kept commas and punctuation and stop words. It’s just full of rubbish, really. They’ve been included because they are evenly distributed throughout the text. It makes logical sense but it’s not very useful for us. We can instead look at the opposite, which are the ones that aren’t frequently occurring at all, but again, I’ve never heard of Brandi in Alice in Wonderland so it doesn’t really tell me much about the text.
[00:10:14] Instead, we could look at – still in the frequency distribution – but maybe looking for longer words, which is definitely more useful. Like gryphon and creatures and mushroom. There are more informative words but perhaps not quite what we want. The next part I want to talk about is speech tagging, which is also known as pause tagging. It’s where you extract, given a sentence or something like that, whether each token is a verb or an adjective or just a bit of punctuation. This is how we tokenise using NLTK. Loads of different libraries have their own versions of how they represent the tags, which is really annoying. This one uses the port of stemmer tag set. I don’t know them all off the top of my head but NNP is a proper noun and then VB is obviously a verb and then we’ve got prepositions, nouns, injunctions. Using a frequency distribution and from learning how we can pause tag sentences, we can consider the frequency distribution of the types of tags throughout Alice in Wonderland. Again, not very interesting, the most common is nouns, injunctions, determiners, prepositions. It’s what you’d expect. From that, we can try find more interesting words, again. It was difficult earlier. This is looking more informative already. This goes through the most common ones we were looking at before and extracting the proper nouns, so obviously Alice, Queen, but it’s included a bunch of punctuation. You can’t really know why that’s happened without going through an analysing the individual sentences, but the NLTK pause tagger is not amazing by any means and this is on properly written text. It would just fall apart on Tweets.
[00:12:42] Yes, now I want to have a little look at sentiment projects, working on the building blocks of what we had a look at before with tokenisation. One of the common approaches to sentiment analysis, while it’s not super clever, is rule-based, which is exactly what it sounds like, finding rules in our text. To find out the polarity of the text. I’ve just stolen a bunch of reviews from Rotten Tomatoes because I’m a monster but I can analyse. Building from what we had before, I’m going to take the tokenizer and split one of our reviews, I think it’s Captain America into tokens and then pause tag them and return those. That’s what you can see here. Then I’ve just built a list of handcrafted rules and that’s definitely not the way you do it in real life but it’s good enough for this. The first way we could do it is go through and look – the first review, very entertaining and a far tighter production of Marvel’s recent output. Go through and if I find one of the words that’s been in our list of words, increment the count, the score by one or if I find it in the negative reviews, in the negative rule, sorry, I want to decrement it by one. Really simple approach and it just outputs zero because there was entertain but it didn’t find that because we didn’t lemmatize it.
[00:14:21] Next up, we can add lemmatization but there’s an awful lot to go through so I’ve added it to just looking at adjectives and do the same again. It’s found entertaining because we had “entertain” in the rule set. We can just build upon it like this. Then you can improve it further maybe by looking at words that increment the meaning of things, like if it’s really great or very great or too brilliant might be even better, I don’t know. Tweets are terrible these days. Then from here we can see that very is very entertaining so it increases the score even more. This isn’t a great approach but it’s just an example of roughly how rule-based approach could work. Then you could take it even further and just build upon it. In a similar way, we had words incrementing it like “very”, we could add modifiers for words like “not”. Or, we could add things that decrement it like, “It’s a little bit good.”
How do you test drive sentiment analysis like that? Would you find a hundred lines where you know what the score should be?
What? For a rule-based approach?
[00:15:50] I get people to mark it up themselves, I guess. I get people to mark up a bunch of them and then you can do, “Well, if it agrees with them, I’m probably right.” Which brings me on to a base sentiment analysis, which is more sophisticated, I suppose. I’ll just move on. Words are not coming. Okay. This is how you can build a super, simple, naïve based classifier with NLTK. NLKT naïve based works by using training data. I’ve gone and found a bunch of Tweets which are already marked up. It will be loads of Tweets that someone has hand annotated for a very long time and said, “This one is positive, this one is negative” and it’s very exhausting. Then I can go through and I can split my data into training data and testing data, so training data is what the classifier will use and testing data is so I can see later if it’s worked as well as I’d hoped.
Sorry, Jessica, could you go back, could you explain the polarity?
Yes, the polarity is just the positive or negative. For some reason the person has used four instead of one.
It’s so strange, I don’t understand it.
Just in case you get a bit of wibble because they know one, two and three then.
Yes, they just used four and zero, I don’t know why. It seems so obvious.
Ask for a beer, get four back.
[00:17:39] Yes, we split our data and I’ve just processed it so that it’s in a format that makes it easier for each Tweet. If it’s a positive Tweet, I’ve put it in a tuple with the Tweet and the sentiment in a list called either positive or negative. This is how it looks now. This is a sample of the negative and positive. It’s just the Tweet and the sentiment. Now, ridiculous. How’s it ever going to learn. Now, breaking each down into a bag of words and just making sure it’s kept with the sentiment now but removing those that are really small words because they’re not going to be informative to us. This gives us an actual bag of words for all the Tweets, so before we had just a group for – we still knew whether they were positive or negative. We were keeping track of them from before. Now we’ve just got an ambiguous bag of words for all of the Tweets. Which lets us build this. Now, building a frequency distribution like we did before, so we can find out the most informative features from this group of words. We go through and extract these features so that when we’ve a document, we can find if any of the features match up with – sorry. We pass it in a document like, “All rock stars are back home while some of us freshen up, others watch magic Laker’s games then will celebrate in rocking Florida.” Okay. We split that up and we can go through and find out if anything in this Tweet is matched with things from our training data, so we can ultimately classify it using the classifier.
[00:19:58] Then from that training set, we build an actual classifier and this is an output of what it thinks is the most informative features. They’re features that bear the most weight for the classifier we just built. If something contains cancer, it’s probably a 12:1 possibility that it’s negative; whereas, if it contains love, it’s a 10:1 possibility that it’s positive because that’s what it learned from the data and rightly so. Now we’ve got this, we can classify some Tweets we put aside at the beginning. We can extract one of the positive Tweets and classify it with our new classifier and it classifies the positive ones as positive and the negative ones as negative. That’s been classified as positive and it was marked up by a person as positive. I haven’t done a thorough investigation on all the 100 Tweets I put aside, I probably should but I think I only put about 100 Tweets aside and that’s probably not enough. Naïve based is entirely dependent on how much data you throw at it, which also makes it quite hard to see – sorry – yes, the more data you give to naïve based classifier the better it will perform. Obviously, then you’ve got the alternative of having tot mark-up loads of data and if something doesn’t classify correctly, as we’ll see later, it’s hard to see where it’s gone wrong and you just have to throw more data at it and if it comes across a feature it hasn’t seen before, it’s not going to be able to classify it or it’s going to classify it incorrectly.
[00:21:48] Now, I just thought I’d do a little demo so we could look at the sentiment that co-occurs along with smileys using the classifier we’ve already built. Yay, loading in Tweets. This is a lot of code. Right. I’m not going to go through and explain all of this because it would be really boring but basically these are the Unicode ranges for emojis and they’re in two different ranges, which is why they’ve had to be compiled separately. It just finds any of those ranges in a Tweet, I am going through all of this, then goes through and classifies a Tweet and if a Tweet exists, it finds the emoji in our dictionary and increments one to it or increments one to it if it’s a negative Tweet. Then that’s the result of all of our emojis below but it will be a bit more useful in a graph. I don’t think this is going to be super informative but we can have a look anyway. It’s not bad. The crying emoji appears far more often with negative Tweets than with positive Tweets. The happy emoji appears more or less an equal number of times with both, which makes me think that it’s not a very good classifier. The joyful Tweet is appearing with negative far more than positive. I don’t think this classifier has anywhere near enough data or maybe people are writing really weird Tweets.
Couldn’t you just split it in way where the emojis cells have become words that are classified by the base classifier?
No, because I don’t think there are only emojis in the data set I had.
Okay, so there are no emojis in the training set, just in the one that you…
This is just separate data that I found.
[00:24:00] Which I just gathered myself to try and maximise the amount of emojis that I could get back but there weren’t many. Then tested it against. I’ve not used the training data. I’ve not been nice to it, it’s just completely new data. Yes, there is still a lot going on beyond NLTK, it’s quite a limited library, really. I hope that some of these demos have given you an idea of what you can go and do. If you’re actually interested in going far beyond NLTK, there’s a lot of interesting and better, faster libraries about at the moment. There’s a Python library called, “Spacey” which is really cool, which has built in named entity recognition and tokenisation, which is far superior to NLTK’s. You might have heard of Google’s dependency passer that came out recently, that was open source, I think. There are a lot more things you can look at that are a lot more relevant but I hope this has given you an idea of getting started. It’s not too hard at all, really.
Can you say the name of Google’s new…?
[00:25:12] Pazeyface. It’s on GitHub. This is just a Jupiter notebook, so you can just run it yourself and it should be easy. Yes. Okay.