NETWORKED INSIGHTS BLOG

Sentiment in the drips-and-drabs of informal writing

In a secret mission given to me by the commanding NLP officer at Networked Insights I bumped into a new kid on the sentiment-analysis block (founded in June ‘09ish, I believe), Evri. What they do is pretty interesting! First, they comb a limited number of “highly regarded” sources[1], extract entities (NLP jargon for the stuff we’re talking about — words and phrases) and relate them together. If you traffic in NLP-land this isn’t super-awesome-cool, but it is a lot of fun to see someone productize some of the algorithms out there. Kudos!

Now, you’re probably wondering why I italicized the little quote about highly regarded sources, and if you are the foreshadowing type, you may already be able to guess where I’m going with this. First, let me say that most of the effective algorithms for extracting entities, and almost all of NLP, rely on fairly well formatted, “predictable” text. By “predictable” I mean that it follows formal grammar rules, etc. So, in selecting highly regarded sources (say, CNN?) you are constraining your statistical frame to sites that you can processes. There isn’t anything terribly wrong with doing what you’re good at, but I would like to argue that, at least at Networked Insights, we fight to keep away from this restriction. In part, formal writing carries a bias. It carries a message motivated by some latent motivations. Keeping a high reputation, for instance. It is an immensely more difficult task to harvest information from the drips-and-drabs of informal writing such as is found in twitter and forums (or even blogs).

It’s because of this that, while the thought of Evri is exciting, I don’t think it will tell you anything that you didn’t already know. Now, my goal isn’t to pick on Evri, but I think it is fascinating to realize that the reason that analyzing more formal and easily analyzed text on the web is a bit of a losing battle is because the formalism comes in part from the author knowing that we are watching. Fox or CNN write with a specific audience in mind, and that audience is the same audience that TV seeks to entertain, and on, and on, and on. What is so powerful about the social web is that it’s text produced with only an audience of two or three people being expected.

I say this is powerful in two ways. First, biases equal out by the sheer volume and diversity of publishers. If someone is trying to catch the attention of a particular audience and tailors their text to fit, then that intentional word-smithing is likely reduced by countless other authors with similar overall ideas to express, but different spins to put on the text. Second, since most of this text is being generated quickly, and admittedly not always that well thought out, the raw feelings of people tend to leak into the text. It’s the living room conversations had without thinking. It’s the reflexive “boo” at the stadium. This rawness, if you will, if far more valuable because it’s never what you were expecting.


Reader Comments

Sorry, comments are closed.