NETWORKED INSIGHTS BLOG

Social Media Analytics, Humans vs. Machines

The fine folks at www.research-live.com recently posted an email debate titled “Tracking online word-of-mouth: The people vs machines debate.” This debate featured Mike Daniels of Report International arguing the pro-human side and Mark Westaby of Metrica arguing the pro-machine side.

This is a great debate and is definitely worth checking out, along with Nathan Gilliatt’s response. I would like to add a few points that may have been under emphasized. In particular, I would like to address misconceptions about the volume of data we’re talking about, the use of statistics, validity vs. power, and human bias.
human_v_machine

Data Volume

The main issue around data volume was covered in my last blog post, “Does monitoring provide the confidence and omniscience you need? One of the main points of that article is that there is simply too much data out there for humans to analyze effectively, in many cases.

As a concrete example (which I used in the last article), a search of only Google Groups for “Macintosh OR Mac -cheese -Fleetwood” returns over 17,000,000 posts over the last three months (more than twice as many as I found while writing my last article). That averages out to over 180,000 posts a day for one brand, from only the sources searched by Google! (While the number of sites covered by Google is impressive, it is not exhaustive.)

The point is, to perform analytics around one topic or brand (such as Macintosh), let alone multiple, means analyzing staggeringly large volumes of data! And, it is only increasing with time. It is important to keep the magnitude of data in mind as we move forward.

Statistics

The response to the above point is often something similar to Mike Daniels response: “But there’s an unstated assumption behind the technology promise: that it is necessary to analyse all or a very large percentage of these conversations in case we miss something”[sic].

Do we need to analyze a large percentage of posts? Yes and no. There are really two use cases that we are talking about.

  1. Finding the PR or marketing emergencies that require immediate action
  2. Understanding trends

For the first use case, finding emergencies, we really do need to analyze a large percentage of posts, and this is why the volume of data is relevant (This point was also covered by my last blog post). For example, let’s say you can only analyze 10,000 posts per day because we are using humans, and if we are receiving 180,000 posts per day, that gives us less than a 6% chance of finding an emergency post when it happens. How much are you willing to pay for a 6% chance?

For the second use case, understanding trends, it is true that we do not need to analyze a large percentage of posts. We have well founded statistical methods of sampling a subset of posts, analyzing them, and then relating our results back to the whole in a valid way.

However, saying that we don’t have to measure all posts is an abuse of statistics. Statistics is not meant as a shortcut to avoid processing more data points. Instead, it is a way to still derive some value when you CANNOT process all the data points (for whatever reason). Statistics is a fall back, it is a safety net. As a general rule we should process as many data points (posts) as we can and only use statistics when we are unable to process all of the points (due to constraints of time, money, feasibility, etc.)

Validity vs. Power

This discussion of statistics feeds directly into my next point: analytic validity and power. This is really the heart of the matter. While in some cases we may not need to process a large percentage of posts (as I discussed above), we do want to process as many posts as possible.

In analytics, we talk about “validity” and “power”. Statistics provides methods and rules for finding valid results when you can not process all the data points. Analytic power comes from processing more data points. “Power” in this sense is the ability to detect a difference, trend, etc. when it is exists. So, yes, we don’t need to process a large percentage of the posts to be valid. But, with more posts comes more analytic power and, hence, more value. Thus, there is much to be gained from taking advantage of machines to perform analytics.

What about humans, then?

Human Bias

Even if we could use humans to analyze all of the post we have, we still may not want to. Computers have the ability to put new data points in context with all of the other data under consideration. Humans can’t do that (especially when there are 180,000 other data points); we put things in context with all of the other things we know, which is very different. This is where bias can creep in.

There is a natural flow in analytics from data, to information, to knowledge, to decisions. We begin with data, organize it, analyze it, and put it into context with the other data to produce information. Then, a human consumes that information and combines it with what she knows to produce new knowledge. This knowledge can then drive decisions.

Regardless of whether your analytics are performed by human, machine, or other, this is the general flow; there is human knowledge used at some point. But, it is important to insert this knowledge into the flow at the right time. Human knowledge, in the form of expectations, inserted too early into in the the process I’ve described can drastically bias the results.

To sum up, I think Mike Daniels and Mark Westaby both bring up some great points, as do many of the people commenting on their post. At the end of the day, both humans and machines are needed, one cannot proceed along the data, information, knowledge, decision chain without them. The trick is to use the right tool for the right job. Machines are great at processing large amounts of data, putting it into context, and producing information while people are unequaled in our ability to create knowledge and use it to drive decisions in situations where the answer cannot be reduced to a one or a zero.


Reader Comments

Sorry, comments are closed.