Neither Big Data nor email are anything new: the latter has been the communications vehicle of choice in businesses for over decade, and companies have been pondering how to extract value from Big Data nearly as long. In fact, it was the META group (now part of Gartner) who coined the famous “volume, velocity and variety” to describe the challenges of Big Data.
Does email fit the definition of Big Data?
By its very nature, email is unstructured. Growing at a staggering volume year over year, email is the transport mechanism for all manner of attachments and it occurs at the speed of communications. That would seem to fit all the definitions of Big Data.
Where the problem occurs is, if email is Big Data, then how can companies mine value from it?
The answers here are less clear. Companies looking to derive business trends don’t necessarily know what they’re looking for, and worse are familiar with business intelligence. Business intelligence and Big Data , it turns out, are different.
According to Pierre Delort, CIO at France’s Inserm research institute and a leading authority on Big Data, Business intelligence or BI relies on descriptive statistics from data with high information density to measure things, detect trends etc., whereas Big Data uses inductive statistics and concepts from low density, large information sets to reveal relationships, dependencies and perform predictions of outcomes and behaviors
Email factors in an organization’s BI initiatives
Contrary to Mr. Delort, many companies are looking to deploy business intelligence or BI initiatives that focus on Big Data, not the structured data which has factored in most business intelligence efforts.
Regardless of the methods deployed, the technologies used or the data stores being analyzed, mining intelligence from Big Data is very much an activity to tease-out relationships and undiscovered connections – i.e., you know you don’t know what you don’t know. In this case, that’s the whole point – to find new data points, ones that you didn’t know existed.
Email, therefore, is a logical focus for intelligence initiatives. While email is both a transport mechanism and a suite of captured communications, its very bulk and velocity means the likelihood of big data email holding new, unknown data points is high. In other words, it’s a valid data set for intelligence mining.
Email confounds many BI initiatives
As valuable as email may be for new insights and new data points, it can easily distort Bi initiatives and render their results virtually meaningless. This is due to the redundant and transient nature of email.
Email is by nature repetitive: emails are often sent to numerous individuals, recipients frequently respond to emails (thereby duplicating the chain yet again), and users sometimes copy emails, change the subject matter, and send them anew (which can wreak havoc with metadata). Users often combine answers to different questions in the same email – so categorization is also an issue.
Email is also transient: many emails are only relevant for a short time. They refer to some current situation, or worse are commenting on some current issue that’s not reflected in the email, and over time the reason why they were originally sent is lost or forgotten.
These emails are sometimes referred to as ROT – redundant, obsolete, or trivial. Yet they frequently remain in email Big Data stores, and there is often sufficient ROT in an email store that it adversely affects intelligence outcomes. In a paraphrase, intelligence mined from garbage is still going to be garbage.
How can email be best adapted for intelligence initiatives?
Email – even very old email – can, and often does contain valuable insights and new data. The trick that companies need to learn is how to rate and value their email information, and in particular remove emails for which there is no longer a business need. That business need may be a specific retention period for compliance reasons, but in most cases it is email which was either retained for no reason or no longer has value. An example of email retained for no business reason could be a .WAV file preserved as part of a journal capture. That file may have no business bearing – it could be an employee-to-employee joke – yet unless it is proactively removed, it (and thousands like it) it can distort BI initiatives.
The second and larger issue is email which no longer has any business value. Just like news stories – where newscasters report the “headlines of the day” but often don’t follow up on how the story ultimately ended – email is a collection of statements-in-time, conversations about a current issue. A series of emails regarding a customer issue don’t often include the resolution – that is frequently captured in a CRM solution. If those emails continue to be included in BI initiatives years after they were written, no one – including the BI software – is going to realize what they were about, and worse they may contain terminology that inadvertently ties them to some current yet unrelated issue. In other words, as ROT they can masquerade as valuable emails, yet only serve to trivialize BI results.
Because of these factors, email stores aren’t suitable for intelligence initiatives unless they are cleaned-up. The easiest way to remove ROT is using retention policies, which will delete all email which is past a given retention period; 80% of ROT can be removed this way.
The second way email stores can be cleaned-up is eliminating non-business or non-valuable emails.
A simple way to do this is eliminate obvious outliers, such as .WAV and .MPEG files (unless of course you have a business use for audio and video streams).
The ideal way to clean-up email stores is not to let them become full of ROT in the first place, which is where a policy-driven retention solution comes into play.
Properly managed, email stores are a valuable source of Big Data, but in their raw form, can often yield disappointing results.