Superlatives are the norm when talking about Big Data. We all see Big predictions of the Big numbers and High growth of petabytes, connected devices, data containers, etc., and the Tidal changes they bring.
All true!
But I think the “Big” adjective best applies to Complexity. How do IT shops manage all these 1s and 0s and set the table for meaningful analytics?
At the most basic level, there are two approaches, seemingly contradictory, for laying the analytics foundation: consolidate data onto fewer platforms, or analyze it selectively across platforms in a decentralized fashion. These actually complement one another well as part of an iterative data management model in which IT continuously re-positions data sets to support evolving analytics needs.
Let’s examine each, then look at the squishy middle, where most companies will reside.
Bulk Consolidation!
Suppose data is growing 30% annually for a health insurance provider as it starts to incorporate standard patient medical records and claims documents, as well as FitBits and SmartSleep feeds from willing patients that get a discount. All this initially spans Oracle, SQL and Hadoop. The provider wants to correlate points from all sources for customer segmentation, pricing and marketing programs. To start, they consolidate on Hadoop, using software such as Attunity Replicate to automatically integrate raw-format data onto Hadoop, where it subsequently is transformed for their queries.
Decentralization
Now let’s consider a larger insurer that has acquired four smaller companies, each storing medical records and claims records on a different RDBMS, over the last two years. They want to combine the same data types for the same reasons as above. However, different platforms and higher data volumes raise the complexity, as do varying services and customer types among the business units. So rather than copying everything onto Hadoop for companywide analytics, each business unit runs pilot analytics with silo-spanning software. For example, they use ClearStory Data (which runs on Apache Spark) to identify metadata relationships across platforms and visualize the resulting analytics. No need for bulk extraction and loading.
Such processes also are described as data federation or data virtualization, meaning metadata is indexed and aggregated without moving the source data itself. “This is similar to the way in which Google indexes the entire Internet in order to provide sub-second search results to its hundreds of millions of users,” writes EMC CTO Bill Schmarzo in his book Big Data: Understanding How Data Powers Big Business.
The Squishy Middle
Decentralization is becoming pervasive, but doesn’t replace the old. Gartner predicts that “by 2018, data discovery and data management evolution will drive most organizations to augment centralized analytic architectures with decentralized approaches.”
Many enterprises will take things one project at a time. The more elaborate Big Data initiatives that rely on complex and evolving queries of high volumes of data generally point to consolidation. Specific queries of narrowly-defined data sets can opt for a decentralized approach. As decentralized tools grow more powerful, these lines potentially will blur.
What does all this mean? You will need to continuously measure usage and move data across platforms as capabilities and needs change, and as you learn what works.
Enterprises need to invest in software that profiles data usage and its impact on infrastructure. A common use case here is to see how many TBs and CPU cycles are consumed by specific EDW or Hadoop data sets, and how often the data sets are used for analytics. Stakeholders need to see this plainly, decide what should be consolidated or relocated, and then move it. This will be an iterative process.
The takeaway? Stay flexible. Invest in cross-platform software that tells you what to move and where, then moves it with a drag and drop of the mouse.