The true value in incorporating a Big Data initiative into an overall Enterprise Data Management scheme comes from integrating, and in some cases aggregating, external Big Data with more conventional sources of data.
Doing so correctly involves accounting for issues of Data Governance, Metadata Management, traceability, and Semantic consistency that frequently require more than simply dumping data into a single repository as a data lake—which incurs the risk of creating the proverbial data swamp.
The crux of the matter, due to Big Data’s ascending popularity and the efforts of vendors to capitalize on it, is that there are “…probably several hundred technologies out there for applying Big Data as a very broad technology space,” according to James Cerrato of Adaptive, Inc.
Integrating those technologies with conventional relational technologies and those native to the enterprise was the subject of Cerrato’s presentation at Enterprise Data World 2015 Conference, “Big Data Analytics – Are You Creating a Data Swamp.” Doing so correctly incorporates the aforementioned aspects of governance, Metadata, traceability, and Semantics in a way that facilitates much needed transparency—which Cerrato noted is the key point of distinction between a data lake and a data swamp.
Additional Integration Concerns
Aside from merely contending with a plethora of Big Data technologies and more traditional enterprise-based ones, integrating Big Data with conventional data is also exacerbated by:
Organizational Structure: Different departments may have different objectives that require varying data types and purposes, all of which can foster a silo culture.
- Technology: Even with longstanding internal systems, integrating various technologies may provide pain points for integration prior to the incorporation of Big Data.
- Data Quality: Integrating different systems tests an organization’s Data Governance and can present issues for Data Quality concerning accuracy, completeness, timeliness, and more.
- Security: System integration and data integration also affect organizational security as access to data can change or produce undesired ramifications.
- Legacy Systems: Integrating legacy systems and their attendant technologies with ones for Big Data can prove difficult.
Automated Data Governance
The importance of having strict governance mechanisms in place when integrating Big Data with other data throughout the enterprise is a necessity for utilizing any sort of data lake option. At the macro level of governance it is necessary to establish critical facets of ownership, accountability, and the roles and responsibilities pertaining to the data in terms of stewardship, subject matter experts, and even specific members of a Governance Council. After designating these various points and relationships, organizations can actually automate them with governance tools designed for Big Data integration. Such platforms can issue alerts based on workflow automations for specific governance personnel associated with data types and processes based on regulations information, application uses, and other business functions. “Those processes will actually trigger notifications according to those accountable relationships you’ve identified that have responsibility for each step in your review process,” Cerrato said.
Automating Metadata
At the micro level, integration of Big Data and other sources of data involve a degree of Metadata management that is similarly automated. Metadata’s penchant for providing context to different types of data is invaluable when integrating time-sensitive Big Data with other data types. Metadata requirements pertain to regulations, specific business processes and application requirements including those across (and specific to) business units. As Cerrato noted, the point is to:
“…not just manage technical Metadata, but to place it in the larger context of the enterprise—of the different aspects of process, of organization, of governance, of metrics. That is a real true differentiator that is adding value to how people can manage that information.”
Accounting for enterprise-wide integration of Big Data with Metadata enables organizations to take a holistic approach to that integrative process. Furthermore, contemporary governance solutions for Big Data can streamline that process by having Metadata operate as the basis upon which policies are founded—and in turn automating those rules.
Standards-Based Semantics
At the macro level, Big Data integration is based on the rules and responsibilities that are critical to Data Governance. At the micro level, those policies for governance are largely determined by the Metadata that provides a critical context for the integrated data. At a granular level, that integration is widely predicated upon standards-based Semantics, like many other critical applications and technologies at the forefront of Data Management today. The various aspects of Metadata between Big Data and other sources of data are able to be integrated in an orderly fashion upholding principles of governance because of the Semantics approach of more competitive solutions of Big Data Governance. Additionally, Semantics creates a degree of visibility within data elements that allows IT personnel to see, at a granular level, the various business terms and their definitions that relate to a data element and impact its integration with others. From this perspective, one of the fundamental aspects of Big Data integration involves a standards-based Semantics Metadata repository. Such a repository is essential for providing a degree of lineage and transparency with that integration, which helps to reinforce effective Governance.
Business Rules and Traceability
The relationship between business rules, Semantics, and effective integration is a pivotal one, especially when applied to huge quantities of Big Data. As the basis for the context and business terms that influence the integration of data elements, different aspects of Semantics provide a degree of visibility from the business down to the IT, and even from the IT back up to the business. The degree of specificity to data elements that Semantics provides includes, according to Cerrato, answers to such questions as, “What taxonomy applies to that? Are there Semantics? Are there ontologies and concept models that are relevant, maybe industry standards?” The answers to all of these questions merely create additional ways of “representing business context and doing gap analysis against industry standards,” Cerrato noted.
The overall effect is increased traceability of data and transparency within a data lake or some other means of integrating Big Data. When it comes to determining the movement of data across any number of different systems, technologies, and applications, this sort of lineage is extremely useful for providing a structured means of keeping track of data which may itself be unstructured. This fact becomes even more important when it is used to guide and ensure adherence to business rules and governance policies.
The Benefit of Integration: Operational Data
The four different aspects of Big Data integration outlined in this article (and in Cerrato’s presentation) ultimately create the means for enterprises to combine external and internal data sources in a timeframe that enhances operational data. Those aspects of integration include Data Governance, Metadata Management, Semantics, and the traceability of business rules. Furthermore, they enhance operational data in a way that substantially adds to the meaning of Big Data and the value it can produce when leveraged with traditional data sources. As Cerrato observed:
“All of these things come into play in a way that starts to bring together both small and Big Data worlds…The different areas of focus around Metadata Management, around governance, around ontology management, the management of business rules and decision processing. And all of that brings into play whatever operational sources you have whether they are legacy mainframe, whether they are XML schemas, relational structures, Big Data structures…Being able to layer that whole governance framework on top of any kind operational data you might have through all the different technologies that you might be using.