This content is part of the Essential Guide: Maximizing and managing big data with SOA middleware

Using SOA for big data and cloud data management

Big data and cloud data management can be problematic. Learn how SOA can manage data in several ways.

Do we need data-centric SOA or SOA-centric data? The answer may depend on how you juggle the three different dimensions...

of the SOA-data relationship to manage big data, cloud data and data hierarchies. Fitting these models optimally with each other for all types of data, in all of the growing number of virtual resource models, is one of SOA's most profound challenges. This tip examines the benefits, choices and options of each SOA model for managing data.

SOA's three data-centric models are Data as a Service (DaaS), physical hierarchies and the architectural component. The DaaS dimension of data access represents how data is made available to SOA components. In the physical dimension, there's how storage and storage hierarchies figure into SOA data access. Finally, the architectural view is the way that data, data management services and SOA components relate.

SOA and data enterprise example

Probably the best way to understand the SOA data issue is to start with the limiting case: an enterprise with data needs that can be completely represented in relational database management system (RDBMS) terms. Such an enterprise would likely adopt either database appliances or dedicated database servers and present query services to SOA components (Query as a Service, or QaaS). This is a design that's been accepted for five years or more. What makes it work is its balance of the three dimensions mentioned above. The QaaS service model isn't physically linked to storage; it's mapped to it via a single architecture -- RDBMS. Data deduplication and integrity can be managed for that single architecture easily.

Why this simple approach may not work with data on a broader scale is best understood through the example of big data. Most big data is nonrelational, nontransactional, nonstructured and even nonupdated. It's not easily abstracted to a query service because of its lack of structure, it's rarely stored in an orderly way because of the multiplicity of sources and formats, and there are few rules that define even basic data integrity and deduplication processes for it. When such things as big data are introduced to SOA applications, it's critical to define the last of our three models, the architectural model of SOA data relationships. There are two broad choices: horizontal and vertical.

SOA and types of data models

In a horizontally-integrated data model, data is collected behind an abstract set of data services that present one or more interfaces to applications and also provide all the integrity and data management features. Components don't access data directly, but in as-a-Service form, just as they would in the simple case of an enterprise whose data requirements were pure RDBMS. The application components are largely insulated from the data management differences of RDBMS versus big data. While this approach can't create the simple query model of RDBMS for the reasons already given, it at least replicates the simple model of RDBMS that we presented above.

The vertically integrated data model links application data services to resources in a more application-specific way, where the customer relationship management or enterprise resource planning or dynamic data authentication application data is largely separated first at the as-a-Service level, and that separation is maintained down to the data infrastructure. In some cases, these applications might have SOA components that access storage/data services directly. To provide more uniform data integrity and management, management services may be provided as SOA components that operate on the various database systems, performing common tasks, such as deduplication and integrity checks, in database-specific ways. This approach is more easily adapted to legacy application and data structures, but it risks compromising SOA as-a-Service principles in how data is accessed, and it may also create issues with consistency in data management.

SOA and horizontal data models

There is little question that the horizontal model is more consistent with SOA principles, since it abstracts data services from SOA components more thoroughly. To make it work, though, it's necessary to define abstractions for nonrelational databases and to deal with any inefficiencies associated with the abstraction process -- which SOA architects know can become insurmountable unless care is taken to avoid them.

A horizontal SOA data strategy has to start with a data abstraction that works for big data. The most familiar solution to this problem is MapReduce, available in the form of the Hadoop cloud architecture. Hadoop and similar approaches distribute data, management and access and then centrally correlate the results of queries made on this distributed information. SOA components should, in effect, treat MapReduce and similar database analysis functions as queries.

Handling horizontal database efficiency issues

Efficiency issues are more complex. Because the horizontal database model is likely to be implemented through a message service bus like most SOA processes, one essential step is to ensure that the overhead associated with this orchestration is kept to a minimum. This can help to reduce the software overhead associated with SOA data access, but it can't overcome problems with the storage systems themselves. Since these storage systems are abstracted away from SOA components by the horizontal model, it's easy to overlook problems with latency and data transfer volumes, particularly if the databases are cloud-distributed and thus have variable network delays associated with their use.

One solution to this is the modern notion of hierarchical storage. A database isn't viewed as a disk, but rather as a set of connected cache points that start in local memory, move perhaps to solid-state drives, then to local disks and finally to cloud storage. Caching algorithms deal with the movement between these cache points to balance storage cost (and, for updates, synchronization) with performance.

For big data, it's also often possible to create summarizations of data that can be used for most analytics. An example is a traffic telemetry application that includes car-counting at various points. This can generate massive amounts of data, but if the summary counts for the last minute are stored in memory, for the last hour in flash and for the last day on disk, the needs of real-time control applications can be satisfied from a fast-access source, while what-if analytics can use something less expensive and slower.

SOA is all about abstraction, but abstraction is increasingly risky when it hides underlying complexities that impact performance and response time. Data access is such a situation, so SOA architects need to consider the balance of abstraction and performance carefully and optimize it for their specific business needs.

Dig Deeper on Distributed application architecture