Not only do Big Data analytics systems traffic in massive volumes of data, they use a specialized integration pattern. These are not traditional data warehouses that integrate data from CRM/ERP and OLTP systems using custom extract transform load (ETL) tools. In fact, I’d hazard that in 4 out of 5 New Media analytics implementations source data is primarily from log files that are continually spooled by large banks of web/application servers. For example, one large social media enterprise I recently met with uses a cluster of over 250 servers for targeted advertising.
These systems are typically organized as massive distributed clusters, often in several data centers. In such environments, failures are a fact of life and when a failed node recovers it typically sends older spooled data to the server that can be minutes, hours, or sometimes even days late. For example, during pre-production analysis at a leading digital media business, we found several examples where data reached the warehouse 23 or more hours late. This late data never made it into their legacy reporting system’s batch processing window and was unceremoniously dropped from the reports that were used to make critical business decisions!
Unfortunately, this is not an isolated incident – late data affects all kinds of systems ranging from Hadoop and other MapReduce implementations to more traditional data warehouse technologies purveyed by both storied vendors as well as scrappy startups. Late data causes serious problems for practitioners who manage today’s batch processing systems, with admins often rigging up band-aids. A common approach we’ve seen is to re-run batch reports several times as “insurance”, reminding me of the old Unix sysadmin trick of running sync three times – and still throwing salt over the left shoulder for good measure.
We believe that it is unacceptable to choose between delivering on time reports that are instantaneously incorrect when new data comes in, or increasing both capex and latency to run and re-run batch jobs and still not be 100% safe! Our customers, who use a motley mix of many analytics systems, have been clamoring for a solution to fix the late data problem and finally have a real solution with Truviso.
The Truviso Continuous Analytics 3.2 release continuously processes and updates data as it enters the system, whenever it does, and reflects the latest and greatest data by adjusting computations on the fly. Late data management is accomplished through a pioneering technological innovation – the use of innovative query rewrites to transparently pre-process queries independently and combine the pre-processed results on the fly to always answer queries with the correct and up-to-date results.
With Truviso, Big Data customers can process massively large volumes of data at very low latencies without being affected by the order in which the data arrives in the system. As a result, reports are always delivered instantly on the fly and include all data regardless of age or when it was submitted to the system.
I’d love to know more about how late data arrives in your analytics systems, and how your reporting infrastructure deals with it. Tell me your war stories!




