TRUVISO Continuous Analytics rss

Truviso Blog
Home Page Products Solutions Customers Resources Company Contact Us
Customer Support Login
Truviso Blog | Immediate Insight |

Truviso 3.2: Making Interoperability Simple and Safe

By Sailesh Krishnamurthy
November 25, 2009 @ 4:32 pm

Applications that load data typically operate in transactional fashion, where any failure because of software, hardware or network error results in the data loading transaction aborting, forcing the application client to re-load the appropriate data. The problem with this in most analytical event processing (AEP) and complex event processing (CEP) systems is that individual records that have already been processed by continuous queries have to be undone if those records are part of a data loading transaction that has been aborted. Delaying the processing of the data until the completion of the loading transaction leads to significant latency, and is therefore unacceptable in a real-time environment.

The Truviso Continuous Analytics version 3.2 release offers full dataset transactional consistency for data loading integrity. If all the records of a dataset are not successfully transmitted, the system is able to “unwind” its effects on all reports and analyses that were affected.

Truviso achieves transactional consistency by running two processes in parallel: (1) a “bookkeeping” protocol that tracks when a loading dataset begins and commits, and (2) a continuous processing activity that continually updates queries.  In the event of a transaction abort, the bookkeeping function records the partial state of the data load, and the system unwinds any committed data that will adversely affect results until the dataset re-loads. This function maintains the correct data state in a seamless and transparent fashion.

As a result, with Truviso v3.2, you can rely on your infrastructure to maintain data integrity, including when connecting the system with extract transform load (ETL) or other data-loading transactional applications. Another important application is when moving data around with a transactional message queue in order to guarantee that data is not lost anywhere. Best of all, the bookkeeping protocol is so lightweight that safety is accomplished without sacrificing any appreciable performance.

The ability to exploit deep RDBMS (relational database management system) integration enables Truviso to provide unmatched levels of data integrity – a major benefit that no AEP/CEP system can deliver because of their inherent architectural constraints.




Truviso 3.2: Scale-up, Clustering and High Availability

By Sailesh Krishnamurthy
November 24, 2009 @ 4:55 pm

There are two trends driving the exploding data volumes in the digital and social media world today. First, with the network effect of Metcalfe’s Law, the success of the real-time web in connecting people leads to increased engagement and therefore more data. Second, there is a trend we are calling the Deep Real-Time Web that refers to the machine-to-machine communication resulting from and in fact dwarfing the user-driven interactions. In addition to handling these exploding data volumes, customers who have implemented a Truviso operational analytics solution “democratize the data” by providing a low-latency and high-concurrency environment for interactive queries.

The Truviso Continuous Analytics 3.2 release introduces a highly scalable data-parallel processing architecture that exploits the newest many-core hardware as well as multi-machine cluster configurations. Software needs to be written specifically for multi-core architectures to avoid performance problems; while we did it right, it’s an area where some other software companies stumble, as pointed out in a good Forbes.com article.

Truviso achieves vertical scalability on many-core SMP systems by splitting incoming data into independent streams running an on-the-fly basis so that each run can be processed in parallel and consolidated continuously. This vertical scalability technology lets enterprises save money on hardware, networking, data center and maintenance costs. For example, a large media network with analytics applications running on a 40-node cluster of specialized data warehouse servers each costing over $50,000 successfully offloaded the most high-value and high-volume workloads at 500,000 records/second on one production and one hot failover server running Continuous Analytics — with hardware costs of just $6,500 for each commodity server.

Speaking of hot failover, with version 3.2 you have the option to use a cluster configuration to achieve high availability / disaster recovery (HA/DR), seamless failover and distributed scalability, as well as scale out processing of interactive queries. The clustering infrastructure allows multi-master replication where data can be sent to any node in the cluster. Queries can also be deployed in the system in an active-active fashion to provide instant failover in the event of a node failure. A failed node can be brought back online and transparently start processing the latest data, while automatically and seamlessly “catching up” with the data that it missed during the time it was offline in an asynchronous fashion.

In addition to high availability and failover, this release offers continuous online backup features that can be used to organize a cluster in an active-standby configuration. Together with our TruLink connectors to aggregate data from disparate sources, we provide lots of flexible options to fit seamlessly in your existing infrastructure.




Truviso 3.2: Recent Performance is a Better Indicator of Future Results

By Sailesh Krishnamurthy
November 19, 2009 @ 4:08 pm

Truviso 3.2 Rolling Data Management

Like many technologists, I find customer interactions to be one of the most fulfilling parts of my job. The Internet world in particular is extremely liberating – customer problems are so unique that none of the standard assumptions hold any more. It turns out that for these folks, their business doesn’t fit the traditional model of a data warehouse that stores all the data from the beginning of time.

What really matters, what drives the business are the events of the last 60 to 120 days, and that is because these businesses operate at a blinding pace. Their user populations are constantly in flux, growing and changing in various ways. Heck, oftentimes their very business models change within a year. In other words, for these organizations recent performance is a better indicator of future results as opposed to the cautionary maxim of “past performance does not guarantee future results” that our friends in Wall Street probably wish they followed more closely!

It’s the recent data that’s most valuable, that’s most required and actively sought out by business managers making decisions on the front line day in and day out. Of course, it’s necessary to keep all historical data for various reasons such as compliance, accounting, and long-term trend analysis by the high priests of data.

These Internet companies, especially the trail-blazers, have realized that their enterprise data warehouses (EDWs) are needlessly over-provisioned. Many organizations are taking Bill Inmon’s advice and storing their most recent transactional data on an operational data store.

Truviso takes the ODS to the next level by providing live production analytics on top of transactional reporting, thereby complementing the EDW, and offloading the most challenging and useful workloads to a location where those workloads can be immediately used to improve operations. This model lets the EDW be used for its strengths, such as open-ended “blue sky” analysis incorporating years worth of historical data.

A critical aspect of Truviso Continuous Analytics version 3.2 that was architected for ODS environments is rolling data management – a feature necessary to efficiently move old data out of the system on a continuous, controlled basis. The actual mechanics are based on infrastructure that can be used to organize detailed as well as summary data into separate partitions (typically on a daily basis) thus letting older partitions be instantly groomed. In a typical implementation, the system is configured to periodically offload data onto a different system such as a data warehouse or a more cost-effective storage array, or even a Hadoop cluster. This reduces the cost of storage and maintenance, cuts hardware costs, and speeds up query processing by retaining only “recent” data in the operational analytics system.

Tell me your stories – I’d love to know what fraction of your analytics workloads focus on recent data.


Tags:


Truviso 3.2: Instantaneous and Reliable Late Data Processing

By Sailesh Krishnamurthy
November 17, 2009 @ 2:19 pm

Not only do Big Data analytics systems traffic in massive volumes of data, they use a specialized integration pattern. These are not traditional data warehouses that integrate data from CRM/ERP and OLTP systems using custom extract transform load (ETL) tools. In fact, I’d hazard that in 4 out of 5 New Media analytics implementations source data is primarily from log files that are continually spooled by large banks of web/application servers. For example, one large social media enterprise I recently met with uses a cluster of over 250 servers for targeted advertising.

These systems are typically organized as massive distributed clusters, often in several data centers. In such environments, failures are a fact of life and when a failed node recovers it typically sends older spooled data to the server that can be minutes, hours, or sometimes even days late. For example, during pre-production analysis at a leading digital media business, we found several examples where data reached the warehouse 23 or more hours late. This late data never made it into their legacy reporting system’s batch processing window and was unceremoniously dropped from the reports that were used to make critical business decisions!

Unfortunately, this is not an isolated incident – late data affects all kinds of systems ranging from Hadoop and other MapReduce implementations to more traditional data warehouse technologies purveyed by both storied vendors as well as scrappy startups. Late data causes serious problems for practitioners who manage today’s batch processing systems, with admins often rigging up band-aids. A common approach we’ve seen is to re-run batch reports several times as “insurance”, reminding me of the old Unix sysadmin trick of running sync three times – and still throwing salt over the left shoulder for good measure.

We believe that it is unacceptable to choose between delivering on time reports that are instantaneously incorrect when new data comes in, or increasing both capex and latency to run and re-run batch jobs and still not be 100% safe! Our customers, who use a motley mix of many analytics systems, have been clamoring for a solution to fix the late data problem and finally have a real solution with Truviso.

The Truviso Continuous Analytics 3.2 release continuously processes and updates data as it enters the system, whenever it does, and reflects the latest and greatest data by adjusting computations on the fly. Late data management is accomplished through a pioneering technological innovation – the use of innovative query rewrites to transparently pre-process queries independently and combine the pre-processed results on the fly to always answer queries with the correct and up-to-date results.

With Truviso, Big Data customers can process massively large volumes of data at very low latencies without being affected by the order in which the data arrives in the system. As a result, reports are always delivered instantly on the fly and include all data regardless of age or when it was submitted to the system.

I’d love to know more about how late data arrives in your analytics systems, and how your reporting infrastructure deals with it. Tell me your war stories!


Tags:


Truviso 3.2: Tracking Unique Users with Truviso

By Sailesh Krishnamurthy
November 16, 2009 @ 4:54 pm

Core to almost every measurement metric for online and mobile media networks are counts of “unique users” or “unique visitors”. It is vital that digital service providers can accurately track and measure how their massive visitor populations interact with the services provided, and the measure of choice is unique users over various dimensions (e.g., demographics, behavior, geography and time intervals).

This is a hard problem, and unsurprisingly even leading web analytics offerings like Omniture impose severe limits on the number of unique users and page views that they can report on, with the dreaded “database uniques exceeded” error message that analysts have grown to hate. In addition, these systems face the problems of unacceptable reporting latencies because of their batch processing paradigms and/or quality issues due to reliance on sample data. It’s common to hear our customers complain that they just cannot rationalize their Omniture and Google Analytics numbers. Of course, there was also the recent controversy with comScore and Nielsen reporting very different unique user numbers for Hulu during the month of April (over 40 million versus 8.9 million !!).

If you (1) have too much data, (2) would like to move from 12-24 hour old data to real-time, and (3) need to blend web analytics with additional data sources for a more comprehensive view, you should consider implementing the Truviso Continuous Analytics solution.

Continuous Analytics version 3.2 provides critical new functionality built specifically to solve the uniques problem. The unique-user tracking feature is designed for efficient analysis of tens of billions of unique values (that’s right, billions!), with an expectation of tens to hundreds of millions of actual users for any given category or dimension measured. This information is available instantly in real-time without full table scans, self-joins or 12-24 hour processing delays as is the norm for batch-oriented systems.

The feature itself is based on a novel insight – if the user-id can be mapped to a dense space (we provide connectors to handle cases where this is not true) then it is possible to efficiently represent and maintain the sets of users for each attribute, time dimension or category. Truviso has invented an innovative SQL-based adaptive data structure that is easily used in standard SQL queries – this structure yields high compression and can represent both sparse and dense sets very effectively. In addition, it allows for highly optimized manipulation of the sets – including the ability to add elements in a cumulative fashion, combine sets in an additive fashion for roll-ups, and compare sets to measure similarity.

The Truviso unique-user tracking feature transforms the challenging unique count problem into something that can be processed additively, resulting in enormous data processing productivity gains. It enables streaming on-the-fly unique-user computations, and affords easy roll-up of fine-granularity unique counts to coarser measures. Finally, its efficient set representation allows for a fast and easy implementation of higher-value correlation queries through the standard SQL query language. Some examples of these higher value queries include comparing the actual users who saw a given campaign across different time periods to enable better targeting, or measuring the similarity of the audience of two prime-time shows.

Ben Lorica at O’Reilly Radar posted a great article discussing how the Truviso approach to uniques allows publishers and marketers to adjust campaigns in real-time with A/B split bucket testing and referral analysis. With Truviso, marketers and business managers can really get a better understanding of loyalty and engagement, and relate that directly to a person — not just as a “new” or “returning” visitor statistic.

In the coming weeks we’ll be sharing more information on this feature, including example uses cases as well as mind-blowing performance results.


Tags:


Release Version 3.2 with Unique User Tracking

By Sailesh Krishnamurthy
November 11, 2009 @ 8:12 am

A big day for Truviso, the new version of Continuous Analytics v3.2 is now available! Congrats to our development team and everyone who’s put in creativity and bug free code, and to our customers who beta-tested new features.

Continuous Analytics version 3.2 is designed with our media, entertainment and other digital media customers in mind.  Over the next few days, I’ll dive into the technology specifics of each of the new features, but here’s a brief overview.

Affecting just about every Internet and mobile company, Unique Users Tracking is a thorny challenge for traditional business intelligence and analytics tools, which provide aggregates but with limited insights into individual users. Truviso now makes analyzing unique visitors just as straightforward as any other web metric.

With this innovation, online publishers can take the first step toward really tracking user engagement. This feature will help ad networks recognize and react to trends using predictive scoring, and video publishers optimize content to improve relevance and visitor engagement. By making the most of unique users, online properties will now have the right data to increase advertising impressions and boost revenue.

Other highlights of Continuous Analytics version 3.2:

  • Instantaneous Late Data Processing: Correctly process data coming in asynchronously across complex, distributed systems and networks.
  • Rolling Data Management: Keep costs down by storing the most recent, actionable data for operational users.
  • Multi-Core Processing: Pushing our benchmarks even further, Continuous Analytics can now process up to 15 Terabytes of data a day on a 8-core server.
  • High Availability and Disaster Recovery: Ensure reports are always available, day or night.
  • Transactional Consistency: Guarantee data integrity, even if source systems fail, so all data committed is guaranteed complete.

For the v3.2 press release and features overview, visit here.

To find out more, contact us.


Tags:


Can Your Business Really Use and Benefit from Real-time Data?

By Sailesh Krishnamurthy
November 6, 2009 @ 5:43 pm

It’s been a common refrain for years that many organizations would like to become low-latency enterprises but can’t, because their business processes and operational systems are not equipped to handle data on a real-time basis. Here at Truviso, we’re finding that enterprise customers in Internet, mobile and other digital services are beginning to realize new revenues and competitive differentiation from real-time analytics.

For example, I was talking recently with a VP at a large online ad network. Their existing models that were driven by batch-mode analytics degrade very very quickly. Real-time is particularly important because his user population changes so frequently, as publishers modify the percent of their spend on his platform compared to his competitors.  In order to survive and adapt to the speed of the market they need Continuous Analytics.

Advantages include real-time impression-based bidding – a big plus for his media buyer customers compared to static bidding – and better ability for him to monetize inventory that would otherwise go unsold. Combining real-time impression attributes with proprietary client data from the last 90 to 120 days offers a complete user picture. Continuous Analytics enables a competitive advantage for inventory management, ad serving, targeting and reporting.

What are ways that real-time data would have a material impact for your business? Let me know what you think.


Tags:


© Truviso, Inc. 2009-2010. All Rights Reserved.
Truviso™, Continuous Analytics™, VIA™, TruCQ™, and TruView™ are trademarks of Truviso, Inc.