 |
|
 |
 |
 |
 |
 |
 |
 |
 |
By Jeff Davis December 13, 2009 @ 11:49 pm
Ian Eure wrote an interesting piece on scalability at Digg:
http://about.digg.com/blog/looking-future-cassandra
A short quotation perfectly summarizes the motivation to move away from existing SQL systems:
The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes. This is completely wrong for large-scale web applications, where response time is critical.
There are two things I like about the above statement:
- The author does not suggest that the problem is inherent, but rather endemic.
- He presents — with crystal clarity — exactly what’s keeping SQL systems out of the running; and it’s not the SQL language. It’s the processing model.
The various NoSQL processing models can be integrated seamlessly into a SQL system. For instance, Truviso (my employer) answers exactly this problem by offering a stream processing model, which computes results as the data arrives. The engine uses the SQL language and is fully integrated with a mature SQL implementation.
The author is moving toward NoSQL, which is a kind of “back to basics” database system movement trying to build database systems from the ground up. The big push behind NoSQL is clearly performance; but discarding SQL systems also discards all of the lessons learned over the years for managing the variety of queries that real businesses require.
One of those lessons is the declarative language itself, SQL, which started out as a primitive language but grew much richer over time. NoSQL systems either use a new declarative language that is much less powerful than SQL, or regress all the way to a key/value (or graph) storage system. Poor language support means a poor optimizer. It’s often possible to work around a poor optimizer, but these workarounds quickly turn into herculean engineering efforts as you try use a dumb engine in a clever way.
The next lesson is that database systems must be suitable for a wide variety of queries. If you are running only one query, and you know what it is in advance, then clearly you can engineer your whole data architecture around that single query. But for most companies, that’s far from reality — they need to add queries on the fly, query historical data, and join new data with historical data. Additionally they need a language flexible enough that this can be done immediately, rather than kicking off a new engineering project every time they need to add a query.
A unified database management system that integrates NoSQL processing models with a traditional SQL system is the real answer here; and streaming is one way to accomplish that. This integration allows a wide range of data processing strategies to work together – traditional tables offer recovery of streaming data, for instance — rather than forcing you to choose a single processing model.
In other words, the language and logical model should be separate from the processing model. And isn’t that what the relational model is all about?
By Sailesh Krishnamurthy December 4, 2009 @ 4:32 pm
The philosophy of the Truviso continuous analytics approach is that data is most efficiently processed while on the move as opposed to while at rest. Traditional store-first, query-later data warehouses are like the Hotel California in the famous Eagles song – easy to get into, hard to get out of – which is more complimentary than what a partner of ours calls them: the “Roach Motel” of enterprises where data goes to die.
What really sets the Truviso approach apart from other related real-time technologies is that the timing of data processing is decoupled from data consumption. In other words, just because the analysis of data occurs in real-time does not mean that the results of the analysis must also happen immediately. This subtlety was lost to various CEP vendors who focus on the “now” and only analyze current conditions and exceptions as described perspicaciously by Doug Henschen in an Intelligent Enterprise Q&A. This decoupled approach is also increasingly finding other uses such as in “assist or suggest” capabilities for Internet search, as discussed in a good GigaOM post.
Truviso’s approach is realized in a Stream-Relational Database Management System (see our CIDR 2009 paper for more details) where the results of continuous analysis of data are stored natively in a high-performance fashion. This lets us blend the real-time-only nature of stream processing with the stability, flexibility and familiarity of OLAP-style analytics in a single architecture. Furthermore, having both real-time and OLAP functionalities tightly integrated in a single system enables our customers to easily marry analyses of both live and historical data using standard SQL queries.
With this hybrid architecture, Truviso has created a solution for analyzing recent data. In some cases – such as for Internet, video or mobile usage dashboards – analytics should be in real-time. In contrast, some back office systems may only require updates by the hour or by the day to meet operational needs and service level agreements.
In other words, it’s like the old U.S. Army saying of “Hurry up and wait”. While data processing occurs continuously in real-time for maximum efficiency, the analytics is available on demand in whatever time periods that operational systems and business users need. This distinction is critical to the success of the Truviso solution: maximize efficiency and scalability through continuous processing, while providing analytics “whenever needed” for both people (internal users, customers and partners) and operational systems.
With Truviso, you’re providing analytics in real-time to those who want and need it, while integrating seamlessly with existing infrastructures that operate on timed intervals.
In my next post I’ll describe the evolution of Continuous Analytics in historical context. Stay tuned!
By Sailesh Krishnamurthy November 25, 2009 @ 4:32 pm
Applications that load data typically operate in transactional fashion, where any failure because of software, hardware or network error results in the data loading transaction aborting, forcing the application client to re-load the appropriate data. The problem with this in most analytical event processing (AEP) and complex event processing (CEP) systems is that individual records that have already been processed by continuous queries have to be undone if those records are part of a data loading transaction that has been aborted. Delaying the processing of the data until the completion of the loading transaction leads to significant latency, and is therefore unacceptable in a real-time environment.
The Truviso Continuous Analytics version 3.2 release offers full dataset transactional consistency for data loading integrity. If all the records of a dataset are not successfully transmitted, the system is able to “unwind” its effects on all reports and analyses that were affected.
Truviso achieves transactional consistency by running two processes in parallel: (1) a “bookkeeping” protocol that tracks when a loading dataset begins and commits, and (2) a continuous processing activity that continually updates queries. In the event of a transaction abort, the bookkeeping function records the partial state of the data load, and the system unwinds any committed data that will adversely affect results until the dataset re-loads. This function maintains the correct data state in a seamless and transparent fashion.
As a result, with Truviso v3.2, you can rely on your infrastructure to maintain data integrity, including when connecting the system with extract transform load (ETL) or other data-loading transactional applications. Another important application is when moving data around with a transactional message queue in order to guarantee that data is not lost anywhere. Best of all, the bookkeeping protocol is so lightweight that safety is accomplished without sacrificing any appreciable performance.
The ability to exploit deep RDBMS (relational database management system) integration enables Truviso to provide unmatched levels of data integrity – a major benefit that no AEP/CEP system can deliver because of their inherent architectural constraints.
By Sailesh Krishnamurthy November 24, 2009 @ 4:55 pm
There are two trends driving the exploding data volumes in the digital and social media world today. First, with the network effect of Metcalfe’s Law, the success of the real-time web in connecting people leads to increased engagement and therefore more data. Second, there is a trend we are calling the Deep Real-Time Web that refers to the machine-to-machine communication resulting from and in fact dwarfing the user-driven interactions. In addition to handling these exploding data volumes, customers who have implemented a Truviso operational analytics solution “democratize the data” by providing a low-latency and high-concurrency environment for interactive queries.
The Truviso Continuous Analytics 3.2 release introduces a highly scalable data-parallel processing architecture that exploits the newest many-core hardware as well as multi-machine cluster configurations. Software needs to be written specifically for multi-core architectures to avoid performance problems; while we did it right, it’s an area where some other software companies stumble, as pointed out in a good Forbes.com article.
Truviso achieves vertical scalability on many-core SMP systems by splitting incoming data into independent streams running an on-the-fly basis so that each run can be processed in parallel and consolidated continuously. This vertical scalability technology lets enterprises save money on hardware, networking, data center and maintenance costs. For example, a large media network with analytics applications running on a 40-node cluster of specialized data warehouse servers each costing over $50,000 successfully offloaded the most high-value and high-volume workloads at 500,000 records/second on one production and one hot failover server running Continuous Analytics — with hardware costs of just $6,500 for each commodity server.
Speaking of hot failover, with version 3.2 you have the option to use a cluster configuration to achieve high availability / disaster recovery (HA/DR), seamless failover and distributed scalability, as well as scale out processing of interactive queries. The clustering infrastructure allows multi-master replication where data can be sent to any node in the cluster. Queries can also be deployed in the system in an active-active fashion to provide instant failover in the event of a node failure. A failed node can be brought back online and transparently start processing the latest data, while automatically and seamlessly “catching up” with the data that it missed during the time it was offline in an asynchronous fashion.
In addition to high availability and failover, this release offers continuous online backup features that can be used to organize a cluster in an active-standby configuration. Together with our TruLink connectors to aggregate data from disparate sources, we provide lots of flexible options to fit seamlessly in your existing infrastructure.
By Sailesh Krishnamurthy November 19, 2009 @ 4:08 pm
Truviso 3.2 Rolling Data Management
Like many technologists, I find customer interactions to be one of the most fulfilling parts of my job. The Internet world in particular is extremely liberating – customer problems are so unique that none of the standard assumptions hold any more. It turns out that for these folks, their business doesn’t fit the traditional model of a data warehouse that stores all the data from the beginning of time.
What really matters, what drives the business are the events of the last 60 to 120 days, and that is because these businesses operate at a blinding pace. Their user populations are constantly in flux, growing and changing in various ways. Heck, oftentimes their very business models change within a year. In other words, for these organizations recent performance is a better indicator of future results as opposed to the cautionary maxim of “past performance does not guarantee future results” that our friends in Wall Street probably wish they followed more closely!
It’s the recent data that’s most valuable, that’s most required and actively sought out by business managers making decisions on the front line day in and day out. Of course, it’s necessary to keep all historical data for various reasons such as compliance, accounting, and long-term trend analysis by the high priests of data.
These Internet companies, especially the trail-blazers, have realized that their enterprise data warehouses (EDWs) are needlessly over-provisioned. Many organizations are taking Bill Inmon’s advice and storing their most recent transactional data on an operational data store.
Truviso takes the ODS to the next level by providing live production analytics on top of transactional reporting, thereby complementing the EDW, and offloading the most challenging and useful workloads to a location where those workloads can be immediately used to improve operations. This model lets the EDW be used for its strengths, such as open-ended “blue sky” analysis incorporating years worth of historical data.
A critical aspect of Truviso Continuous Analytics version 3.2 that was architected for ODS environments is rolling data management – a feature necessary to efficiently move old data out of the system on a continuous, controlled basis. The actual mechanics are based on infrastructure that can be used to organize detailed as well as summary data into separate partitions (typically on a daily basis) thus letting older partitions be instantly groomed. In a typical implementation, the system is configured to periodically offload data onto a different system such as a data warehouse or a more cost-effective storage array, or even a Hadoop cluster. This reduces the cost of storage and maintenance, cuts hardware costs, and speeds up query processing by retaining only “recent” data in the operational analytics system.
Tell me your stories – I’d love to know what fraction of your analytics workloads focus on recent data.
By Sailesh Krishnamurthy November 17, 2009 @ 2:19 pm
Not only do Big Data analytics systems traffic in massive volumes of data, they use a specialized integration pattern. These are not traditional data warehouses that integrate data from CRM/ERP and OLTP systems using custom extract transform load (ETL) tools. In fact, I’d hazard that in 4 out of 5 New Media analytics implementations source data is primarily from log files that are continually spooled by large banks of web/application servers. For example, one large social media enterprise I recently met with uses a cluster of over 250 servers for targeted advertising.
These systems are typically organized as massive distributed clusters, often in several data centers. In such environments, failures are a fact of life and when a failed node recovers it typically sends older spooled data to the server that can be minutes, hours, or sometimes even days late. For example, during pre-production analysis at a leading digital media business, we found several examples where data reached the warehouse 23 or more hours late. This late data never made it into their legacy reporting system’s batch processing window and was unceremoniously dropped from the reports that were used to make critical business decisions!
Unfortunately, this is not an isolated incident – late data affects all kinds of systems ranging from Hadoop and other MapReduce implementations to more traditional data warehouse technologies purveyed by both storied vendors as well as scrappy startups. Late data causes serious problems for practitioners who manage today’s batch processing systems, with admins often rigging up band-aids. A common approach we’ve seen is to re-run batch reports several times as “insurance”, reminding me of the old Unix sysadmin trick of running sync three times – and still throwing salt over the left shoulder for good measure.
We believe that it is unacceptable to choose between delivering on time reports that are instantaneously incorrect when new data comes in, or increasing both capex and latency to run and re-run batch jobs and still not be 100% safe! Our customers, who use a motley mix of many analytics systems, have been clamoring for a solution to fix the late data problem and finally have a real solution with Truviso.
The Truviso Continuous Analytics 3.2 release continuously processes and updates data as it enters the system, whenever it does, and reflects the latest and greatest data by adjusting computations on the fly. Late data management is accomplished through a pioneering technological innovation – the use of innovative query rewrites to transparently pre-process queries independently and combine the pre-processed results on the fly to always answer queries with the correct and up-to-date results.
With Truviso, Big Data customers can process massively large volumes of data at very low latencies without being affected by the order in which the data arrives in the system. As a result, reports are always delivered instantly on the fly and include all data regardless of age or when it was submitted to the system.
I’d love to know more about how late data arrives in your analytics systems, and how your reporting infrastructure deals with it. Tell me your war stories!
By Sailesh Krishnamurthy November 16, 2009 @ 4:54 pm
Core to almost every measurement metric for online and mobile media networks are counts of “unique users” or “unique visitors”. It is vital that digital service providers can accurately track and measure how their massive visitor populations interact with the services provided, and the measure of choice is unique users over various dimensions (e.g., demographics, behavior, geography and time intervals).
This is a hard problem, and unsurprisingly even leading web analytics offerings like Omniture impose severe limits on the number of unique users and page views that they can report on, with the dreaded “database uniques exceeded” error message that analysts have grown to hate. In addition, these systems face the problems of unacceptable reporting latencies because of their batch processing paradigms and/or quality issues due to reliance on sample data. It’s common to hear our customers complain that they just cannot rationalize their Omniture and Google Analytics numbers. Of course, there was also the recent controversy with comScore and Nielsen reporting very different unique user numbers for Hulu during the month of April (over 40 million versus 8.9 million !!).
If you (1) have too much data, (2) would like to move from 12-24 hour old data to real-time, and (3) need to blend web analytics with additional data sources for a more comprehensive view, you should consider implementing the Truviso Continuous Analytics solution.
Continuous Analytics version 3.2 provides critical new functionality built specifically to solve the uniques problem. The unique-user tracking feature is designed for efficient analysis of tens of billions of unique values (that’s right, billions!), with an expectation of tens to hundreds of millions of actual users for any given category or dimension measured. This information is available instantly in real-time without full table scans, self-joins or 12-24 hour processing delays as is the norm for batch-oriented systems.
The feature itself is based on a novel insight – if the user-id can be mapped to a dense space (we provide connectors to handle cases where this is not true) then it is possible to efficiently represent and maintain the sets of users for each attribute, time dimension or category. Truviso has invented an innovative SQL-based adaptive data structure that is easily used in standard SQL queries – this structure yields high compression and can represent both sparse and dense sets very effectively. In addition, it allows for highly optimized manipulation of the sets – including the ability to add elements in a cumulative fashion, combine sets in an additive fashion for roll-ups, and compare sets to measure similarity.
The Truviso unique-user tracking feature transforms the challenging unique count problem into something that can be processed additively, resulting in enormous data processing productivity gains. It enables streaming on-the-fly unique-user computations, and affords easy roll-up of fine-granularity unique counts to coarser measures. Finally, its efficient set representation allows for a fast and easy implementation of higher-value correlation queries through the standard SQL query language. Some examples of these higher value queries include comparing the actual users who saw a given campaign across different time periods to enable better targeting, or measuring the similarity of the audience of two prime-time shows.
Ben Lorica at O’Reilly Radar posted a great article discussing how the Truviso approach to uniques allows publishers and marketers to adjust campaigns in real-time with A/B split bucket testing and referral analysis. With Truviso, marketers and business managers can really get a better understanding of loyalty and engagement, and relate that directly to a person — not just as a “new” or “returning” visitor statistic.
In the coming weeks we’ll be sharing more information on this feature, including example uses cases as well as mind-blowing performance results.
By Sailesh Krishnamurthy November 11, 2009 @ 8:12 am
A big day for Truviso, the new version of Continuous Analytics v3.2 is now available! Congrats to our development team and everyone who’s put in creativity and bug free code, and to our customers who beta-tested new features.
Continuous Analytics version 3.2 is designed with our media, entertainment and other digital media customers in mind. Over the next few days, I’ll dive into the technology specifics of each of the new features, but here’s a brief overview.
Affecting just about every Internet and mobile company, Unique Users Tracking is a thorny challenge for traditional business intelligence and analytics tools, which provide aggregates but with limited insights into individual users. Truviso now makes analyzing unique visitors just as straightforward as any other web metric.
With this innovation, online publishers can take the first step toward really tracking user engagement. This feature will help ad networks recognize and react to trends using predictive scoring, and video publishers optimize content to improve relevance and visitor engagement. By making the most of unique users, online properties will now have the right data to increase advertising impressions and boost revenue.
Other highlights of Continuous Analytics version 3.2:
- Instantaneous Late Data Processing: Correctly process data coming in asynchronously across complex, distributed systems and networks.
- Rolling Data Management: Keep costs down by storing the most recent, actionable data for operational users.
- Multi-Core Processing: Pushing our benchmarks even further, Continuous Analytics can now process up to 15 Terabytes of data a day on a 8-core server.
- High Availability and Disaster Recovery: Ensure reports are always available, day or night.
- Transactional Consistency: Guarantee data integrity, even if source systems fail, so all data committed is guaranteed complete.
For the v3.2 press release and features overview, visit here.
To find out more, contact us.
By Sailesh Krishnamurthy November 6, 2009 @ 5:43 pm
It’s been a common refrain for years that many organizations would like to become low-latency enterprises but can’t, because their business processes and operational systems are not equipped to handle data on a real-time basis. Here at Truviso, we’re finding that enterprise customers in Internet, mobile and other digital services are beginning to realize new revenues and competitive differentiation from real-time analytics.
For example, I was talking recently with a VP at a large online ad network. Their existing models that were driven by batch-mode analytics degrade very very quickly. Real-time is particularly important because his user population changes so frequently, as publishers modify the percent of their spend on his platform compared to his competitors. In order to survive and adapt to the speed of the market they need Continuous Analytics.
Advantages include real-time impression-based bidding – a big plus for his media buyer customers compared to static bidding – and better ability for him to monetize inventory that would otherwise go unsold. Combining real-time impression attributes with proprietary client data from the last 90 to 120 days offers a complete user picture. Continuous Analytics enables a competitive advantage for inventory management, ad serving, targeting and reporting.
What are ways that real-time data would have a material impact for your business? Let me know what you think.
By Asad Ali September 4, 2009 @ 7:00 am
A recent Wall Street Journal article described the weak sales of a Gatorade drink for PepsiCo Inc. due to a marketing stumble in the makeover efforts of the Gatorade brand. According to the report, consumers were confused by the Gatorade “G” advertising campaign. The challenges with predicting campaign and branding response from the general public are not unique to Pepsi. With any new marketing campaign, the main challenge is to effectively measure consumers’ perception of the product and how it changes from earlier perceptions.
The effectiveness of any marketing campaign requires strategic thinking about the promotion prior to launch, but just as importantly once the campaign is running, data analysis is needed to measure the ongoing progress of such campaigns. For Web 2.0 companies, ad networks and publishers, the timely measurement of campaign effectiveness is even more acute, as the life of any marketing promotion is generally short lived.
Proactive monitoring and trend analysis of the marketing campaigns is becoming a necessity as a day’s worth of bad results can equate to millions of dollars in unrealized or lost revenue. However, many of the analytic systems used to measure and track campaigns don’t provide the timely access to the information that is required for advertisers to evaluate the campaign success. Different analytical systems offer different ways to measure the performance.
There’s usually a “fast” component to reporting, where absolute numbers are delivered based on transactions. This is usually available on the next day after a campaign, but it can be as quick as 3-6 hours with some systems. Additionally, there’s a “trend” view that shows how a campaign is doing over time. This is always time-delayed and usually not very granular (i.e., based on samples, or only aggregate information).
Lastly, there’s target vs. actual. This is more than just a DoubleClick DART or web analytics report – it’s comparing the original target audience with real profile information from real users. If, for example, women 18-25 living in NY with an income of $25,000 – $45,000 were targeted, was that the segment that actually received the most impressions? And, more importantly, was that the most successful demographic? This information is only available if web analytics is combined with application data and overlay data.
The ability to perform this type of Deep Real-time Web data analysis is complicated because it requires looking beyond current page view and impression information. The constantly arriving stream of that data must be analyzed in the context of what is known about users’ demographic information and their past behavior and preferences.
To realize the benefits of any new marketing campaign, or to understand if the campaign is working as planned, being able to see trends within minutes or hours is crucial. Companies advertising online have an advantage – the needed data is available. The hard part is parsing that data quickly, getting it to the right people, and making sense of it, but this timely data analysis ensures proper audience targeting and segmentation and limits the risk of poor campaign performance and misspent budgets.
Truviso is helping online advertisers and publishers see the impact of campaigns immediately rather than in days or weeks. The Continuous Analytics system provides dashboards with audience segmentation and even unique user drill-down, and is helping advertisers measure the effectiveness of their customer outreach campaigns in a timely fashion.
The value of our product is immediately visible to the marketers (and the finance team measuring return on advertising spend (ROAS)) who are now able to derive meaningful information from their data and ensure that their message is being received by the right people in the right place at the right time.
Older Posts »
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
|
|