The “Deep Web” is the vast collection of information that is hidden in databases and file systems. The more familiar “surface” Web, which consists largely of html and text, contains only a small fraction of the information available on the Internet. The deep web contains much of the valuable data on the Web, but is largely invisible to standard web crawling techniques. As a result, search engine companies and researchers are hard at work developing new approaches to get at this important information. (See “Exploring a Deep Web that Google Can’t Grasp”)
It should be no surprise then that a similar phenomenon arises in the “Real-Time Web”. All the tweets, blog posts, and comments on the latest viral video that make up the surface real-time web sit atop a much larger, constantly evolving, continuously growing set of structured data streams and conversations that are driven by the activities on the surface. And like in the regular Web, much of the real value lies under the surface.
Where does this deep real-time information come from? It is machine-generated. It is the constant chatter of the complex and dynamic software and hardware stack that underlies all Web 2.0 applications. Every user action (or sometimes even inaction) generates a cascade of data describing what was shown, clicked, viewed and interacted with as well as log and performance data from all the systems and networks along the way.
Let’s say a user visits a Web site to watch a video. Logs record this page view, each ad that is served and presented, various requests to collect data necessary to construct the page, etc. Clicking on an ad generates yet more data. When the video is started, “beacons” in the video return a stream of periodic status updates recording content, user actions, (stop/start/fast forward), video quality and bit rate, etc. When the user posts a comment about the video, tweets about it, forwards a link to his or her friends, etc. more data is generated. Thus, each surface action results in many data events in the Deep Real-Time Web.
Why is this Deep Real-Time Web data valuable? Consider an advertiser who purchases a video advertising campaign targeted at certain types of customers. The success of this campaign depends on many factors, including the accuracy of the targeting, the other campaigns that might be running concurrently with this one, and the video quality in terms of both content and delivery. Quickly and correctly understanding what is going on with the campaign can mean the difference between success and failure.
The Deep Real-Time Web presents new data analytics challenges due to both time constraints and scale. Companies that use Deep Real-Time Web data can easily find themselves dealing Terabytes of data a day, and the analysis of that data must be done quickly enough so that the company can react when something is going wrong, or capitalize on new opportunities as they arise.
Truviso is built for this world of continuously streaming real-time data. Early on, with a partner, we developed the first dynamic tag cloud that showed the evolution of hot topics on the blogosphere (“Truviso Shows Off Dynamic Database with Technorati Tag Cloud” ). More recently, our customers have been using Truviso to extract value from the less visible but vastly larger streams of the deep real-time web. Truviso Continuous Analytics enables them to understand the performance of ad campaigns as they evolve, quickly detect unexpected problems, and even take the load off of their data warehouses that are struggling to keep up with the massive data volumes of the Deep Real-Time Web.
And more breakthroughs will come from combining insights gleaned through both the Surface and Deep Real-Time Webs.




