July 01, 2008
A New Approach to Searching Flash
It’s been a while since I’ve posted, but it’s not because I haven’t had anything new to write about. We’ve been working hard on Thermo and I had the pleasure of showing off a recent build at Flashbelt in Minneapolis a few weeks ago. One of the challenges I’ve had with blogging is that many of my projects can’t be talked about while they’re under development. Fortunately, these things don’t stay hidden forever!One of the projects I worked on a while ago that I can now talk about is a new approach to more effectively search Flash based applications and content. We developed it in collaboration with Google and Yahoo. Google is in the process of rolling it out and Yahoo is committed to doing so in the near future.
To understand why a new approach is needed, let’s take a step back and examine how search engines work with basic web content today. During the indexing process, HTML and other well defined file formats are retrieved, parsed, and analyzed for content such as text, graphics, metadata, and most importantly links to other content. By traversing the set of links, the indexer can crawl the site and discover all of its content.
This works because HTML is a simple, declarative format that is easy to parse and understand. Or at least, that’s how HTML used to be! The declarative nature of HTML is important, because it means that you can look at a tag such as a link or heading and the format "declares" what it is. You don't have to run any code to understand it – you can tell just by looking at it.
The fact that SWF files are binary has led some people to conclude that this is why Flash is hard to index. However, this isn’t really the reason. Search engines can and index SWF files today.
What actually makes Flash hard to index is the same thing that makes AJAX applications hard to search, and that is that they don’t work like simple, easy to interpret HTML. You can't tell what they do just by looking at them. Rather, they are complex bodies of code that execute in the browser and do non-discoverable things such as calling out to web services and dynamically generating what the user sees.
In fact, this morning I was reading a report about what we’ve done on news.com, and some of the comments illustrated how pervasive the problem is even though they thought they were doing otherwise. The issue brought up was a simple one – how can Google and Yahoo even figure out what SWF file to load, since it's done with complex JavaScript? One commenter wrote: “Will Google and Yahoo parse everyone's different javascript technique to reveal where the .swf file (is)?”
This gets to the heart of the problem. Search engines can no longer figure out what’s happening simply by looking at the source code – they have to run it, whether it’s complex JavaScript to load a SWF, some XML, or other content, or the Flash application itself.
So what we’ve done is to enable the search engines to actually run the app just as an end user would. They can not only run it and see the information that’s displayed, including data dynamically loaded from the network, but can interact with it as well, pressing on buttons and links to interact with the app and explore all of is content.
To enable this we have created a special version of the Flash player that is designed to run on the server as part of the indexing process. As the code executes, there are special API that notify the search engine when something changes and that allow inspection of the textual and other data that would be displayed to the user.
There are other API that enumerate links and allow the indexer to instruct the player to simulate a “click” on various objects that are displayed. In this way, the indexer can navigate the running app.
What’s especially cool about this is that it doesn’t require any changes to the application code to enable it to be searched. It just works.
Of course, adding things such as deep-linking – exposing URL for distinct parts of a running app, will make searching content more effective, but it’s not required. This and other techniques will undoubtedly become important tools for optimizing how to most effectively expose Flash-based information to search engines.
Posted by Mark Anders at 09:05 AM