Does Google Index Dynamic JavaScripted Content?
Posted by Greg Bulmash in Online Marketing And SEO, Techno Thoughts, Web Programming, tags: google, seoI've been reading different articles about what elements of a page Google indexes with an eye toward whether they index content that's added to the page via the JavaScript document.write() method. Not getting a conclusive answer, I decided to do my own test.
Why was I interested? Well, with all the "Web 2.0" technologies that rely on JavaScript (in the form of AJAX) to populate a page with content, it's important to know how it's treated to determine if the content is searchable. If it's not searchable, then it's not having an impact on search-driven traffic.
The test page had three pairs of nonsense words that, at the time of its creation, generated no hits in a Google search. Two were placed in the page via straight HTML. Two were placed in the page via a JavaScript that was part of the document. Two were placed in the page via a JavaScript on a different server that was sourced from within the page (<script type="text/javascript" language="javascript" src="URL to script on other server">).
The page was linked from a sitewide footer to ensure that Google found it, and was posted and linked on the evening of March 7th. Google alerts were set up for one word from each pair so Google would notify me by e-mail when it spotted a page containing those words.
An alert came in in the late evening of March 10th for "zonkdogfology", one of the words in the first pair (part of the straight HTML). By the time I got online in the early afternoon of March 11th, it was part of the Google index and a search for it turned up the page as the sole result.
I then searched for each of the six words at Google.
- The two HTML words both generated a search result that included the page.
- The two words inserted by a JavaScript in the page generated no search results.
- The two words inserted by a remotely sourced JavaScript generated no search results.
Now, it's too early to say conclusively that Google will never index the JavaScript-generated content, barring a change in their search/indexing algorithms. I'll continue to monitor the situation over the next two weeks to give Google time for any secondary processing and distribution to all their datacenters. It is worth noting though, that at least in the immediate term, content in your pages that is made part of the page via JavaScript document.write statements will not be searchable in Google.
GOING FORWARD: Over the next two weeks, I'll be watching to see two things. First, does the indexing change so this page shows up in searches for the four JavaScripted words? And second, how long does it take for MSN and Yahoo to pick up the page and how do they treat it?
Stay tuned.
Addendum: People have been asking why you'd want to index dynamic JavaScripted content... Look at the dozens of comments on this article. They're all going to be indexed by Google because the inclusion is server-side. That's got some value. Comments in general don't just enhance the user experience, but add indexable content to your page and can organically increase your keyword density.
If you're using an AJAX powered comment module, particularly one that's remotely hosted like JS-Kit, then it's important to know what you're getting and what you're losing. Yes, you may be adding functionality to your page easily and enhancing the user experience, but if you don't get the comments indexed, you lose all that juicy keywordy goodness.
Given, I didn't do a heavily AJAXed test with nodes and other constructs. I decided to do the most simple construct... document.write(). I may do other tests in the future. But this was a good place to start. See, in both instances of the JavaScript inserted words, they were included in the scripts as discrete strings. If Google merely indexed the page and made the script text part of the searchable index, the two words from the script that's hardcoded into the page would become searchable. If it read the remote script and indexed it in the same manner, we might see those last two words showing up either in the test page or get the remote script as a hit.


Entries (RSS)
[...] Original post by Greg Bulmash [...]
thank you for that, that's an interesting piece of research.
I've been dabbling with search engine technology myself, and this was one of the major stumpers, in order to really know what a javascript does you have to *run* the bugger, there really is no other way.
DHTML/AJAX are quickly becoming so prevalent that it will not be long before google will have to do something about this or they risk losing a very important part of the web as 'dark'.
thanks again, & best regards
Jacques Mattheij
ps: picked you up through ./'s firehose...
I think the words will show up after they have been added to the crawling pile. Sometimes it will take another 2 weeks before those links are visited.
Ed said: "think the words will show up after they have been added to the crawling pile. Sometimes it will take another 2 weeks before those links are visited."
I thought that might be the case, which is why I'm monitoring this for at least two more weeks. I also want to see how Yahoo and MSN Live treat this. Though they're not quite as big as Google when it comes to referring traffic, it's worth watching.
- Greg
"Going forward"...do you mean "in the future"? That aside, good bit of experimental work.
OK it was good of you to share these findings. In my experience straight html with uncomplicated url's are always the best indexed by any search engine.
This is the classic Halting problem outlined by Turing -- Google can't really index *everything* that's dynamically generated without running it -- and that could be dangerous. I'm sure some hacker would quickly code up an ECMAscript DoS attack on Google, waiting for their spiders to fall into the trap.
neat experiment. but, what sort of content needs to be inserted into your web pages using javascript, such that this content *should* be indexed?
i think it's fair to say that DHTML web applications shouldn't be indexed like documents. they have a potentially infinite number of states and the information on these states isn't really aligned with the semantics of searching the web, which is geared towards finding documents.
there's also an issue of fairness. if you're sourcing content from another site, then doesn't that content belong to *that* site, and thus should only contribute to that site's placement in the index?
finally, if both sites belong to you, then wouldn't you rather have the content incorporated on the server side for the sake of efficiency anyway?
interesting experiment, though most people don't use document.write these days, so I'm not sure what this shows. most dynamic sites tend to use innerHTML or dom methods to create nodes and content....either way, the crawler would basically have to be a full web client to support this kind of thing in any reasonable way...
I don't know if this is a big problem at all, or if it is just a quirk of the wayt Google's algorythms work. If I recall correctly, the philosophy behind Google's search is to depend less on Google's own interpretation of the content of the website, and more on what other sites say about your website.
zonkdogfology
I think you just undid your googlewhack.
Very interesting study. The results are no wonder, but a practical proof and study was in need.
Thanks
[...] Many new sites are adapting the new technologies that makes it easier and faster for the end user to browse their site. Gmail is the best example of useful application of the AJAX technology. You don’t need to refresh the whole page to search, browse, open nor send emails. So good that it became even better than using Outlook. However care should be taken when you apply those kind of technologies. They are good for the user who is already in your site, it is useful for specific kind of applications, but what about people looking for your site. How will they find it? Assume you have a rich news site, and it’s so fast and easy to use. All content is generated through AJAX. Users find it so easy to pick what they want. But despite you may have the best piece of news, people searching the Google will not find the results from your site because they are all dynamically generated using AJAX, and web spiders usually aren’t that smart to index AJAX generated content. Someone did practical test on Googlebot and how it indexes normal content, Javascript generated content, and AJAX content as well. See the results here. Thanks for Slashdot for posting the URL. [...]
@sam:
so long as neither of the other 2 pairs are copied to some html-coded site, the test environment will not be compromised.
unfortunately, searching for 'zonkdogfology' leads us to the site in question, and therefore the other 2 pairs.
best to setup a clean environment to test and report in a few weeks. perhaps even a rolling system w/ new words each day, then we'll know when things change.
No Google (googlebot) does not index javascript in anyway, nor css stylesheets (display:none?) Nor do other popular search engines. End of story.
If Google was to decide to create a javascript parser bot (damn complex when you think of all the libraries such as jquery/prototype) they would be supporting the so called web developers who are doing it 'wrong' in the first place. Javascript must be implemented unobtrusively which means the site is fully accessable to people without javascript enabled browsers (Google, Braile for the blind, and about 4% of neticens). This is in 99% of cases implemented simply by placing your javascript on the document.load event and using it for after affects, ajax requests must always be backed up with a functional form post URL also).
None of the major Web search indexers run JavaScript embedded in your page. They probably don't even download the JavaScript. So, as you discovered, text generated by JavaScript won't be indexed by Google, Altavista, MSN, etc. It also usually won't be read by people using a text reader, e.g. blind users, which can be a difficulty for Web developers required to make pages that are accessible, e.g. for the US 508 legislation. Also be aware that some corporate (or educational) firewalls block JavaScript files, and/or disallow execution of JavaScript for the users in other ways.
Liam
An easy way to understand what a spider sees is to turn off javascript. If your site is heavily dependent on javascript to generate content, you'll see big holes in your pages where content used to be. What's left is what gets added to the Google index.
Google's inability to crawl javascript extends beyond doc.written content. For instance, using javascript to link to another page via window.open presents its own problems. Google will not follow this link and the linked page will not be added to the index. You run the risk of hiding entire sections of your site because of javascript linking.
There are less than ideal workarounds to this—like publishing a site map containing every destination. Sometimes you don't have control of the resulting HTML and a site map is a cheap fix. It won't help your PageRanks, but registering the site map with Google will give it a cue to discover pages it otherwise wouldn't find on its own.
I'm not against javascript linking per se but I would recommend augmenting your onclick handler with a conventional "blue link" to the page in question.
Change:
a href="#" onclick="window.open('page.html');return(false);"
OR
a href="javascript:window.open('page.html');return(false);"
To:
a href="page.html" onclick="window.open('page.html');return(false);"
Google will find page.html and you still get to keep your onclick javascript handler.
How exectly did you setup the Google alert to get informed when the Googlebot visited?
I used a jha menu for some time on my website. For first I encountered some problem in indexing. My page slowly go down in the google rank.
I Tryed than a flash menu controlling the main page.
Here you are the sample menus available. http://www.nicolamarini.it/pagine/news.htm
A solution for the ranking was to add a link to the "hidden" content on the page and than hiding those links with css, but the problem persist when you have a dynamic site with java generated links.
So (for now) i prefer not using JavaScripted content to creating critical link to other pages inside the same internet site.
Not surprising but like Bashar said, the proof is useful.
Actually I am not sure if it makes sense for google to index the content generated using document.write... Let's consider this:
document.write(navigator.userAgent);
What should google index in this case? Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1) for IE7 or maybe Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1 for Firefox 2?
One more reason why AJAX is a hack...
Do server side XSLT and some decent DHTML where it makes sense, and life will be a lot better.
This simply proves that google has indexed the strings in the HTML (JS) output that are on the page.
Now, if you echoed the letters separately onto the screen
document.write 'd';
document.write 'o';
document.write 'g';
document.write 'f';
document.write 'u';
document.write 'd';
document.write 'g';
document.write 'e';
and google indexed dogfudge, THEN I would believe that google is executing the javascript, instead of indexing it unparsed
The basic rule still applies then, look at it in a text browser (lynx) and see what you get, that is what google sees. And interestingly is a fair approximation for what a text reader will say. Keep all your core content standards compliant and you can't go far wrong with Google.
If you've got the kind of dynamic content that you want to be searchable then it's probably also the kind of content that you want to be bookmarkable.
Surely then the goal would be to make sure you implement a unique URL for each document state. This way you get the triple whammy:
1. Search engine visibility
2. Back button support
3. Bookmarkability and permalinks
Really Simple History seemed to do the trick: http://codinginparadise.org/projects/dhtml_history/README.html
Safari needs to be fixed to work with it but as far as I know Safari doesn't support any dynamic back button solutions .
Until someone runs the javascript new content is not an offcial part of the page so it won't be indexed... I don't think google will start running javascript since it is a security problem (and usually requires user input).
Btw if you want to add something dynamically to your page (with javascript) and want it indexed try generating html (with php or something like it) which will have a javascript data structure in it and just reference it from javascript code... This way data will be part of html during indexing (but you loose real dynamics)
w3c is phasing out document.write and document.writeln, you should be using xml nodes.
I think using innerhtml maybe frowned upon
It is also worth noting that the Google Cache of the page does not show either of the javascript interjections in it's "text only" mode.
Looking at how other cached pages appear in text-only mode leads me to believe that the text-only cache is the page in the form Google digests it. If you look at the page source, you see that full comments from the original page are preserved, but many tags have been stripped out. Some of the preserved comments contain javascript code, as if they were placed there to protect non-javascript aware browsers from the code, but the script tags which would surround it have been stripped.
Cool.
That's a cool idea, thanks.
Here's a thought though: if in two weeks you do start seeing the javascript words on that indexed, you won't know if it's from spidering that specific page and seeing the word there or from someone else's link to your now-public test page if they use the specific nonsense word in their link to you.
Hm,
that was to be expected, don't you think so? To find the words that are emitted by JavaScript the Google-Bot needs to behave like a browser, that means the bot needs to download the HTML *and* the JavaScript *and* to execute the JavaScript. And now the mess starts: is your script a clutter of "if (browser.name() ... browser.version())"? If so which browser do you expect the Google-Bot to emulate?
Regards,
Angelo
Good work, Greg. It's one of those things that I've sometimes wondered about, but never bothered to check
Russtopia said:
"This is the classic Halting problem outlined by Turing — Google can’t really index *everything* that’s dynamically generated without running it — and that could be dangerous. I’m sure some hacker would quickly code up an ECMAscript DoS attack on Google, waiting for their spiders to fall into the trap. "
Shoot me if I'm being foolish, but I don't think this should be a problem. I think Google should write a simple javascript parser which searches for javascript strings, and adds them to its index.
I don't think it really matters when/how/why the strings gets document.write()-ed. Of course, small words and phrases which aren't part of the real content (eg: if (something_bad) { alert("Don't do that"); } ) would get indexed as well, but that probably won't affect results too much. After all, a lot of static pages have misc. text which has no bearing on the pages content.
If Google was really feeling energetic, they could make a simple javascript parser which creates a flow-tree (perhaps limiting the max depth of the tree to prevent infinite loops and abuse). They could use the tree to find out which strings actually get printed to the page, and which are used internally by javascript (eg: document.getElementById("this string is never printed") ).
Anyway, that's just my $0.02.
Cheers, Colin
Russtopia; one compromise that won't expose Google to that sort of risk is to do a very simple parse/pseudo-execution of the Javascript. This would take into account all content generated by document.writes with static content, and by those with trivial initialised-string content (e.g. var a = "message", later on document.write("Hello, I wanted to say " + message), etc.
How far this can go is up to Google; obviously it won't cover everything.
Very interesting observation indeed! In the era of web 2.0 ignoring javascript will be a crime!
Hey,
I think you'll find the script source will get indexed but I don't think it will get indexed with the page thats calling it. I have been playing about with a counter site. If you do a search for little counter in google you'll see that indexed is http://www.littlecounter.com/client/counter.php?clientID=9264 which only reference is a javascript source url.
Great work btw I hope to see the rest of the results soon.
Great work. This was great expirement to implement.
I agree with you that overtime/near future Google will start inspecting the content of all tags.
Keep up the good work.
Saeed
I tried hooking up adsense (with a new account) to an AJAX-laden site at one point. I did a trick to "refresh" the Google script when the content was updated each time. At first, the ads were appropriate to whatever dynamic content was present. After a couple of days that stopped and I got generic ads. The explanation from Google was that while the crawling index was being built, it was working off of whatever it happened to find when the script executed, but gradually switched over to their crawled index. Funny way of doing things. Clearly Google CAN handle dynamic content since it just scans the DOM, but they choose not to. In fact, it was politely pointed out that what I was attempting was a violation of their ToS.
Nice and useful. Can you do one for text that's hidden or in a div with display:none? I don't show all the content on a page at one time because it takes too much room. I'm guessing it doesn't get indexed, but I'd like to know for sure.
Very interesting study! One idea I embrace when coding dynamic/AJAX pages is that of extending functionality that's already there. This means that instead of relying on javascript to generate the 'base page', I have that pre-built in html. Then I use javascript to remove elements and replace them with dynamic sections. This ensures not only that people who have javascript disabled can use the site, but also that Google and other search engines will be able to index its content.
One other item to mention: you can still use meta tags to keyword your page and provide a limited sort of access to the data that would otherwise be dynamically generated in it.
As of 6:46 AM on March 12, [zonkdogfology] returns zero results at search.yahoo.com, search.live.com, and ask.com.
AOL search finds the word, but their search is powered by Google.
Looks like Google's indexing is a good bit faster than the other guys.
Now that you have some publicity, you have to be careful that no one links to your test site using the JavaScript-created words -- otherwise the JavaScript-created words might turn up your test site, even if Google ignores all JavaScript.
Instead of pending an experiment such as this - why not just ask google?
This would require at minimum a JavaScript lexer so it can find all of the strings. This would not be too difficult and would be O(n). It would not be able to distinguish real words from something like "SWITCH" "FIRST" "SECOND" if you're using them as defined constants (which I like to do in JavaScript, because nobody cares if JS is a little inefficient). Though google might ignore all strings of one word and all caps.
The other problem involves javascript that constructs strings. Say "one " + "day, " + "I " + "woke " + "up.". You cannot expect google to try to do stuff like this. Maybe a few special cases (the example I gave was simple), but to try to execute code in all ways imaginable so you can find out if there are other strings that can be derived is an NP-hard problem. Google should not try to do this. They should say that if you want your javascript to be used for their search engine, then make your strings easy to spot.
@Russtopia: Turing didn't just 'outline' the problem, he proved it insolvable.
And, more importantly, running JavaScript for output isn't necessarily the halting problem, viz. we don't really care whether the JavaScript halts, but rather want to see some of the content on the page. While it is impossible to wait and see 'all' output, as tormp points out that the number of states of a DHTML may be practically infinite, or at least as large as your database, there's nothing to stop Google from running, say, 500 or 1000 evaluation steps on the JavaScript to see if there is any immediate output.
Since Google could easily control the number of evaluation steps taken in the Googlebot's JavaScript interpreter, I hardly think they're at risk for a DoS.
The logic/premise of this article is interesting but flawed to a tiny degree. Any properly coded Ajax/document.write page should be coded in a degrading manner so that if the client does not support javascript, the page will still be completely accessible.
If this is the case, when the http page is served to google it will serve the degraded ajax-less version of it, detecting that "googlebot" isn't a proper client.
And as "8. Greg" said, to capture all the javascript would require a full client on googles end and it could no longer simply crawl the web.
What's more, is that when you get into JavaScript code, there could be a virtually endless potential of code to be dynamically written. There would be no way to capture and accurately display all of the code, especially if by executing 10 different js events, the page will be displaying 10 different sections of content. And what if on top of those 10 sections, each section has 2 modifiers? Now we're up to 30 inline client-based renders for one page... It's just too much.
If it's a developers concern, they just need to write better code.
I knew it!
I wanted to do a similar test a long time ago.
Thanks! A useful test that I have been meaning to get around to doing for some time, but hadn't!
It comes as no great surprise that Google doesn't do much to 'test' Javascript for output (because it would be extremely dangerous to do this arbitrarily, imho), but it's still a worthwhile experiment.
More to the point, it highlights the issue for us developers, and gives us clear guidance on what to do if we absolutely have to remain 'visible'. KISS priniciple, really, innit?
Well done, and thanks again.
Most badges and tag clouds use the document.write method of inserting data into the page. I've had a site up since the beginning of the year and the document.write content has yet to be indexed.
hard to believe, but it's still the only page that devotes itself to the study of zonks, dogs and fos. googlebot seems to be pretty resilient to slashdotting.
You may be interested in a paper that Tim Berners-Lee and I wrote on choosing the right language to use for your Web content. It makes the point that putting your content into imperative languages like JavaScript is going to greatly reduce the chance that the information can be repurposed, whether as input to a search engine crawler, or for other good uses, when compared to putting the same content into a declarative language such as HTML .
See the W3C Technical Architecture Group (TAG) Finding titled: The Rule of Least Power at http://www.w3.org/2001/tag/doc/leastPower.html .
Noah
Google should be in the business of finding what is there not imposing its will upon what it is trying to "search". I like Google as a tool, not as a master.