AJAX/AJAJ dynamic web page crawling

Beitragvon Orbiter » Do Aug 04, 2016 9:34 pm

Since october 2015 google officially announced that they are able to crawl dynamic AJAX/JSON-driven web pages without any extra work from web page administrators because their crawler is able to read dynamic web pages in the same way as web browsers do:

https://webmasters.googleblog.com/2015/ ... cheme.html
https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html hat geschrieben:Today, as long as you're not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers.

I thought it may be really difficult to do so for us as well. But just recently I had a look for headless browser frameworks and I found htmlunit - in a very simple test I was able to run this tool and get full DOM-enriched content from AJAX-driven web pages. :ugeek:
That means we have the opportunity to get better crawled content. I am currently investigating opportunities to create a new crawler based on that.

As there is a plan to create a 'YaCy2' made out of single components (see kaskelix.de) the 'new' crawler using htmlunit could become one first of such components.
Re: AJAX/AJAJ dynamic web page crawling

Beitragvon luc » Mo Okt 17, 2016 6:58 pm

Maybe we could even go a little further with Selenium WebDriver and thus be able to easily choose the rendering engine used to generate the DOM : HTMLUnit or any supported full browser...
