Seite 1 von 1

AJAX/AJAJ dynamic web page crawling

BeitragVerfasst: Do Aug 04, 2016 9:34 pm
von Orbiter
Since october 2015 google officially announced that they are able to crawl dynamic AJAX/JSON-driven web pages without any extra work from web page administrators because their crawler is able to read dynamic web pages in the same way as web browsers do: ... cheme.html hat geschrieben:Today, as long as you're not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers.

I thought it may be really difficult to do so for us as well. But just recently I had a look for headless browser frameworks and I found htmlunit - in a very simple test I was able to run this tool and get full DOM-enriched content from AJAX-driven web pages. :ugeek:
That means we have the opportunity to get better crawled content. I am currently investigating opportunities to create a new crawler based on that.

As there is a plan to create a 'YaCy2' made out of single components (see the 'new' crawler using htmlunit could become one first of such components.

Re: AJAX/AJAJ dynamic web page crawling

BeitragVerfasst: Mo Okt 17, 2016 6:58 pm
von luc
Maybe we could even go a little further with Selenium WebDriver and thus be able to easily choose the rendering engine used to generate the DOM : HTMLUnit or any supported full browser...