Hegwin.Me

Thoughts grows up by feeding itself with its own words.

Catch HTML elements with delayed loading

抓取JS延迟加载的页面内容

I recently wanted to do a small website (WoW-related), which requires data from other people's public database to take, but as a mainland Battlenet users, to apply for API privileges with Blizzard is very troublesome, so I intend to crawl directly from other people's ready-made database pages.

Isn't it an HTML crawler? So I used open-uri and nokogiri as usual, and found a tragedy!

require 'open-uri'
require 'nokogori'

url = "http://db.178.com/wow/cn/battlepets.html#battlepets:50"
doc = Nokogiri::HTML(open(url))

doc.css("div.list-table")
=> []  It's empty; but I clearly have data in my browser

Later, after looking at the page carefully, the content in the form was originally stored in JS, and the content would only be displayed when the JS was finished running. The open URI here does not run the JS, so it can't grab the corresponding content.

Coincidentally, a person in a chat group asked why he was using watir 4.x when he was prompted that there was no Watir::IE.

The inspiration just come. I can use Watir::Browser to simulate opening the page, and then crawling the page. There will just be the delayed-loading content. All I need to do is to parse the HTML.

require 'watir'
require 'nokogiri'

ff = Watir::Brower.new
ff.goto url

ff.div(:id, "footer").wd.location_once_scrolled_into_view
# location_once_scrolled_into_view is an instance method of Selenium::WebDriver::Element, and as you can see from the name he is sliding to the position of this element
# Here we are using it to scroll to the end so that even the kind of content that needs to slide somewhere before loading can be displayed smoothly

doc = Nokogiri::HTML(ff.html)
# Then, just parse it
< Back