解析内容不在html标签Nokogiri

问题描述:

<form method="post" action="/M740/Biography/History/Drama/12+Years+a+Slave"> 
    <input type="image" src="/public_site/webroot/cache/imdb/2024544_100.jpg" width="100" style="float:right;margin-left:2px;"> 
    <strong><span style="color: rgb(255, 69, 0);">12 Years a Slave</span></strong> 
    <br> 
    In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.<br> 

    <br><strong>Century Cinemax - Junction</strong><br> 

    <a href="tel:0774136246">0774136246</a> 

     <a href="tel:0208022073">0208022073</a> 

    <br> 
    12:10, 19:10, 21:40<br> 

     <br><strong>Fox Cineplex Sarit</strong><br> 

    <a href="tel:0203753025">0203753025</a> 

    <a href="tel:0720366208">0720366208</a> 

    <br> 
     11:00, 14:00, 18:00, 20:40<br> 

    <br><strong>Planet Media - Kisumu </strong><br> 

    <a href="tel:0731999100">0731999100</a> 

     <a href="tel:0724999100 &amp; 0202629388">0724999100 &amp; 0202629388</a> 

    <br> 
    12:00, 14:30, 20:30<br> 

    <br> 
    <input type="hidden" name="cinema" value="0"> 
    <input type="hidden" name="searchMovie" value="0"> 
     <input type="hidden" name="movie" value="740"> 
    <input type="hidden" name="date" value="0"> 
    <input type="hidden" name="groupId" value="0"> 
    <input type="submit" name="ok" value="Further Details"> 
</form> 

好吧,这只是我试图解析使用Nokogiri的一部分HTML。 html中的语义并不完整,我正在用Nokogiri获得想要的内容。作为参考,这是我想要废除的网站(http://flix.co.ke/Frontpage/Listings解析内容不在html标签Nokogiri

到目前为止,我能够获得电影的标题,一个电影院和两个电话号码,但与我的方法我不能真正得到所有内容所需

这是我使用

require 'rubygems' 
require 'nokogiri' 
require 'open-uri' 

url = "http://flix.co.ke/Frontpage/Listings" 
doc = Nokogiri::HTML(open(url)) 

doc.css(".min-width div form").each do |entry| 
    title = entry.at_css("span").text 
    puts title 

    cinema = entry.at_css("br+ strong").text 
    puts cinema 

    phone = entry.at_css("a").text 
    puts phone 

    puts entry.at_css("a").next_element.text 
end 

有了这个我目前的剧本我只能够得到电影的titleone cinematwo contact numbers所以我的样本输出的模样。

12 Years a Slave 
Century Cinemax - Junction 
0774136246 
0208022073 

47 Ronin 3D 
Century Cinemax - Junction 
0774136246 
0208022073 

Delivery Man 
Century Cinemax - Junction 
0774136246 
0208022073 

Frozen 
Century Cinemax - Junction 
0774136246 
0208022073 

(continued...) 

有,只是在休息标记后称号后的描述,我无法得到这一点,并我怎么通过
标签内的所有电影院循环?以及逗号分隔的电话号码和个人演出时间。

我只是不知道从哪里开始。我会想取得这样的成绩对于这种情况

  • 12年从

  • 在战前美国,所罗门·诺萨普,一个*的黑人男子从纽约州北部,被绑架并卖入奴隶制。

  • 世纪Cinemax的 - 结 12:10,19:10,21:40
  • 福克斯影城沙立 11:00,14:00,18:00,20:40

etc

任何帮助将不胜感激。在此先感谢

+2

包含有效的HTML片段,而不是提取。为了帮助你,我们必须跳过篮球。 –

电影院你循环html真的不是那么糟糕,并且你在br + strong的正确轨道上,这就是你想要迭代的东西:

doc.search('.min-width div form').each do |form| 
    title = form.at('span').text 
    description = form.at('br').next.text 

    form.search('br + strong').each do |el| 
    cinema = el.text 
    phones = [] 
    while next_el = el.at('+ a', '+ br + a') 
     el = next_el 
     phones << el.text 
    end 
    times = el.at('+ br').next.text   
    end 
end 
+0

我不能强调这是多么有帮助。谢谢一堆! ;-) –

这是可怕的HTML:/它是无效的451错误和9警告。没有语义,所以你必须依靠可能会改变的结构,打破你的刮擦。

然而,你可以通过使用同级方法获得每一种:

doc.css('.min-width div form').each do |node| 
    description = node.at_css('br').next_sibling.text 
    puts description.strip 
    puts '-'*10 
end 

# >> In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery. 
# >> ---------- 
# >> A band of samurai set out to avenge the death and dishonor of their master at the hands of a ruthless shogun. 
# >> ---------- 
# >> An affable underachiever finds out he's fathered 533 children through anonymous donations to a fertility clinic 20 years ago. Now he must decide whether or not to come forward when 142 of them file a lawsuit to reveal his identity. 
# >> ---------- 
# >> Fearless optimist Anna teams up with Kristoff in an epic journey, encountering Everest-like conditions, and a hilarious snowman named Olaf in a race to find Anna's sister Elsa, whose icy powers have trapped the kingdom in eternal winter. 
# >> ---------- 
# >> A medical engineer and an astronaut work together to survive after an accident leaves them adrift in space. 
# >> ---------- 
# >> A pair of aging boxing rivals are coaxed out of retirement to fight one final bout -- 30 years after their last match. 
# >> ---------- 
# >> 
# >> ---------- 
# >> Harrison, overworked and underpaid is looking for money for bride price. A 'business' opportunity presents itself when he gets the keys to the Company house. With the CEO away on holiday, he has access to a vacant fully furnished house. He ... 
# >> ---------- 
# >> 
# >> ---------- 
# >> A chronicle of Nelson Mandela's life journey from his childhood in a rural village through to his inauguration as the first democratically elected president of South Africa. 
# >> ---------- 
# >> Author P. L. Travers reflects on her difficult childhood while meeting with filmmaker Walt Disney during production for the adaptation of her novel, Mary Poppins. 
# >> ---------- 
# >> The Manzoni family, a notorious mafia clan, is relocated to Normandy, France under the witness protection program, where fitting in soon becomes challenging as their old habits die hard. 
# >> ---------- 
# >> The dwarves, along with Bilbo Baggins and Gandalf the Grey, continue their quest to reclaim Erebor, their homeland, from Smaug. Bilbo Baggins is in possession of a mysterious and magical ring. 
# >> ---------- 
# >> The film begins as Katniss Everdeen has returned home safe after winning the 74th Annual Hunger Games along with fellow tribute Peeta Mellark. Winning means that they must turn around and leave their family and close friends, embarking on a ... 
# >> ---------- 
# >> A day-dreamer escapes his anonymous life by disappearing into a world of fantasies filled with heroism, romance and action. When his job along with that of his co-worker are threatened, he takes action in the real world embarking on a global ... 
# >> ---------- 
# >> Faced with an enemy that even Odin and Asgard cannot withstand, Thor must embark on his most perilous and personal journey yet, one that will reunite him with Jane Foster and force him to sacrifice everything to save us all. 
# >> ---------- 
# >> A journey into the lives of a mother polar bear and her two seven-month-old cubs as they navigate the changing Arctic wilderness they call home. 
# >> ---------- 
# >> See and feel what it was like when dinosaurs ruled the Earth, in a story where an underdog dino triumphs to become a hero for the ages. 
# >> ---------- 

通过使用以css代替at_css(您通过表单元素循环例如方式相同)

+0

好多了! – Bala