问题与红宝石解析

问题描述：

我只是有一个轻微的问题，在红宝石nokogiri网站与一个网站。问题与红宝石解析

下面是该网站看起来像

<div id="post_message_111112" class="postcontent"> 

     Hee is text 1 
    here is another 
     </div> 
<div id="post_message_111111" class="postcontent"> 

      Here is text 2 
    </div>

这里是我的代码来解析它

doc = Nokogiri::HTML(open(myNewLink)) 
myPost = doc.xpath("//div[@class='postcontent']/text()").to_a() 

ii=0 

while ii!=myPost.length 
    puts "#{ii} #{myPost[ii].to_s().strip}" 
    ii+=1 
end

我的问题是，当它Hee is text 1，在to_a后显示出来，因为新线把它怪怪的像这样

myPost[0] = hee is text 1 
myPost[1] = here is another 
myPost[2] = here is text 2

我希望每个div都是它自己的消息。像

myPost[0] = hee is text 1 here is another 
myPost[1] = here is text 2

我将如何解决这个感谢

修订

我试图

myPost = doc.xpath("//div[@class='postcontent']/text()").to_a() 

myPost.each_with_index do |post, index| 
    puts "#{index} #{post.to_s().gsub(/\n/, ' ').strip}" 
end

我把post.to_s（）。GSUB，因为它是抱怨GSUB不作为发布的方法。但我仍然有同样的问题。我知道即时做错了刚刚击毁我的头

更新2

忘了说，新的生产线是<br />，甚至与

doc.search('br').each do |n| 
    n.replace('') 
end

或

doc.search('br').remove

的问题仍然存在

答

如果你看看myPost数组，你会看到每个div实际上是它自己的消息。第一个恰好包括一个换行符\n。要用空格替换它，请使用#gsub(/\n/, ' ')。所以，你的循环是这样的：

myPost.each_with_index do |post, index| 
    puts "#{index} #{post.to_s.gsub(/\n/, ' ').strip}" 
end

编辑：

据我有限的了解它，XPath的只能找到节点。子节点为<br />，因此您要么在它们之间有多个文本，要么在搜索中包含div标记。确实有办法加入<br />节点之间的文本，但我不知道它。直到你找到它，在这里一些作品：

与"//div[@class='postcontent']"

更换您的XPath匹配调整你的循环删除div标签：

myPost.each_with_index do |post, index| 
    post = post.to_s 
    post.gsub!(/\n/, ' ') 
    post.gsub!(/^<div[^>]*>/, '') # delete opening div tag 
    post.gsub!(%r|</\s*div[^>]*>|, '') # delete closing div tag 
    puts "#{index} #{post.strip}" 
end

感谢您的快速回复，但只有一个小问题。之后myPost = doc.xpath（“// div [@ class ='postcontent']/text（）”）。to_a（）... I have .... myPost.each_with_index do | post，index | puts“＃{index}＃{post.gsub（/ \ n /，''）.strip}” end ....但是它给出了关于没有方法gsub的帖子，所以如果我把... myPost .each_with_index do | post，index | puts“＃{index}＃{post.to_s（）。gsub（/ \ n /，''）.strip}” end ......它解决了no gsub问题，但仍然是数组的问题 – DanielJ 2013-03-10 17:51:48

不好意思，当然有'to_s'丢失了。我将它固定在原文中，但现在它会将每篇文章打印在一行中。我不知道到底发生了什么，你能提供一个有效的例子吗？您发布的html无法自行分析。 – Huluk 2013-03-10 18:23:11

\t \t \t text text text text text text text text text text text text text text text text text text.

MAny thanks. \t \t

– DanielJ 2013-03-10 18:30:42

答

这里，让我为你清理它：

doc.search('div.postcontent').each_with_index do |div, i| 
    puts "#{i} #{div.text.gsub(/\s+/, ' ').strip}" 
end 
# 0 Hee is text 1 here is another 
# 1 Here is text 2

问题与红宝石解析

相关推荐