I'm trying to parse html that contains both an ordered list as well as DL/DD tags. The goal is to create an xml structure that itemizes the contents of EACH tag adding some attribute. In end effect flattening the structure (desired output will be shown at the end of the question).
Here's an example of the html stored in a file (contained in test.html in my code):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Test Structure</title>
</head>
<body>
<ol><li>Item 1 - Level 1
<dl><dd>Item 1.1 - Level 2
</dd><dd>Item 1.2 - Level 2
</dd></dl>
</li><li>Item 2 - Level 1
<dl><dd>Item 2.1 - Level 2
<dl><dd>Item 2.1.1 - Level 3
</dd><dd>Item 2.1.2 - Level 3
<dl><dd>Item 2.1.2.1 - Level 4
</dd><dd>Item 2.1.2.2 - Level 4
</dd></dl>
</dd></dl>
</dd><dd>Item 2.2 - Level 2
<dl><dd>Item 2.2.1 - Level 3
</dd><dd>Item 2.2.2 - Level 3
<dl><dd>Item 2.2.2.1 - Level 4
</dd><dd>Item 2.2.2.2 - Level 4
</dd></dl>
</dd><dd>Item 2.2.3 - Level 3
<dl><dd>Item 2.2.3.1 - Level 4
</dd><dd>Item 2.2.3.2 - Level 4
</dd></dl>
</dd><dd>Item 2.2.4 - Level 3
</dd></dl>
</dd></dl>
</li><li>Item 3 - Level 1
<dl><dd>Item 3.1 - Level 2
</dd><dd>Item 3.2 - Level 2
</dd></dl>
</li></ol>
</body>
</html>
Output from HTML (shown here you don't see the indentation you would see in a browser):
- Item 1 - Level 1
- Item 1.1 - Level 2
- Item 1.2 - Level 2
- Item 2 - Level 1
- Item 2.1 - Level 2
- Item 2.1.1 - Level 3
- Item 2.1.2 - Level 3
- Item 2.1.2.1 - Level 4
- Item 2.1.2.2 - Level 4
- Item 2.2 - Level 2
- Item 2.2.1 - Level 3
- Item 2.2.2 - Level 3
- Item 2.2.2.1 - Level 4
- Item 2.2.2.2 - Level 4
- Item 2.2.3 - Level 3
- Item 2.2.3.1 - Level 4
- Item 2.2.3.2 - Level 4
- Item 2.2.4 - Level 3
- Item 3 - Level 1
- Item 3.1 - Level 2
- Item 3.2 - Level 2
Desired output:
<job>
<req level='1'>Item 1 - Level 1</req>
<req level='1.1'>Item 1.1 - Level 2</req>
<req level='1.2'>Item 1.2 - Level 2</req>
<req level='2'>Item 2 - Level 1</req>
<req level='2.1'>Item 2.1 - Level 2</req>
<req level='2.1.1'>Item 2.1.1 - Level 3</req>
<req level='2.1.2'>Item 2.1.2 - Level 3</req>
<req level='2.1.2.1'>Item 2.1.2.1 - Level 4</req>
<req level='2.1.2.2'>Item 2.1.2.2 - Level 4</req>
<req level='2.2'>Item 2.2 - Level 2</req>
<req level='2.2.1'>Item 2.2.1 - Level 3</req>
<req level='2.2.2'>Item 2.2.2 - Level 3</req>
<req level='2.2.2.1'>Item 2.2.2.1 - Level 4</req>
<req level='2.2.2.2'>Item 2.2.2.2 - Level 4</req>
<req level='2.2.3'>Item 2.2.3 - Level 3</req>
<req level='2.2.3.1'>Item 2.2.3.1 - Level 4</req>
<req level='2.2.3.2'>Item 2.2.3.2 - Level 4</req>
<req level='2.2.4'>Item 2.2.4 - Level 3</req>
<req level='3'>Item 3 - Level 1</req>
<req level='3.1'>Item 3.1 - Level 2</req>
<req level='3.2'>Item 3.2 - Level 2</req>
</job>
Note that we want to derive the hierarchy from traversing the structure, not from the actual contents of each LI and DD attributes...the contents of my example list out the hierarchy (1, 1.1, 1.2 ...) but in the actual data we won't see that. The "level" attribute should reflect the traversal of the structure.
I'm new to both Ruby as well as Nokogiri but here is my attempt at reading the HTML (haven't got to creating the XML). I'm stuck separating out the LI nodes and contents. I've tried using .each
, children.each
, etc:
require 'rubygems'
require 'open-uri'
require 'nokogiri'
url = "test.html"
doc = Nokogiri::HTML(open(url))
line = "1"
doc.css("ol[1]").children.each开发者_高级运维 do |n|
puts line + n.content.to_s
line.succ!
n.children do |c|
puts line + c.content.to_s
line.succ!
end
end
You can use the node_name
method to determine what is text and what is a child, here's a sample function that spits out the name of the html tags under the ol:
def traverse(node, indent = 0)
node.children.each do |child|
next if child.node_name == "text"
puts " "*indent + child.node_name
traverse(child, indent+1)
end
end
traverse doc.css("ol[1]")
(the text nodes that i'm skipping above are the textual content of the tags)
精彩评论