开发者

How to parse nested ul/li tags using Hpricot

开发者 https://www.devze.com 2023-03-17 00:22 出处:网络
I have the following HTML structure <div id=\'my_categories\'> <ul> <li><a href=\"1\">Animals, Birds, & Pets</a></li>

I have the following HTML structure

 <div id='my_categories'>
   <ul>
     <li><a href="1">Animals, Birds, & Pets</a></li>
     <li><a href="2">Ask the Expert</a>
       <ul>
         <li><a href='21'>Health Care Providers</a></li>
         <li><a href='22'>Influnza</a>
           <ul>
             <li><a href='221'>Flu Viruses (2)</a></li>
            <li><a href='222'>Test</a></li>
           </ul>
         </li>
       </ul>
     </li>
    </ul>
  </div>

This is how the web page looks

How to parse  nested ul/li tags using Hpricot

What I need is, I have a categories table with fields category_name, category_url and parent_id.

I need to save each category and sub-category. The parent_id denotes under which category does this sub-category comes under.

How can I parse through this HTML structure using this Hpricot and save data to my database. Please help

My table looks like

   id   category_name              category_url  Parent_id 
   1    Animals, Birds, & Pets     null         开发者_JAVA百科 null
   2    Ask the expert             null          null
   3    Health Care Providers      null          2
   4    Influenza                  null          2
   5    Flu Viruses                null          4
   6    Test                       null          4

Thanks in advance


Below is the code that worked for me...

   doc = Hpricot(open(categories_page).read)
   doc.search("ul/li").each do |li| 
     category = li.search('a[@href]').first.inner_text.gsub(/ *\(.*?\)/, '')
     category_url = li.search('a').first[:href]
     category = Category.find_or_create_by_name(category, :url => category_url)

     puts "---------- #{category.name} ------------"
     nodes = li.search("ul/li/a")
     unless nodes.empty?
       nodes.each do |node|
         node_name = node.inner_text.gsub(/ *\(.*?\)/, '')
         node_url = node.attributes['href']
         sub_category = Category.find_by_name(node_name)
         if sub_category.blank?
           sub_category = Category.create(:name => node_name, :url => node_url, :parent_category_id => category.id)
           puts " #{sub_category.name}"
         else
           sub_category.update_attribute('parent_category_id', category.id)
           puts "  #{category.name} --> #{sub_category.name}"
         end
       end
     end    
   end
0

精彩评论

暂无评论...
验证码 换一张
取 消