开发者

get element and change element text with python and lxml

开发者 https://www.devze.com 2023-04-05 16:11 出处:网络
First thing first, I know there are many questions regarding python and lxml on StackOverflow already, and I did read most of them, if not all. Right now I am looking for a more comprehensive answer i

First thing first, I know there are many questions regarding python and lxml on StackOverflow already, and I did read most of them, if not all. Right now I am looking for a more comprehensive answer in this question.

I am doing some HTML conversion and I need to grammatically parse the HTML and then do some content changes to href, img and such.

This is a simplified version of what I have right now:

with open(fileName, "r") as inFile:
    inputS = inFile.read()

myTree = fromstring(inputS) #parse etree from HTML content

breadCrumb = myTree.get_element_by_id("breadcrumb") #a list of elements with matching id
breadCrumbContent = breadCrumb[0].text_content().strip() #text content of bread crumb

h1 = myTree.xpath('//h1') #another way, get elements by xpath
h1Content = h1[0].text_content().strip() #get text content

getTail = myTree.cssselect('table.results > tr > td > a + span + br') #get list of elements using css select

So basically that's what I know at the moment. Is there any other ways to get elements/attributes using lxml? I know that they may not be the best way to do it but bear with me, i am new to this whole thing.

Following is what I want to do. I have:

<img src="images/macmail10.gif" alt="" width="555" height="485" /><br />
<a href="http://www.some_url.com/faq/general_faq.html" target="_blank">General FAQs page</a>

They can be nested inside other elements like div, p whatsoever. What I want to do is to programatically look for those elements; for image, I want to extract the src, do some manipulation with it and set src to something else (for example, src="images/something.jpg" into src="something_images.jpg"), the same thing with href, i want to change it to make it point to other place.

Other than that, I also want to remove some elements from the tree to simplify it, for example:

<head>
    <title>something goes here</title>
</head>
<div>
    <p id="some_p"> Some content </p>
</div>

I would want to remove the head node and the div, i would be able to get the p with id="some_p", is there any ways to grab the parent element? is there also any way to remove those elements? (in this case look for head开发者_JAVA技巧, remove head and then look for id="some_p", get the parent and delete it.

Thank you!

==================================================

UPDATE: I already found the solution to this and already finished coding using lxml.etree. I will post the answer to that as soon as stackoverflow allows me. I truly hope that the answer for this question would be of help to other people when they have to deal with HTML parsing!


lxml and ElementTree are quite similar. The ElementTree portion of the lxml documentation site, in fact, just points to ElementTree's documentation.

You might try working through the ElementTree tutorials and examples at the bottom of the overview page. Since ElementTree is part of the Python distribution, it tends to be widely documented (and easily Googled). Once you grok that, extend with some of the lmlx magic not initial found in ElementTree if you need to. For example, lxml maintains parent relationships for every element and ElementTree does not. You can add parent relationships to ElementTree, but it is not an easy example to start with.

That how I learned it.

0

精彩评论

暂无评论...
验证码 换一张
取 消