开发者

Cleaning and stripping of strings/HTML - Python

开发者 https://www.devze.com 2023-01-22 17:02 出处:网络
I have a set of questions, of which I do not have an answer to. 1) Stripping lists of string input: \'item1,item2, \\t\\t\\t item3, \\n\\n\\n \\t, item4, , , item5, \'

I have a set of questions, of which I do not have an answer to.

1) Stripping lists of string

input:
'item1,   item2, \t\t\t item3, \n\n\n \t, item4, , , item5, '

output:
['item1', 'item2', 'item3', 'item4', 'item5']

Anything more efficient than doing the following?

[x.strip() for x in l.split(',') if x.strip()]

2) Cleaning/Sanitizing HTML

keeping basic tags e.g. strong, p, br, ...

removing malicious javascript, css and divs

3) Unicode handling...

what would y开发者_JAVA技巧ou recommend for dealing with unicode parsed within documents?


Any ideas? :) Thanks guys!


To clean HTML use lxml.html

import lxml.html
text = lxml.html.fromstring("...")
text.text_content()


For the first one you can use split then a list comprehension to trim the extra whitespace:

result = [x.strip() for x in i.split(',')]

And to remove the empty strings from the list:

result = [x for x in result if x]


I tend to write multiple cascading generators, particularly if I want to some output to be part of a test:

stripped_iter = (x.strip() for x in l.split(','))
non_empty_iter = (x for x in stripped_iter if x)

The inspiration is Beazley's presentation on coroutines.


I am somewhat of a beginner at python web development, but for cleaning/sanitizing html I have found that the markdown2 library has some very nice features. You can use it with the MarkItUp! jQuery-based editor. They may not solve all your problems but might help you do a lot of work in a short time.


1) you can use the strip method

2) you can use sanitize , http://wonko.com/post/sanitize

3) some unicode tips here: http://blog.trydionel.com/2010/03/23/some-unicode-tips-for-ruby/


1) [j.strip() for j in a.split(',') if j.strip()]

2) check tidy

0

精彩评论

暂无评论...
验证码 换一张
取 消