I have a set of questions, of which I do not have an answer to.
1) Stripping lists of string
input:
'item1, item2, \t\t\t item3, \n\n\n \t, item4, , , item5, '
output:
['item1', 'item2', 'item3', 'item4', 'item5']
Anything more efficient than doing the following?
[x.strip() for x in l.split(',') if x.strip()]
2) Cleaning/Sanitizing HTML
keeping basic tags e.g. strong, p, br, ...
removing malicious javascript, css and divs
3) Unicode handling...
what would y开发者_JAVA技巧ou recommend for dealing with unicode parsed within documents?
Any ideas? :) Thanks guys!
To clean HTML use lxml.html
import lxml.html
text = lxml.html.fromstring("...")
text.text_content()
For the first one you can use split then a list comprehension to trim the extra whitespace:
result = [x.strip() for x in i.split(',')]
And to remove the empty strings from the list:
result = [x for x in result if x]
I tend to write multiple cascading generators, particularly if I want to some output to be part of a test:
stripped_iter = (x.strip() for x in l.split(','))
non_empty_iter = (x for x in stripped_iter if x)
The inspiration is Beazley's presentation on coroutines.
I am somewhat of a beginner at python web development, but for cleaning/sanitizing html I have found that the markdown2 library has some very nice features. You can use it with the MarkItUp! jQuery-based editor. They may not solve all your problems but might help you do a lot of work in a short time.
1) you can use the strip method
2) you can use sanitize , http://wonko.com/post/sanitize
3) some unicode tips here: http://blog.trydionel.com/2010/03/23/some-unicode-tips-for-ruby/
1) [j.strip() for j in a.split(',') if j.strip()]
2) check tidy
精彩评论