I am trying to write some test script for a web application. So I tried use twill, which turns out to be using mechanized to parse html. But it really lets me down. For example, somehow, it couldn't correctly recognize a form on web page requires method "POST", not "GET".
So, is there any better alternatives other than directly using urllib2?
Edit, this this form that twill can not recognize.
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML Transitional//EN'
'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
<html>
<head>
<meta http-equiv='Content-type' content='text/html; charset=utf-8' />
<title> Login </title>
<link rel='stylesheet' href='/assets/styles/default.css' type='text/css'/>
<link rel='stylesheet' href='/assets/styles/button.css' type='text/css'/>
<link rel='stylesheet' href='/assets/styles/login.css' type='text/css'/>
<link rel='icon' type='image/x-icon' href='/assets/favicon.ico' />
</head>
<body >
<div id='content_area'>
<div id='outter_login_area'>
<div id='inner_login_area'>
<div id='login_head'>
<div>
Login with your
</div>
<div id='login_head_bottom'>
<img src='/assets/images/header_logo.png' class='login_logo'/>
<b>Admin Account</b>
</div>
</div>
<hr/>
<form action="." method="POST" id='user_login_form'>
<div style='display:开发者_运维知识库none'><input type='hidden' name='csrfmiddlewaretoken' value='c6b6e0ca08d53093428c61f62f51ea1f' /></div>
<div>
<label for="id_username">User Name</label>
<input id="id_username" type="text" name="username" maxlength="30" />
</div>
<div>
<label for="id_password">Password</label>
<input type="password" name="password" id="id_password" />
</div>
<div id='submit_bar'>
<input name='submit_button' type="submit" value="Submit" class="button blue"/>
</div>
</form>
</div>
</div>
</div>
</body>
</html>
This is what twill says:
In [5]: br.go('http://localhost:8000/')
==> at http://localhost:8000/accounts/login/?next=/chancellor/
In [6]: br.get_all_forms()
Out[6]: [<_mechanize_dist.ClientForm.HTMLForm instance at 0x03112F80>]
In [7]: br.get_all_forms()[0]
Out[7]: <_mechanize_dist.ClientForm.HTMLForm instance at 0x03112F80>
In [8]: br.get_all_forms()[0].method
Out[8]: 'GET'
Well, that's not true.
mechanize
can recognize whether the method of the form is POST
or GET
just fine. It looks the "method"
attribute of the <form>
tag to do so.
So, if it didn't work for your particular case, you'd have to look what's wrong. Can you provide the HTML source code for the page you're trying to use? I suspect the form isn't declared as POST
, otherwise mechanize
would detect it.
That said, if you're looking for alternatives, I like to use scrapy for web scraping. It's a fast high-level screen scraping and web crawling framework, written from ground up with the purpose of crawling websites to extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
EDIT:
I saved your html snippet to /tmp/test.html
and ran the following code:
import mechanize
br = mechanize.Browser()
br.open('file:///tmp/test.html')
br.select_form(nr=0)
print br.method
I get POST
as result. So I can't reproduce your issue.
Are you sure it's that page you're parsing?
EDIT 2:
Your HTML is broken.
The line immediatelly above <form>
is malformed. It contains <hr/>
and it should contain either <hr>
or <hr />
in order to be parsed correctly.
Here's how to fix it when parsing:
import mechanize
br = mechanize.Browser()
response = br.open('file:///tmp/test.html')
# fix the page so it is correctly parsed:
response.set_data(response.get_data().replace('<hr/>', '<hr />'))
br.set_response(response)
br.select_form(nr=0)
print br.method
精彩评论