I have created a website using wordpress, and the first day it was full of dummy content until I uploaded mine. Google indexed pages such as:
www.url.com/?cat=1
Now these pages doesn't exists, and to make a removal request google ask me to block them on robots.txt
Should I use:
User-Agent: *
Disallow: /?cat=
or
User-Agent: *
Disallow: /?cat=*
My robots.txt file would look something like this:
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp开发者_StackOverflow-includes
Disallow: /wp-content
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: /author
Disallow: /?cat=
Sitemap: http://url.com/sitemap.xml.gz
Does this look fine or can it cause any problem with search engines? Should I use Allow: / along with all the Disallow:?
I would go with this actually
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
So I would actually go with:
User-agent: Googlebot
Disallow: /*?cat=
Resourse (Under patttern matching)
In general, you should not use the robots.txt directives to handle removed content. If a search engine can't crawl it, it can't tell whether or not it's been removed and may continue to index (or even start indexing) those URLs. The right solution is to make sure that your site returns a 404 (or 410) HTTP result code for those URLs, then they'll drop out automatically over time.
If you want to use Google's urgent URL removal tools, you would have to submit these URLs individually anyway, so you would not gain anything by using a robots.txt disallow.
If a search engine can't crawl it, it can't tell whether or not it's been removed and may continue to index (or even start indexing) those URLs.
精彩评论