Optimizing for search results by taking out parts of our URLs with robots.txt
GoogleCodeExporter opened this issue · comments
What steps will reproduce the problem?
1. Google for 'site:saasta.fi "by jake"'
2. Observe the results
What is the expected output? What do you see instead?
I'd expect to get single post links on the Google results.
Instead, I see December and June archive pages that contain Jake's posts.
These results would be more useful and accurate if Google returned single
post results rather than month archives.
I think it is possible to fix this by carefully modifying our robots.txt to
mask out URLs of format "www.saasta.fi/saasta/?m=<YYYYmm"
Doing this is explained here:
http://www.google.com/support/webmasters/bin/answer.py?answer=40367
Original issue reported on code.google.com by jjhel...@gmail.com
on 17 Aug 2008 at 7:34
Added this to robots.txt:
User-agent: *
Disallow: /saasta/?m=*
Disallow: /saasta/?cat=*
Disallow: /saasta/?tag=*
This has the following effects on our posts:
http://www.saasta.fi/ Allowed
http://www.saasta.fi/saasta/?p=2373 Allowed
http://www.saasta.fi/saasta/?m=200807 Blocked by line 2: Disallow: /saasta/?m=*
http://www.saasta.fi/saasta/?cat=7 Blocked by line 3: Disallow: /saasta/?cat=*
http://www.saasta.fi/saasta/?tag=kylma-kyyti Blocked by line 4: Disallow:
/saasta/?tag=*
Let's see how this affects our Google results. What I'm aiming for is that
there
should be mainly single post URLs in Google results and not duplicates caused by
month archives, tags and categories.
Original comment by jjhel...@gmail.com
on 18 Aug 2008 at 4:09
Implemented. The offending URL patterns are now masked out by our robots.txt.
Verified with Google Webmaster Tools that these pages are indeed being blocked.
Googling for the example queries still gives the same wrong results though. My
guess
is that it takes a while for Google to update their index.
Keeping the bug open for a while still to see how this evolves.
Original comment by jjhel...@gmail.com
on 24 Aug 2008 at 9:48
Well well well, seems that robots.txt has kicked in, with the adverse effect
that our
main page is not serving real ads anymore! You may have seen plenty of "Public
service ads" on saasta.fi main page. Well, this is not intended.
Looks like I shouldn't have blocked Google adsense bot from accessing many of
our
links.
http://www.askapache.com/google/adsense-robots.html
Edited robots.txt to fix this. It now looks like this:
8<
User-agent: *
Disallow: /saasta/?m=*
Disallow: /saasta/?cat=*
Disallow: /saasta/?tag=*
Disallow: /saasta/?s=*
Disallow: /saasta/?feed=*
Disallow: /saasta/?paged=*
Disallow: /saasta/wp-admin
Disallow: /saasta/wp-content/plugins
Disallow: /saasta/wp-content/themes
Allow: /saasta/wp-content/uploads
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*
8<
Original comment by jjhel...@gmail.com
on 4 Sep 2008 at 8:48