nurpax / saastafi

Automatically exported from code.google.com/p/saastafi

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimizing for search results by taking out parts of our URLs with robots.txt

GoogleCodeExporter opened this issue · comments


What steps will reproduce the problem?
1. Google for 'site:saasta.fi "by jake"'
2. Observe the results

What is the expected output? What do you see instead?

I'd expect to get single post links on the Google results.

Instead, I see December and June archive pages that contain Jake's posts.

These results would be more useful and accurate if Google returned single
post results rather than month archives.

I think it is possible to fix this by carefully modifying our robots.txt to
mask out URLs of format "www.saasta.fi/saasta/?m=<YYYYmm"

Doing this is explained here:
http://www.google.com/support/webmasters/bin/answer.py?answer=40367

Original issue reported on code.google.com by jjhel...@gmail.com on 17 Aug 2008 at 7:34

Added this to robots.txt:

User-agent: *
Disallow: /saasta/?m=*
Disallow: /saasta/?cat=*
Disallow: /saasta/?tag=*

This has the following effects on our posts:

http://www.saasta.fi/   Allowed
http://www.saasta.fi/saasta/?p=2373     Allowed
http://www.saasta.fi/saasta/?m=200807   Blocked by line 2: Disallow: /saasta/?m=*
http://www.saasta.fi/saasta/?cat=7  Blocked by line 3: Disallow: /saasta/?cat=*
http://www.saasta.fi/saasta/?tag=kylma-kyyti    Blocked by line 4: Disallow:
/saasta/?tag=* 

Let's see how this affects our Google results.  What I'm aiming for is that 
there
should be mainly single post URLs in Google results and not duplicates caused by
month archives, tags and categories.

Original comment by jjhel...@gmail.com on 18 Aug 2008 at 4:09

Implemented.  The offending URL patterns are now masked out by our robots.txt. 
Verified with Google Webmaster Tools that these pages are indeed being blocked.

Googling for the example queries still gives the same wrong results though.  My 
guess
is that it takes a while for Google to update their index.

Keeping the bug open for a while still to see how this evolves.

Original comment by jjhel...@gmail.com on 24 Aug 2008 at 9:48

Well well well, seems that robots.txt has kicked in, with the adverse effect 
that our 
main page is not serving real ads anymore!  You may have seen plenty of "Public 
service ads" on saasta.fi main page.  Well, this is not intended.

Looks like I shouldn't have blocked Google adsense bot from accessing many of 
our 
links.

http://www.askapache.com/google/adsense-robots.html

Edited robots.txt to fix this.  It now looks like this:

8<
User-agent: *
Disallow: /saasta/?m=*
Disallow: /saasta/?cat=*
Disallow: /saasta/?tag=*
Disallow: /saasta/?s=*
Disallow: /saasta/?feed=*
Disallow: /saasta/?paged=*
Disallow: /saasta/wp-admin
Disallow: /saasta/wp-content/plugins
Disallow: /saasta/wp-content/themes
Allow: /saasta/wp-content/uploads

# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*

# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*
8<


Original comment by jjhel...@gmail.com on 4 Sep 2008 at 8:48