thistlechaser: (Sleepy Ken)
[personal profile] thistlechaser
There is such a thing as too much of a good thing. In the last 13 days, Google's spider has been to my site (Catlove) 473 times, and has used 41.64 MB of bandwidth. The next most-visiting spider came 29 times and used 1.11 MB bandwidth.

As far as I can tell, robots.txt will only keep a spider out or let it in. I thought I recalled that you could restrict them to like "once a month" or something like that, but I can't find any reference to that now...

Can you tell I'm really bored?

Date: 2003-11-13 11:47 am (UTC)
From: [identity profile] alchemia.livejournal.com
Really, yikes! Google only uses about 200K each month on intertexius. I've it set though to only index the main page and not follow any links past that.

How much of your site do you want Google to archive? Just hte main page? Pages but not images? etc.? You can be very specific.

The robots exclusion protocol can be found here, it works for google and other search engines:
http://www.robotstxt.org/wc/exclusion.html

Google specific info is here: http://www.google.com/webmasters/faq.html

If you need further help beyond that, ask =)

Date: 2003-11-13 12:57 pm (UTC)
From: [identity profile] thistle-chaser.livejournal.com
Yep, I read through those two links before posting. Thanks though!

How much of your site do you want Google to archive?

I'm fine with them crawling it all, but 473 visits in 13 days? That's just seriously excessive...

Date: 2003-11-13 01:40 pm (UTC)
From: [identity profile] gconnor.livejournal.com
If they are requesting any images, you can block them from your images folder. If they are just requesting the same html over and over, not much you can do about that.

Date: 2003-11-13 01:47 pm (UTC)
From: [identity profile] thistle-chaser.livejournal.com
Unfortunately I was sloppy in setting up the site, and never made an /images directory -- I just tossed everything in together. Google's image bot visits along with the normal one. So long as my images start showing up in their image searches, I don't mind them having (reasonable) access, but thus far I seem not to be listed there yet. (Images seem to have a whole separate database than their text searching.)

Date: 2003-11-13 05:28 pm (UTC)
From: [identity profile] gconnor.livejournal.com
I wonder if their image bot has a different name, if so you could just block that crawler.

If you do decide to move the images to images/ you can provide a list of redirects from /whatever.jpg to /images/whatever.jpg. Normal users will follow the image redirect automatically and robots would be blocked.

A cheap trick is to make a symlink "images" to "." - then both urls will work. This doesn't keep people out but it gives you time to change / to /images everywhere without breaking anything (well, copying all the images to both places works too, but symlinks are so much geekier)

Date: 2003-11-13 06:05 pm (UTC)
From: [identity profile] thistle-chaser.livejournal.com
I wonder if their image bot has a different name, if so you could just block that crawler.

You know, I bet you can. Thanks for the idea and the other info as well!

Date: 2003-11-13 06:31 pm (UTC)
From: [identity profile] mousapelli.livejournal.com
thanks for the info and the link, i was having the same sort of issue but couldn't identify the problem. Stupid automated internet systems. I hope the robot.txt thing fixes it.

Profile

thistlechaser: (Default)
thistlechaser

July 2025

S M T W T F S
  1234 5
6 789101112
13141516171819
20212223242526
2728293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Apr. 25th, 2026 01:14 pm
Powered by Dreamwidth Studios