When you’re dealing with website security, knowing who to block from accessing your website can be difficult. Some IP address ranges are easy to spot because they belong to companies that provide web hosting services and some are almost impossible to find. I’ve been working on a long-term project in which I intend to separate the good guys from the bad guys and it’s an extremely tiring process. It’s at the point where I almost want to give up on it.
Internet Access Providers
I’ve already blocked most of the server companies and their associated website attacks. Unfortunately, that only fixes half of my problems. A lot of the spam and registration attacks come from the dynamic ranges of the well-known Internet access providers like Comcast, Cox, Road Runner and Shaw (and I’m only mentioning a few). For them, I’ve developed a script that scans my access logs every 15 minutes for “bad behavior” and then updates a block list with offending single IP addresses that blocks them for two days from the last offense. Because many of them are repeat offenders, who show up daily, this particular block list always has between 300 and 700 individual IP addresses on it.
It’s not enough that I have WordPress plugins in place to stop the bad behavior from having any impact. I’m blocking them to prevent the constant leeching of bandwidth as well as preventing the scraping of my original content as much as possible.
I’ve been working on a list of all IP ranges that exist – those that have actually accessed at least a single page on the website with the HTTP response code of 200. I’m actually quite surprised at how much reach this website seems to have. I’m getting visits from military installations, civil government offices as well as educational institutions and even software development companies. The only thing I can’t figure out is, what exactly are these entities looking for?
Some of these entities are nothing more than crawlers for independent search engines, similar to Google or Bing. Speaking of Google, I wonder why they’re crawling (as noted in Webmaster Tools) more than 2000 pages every single day. It doesn’t make sense because I only have around 3000 pages if you include the category pages, which I have marked as “noindex”. What exactly is Google looking for that it doesn’t already have?
I plan to create lists of “bad actors” for others to use, but I really have to get through this immense list of IP addresses that continue to access this website or the lists won’t be complete. In other words, I don’t like doing things two and three times because I failed to do them right the first time.
Trust me when I tell you, my lists will allow anyone to reduce the amount of “attacks” made on their websites. It will be up to you, of course, as to which lists you’ll use. I’m having difficulty, due to local brownouts and such, in compiling the lists, so it will take more time than I originally anticipated to get them done.
Believe me, I grow weary of seeing the same IP addresses connecting, trying to publish spam comments, trying to register when registration is turned off, trying to log in as “admin” (and there isn’t an “admin” user), attempting to find exploits in the software and scraping the content. Website owners shouldn’t have to worry about things like this because it distracts us from doing what we should be doing and that’s creating content.
I’m currently going through the US-based IP address lists and perhaps I’ll start publishing the block lists after I finish with each country. Don’t hold your breath, of course, because that also depends on whether I can keep doing it day after day.