What's new

You know you're committed to your project when...

nt81

Top Contributor
You know you're committed to your project when...

Your server has been running a script to glean information on 300,000 domains and it is still only half way done after 3 weeks running 24x7

But the reward will be worth it. Ask me in a month, haha
 

nt81

Top Contributor
I'm learning bash scripting at the moment, but the one running the lookups at the moment is PHP/mySQL + Centos6 x64 -

But i'm rate limited by the API's that i'm working with, and I'd rather not flood them / get banned for it.

One worker thread checking about 10 domains per minute. Slow going!

I got sick of my VentraIP hosting stalling all the time and signed up for a VPS with another supplier and haven't had an issue since.

My current love is for working with API's and PHP/MySQL - so many interesting API's out there :)
 

m8e

Top Contributor
Nice! Bash+API = powerful

Good not to flood the service you're grabbing data from TOO much.

Trick is to find the balance between speed and not getting banned.
So you are doing one request every 6 second then?

Could also throw in a randomized wait as well in between requests so it's not just a constant rhythm jackhammer.

I've done that on a few web scraper projects.


Since you are grabbing such a huge data set, if you are interested in speeding things up I would consider splitting it over a few virtual machines on different IP's and share the load.

Depends on your API access though... if it's linked to the one account then no matter. But on the other hand, if it's public data that's anonymous open access for all, split it up into a few cloned VM's

Could get fancy with shared backend MySQL on another server even.

Or just combine the results in the end, via one-way replication, have each VM node a MySQL master and publish back to your central data collector which is a slave to all the bots

http://dev.mysql.com/doc/refman/5.0/en/replication.html

Assuming you have unlimited time and budget to set this all up ;)
 

nt81

Top Contributor
Yeah, I looked at slaving a couple of cheap hosting accounts, spread around the place... but i'm not going to push it.

Script does 5 domains at a time, with a random wait inbetween for 5 - 12 seconds. Would be easy to spot, but i'm not leeching :p

I go live in August, so I have plenty of time to catch up on the data I need to capture.

VPS #2 is on the way probably tomorrow. This one is going to be in the USA, I've spent the last 6 hours researching it, so many VPS hosts out there :S
 

Community sponsors

Domain Parking Manager

AddMe Reputation Management

Digital Marketing Experts

Catch Expired Domains

Web Hosting

Members online

Forum statistics

Threads
11,099
Messages
92,050
Members
2,394
Latest member
Spacemo
Top