ColdFusion RAID databases (or database pools)
You've tuned your databases queries, added all of the indexes you can think of and cached all of the queries you can but your database still isn't giving you the performance you need.
So it's time to split up your database into several databases and move each of these new databases off onto separate database servers. Or perhaps a RAID database is the solution to your performance issues.
With RAID hard disk mirroring you have multiple hard disks each with the same information on them. If you want to read information off the RAID disks you can read it off any of the disks and if you want to write information you need to write it to all disks. Basically you can do X times as many reads in the same amount of time for a X disk system as a one disk system. ie 2 mirrored hard disks give twice the read performance, 3 hard disks give three times the performance, etc etc
You can use a similar technique with databases. If you have multiple copies of the same database on separate servers you can implement this fairly easily in ColdFusion.
Building on the data source name singleton code I wrote the other week we have:
DSN.cfc
2 <cfset variables.DSN = arraynew()>
3 <cfset variables.length = 0>
4
5 <cffunction name="getDSN" access="public" returntype="string" output="false">
6 <cfset var i = RandRange(1,variables.length)>
7 <cfreturn variables.DSN[i]>
8 </cffunction>
9
10 <cffunction name="getAllDSNs" access="public" returntype="string" output="false">
11 <cfreturn arraytolist(variables.DSN)>
12 </cffunction>
13
14 <cffunction name="setDSNs" access="public" output="false">
15 <cfargument name="DSNs" type="string" required="yes">
16
17 <cfset variables.DSN = listtoarray(arguments.DSN)>
18 <cfset variables.length = arraylen(variables.DSN)>
19 </cffunction>
20</cfcomponent>
For any select query with just use the getDSN method like before:
2
3 <cfquery name="myquery1" datasource="#dsn.getDSN()#">
4 select ....
5 </cfquery>
For any update, insert or delete queries you must alter all databases so you need to loop over them like so:
2
3 <cfloop item="ds" list="#dsn.getAllDSNs()#">
4 <cfquery name="myquery1" datasource="#ds#">
5 update ....
6 </cfquery>
7 </cfloop>
This technique would only be of benefit for databases that have more reads (ie selects) than writes(ie updates and deletes), but for most web applications that is the case.
The DSN component code could be modified so you can mark data sources as offline and flags datasources that are currently out of action. That way your system will continue to run off the other databases if one (or more) of them are down.
It's also fairly easy to add additional databases to cope with increased load to a running system.
One issue with this method is re syncing the databases once one of them has gone offline and then come back online again. It would be possible to do this re syncing via code by logging and rerunning all SQL update, insert and delete queries but it's probably better to just re sync with your native database tools.
Quick question: Is there a drawback (other than being slightly more complex) of keeping a "busy" flag for each datasource and picking a not-busy one as opposed to picking one randomly?
Good idea with the busy flag. To keep track of of busy databases you could a) write your own custom tag and use that instead of cfquery or b) use cflock as you need to mark them as busy at the start of the query and not busy at the end.
However I'm not sure that you would gain a lot from it as most databases are built to handle simultaneous requests. You could count requests and share them out more evenly but randrange would give a good approximation anyway. You would need to test the system under load to see if it would help or not.
As for the busy flag - I am aware of simultaneous requests being handled by the DB (and CF, for that matter) better than I could, or would want to. I was speaking rather to only the CF side - if we are to choose a random one, why not just make sure its not "busy" first? (and on that note, not relying on locking it, since that won't give us an "isBusy" flag to pick the next not busy one)
I can see how randomly picking one would be a good approximation, but I was wondering specifically about the value of keeping a busy flag within our CF code to ensure we never hit a "busy" db (unless the pool was full, of course).
I'm not convinced there is value, but your post was timely because I've been thinking about doing what I've described lately, and not just for DBs, but pools of objects in general...
DK
A quick google turned up
http://www.howtoforge.com/loadbalanced_mysql_clust...
a MySQL article. How these tools do their job I don't know for sure. I'd imagine all cluster members could manage the data on a SAN space communicating to each other about locks and such.
DK
The above code gives you a simple way to load balance with any database at a low cost ie nothing extra to buy, install or configure other than standard database servers. Would it be suitable for everyone/every application? Probably not, but them not all of us are as big as Adobe or MySpace.
The load balanced mysql is interesting but looks complex to set up and maintain. eg How would you add an extra server if you needed it? With the code above it would be a matter of just adding a new datasource to the list in the application scope (as long as the DB was set up and populated).