Back in the days of DOS, shutting down a machine was merely about turning off the power, and at times, directly the mains. One didn’t have fancy file systems that needed maintaining consistency and atomicity. If something was bad, you just ran chkdsk or scandisk to fix it. Consistency was maintained very optimistically – we just assumed the best and prepared for the worst. If files got lost, we would have xcopy’d backups elsewhere and we’d compared the whatever little was saved against our backup and do a merge….. sounds very familiar doesn’t it?
Then came relational databases, normalization, concurrency control, etc. All kinds of fancy computer science was dedicated to complex programming techniques and systems that were debugging nightmares. As if designing and maintaining a system that acquires locks over time, then makes changes, and rolls back failed transactions wasn’t fun enough, we went ahead and created distributed versions of those systems. Either everything was in realtime, synchronous and consistent, or nothing was happening at all. But this can go only so far – trying to maintain redundancy and failover in such a complex system isn’t just non-trivial, but impossible on the scale the web today requires.
Then again, all the million people on your site don’t really operate on the same data at the same time – and perhaps such tight locking across all resources wasn’t only unnecessary, but also inefficient. Perhaps you could have two broken databases that would periodically exchange information with each other about the changes they underwent in the meantime. After many years of research in locking and concurrency control, and write-ahead logging, and such stuff, we ended up with the same model we used in the early 80′s – sync!
It’s not only a lot simpler to implement, maintain, fix, patch, test, but also a lot simpler to understand and use as well. Moreover, every user, every device, every application, has it’s own version of their personalized history. It allows every user to view what’s going on in the world from their own point of view (Obiwan Kenobi, et al).
Yesterday evening I met a friend who is in India for a couple of months and the HoD of Fergusson College’s CS Dept over dinner. We began discussing the kind of cool projects kids could do for fun over the summer, and Atul mentioned beginning by creating a simple MPI cluster out of all machines in the lab (for the uninitiated, for kids in B.Sc. computer science, it would be a first in the history of the course.) Since we’ve both worked-on/interested-in loosely coupled massively scalable distributed architectures (and my project at ISSC was to attempt to build one called YODA), we all began pushing our agenda (before you assume anything, “sync” wasn’t really my original agenda).
The outcome came to be something so awesome and cool that we’re meeting up on Monday (16th of June 2008) to discuss something constructive. As with anything I do, I’m on the lookout for enthusiastic crazy kids who want to do something fun just for the heck of it.
Mrs. Page, the HoD, wanted some real tangible experience for her students. I still have the itch to complete YODA to fruitition – I still believe it was the most awesome project I ever worked on with the best project group in my life with a lot of potential. Atul has his interests in high-perf computations for language processing (the perfect candidate for MapReduce).
The idea is exactly what YODA was all about, but perhaps at a less ambitious level – handing off the network/node/data management jobs to MPI (yes, we take the hit to have a geographically localised cluster). To have a thin “loading” framework on top (my agenda) so that you can submit “jobs” (concepts stolen from BOINC) which may be of the MapReduce type (Atul’s interest), but with one major twist to it – instead of the conventional master-slave topology, why not go for a peer-to-peer sync topology? (perhaps working on Live Mesh has made me a religious sync fanatic)
Look at it this way – today, MapReduce/BOINC keeps working if any/all worker nodes go down. What happens if the entire data center went down? Including all your master nodes? The last few months, if Ray Ozzie has taught me anything, it is that you can add all the UPSes you want and all the security guards you want, but the only way to truly protect data is to simply replicate it as far and wide as possible. Its cheap, it’s effective, and it’s simple. GFS (Google File System) claims to do it. Then why not do it for compute-intensive tasks to share merge results?
Let’s say a BOINC-like cluster where every machine is a peer. You “inject” your task onto one of these machines and based on availability of bandwidth the task starts replicating across all machines like a torrent (the parts that have not gone out before, go out first). When any node finishes it’s processing, it simply “sync’s” the results with any peer nodes that may have completed results. When it has any meaningful results that it can merge (“Reduce”), it does so, or it acts as a proxy to sync those results further through the swarm. (You may notice this follows a lot of principles of git as well.)
Instead of being “assigned” a workunit by a master, each peer simply picks up a random workunit out of the ones it currently has but has no recorded solution for. When it gets more than one unit that is mergable, it merges them. This is like dynamic-programming on stereoids. If the bittorrent implementation is done right, each node should have as minimum overlap as possible, and yet, we do need some overlap which will be a function of the design as opposed to any deliberate actions by the master. (The overlaps are used to validate results in case something went wrong.)
Just imagine if your college lab turned into a swarm of independently operating units who are sync’ing with each other and processing results, and when you go at any node in the system, you have part of the answer. Depending on how long you give it, every node will have the final answer in the end. You can naturally optimize by assigning a “special” result node which periodically gets partial results from all boxes in the swarm and thereby create a weak master.
The good part is, if you have multiple problems to solve you just go on injecting them into any node you have accessible. Over time, you go to any node you can find and get results. All without any fancy coding to maintain consistency, accuracy, integrity, etc.
Naturally, there may have been things I’ve overlooked, but for a 10-minute discussion, doesn’t seem like a bad idea for some college students to try out as a few-weeks project. In case you’re interested, let me know. All debates, arguments, flames, criticisms are appreciated.