Wednesday, August 17, 2011

Parallelism On The Cloud And Polygot Programmers

I am very passionate about the idea of giving developers the control over parallelism without them having to deal with the underlying execution semantics of their code.

The programming languages and the constructs, today, are designed to provide abstraction, but they are not designed to estimate the computational complexity and dependencies. The frameworks such as MapReduce is designed not to have any dependencies between the computing units, but that's not true for the majority of the code. It is also not trivial to rewrite existing code to leverage parallelism. As, with the cloud, when the parallel computing continues to be a norm rather than an exception, the current programs are not going to run any faster. In fact, they will be relatively slower compared to other programs that would leverage parallel computation. Robert Harper, a Professor of Computer Science at Carnegie Mellon University recently wrote an excellent blog post - parallelism is not concurrency. I would encourage you to spend a few minutes to read that. I have quoted a couple of excerpts from that post.

"what is needed is a language-based model of computation in which we assign costs to the steps of the program we actually write, not the one it (allegedly) compiles into. Moreover, in the parallel setting we wish to think in terms of dependencies among computations, rather than the exact order in which they are to be executed. This allows us to factor out the properties of the target platform, such as the number, p, of processing units available, and instead write the program in an intrinsically parallel manner, and let the compiler and run-time system (that is, the semantics of the language) sort out how to schedule it onto a parallel fabric."

The post argues that language-based optimization is far better than machine-based optimization. There's an argument that the machine knows better than a developer what runs faster and what the code depends upon. This is why, for relational databases, the SQL optimizers have moved from rule-based to cost-based. The developers used to write rules inside a SQL statement to instruct the optimizer, but now the developers focus on writing a good SQL query and an optimizer picks a plan to execute the query based on the cost of various alternatives. This machine-based optimization argument quickly falls apart when you want to introduce language-based parallelism that can be specified by a developer in a scale-out situations where it's not a good idea to depend on a machine-based optimization. The cloud is designed based on this very principle. It doesn't optimize things for you, but it has native support for you to introduce deterministic parallelism through functional programming.

"Just as abstract languages allow us to think in terms of data structures such as trees or lists as forms of value and not bother about how to “schedule” the data structure into a sequence of words in memory, so a parallel language should allow us to think in terms of the dependencies among the phases of a large computation and not bother about how to schedule the workload onto processors. Storage management is to abstract values as scheduling is to deterministic parallelism."

As far as the cloud computing goes, we're barely scratching the surface of what's possible. It's absolutely absurd to assume that the polygot programmers will stick to one programming model and learn to spot difference between parallelism and concurrency. The language constructs, annotations, and runtime need to evolve to help the programmers automate most of these tasks to write cloud-native code. These will also be the core tenants of any new programming languages and frameworks. There's also a significant opportunity to move the existing legacy code in the cloud if people can figure out a way to break it down for computational purposes without changing it i.e. using annotations, aspects etc. The next step would be to simplify the design-deploy-maintain life cycle on the cloud. If you're reading this, it's a multi-billion dollars opportunity. Imagine, if you could turn your implementation-specific concurrent access to resources into abstract deterministic parallelism, you can indeed leverage the scale-out properties of cloud fairly easily since that's the guiding principle behind the cloud.

There are other examples you would see where people are moving away from an implementation-centric approach to an abstraction that is closer to the developers and end users. The most important shift that I have seen is from files to documents. People want to work on documents; files are just an instantiation of how things are done. Google Docs and iPad are great examples that are document-centric and not file-centric.

Photo courtesy: bass_nroll on Flickr

No comments: