Archive for August 2007
I first encountered the programming language literature as an undergraduate at Yale. Up until that time I’d only ever read programming books or manuals (I still recall hanging out in the backyard of my parent’s house one warm summer devouring the original K&R C book, raving to my bemused parents about how cool enumerations were). But once I got into my junior and senior years at college, I got exposed to the research community. And I started digging through the SIGPLAN Notices stacked up in the computer science building, and reading other students’ theses on Linda (David Gelernter was my thesis advisor – this was a couple years before he became one of the Unabomber’s victims).
This started a habit that has only gotten deeper over the years. Once the Internet hit it big, I realized that now there was a limitless fount of interesting research at my fingertips. And indeed that’s proven to be true — as you’ve already noticed, a lot of my posts here are driven by cool papers I’ve read recently, and you can confidently expect that to continue. Reading research papers is pretty much the best way to stay on top of the cutting edge of the software world.
So this post is about my personal mainlines for cool research; if you’ve got any to add, please comment, because I’m always on the prowl 🙂 My main interests are in (among others) programming languages, type systems, static analysis, metaprogramming, security, distributed systems, scalability, and reliability. So these are obviously skewed in that direction.
First off, the expensive ones. USENIX is probably the number one organization for research about systems in practice, tackling OS issues that have real-world implementations. It’s no wonder Google’s a major USENIX sponsor. And, of course, there’s the Big Daddy, the ACM Digital Library — it’s moderately expensive to be a member, but as a one-stop shop for a huge variety of conferences and research papers, it’s pretty much peerless. Most of the for-pay papers I want to read can be found on either the USENIX or ACM sites. It’s often easy to Google various papers that are also in ACM/USENIX publications, but sometimes it’s nice not to have to bother, and other times the papers really are embargoed.
After that, the most interesting research site bar none is Lambda the Ultimate. For programming language research, complete with an insightful and literate community, this blog wins the internets.
Then we get down into the major corporate research sites. First, the big disappointment of the year — Microsoft Research, once a very worthwhile site, recently crippled themselves with a redesign that removed all access to the chronological list of all their research publications! The only exposure they now give is to the last dozen or so papers published, and a search dialog that doesn’t seem to work for anything. What were they thinking?! And not only that, it’s impossible to find a direct contact email address for them. Sigh. I should talk to some of my Microsoft friends about that….
Memo to all research departments: THE NUMBER ONE FEATURE YOU IMPLEMENT MUST BE A FULL CHRONOLOGICAL ARCHIVE OF ALL YOUR RESEARCH PUBLICATIONS. Without that, you are nigh worthless to people wanting to learn what your researchers are doing in depth.
Anyway, leaving them aside, there’s Sun Research (low volume but often interesting), IBM (frustrating to browse their archive), and Google’s papers page (which gets updated pleasingly often, and which is a model of simplicity and accessibility). Aaaaand… those are my usual suspects.
Then there are the blogs of various programmers, such as Gilad, Crazy Bob Lee, and the Hibernate guys… but those often tend to be pretty low volume for my tastes. When I crave research, I crave it by the gallon!
I eagerly solicit any and all sources of more cool new software research and/or development. (Or even programmer blogs that are updated consistently.) Please please lay ’em on me in the comments, and I’ll cycle it into future addendums to this post. Enjoy!
[Edit, a month later: Turns out Microsoft Research’s RSS feed is the place to go for the stream of recent results. That’ll have to do for now, though one of my MS friends says he’ll put a bug in their ear.]
Dave Winer had a good post a while back about Stop Energy. He defines it as the drive to naysay, to oppose. He doesn’t use the opposite term Go Energy, but I will: Go Energy is what it takes to move forward, to make something happen.
Dave was characterising Stop Energy (and its converse) in terms of community dynamics, saying that some people in a community (software community especially) are driven by Go Energy, and others are driven by Stop Energy. Seems to me that’s a stretch; in any community there are differences of opinion about which direction to go, and one person’s Go Energy can be another person’s Stop Energy. But actually this post isn’t about communities at all.
I find that my own personal time is very driven by my own personal supply of Go Energy (“Go” for short). I’ve got a built-in pool of Go. When it’s fully charged up, I can hack like crazy on my own time — I had a whole lot of Go this spring when I did my GWT/Seam hacking.
More recently, my Go has had another focus altogether — my wife had a baby four weeks ago. Leading up to that, the subliminal anxiety of late pregnancy was definitely draining my pool. The odd part is that it wasn’t completely emptied — there was one particular work task that I really wanted to get done, and so even while taking three weeks of paternity leave, I had enough pre-existing, pre-baby Go to carry me through completing that work.
Now, though, my leave is over, I’m back at work, and my personal Go has got up and gone 🙂 It’s going to be a while before I build up enough Go to start doing some personal hacking. Blogging I can do on my hour-long train commute, and it also requires less Go than hacking, so I’ll be continuing that, but open source work is going to be on hold. Sorry, Jason E., Dan M., and others who were looking forward to having me back! The rest of 2007 is likely going to be fairly low-key.
Thanks for your patience, and I hope the blog tides you all over until the Hack is Back! I’ll try to stick to posting here every two weeks, now that the baby’s sleeping a bit better.
A while ago (circa 2000), Rob Pike wrote a rant about how systems software research is becoming irrelevant, due to the difficulty in getting new operating systems adopted.
Since then, he’s gone on to work at Google, where systems software research is the lifeblood of one of the (if not the) most important Internet sites in history. And he’s happily doing plenty of systems software research that’s become fundamental to the company’s operations.
So his original concern about the irrelevance of operating system research got effectively sidelined, because the action moved from single-machine operating systems to wider distributed systems, especially as used at Google. And Google is as good as anyone at turning research ideas into production practice.
Meanwhile, Jim Waldo at Sun last year wrote another paper — a little less of a rant, but not much — about how systems software design is suffering greatly, largely from the lack of opportunity to learn from experience. Waldo makes good points about the difficulty of teaching system design except through example and experience.
His main concern is that opportunity to learn by doing is very hard to come by. In academia, systems tend to be small and rapidly discarded, due to the need to publish frequently and produce results quickly. In industry, systems tend to be proprietary, encrusted by patents, and impossible to discuss or talk about publicly. This leaves only limited latitude for public construction or discussion of systems large enough and interesting enough to really learn from.
Waldo suggests that open source projects are one of the few ways out of this dilemma. They are in many cases fairly large in scope, they are fully visible to anyone wishing to critique, extend, or adapt them, and they provide not only a code base but (in the best cases) a community of experienced designers from whom new contributors can learn. They therefore are in some ways the best hope for spreading effective education about system design, being unencumbered by either the short-term problems of academia or the proprietary problems of industry.
Recently, coincidentally enough, some Googlers working on the Google lock service — a key part of Google’s distributed infrastructure — wrote a paper describing their experiences building a production implementation of the Paxos protocol for distributed consistency. What’s especially interesting about this paper is how neatly it both decries and embodies the very dilemma Waldo is talking about.
The Google Paxos paper has a lot of extremely interesting technical content in its own right. It’s one of my favorite types of papers — a discussion of problems encountered when trying to take compelling theory and make it into something that really works in a live system. Without that kind of effort, excellent ideas never actually get their chance to make a difference in the world, because until they’re embodied in a real system, they can’t deliver tangible value. So this paper is very useful to anyone working on implementations of the Paxos protocol — it’s exactly the kind of experience that Waldo wishes more people could learn from.
The writers themselves have the following gripes:
Despite the large body of literature in the field, algorithms dating back more then 15 years, and experience of our team (one of us has designed a similar system before and the others have built other types of complex systems in the past), it was significantly harder to build this system then originally anticipated. We attribute this to several shortcomings in the field:
- There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.
- The fault-tolerance computing community has not developed the tools to make it easy to implement their algorithms.
- The fault-tolerance computing community has not paid enough attention to testing, a key ingredient for building fault-tolerant systems.
As a result, the core algorithms work remains relatively theoretical and is not as accessible to a larger computing community as it could be. We believe that in order to make a greater impact, researchers in the field should focus on addressing these shortcomings.
The ironies here are so deep it’s hard to know where to start. Their implementation itself is not only proprietary to Google (and not open sourced), but it also relies on many other proprietary Google systems, including the Google file system. Hence their work itself is not directly available to the wider community for development and further discussion! Their paper has a number of interesting allusions (such as exactly why they needed to make their local log writing multi-threaded) that are not followed up. Unless they write many more papers, we will never know all the details of how their system works.
They criticize the fault-tolerant systems community for not having provided a more solid experience base from which to build. Waldo’s paper makes it crystal clear exactly why this base has been lacking: where is it to come from? Not from academia; research projects in academia tend to be too short-term and too small in scope to encounter the kinds of issues the Googlers did. And not from industry; Google is not known for open sourcing its core distributed software components, yet Google is arguably ahead of anyone else in this area!
The only alternative would be a true open source project. But large-scale distributed systems are probably among the least likely to achieve real momentum as an open source project, because actually using and testing them requires substantial dedicated hardware resources (many of the failure cases the Google team encountered arise only after running on dozens or hundreds of machines), and those resources are not available to any open source projects I’m aware of.
The Googlers are part of the problem, even while their paper seeks to be part of the solution. To some extent it’s a chicken-and-egg dynamic; without access to a truly large pool of machines, and a truly demanding set of developers and applications, it’s hard to get real-world experience with creating robust distributed infrastructure — but you almost have to be inside a large business, such as Google, in order to have such access at all.
So, unfortunately, it would appear that in the near term the Googlers are doomed to disappointment in their expectations of the research community. Google itself is likely to remain the preeminent distributed systems research center in the world, and the fewer of its systems it open sources, the less assistance the rest of the world will be able to provide it.
One can only hope that several years from now, Google’s applications will have evolved so greatly on top of its base infrastructure that it will no longer consider the fundamental systems it uses — MapReduce, BigTable, GFS, Chubby — to be key competitive differentiators, and will choose to open source them all. Of course, by then Google’s real difficulties will still be with problems the rest of the world wishes they had the resources to encounter….
A coda to this: John Carmack, of id software and Armadillo Aerospace fame, is known for open sourcing his game engines after five years or so. Recently he’s been doing work in mobile games, cellphone programming. Here’s a quote from a liveblog about his keynote at Quakecon last week:
Met with mobile developers at the Apple thing, all talking about how they make mistakes all the time. Carmack: “Can’t the guys who made the mistakes the first time just make the chips right this time?” Other devs: “Yeah, but most of those guys are too rich to care anymore.”
So that’s the other reason the field doesn’t make good progress… proprietary stuff gets built, developers get rich, technology gets sold and eventually back-burnered, and then it all has to get reinvented all over again. Open source: the only way to not reinvent the wheel every five years!
[Crossposted to both my blogs]
I’m overjoyed to announce that at 2:05 AM on Monday, July 30, our son Matthew Thomas was born.
He’s a 9 pound 5 ounce bundle of joy. He’s sleeping well, nursing a lot (gaining his ounce a day — hardly lost any weight after birth), and his big sister loves him, as you can see 🙂
Michelle, my wife, is resting and recovering, along with taking wonderful care of him. I’m taking time off work to care for them both, as well as for our daughter Sophie.
We are incredibly blessed. And what’s more, our family is now complete. After having two oversized kids, with difficult births both times, we’re done… we’re going to count our blessings, quit while we’re ahead, and move from child-birthing to child-rearing for good.
Thanks to everyone who’s sent congratulations and other good wishes.
What’s more, we’re even getting down to enough of a routine that I can think about blogging and hacking again 🙂 It’d go a bit more quickly if I hadn’t just gotten a sore throat… but anyway, stay tuned.