I’ve been meaning to write a post about DTrace, and Tim Bray’s tweet finally got me moving. It looks like some people are trying to make DTrace a topic for this year’s Linux Kernel Summit. I hope they succeed. I also hope that those folks pushing for user level tracing have their voices heard. I was amused to read one of the messages which claimed that DTrace is:
DTrace is more a piece of sun marketing coolaid which they use to beat us up at every opportunity.
My experience at Sun thus far is that people generally don’t really appreciate the benefits of DTrace. It stems from a view that I also saw in the LKS threads, which is that DTrace (and tools like Systemtap) is a tool for system administrators, because it reports on activity on the kernel. That’s not how I look at it. DTrace is a tool for dealing with full system stack problems, which initially manifest themselves as operating system level problems. The fact that DTrace can trace user land code as well as kernel code is what makes it so important, especially to people building and running web applications. Because of all the moving parts in a complicated web application (think relational database, memcached or other caching layers, programming language runtime, etc), it can be hard to debug a web application that has gone awry in production. Worse, sometimes the problems only appear in production. Tools which cut across several layers of the system are very important, and DTrace provides this capability, if all the layers have probes installed. When a web application goes wrong in production, you see it at the operating system level – high usage of various system resources. That’s where you start looking, but you will probably end up somewhere else (unless you are ace at exercising kernel bugs). Perhaps a bad SQL query or perhaps a bad piece of code in part of the application. A tool that can help connect the dots between operating system level resource problems and application level code is a vital tool. That’s where the value is.
One of the cooler features of DTrace is that you can register a user level stack helper (a ustack helper), which can translate the stack in a provider specific manner. One cool example of this is the ustack helper that John Levon wrote for Python, which annotates the stack with source level information about the Python file(s) being traced. On an appropriately probed system, this would mean that you could trace the Python code of a Django application, memcached, and your relational database (PostgreSQL and soon MySQL). That would be very handy.
I’d love to see DTrace on Linux, because I have it on OS X and it’s in OpenSolaris and FreeBSD, but I’d also be happy to see SystemTap get to the point where it could do the same job.