Future of Statistical Programming

One of the things I spend the most time thinking about is the future of statistical programming, especially statistical programming for novices. My "grand vision" is unlikely to be fully realized any time soon, but I'm constantly thinking about the pieces that would be necessary to put it together.

First experiments

With the Communications Design Group, I've been thinking about this issue and trying to frame some experiments that could show the power of the sort of tool I imagine as the future of statistical computing. A colleague at CDG, Aran Lunzer, has done some wonderful work illustrating some of the features we hope would be possible in such a tool.

To read more about my grand vision, and the theory behind this implementation, see below.

Tools for learning statistics versus tools for doing statistics

The basic idea is that there's a gap between the tools we use for teaching/learning statistics, and the tools we use for doing statistics. Worse than that, there's no trajectory to make the connection between the tools for learning statistics and the tools for doing statistics. I think that learners of statistics should also be doers of statistics. So, a tool for statistical programming should be able to step learners from learning statistics and statistical programming to truly doing data analysis.

When I refer to tools for learning statistics, I mean things like applets, TinkerPlots, and Fathom. I have nothing against these tools-- I think that they do a great job of teaching statistical concepts. But, as with any new software tool, they take some cognitive effort to learn, and I'm not sure that there's a great payout for that effort. You can't put TinkerPlots on your resume, and if you actually want to apply the skills you've learned to real data, you need to learn another tool.

And when I talk about tools for doing statistics, I mean SAS, STATA, SPSS, python, julia, or R. These tools require some traditional "programming," but they allow for much more flexibility in what you can produce. If you want to do something that doesn't currently exist in the package, you can create it. Personally, I do most of my statistical programming in R. It's the statistical programming language that has the biggest community of users, and the widest variety of user-contributed packages. Python and Julia are getting mentioned more and more, but they're still not the tool of the majority.

And in fact, when I teach statistics, I'm often teaching R. In the 100-level statistics classes at UCLA (those for undergraduate statistics majors) that's the tool of choice, and after years of trial-and-error, we're also using it in Mobilize. This can be a huge challenge, because R wasn't exactly designed for ease of learning. In order to make it easier on teachers and students, we've created an R package of wrapper functions, MobilizeSimple, which simplifies tasks like text analysis and map-making. But there are still many idiosyncrasies to R, like the $ versus model syntax. Since both syntaxes are technically correct, there's no consensus in R code, and learners have to deal with both.

Bridging the gap

So, the first thing that a future tool for statistical programming should do is bridge the gap between learning and doing. I think that a good tool should be able to ease people into programming, using some sort of visual, drag-and-drop interface that exposes novices to the entire trajectory of data analysis (data import, cleaning, new variable creation, plots, summary statistics, models). Then, it should have a way to make the transition to more traditional or textual coding. Ideally, one could look back and forth between the visual representation and the textual one, and a change in one interface would be reflected as a change in the other.

Of course, the question is how that would work in practice-- would users have to flip back and forth, clicking "save" or "run" after every change to view the impact of the change? In the grand vision, of course not. The two would be intimately connected, and the textual code would be almost as interactive as the visual representation, with scrubbable values and instant reflection in the visual.

Interaction at every level

I've said that I think there's a need for an interactive visual tool for learners to get started with, but I think it's equally important that the results of the analysis (no matter what level of statistician created it) be interactive. This means that all graphs should be zoomable, it should be easy to change the data cleaning and see how that change is reflected in the analysis afterward, and that parameters should be easily manipulable. A use case for this is be data journalism and/or academic publishing-- I firmly believe that data products should be accompanied with the fully reproducible code, and that code should be interactive. That way, the audience (even if they don't know much about statistics) can play with the parameters and convince themselves that the data was not doctored.

Documentation as integral

In addition to interaction, I believe that a good statistical programming tool should encourage or require documentation at all steps of the way. Data science is so much about storytelling that I think it should be built into the process. And when I say documentation, I don't just mean the indecipherable comments that I am guilty of inserting into my code ("# Don't know what this does") but rather the supporting narrative that will surround the analysis when it's complete. Instead of encouraging a process where analysts create their data product first, and then go back and try to interpret it, a good statistical programming tool should create major incentive to do the hard work of thinking as you go.

Use cases

The user that I think about most commonly is a novice-- that is, someone who does not know any statistics or programming when they begin. In my experience teaching people at a variety of levels about R, I believe that people who don't have either piece of pre-existing knowledge struggle the most. If you've already got some of the statistics, or you've already learned a programming language, that can ease the transition (at least, it can reduce your anxiety about learning). So my imagined user is someone who doesn't have either piece.

One major use case that I think about a lot is high school students and teachers, like those I interact with through Mobilize. These people are certainly novices as I define them, and they often have a lot of anxiety about learning R and data analysis.

Another use case is data journalists, who are used to telling stories, but again may not have the expertise with statistics or programming.

Inspirations and influence