In honor of my 1,000th tweet, I decided to investigate what events contributed the most tweets to my count.
As I came up on my 1,000th tweet, I was wondering what my pattern of tweeting had been over the years. Since I joined twitter in January of 2009, 1,000 tweets would average out to about half a tweet a day. So, obviously there must be some days I'm tweeting more than others! I had the intuitive sense that I'd been using twitter more recently (certainly upgrading my phone in September of 2013 was a contributing factor), but I wanted to see how my use had changed.
In order to do that, I had to revisit my python API explorations. The twitter API had changed since I used it last, adding more authentication requirements. I haven't yet figured out how to make an app that uses the authentication, but it was simple enough to do it on my local computer. After authenticating, I just grabbed the dates of every tweet I'd ever sent. The only tricky part is the request limit of 200 tweets and related pagination issues. Lucky, twitter provides a resource on this very subject, so it was straightforward to overcome.
One detail that might make this impossible for other users is that twitter will only let you get 3,200 tweets for a given user. So, doing this project on the occasion of 1,000 tweets actually makes sense!
Once I had grabbed all the dates I brought them into R to do some simple ggplot data visualization.
I knew I'd done some live-tweeting that would contribute to my tweet count, but I had no idea how significant those events would be! Annotated on the graph are the dates of the Computation + Journalism Symposium (#compj), eyeo festival (#eyeo2013) and a series of events on the subject of the Future of Statistics, including the Unconference, the UCLA department seminar and ASA working group webinar on big data. Many of these events are tagged with #futureofstats.
In addition to the timeseries plot, I decided to do a cumulative distribution so I could see the count hit 1,000.
Hitting the 1,000 tweet mark felt pretty significant to me, but I decided to look at other users to put it in context. First, I wondered about the number of tweets that had been sent by each of the Twitter users I follow. Looking at the histogram, I was shocked to see that many people I follow have sent more than 25,000 tweets! I knew that Roger Ebert was a prolific tweeter before he passed away, but there were several accounts I followed that beat even Ebert's count. Several of these were obvious, like @nytimes, @TechCrunch, and @VirginAmerica, but of the people I follow, both @znmeb and @ibogost have tweeted more than Ebert (current counts for those two are 110,160 and 39,083 respectively).
This is an example of a long tailed distribution. I tried cutting off the most extreme values, but only got another histogram with extreme values and the bulk of the data to the left.
I thought that users who followed me might have tweeted a little less than the accounts I follow, so I grabbed that data too. Surprisingly, this plot looked almost identical to the first one. The most prolific tweeters I follow (@znmeb and @ibogost) both follow me back, and again, if I try to cut off the tail, I just get a similar distribution again.
I thought that a boxplot might help me compare the number of tweets by people who I follow versus people who follow me, so I made that plot too. Again, it's pretty hard to distinguish between the two groups, and there's plenty of overlap between the groups to confound things. Of the 247 people I follow, 64 of them follow me back (this means that of the 120 people that follow me, I follow 64 of them back). So those people are making the boxplots look more similar. However, I think that the plot shows that the users who follow me have tweeted a little less than the users I follow.
After hearing several people, including Hadley Wickham (@hadleywickham) cry out for log scales, I decided to give them a go. Why logs? Well, we've got numbers on different orders of magnitudes, and big gaps with no data. So taking the log of the data can make it easier to see everything all at once. I often resist log scales because I think they're harder for people to interpret (my regression class really struggled to interpret logged coefficients), but here you go.
First, the logged histogram, and then the logged violin plot. Now I think it's clear that the distributions of the people I follow are different than those that follow me. It's not exactly what I expected, but the people who follow me have the bulk of their distribution a little lower and have more people who have sent very few tweets. On the other hand, the people I follow have an interesting biomodal chunk around the bulk of the distribution, with more people at the extreme values (which matches my intuition).
You can look at my data and code on Github or view the repository right here: