During university, whenever I wasn’t studying for my Journalism degree — which, let’s be honest about it, was most of the time — I found myself picking up new geek skills. I bought a book on PHP coding, and taught myself to write a couple of simple things. I think it’s safe to say that I’ve never written anything significant from scratch, although I did end up doing some pretty serious hacking into some WordPress themes, and I even wrote a couple of plug-ins myself in my time.
These days, though, whenever I make motions to go anywhere near production code, the software engineers do the only correct thing: They chase me off with a stick. I have no business deploying code.
Having said that…
I signed up for Twitter in December 2006, and was dabbling on the platform here and there, but didn’t really figure out what it was really for until I signed up for an account for my photography blog two years later. That worked out rather better — and that account is now my main account. It’s doing pretty well, and has over 50,000 followers.
As I started using Twitter more and more, I also started digging around in the deep dark corners of the Twitter API. I had already written a couple of simple scripts for Flickr, but Twitter… That’s where all the interesting data lives, and an almost incomprehensibly big firehose of data it was, too.
“I bet 99% of people who start a sentence with ‘I’m not racist, but…’ are actually about to say something racist”
One evening, I had a drunken evening with a friend in the pub, and behind us, we overheard a conversation. It started “I’m not racist, but…” and then proceeded to say something horrifically racist. My friend pointed out “Hey, I bet that 99% of people who start a sentence with ‘I’m not racist, but…’ are actually about to say something racist”.
Did I mention that we were slightly inebriated? Because that would explain the next step: I did a quick search on “Not Racist But” on Twitter, and found a tremendous amount of racism. I thought it was interesting, and it looked like my friend was right… But was it really 99%? And so, perhaps naïvely, I decided to try and find out.
And so, the idea for a Twitter bot was born.
“People who tweeted something racist were, on average, 2% more likely to misspell words.”
- I coded a little PHP script that used the Twitter Search API to find all Twitter posts that have the words “not”, “racist” and “but” in them, in that order.
- The script ignores all retweets and @replies (or, in fact, any posts with the @ symbol in them), because I wanted to ensure that only ‘original’ tweets were counted.
- I set up an oDesk account, and asked a freelancer to help me sort the tweets, flagging them either as racist (“I’m not racist but Obama needs to go back to the jungle” etc), anti-racist (“I hate people who say ‘I’m not racist’ and then say something racist’), or unknown (i.e. nonsensical, in another language, or a joke such as ‘I’m not racist but I love brown bread’ etc)
- When a tweet was tagged as definitely racist, the twitter user’s user image was downloaded and stored on a server.
“Based on our data set, it turns out that around 81% of people who start a tweet with ‘Not racist but…’ are tweeting something racist.”
I kept the bot running about ten months, and collected 34k tweets. I would periodically do some arts projects with the tweets we collected — more about that in just a sec.
Most importantly, I discovered that my friend was wrong: Based on our data set, it turns out that around 81% of people who start a tweet with “Not racist but…” are tweeting something racist. So; an overwhelming majority, but not quite 99%.
Turning racism into art
Well, I wanted to try and create something creative, and turn the barrage of racism into something vaguely constructive. I created a second script, which used procedural (!) PHP to analyse the content of Twitter user photo, to find the overall brightness of the photograph. This data could then be used to create a photo mosaic. Of course, there’s plenty of software out there that does this for you, but given that I wanted to learn some coding myself, I decided to give it a shot.
I ended up writing a horrifically inefficient algorithm that basically brute-forced the photos into place. Any real software developer who looks at my source code would probably have a coronary episode — but that’s not the point; it worked, damn it. Like actual magic.
The first piece I created was to use the profile picture of 5,000 people who said something racist after saying ‘Not racist but…’ to create a portrait of Martin Luther King:
Pretty nifty, eh?
I liked the idea, and repeated the project again a few months later, this time with profile photos of more than 25,000 people:
The final portrait of Gandhi turned out to be a 280 megapixel image weighing in at a hefty 40 megabytes. I won’t upload the full version, but here’s a closer look to get a feeling for what it looks like:
Did you do any other statistical analysis?
I played around with the data I gathered quite a lot, both in SPSS and with little tools I coded myself.
One example: I wondered whether people who said “I’m not racist but…” were likely to spell things correctly. My hypothesis was that perhaps people who tweeted something racist would be more inclined to spell things wrong. Was that the case?
I created a little script, that checked every single word in every single tweet we had analysed against a dictionary. Yep, I brute-forced that one, too, and as it turned out, I looked up more than half a million words. Lo and behold, my suspicion was confirmed: I discovered that people who tweeted something racist were, on average, 2% more likely to misspell words than people who tweeted something anti-racist.
The difference wasn’t huge, but it was certainly statistically significant.
In addition, it turns out that there may very well be a “They can’t see me” type bias in all of this: The average number of followers for people who tweeted something racist was just over 500, whereas the average number of followers for people who tweeted something anti-racist was over 800. Maybe being watched makes you more responsible?
Share this post
Like this post? Please tweet: 81% of people who start a tweet with ‘Not racist but…’ are tweeting something racist by clicking here (don’t worry, you can edit the text before you tweet). Thanks!
So, what did I learn?
- It turns out that if you have a specific goal in mind, you can teach yourself to code.
- People can be pretty horrible.
- Digging into Twitter traffic is awesome: People tweet about pretty much everything, and digging into things and doing statistical analysis is fun!
- This project inspired me to code a proof of concept for a series of tools for the Metropolitan Police, that can be used as early warning systems for unrest, accidents, etc.
- Twitter’s APIs are pretty easy to work with, as far as APIs go.
- This was all a lot of fun and I learned a lot.