Subreddit Simulator Breakdown

Subreddit Simulator Breakdown

TL;DR - I used IBM Watson APIs and Bluemix to perform a personality analysis on the different bots from SubredditSimulator, and the results are pretty fascinating. Results are at the middle-end of the post. The technical mumbo-jumbo on how I got there is at the end-end. The app is here.

I must say that /r/SubredditSimulator is one of my favorite things the internet has ever done. It's up there with the original Twitch Plays Pokemon.

some background

For those of you unfamiliar with /r/SubredditSimulator, you should check out "What is /r/SubredditSimulator," but if you're too lazy here's a synopsis: essentially, /u/Deimorz made hundreds of Reddit-bots that each derive a type of content and grammatical style from a different subreddit. They then use Markov Chains to post, make comments, and "interact" with each other.

From the creator himself:

...this is a fully-automated subreddit that generates random submissions and comments using markov chains, with each bot account creating text based on comments from a different subreddit.

Every hour a bot is chosen at random to post, and every three minutes a bot is chosen at random to comment. The up-and-down voting is all done by humans, but all of the content is automatically generated by bots. If a real user ever tries to post on this subreddit, it automatically gets deleted.

ok, so... why are you telling me this?

Well first off, these bots often craft beautiful gems like:

The Bible is written by IT staff. Clearly you're not into ladyboys.

- 4chan_SS

and

He seems to be in reality also.

- badphilosophy_SS

(Each bot is named after the subreddit where it "learned English")

More importantly, what these bots say might actually be revealing. Each bot derives its use of English from a given subreddit, so running a personality analysis on each bot is arguably equivalent to running that analysis on the entire subreddit which that bot represents.

Arguably.

...a personality analysis?

Through IBM Bluemix, IBM Watson offers a Personality Insights service that can be used on chunks of text:

The Personality Insights service uses linguistic analytics to infer individuals' intrinsic personality characteristics from digital communications that they make available via media such as email, text messages, tweets, and forum posts.

This API gives you three different personality models:

  1. "Big Five" personality characteristics represent the most widely used model for generally describing how a person engages with the world. The model includes five primary characteristics, or dimensions - agreeableness, conscientiousness, extraversion, emotional range, and openness.
  2. "Needs" describe which aspects of a product are most likely to resonate with a person. The model includes twelve characteristic needs: Excitement, Harmony, Curiosity, Ideal, Closeness, Self-expression, Liberty, Love, Practicality, Stability, Challenge, and Structure.
  3. "Values" describe motivating factors that influence a person's decision-making. The model includes five dimensions of human values: Self-transcendence, Conservation, Hedonism, Self-enhancement, and Openness to change.

Full docs here.

This API works best when there's over 1000 words, so I wasn't able to run the analysis on every bot that takes part in /r/SubredditSimulator - just the ones who have a total word count of 1000+.

the app!

The App

The app can be found here! From the main page you can navigate to the top comments for different time periods. To access the personality insights ("Big 5," "Needs," and "Values") for a specific bot, click on the name of a bot. To see the ranked list of bots for a given personality categories ("Openness," "Adventurousness," etc.), click on that category. To jump straight to a list of every possible category and the top 5 bots for each category click on the "insights" tab on top.

*Disclaimer: these bots generate sentences based on what they learn from reddit - not everything they say is 100% appropriate.

motivation

The motivation to embark on this journey was inspired by reddit's inability to show all of the top comments for a given sub - going to https://www.reddit.com/r/subredditsimulator/comments only lists the most recent 1000 comments sorted by time. There is good reason for this - why would you ever want to view comments out of their context? This functionality really only makes sense for /r/SubredditSimulator - everything is entirely random and the comments are often funnier than the posts.

After I gathered 15,000+ comments I figured some analysis was in order... so here we are!

and the results are in!

Each of the personality breakdowns contains a lot of data. For example, here's Jokes_SS. Looking at the attributes of each bot is super interesting but then sorting the bots by attribute is an extra layer of juicy awesome.

To quickly see a breakdown of each available category and the top 5 bots for each category head on over here.

The most interesting rankings I've found are:

Intellect

a person's tendency to think in symbols and abstractions

  1. AskHistorians_SS - 98.45%
  2. AskScience_SS - 97.55%
  3. math_SS - 95.67%
  4. history_SS - 94.57%
  5. europe_SS - 94.28%

Extraversion

a person's tendency to seek stimulation in the company of others

  1. gonewild_SS - 100%
  2. relationships_SS - 98.37%
  3. sex_SS - 96.44%
  4. confession_SS - 95.02%
  5. seduction_SS - 91.87%

Self-enhancement

a person's tendency to seek personal success for themselves

  1. math_SS - 95.97%
  2. books_SS - 94.68%
  3. programming_SS - 91.38%
  4. leagueoflegends_SS - 91.17%
  5. soccer_SS - 90.06%

what does all this mean?

Every subreddit is comprised of a community of users that post and comment. Every bot uses markov chains based on real users to create sentences that are likely to be found in that subreddit. It's not a stretch to conclude that the bot's personalities are somewhat indicative of the conversations that take place in that subreddit.

This isn't claiming that these specific subreddits have the most intelligent users, or the most extraverted, etc. It's more of an insight in to the specific language that commenters use when posting in these communities.

We can easily grasp that people posting in the subreddits AskHistorians and AskScience use intellectually focused language. The fact that Watson uncovered such a nuanced attribute, in a close-to-perfect parallel with human perception, is an achievement. We make these judgments ourselves so quickly and that makes it tough to grasp the computational power that goes into them. The ease with which Watson, too, arrived at this judgment shows the power of the software.

let's get technical

The code powering this bad boy can be found here.

architecture

Architecture

challenges

The largest technical challenge stemmed from limitations in reddit's APIs:

  1. You can only fetch 100 comments at a time
  2. /comments can only be sorted by time and doesn't allow you to go further back than ~1000 comments
  3. You can't make more than thirty requests per minute

All of these combined together make it impossible to have a reference to every comment and to have those comments always be up to date.

To circumvent this, I have the following function:

function loadCommentsForever (flag) {
  if (flag) {
    return commentScraper.getAndUploadComments()
      .finally(loadCommentsForever.bind(this, !flag));
  } else {
    return commentScraper.getAndUploadPostComments()
      .finally(loadCommentsForever.bind(this, !flag));
  }
}

getAndUploadComments() uses the /comments API to get the most recent ~1000 comments one "page" at a time, and getAndUploadPostComments() gets the most recent 100 posts, then gets the comments for those posts one at a time, and then gets the next 100 posts, gets the comments for those, etc.

loadCommentsForever() is always running, keeping the comments on the app fresh and up to date.

personality insights

Using the personality insights was much simpler than expected. To initialize:

var watson = require('watson-developer-cloud');
var personalityInsights = watson.personality_insights(credentials);
 
personalityInsights.profile({text: aggregatedTextObj.value.body}, function (err, profile) {
  if (!err) {
    // do cool stuff with profile
  } else {
    // handle error
  }
});

cloudant

Getting the aggregate text for each bot using Cloudant was super-awesome-ly-easy. All I had to do was set up a very simple map/reduce.

Map:

function (doc) {
  if (doc.author && doc.author !== '[deleted]') {
    emit(doc.author, {
      body: doc.body,
      score: doc.score
    });  
  }
}

Reduce:

function (keys, values, rereduce) {
  return values.reduce(function (prev, curr) {
    return {
      body: prev.body + ' ' + curr.body,
      score: prev.score + curr.score
    };
  }, {
    body: '',
    score: 0
  });
}

This created a view that stored the aggregate words from a given bot, and that bot's total score.

To access this, all I had to do was:

var Promise = require('bluebird');
var cloudant = require('cloudant')(dbCredentials.url);
var commentsDB = Promise.promisifyAll(cloudant.use(dbCredentials.dbName));
 
commentsDB.viewAsync('ss_design', 'aggregate_text', {reduce: true, group: true, limit: 1000}).then(function (args) {
  var body = args[0];
  authors = body.rows;
  // filter out the bots with less than 1200 words
  authors = authors.filter(function (r) {return r.value.body.split(' ').length > 1200 });
  // more stuff
});

anything else?

Let me know what you think! If there's a specific feature or whatever you'd like to see added/changed feel free to contact me - all my info's in the about page.