Decoding the Web of Belief
There are no facts, only interpretations.
“There are no facts, only interpretations.”
— Friedrich Nietzsche
Table of Contents
Conception
During a morning walk in mid 2024, I had a simple thought:
"If you could create a physical map of all conspiracy theories and how they are related, what would it look like?"
At the time, I had just finished research on building knowledge graphs from unstructured data. If it was possible to construct graphs from scientific publications, it seemed reasonable that similar, analytical approaches could work for any unstructured source. Years prior I attended a talk by Simon DeDeo at an event hosted by the Santa Fe institute. He shared research findings on how humans construct explanations and how those explanations can go wrong. I’ve long been interested in extreme expressions of human behavior: historical tragedies, war, peak physical performance, and non-conformist belief systems. The idea of compiling a map of the most extreme explanations we’ve created about the world fascinated me.
So I set out to build a graph of conspiracy theories, unsure whether the results would be insightful or incoherent, but confident the process itself would be interesting. This project is technical in nature, spanning data engineering, natural language processing, large language models, the mechanics of knowledge graphs, named entity recognition, relation extraction and applied mathematics. I walk through the work end to end, from raw source data to the constructed graph, and then step back to reflect on what, if anything, the structure of the graph reveals.
Where Theories Live
Conspiracy theories are, at their core, ideas. To map them in the aggregate, those ideas must first be captured and combined in some concrete form. Many philosophers including Plato, Thomas Hobbes, Hegel, and Wittgenstein, have emphasized the deep entanglement between thought, speech, language, and writing. If ideas are inseparable from the language that gives them shape, then collecting language becomes a way of collecting thought itself. As Wittgenstein put it in the Tractatus Logico-Philosophicus, “The limits of my language mean the limits of my world.”
If we accept the assumption that written text is strongly correlated with thought, then text becomes an ideal substrate for constructing a knowledge graph of conspiracy theories. The task, then, is to assemble a corpus that captures how these ideas are articulated, debated, and refined in the wild.

The School of Athens by Raphael. A Renaissance fresco depicting an idealized public square of reasoned debate. What would our ancestors make of the migration from physical spaces to virtual forums online? How would they respond to the fractured sense-making and competing explanations of the world that dominate our discourse today?
An effective corpus for this purpose should have three core properties. First, it must be tightly focused on conspiracy theories, with language bounded to the subject to ensure a high signal-to-noise ratio. Second, it should present discourse rather than monologue. It should contain presentation, discussion, and debate that mirror a public square in which claims are proposed, challenged, defended, and elaborated. Finally, the corpus must be both large and temporally broad. The greater the volume of text and the longer the time horizon it spans, the more robust and representative the resulting map becomes.
Reddit satisfies these requirements particularly well. The platform is organized into topic-specific communities called subreddits, each functioning as a loosely defined forum centered on a shared interest. Some subreddits focus on news, others on casual reflection or technical explanation, but all share a common structure: posts followed by threaded discussion. This structure allows discourse to unfold naturally over time, often across many years and millions of contributions.
For conspiracy theories specifically, I selected four subreddits:
- r/conspiracy
- r/conspiracytheories
- r/conspiracy_commons
- r/conspiracyII.
Together, these communities offer sustained, focused discussion spanning more than a decade.
Using “The Eye” a dataset containing posts and comments from the top 20,000 subreddits over nearly two decades, I extracted all available content from these four communities. This corpus, covering years of claims, arguments, and reinterpretations serves as the raw material for constructing the conspiracy theory knowledge graph.
Harvesting The Data
Our dataset is comprised of eight source files in total from the Reddit archive: for each of the four selected subreddits, one file contains submissions (posts) and a corresponding file contains comments. All files are distributed in .zst format and compressed using the Zstandard lossless compression algorithm, requiring decompression prior to ingestion and processing.

A screenshot from the archive.
Once downloaded, the files are decompressed using Python and parsed to extract a predefined set of fields. The extraction logic is implemented in a custom script, accompanied by a configuration file that specifies the keys to be retained after decompression. This script is adapted from the zreader utility provided by pushshift in their GitHub repository, with slight modifications to support this use case.
For submissions, the following keys were extracted:
- title: The title of the post
- selftext: The text for the body of the post
- num_comments: The number of comments on the posts
- ups: The number of upvotes
- downs: The number of downvotes
- score: The net score between up and downvotes (i.e. ups - downs)
- created_utc: The timestamp the post was created, denoted in UTC
- permalink: The Reddit URL to the post
For comments, the following keys were extracted:
- author: The author of the comment
- body: The comment text
- ups: The number of upvotes
- downs: The number of downvotes
- score: The net score between up and downvotes (i.e. ups - downs)
- created_utc: The timestamp the post was created, denoted in UTC
This script streams large, Zstandard-compressed Reddit archive files and extracts a subset of fields into newline-delimited JSON (.jsonl) for downstream processing. It incrementally decompresses and decodes the data to handle large files and malformed UTF-8 safely, parses each record line by line, normalizes select fields such as Reddit permalinks, and logs progress as it goes. The script allows consistent decompression across multiple subreddit datasets without loading entire files into memory.
Here is an example of a single submission, stored as a line in one of the output files:
{
"title": "What if BitCoin was created by the NSA for free computing power to decrypt anything ecnrypted with SHA-256?",
"body": "Or potentially some other kind of decryption capabilities?",
"number_of_comments": 4,
"upvotes": 11,
"downvotes": 0,
"score": 11,
"created": 1386335177,
"reddit_url": "https://www.reddit.com//r/conspiracytheories/comments/1s8n0r/what_if_bitcoin_was_created_by_the_nsa_for_free/"
}
Once all data has been decompressed into .jsonl files, we can start preparing it for processing.
Removing Noise
To be continued...