350M Tokens Don't Lie: Love And Hate In Hacker News
tl;dr We analyzed all Hacker News posts with more than five comments between January 2020 and June 2023. Leveraging LLama3 70B LLM, we examined both the posts and their associated comments to gain insights into how the Hacker News community engages with different topics. You can download the datasets we produced at the bottom of the article.
Use the tool below to explore various topics and the sentiments they evoke. The sentiment column reports the median sentiment score, 0 being the most negative sentiment and 9 the most positive. Click on the column headers to discover topics that inspire strong reactions, whether positive or negative, and to identify divisive subjects that tend to generate polarized commentary.
Motivationβ
If you have been following Hacker News for a while, you have likely developed an intuition for the topics that the community loves - and topics the community loves to hate. If you port Factorio to run on ZX Spectrum using Rust, you will be overwhelmed with love and karma. On the other hand, you better don an asbestos suit if, instead of raising a funding round, you sell your startup to a private equity company so they can add telemetry in the codebase to power targeted ads.
But that's just a hazy intuition! Since the community is a big believer in rationality and data science (the phrase is divisive though), we would be in a stronger position if we could back our hunch with proper data analysis.
Also, it would give us an excuse to play with large language models which, all the AI hype aside, are mind-blowingly effective at practical tasks like this. And, we happen to be developers of an open source tool, Metaflow, written in Python, which makes it fun and educational to hack projects like this.
Type the bolded phrases in the textbox above to see if we are optimizing for an appropriate emotional response.
What topics trend on Hacker News?β
As described in the implementation section below, we downloaded about 100,000 pages posted on Hacker News to understand what content resonates with the community. We focus on posts that gained at least 20 upvotes and 5 comments, prompting a large language model to come up with ten phrases that best describe each page.
Here are the top 20 topics, aggregated from the 100,000 pages, along with the number of posts covering each topic:
The top topics are hardly surprising. However, the community is known for having diverse intellectual interests. The top 20 topics represent only 10% of the topics covered across posts. You can see this by yourself by using the tool above that cover all the 14,000 topics that are associated with at least five posts.
Have the top topics evolved over time? You bet. Here are the top-10 trending topics:
Even though the dataset only extends to June 2023, we can see a tsunami of AI, natural language processing, and related topics. Sadly, layoffs started trending mid-2022 as well. The rise of AI articles likely explains the growth in the Technology topic as well.
What's decliningβ
In 2020-2022, an overwhelming macro-trend was everything related to the COVID pandemic which fortunately is not a pressing issue anymore. Fascinatingly, August 2021 was a tumultuous month with especially Apple's proposed CSAM scanning causing an uproar in posts related to Privacy and Apple.
If you want to take a trip down the memory lane, consider the past topics on the left which haven't been covered once since January 2022:
Correspondingly, the topics on the right didn't exist before January 2022. It doesn't take a PhD in Economics to understand how Wallstreetbets got replaced by Cost of living crisis, Hiring freeze, and Bank failure.
But how do people feel about these topicsβ
Importantly, sharing or upvoting a post doesn't imply endorsement - often the opposite. Hence to truly understand the dynamics of a community, we need to analyze how people react to posts, as expressed through their comments.
To do this, we reconstructed comment threads associated with the posts in the dataset and asked an LLM to classify the sentiment of the discourse between 0 and 9, zero being an all-out flamewar and nine indicating a perfect harmony and positivity.
Our dutiful LLM took the job as a virtual community moderator and went through 100k comment threads in about 9 hours, reading through 230M words consisting of the full emotional gamut of bitterness, passion, wisdom, humor, and love.
Here's what the LLM came back with in terms of the distribution of sentiments:
Firstly, the LLM is utterly befuddled about what a neutral discussion looks like - there are no 5 in the results. Or, maybe this is a snarky note by the LLM noting that humans are incapable of unemotional, unbiased discourse.
Secondly, the sentiments are clearly skewed towards the positive side, which aligns with our personal experience with the site. The bias to positivity and optimism is a key reason why we open Hacker News daily. There's enough vitriol elsewhere.
It is also worth noting that at least some of the few ones assigned by the machine seem to be bogus. For instance, this post by BentoML was given a score of one, although the sentiment seems majorly positive and supportive. All the usual caveats about LLMs apply.
Love and hate in Hacker Newsβ
Now with the topics and the sentiment scores at hand, we can finally give a scientific answer to the question of what the community loves and loves to hate:
Geeks, nerds, and hackers should find themselves right at home! Interestingly, while the community tends to be visionary and forward-looking (with technical matters at least), it definitely has a soft spot for (technical) nostalgia. While the modern world is amazing, we miss ZX Spectrum, Z80 assembly, and 8086 dearly. At least we find some comfort in PICO-8.
On the anger-inducing side, most topics need no explanation. It is worth clarifying though that Hacker News does not hate International Students, but the posts related to them tend to be overwhelmingly negative, reflecting the communityβs sympathy for the challenges faced by those studying abroad.
Comically, Hacker News is not a community of car lovers. When we talk about cars, it is because there's something wrong with them. For more insights like this, use the tool at the top of the page to explore the diverse landscape of HN topics in detail.
Some topics are just divisiveβ
Besides topics being unimodally love or hate-inducing, some topics are bimodal: Sometimes a post about the topic generates a highly positive response, other times a flamewar. Examples include
- GNOME - KDE vs. GNOME - the war has been raging for 25 years.
- Google - a dominant force in the Internet, both in good and bad.
- Government regulations - damned if you do and damned if you don't.
- Venture capital - the lifeblood of Silicon Valley and a source of endless gossip.
See more by sorting by the divisive
column in the tool above. To rank highly by the divisiveness score, the topic must be associated with both negative and positive posts equally and not many neutral ones.
Is the mood improving?β
Cue Betteridge's Law.
We can plot the average sentiment over time, as expressed in daily comment threads:
The data proves that things sucked in August 2021 (or maybe it was just the Apple debacle highlighted in the chart above). Overall, there's a clear but modest downward trend in the average sentiment.
It would require a deeper analysis to understand why this is the case. Putting aside an obvious hypothesis that life is just getting worse (which might not be true), an alternative hypothesis may be a variant of Eternal September. It takes conscious and tireless effort to maintain a positive mood in a growing community. Kudos to HN moderators for keeping the community thriving and positive over the years!
Implementationβ
A reason to be excited and optimistic about the future is the very existence of this article. While natural language processing and sentiment analysis have been around for decades, the quality, versatility, and the ease of use afforded by LLMs is absolutely unprecedented.
Attaining the quality of topics and sentiment scores with a messy dataset like the one here would have required a PhD-thesis level of effort just a few years ago - and very likely the results would have been worse. In contrast, we developed all the code for this article in about seven hours. Processing 350M tokens with just a decent-sized model would have required a supercomputer a decade ago, whereas in our case it took about 16 hours using widely available hardware.
Most amazingly, all the building blocks, LLMs included, are available in open source! Let's do a quick overview (with code and data), showing how you can repeat the experiment at home.
Here's what we did at the high level:
Each white box in the picture is a Metaflow flow, linked below:
HNSentimentInit
creates a list of posts to analyze.HNSentimentCrawl
downloads the posts.HNSentimentAnalyzePosts
parses the posts and runs them through an LLM.HNSentimentCommentData
reconstructs comment threads based on a Hacker News dataset in Google BigQuery. We should/could have used this in the step (1) too. Next time!HNSentimentAnalyzeComments
runs the comment threads through an LLM.- Data analyses and the charts shown above are produced in notebooks (the gray box in the picture).
Here's what the flows do:
Downloading postsβ
First, we wanted to analyze topics covered by Hacker News posts that have generated some discussion. Using a publicly available dataset of HN posts (thanks Julien!), we queried all posts between January 2020 and June 2023 (the latest date available in this dataset) which had at least 20 upvotes and more than five comments, which resulted in about 100,000 posts. Here's the simple Metaflow flow that did the job, much thanks to DuckDB.
Since the 100,000 posts are mostly on different domains, we can safely download them in parallel without DDOS'ing the servers. It took only around 25 minutes to download the pages with 100 parallel workers (see here how).
Large-scale document understanding with LLMsβ
Parsing the text content from random HTML pages used to be a massive PITA, but BeautifulSoup makes it beautifully straightforward.
At this point, we have a relatively clean dataset of about 100,000 text documents with more than 500M tokens in total. Processing the dataset once with a state-of-the-art LLM, say, GPT-4o, or Llama3 70b on AWS Bedrock would cost around $1,300 (using ChatGPT batch API), not counting the output tokens. Besides the cost, time is a concern - we want to process the dataset as fast as possible, so we can evaluate the results quickly, and rinse-and-repeat if needed. Hence we don't want to get rate-limited or otherwise bottlenecked by the APIs.
We have posted previously about our success with NVIDIA NIM microservices that provide a quickly growing library of LLMs and other GenAI models as prepackaged images, fully optimized to take advantage of vLLM, TensorRT-LLM and the Triton Inference Server, so you don't have to spend time chasing the latest tricks with LLM inference.
Since we had NIMs included in our Outerbounds deployment, we just added @nim
to our flow to get access to a high-throughput LLM endpoint that costs only as much as the auto-scaling GPUs it runs on, in this case, four H100 GPUs. Naturally you can hit an LLM endpoint of your choosing - the options are many these days.
Prompting an LLM to produce a list of topics for each postβ
Our prompt is straightforward:
Assign 10 tags that best describe the following article.
Reply only the tags in the following format:
1. first tag
2. second tag
N. Nth tag
---
[First 5000 tokens from a web page]
The llama3 70b model we used has a 8,000 token context window, but we decided to limit the number of tokens to 5,000 to account for differences in the tokenizer behavior, making sure that we don't go past the limit.
Processing about 140M inputs tokens in this manner took about 9 hours. We were able to increase throughput to around 4,300 input tokens per second by hitting the model concurrently with five workers, as neatly shown in our UI below, to take advantage of dynamic batching and other optimizations.
Processing commentsβ
Instead of trying to download 100,000 comment pages directly from Hacker News, we leveraged a Hacker News dataset in Google BigQuery. Annoyingly, comments in the database are not directly associated with their parent post, so we had to implement a small function to reconstruct the comment threads.
We can't run the function with BigQuery directly but luckily we can export the data easily in Parquet files. Loading the resulting 16M rows in DuckDB and scanning through them was a breeze, aided by the fact that Metaflow knows how to load data fast. We just added @resources(disk=10000, cpu=8, memory=32000)
to run the function on a large enough instance.
Prompting an LLM to analyze sentimentβ
With the comment threads at hand, we were able to run them through our LLM with this prompt:
In the scale between 0-10 where 0 is the most negative sentiment
and 10 is the most positive sentiment, rank the following discussion.
Reply in this format:
SENTIMENT X
where X is the sentiment rating
---
[First 3000 tokens from a comment thread]
Getting a simple structured output like this seems to work without issues. We processed through some 230M input tokens in this manner, which took only about 7 hours as we needed only two output tokens.
Why Metaflowβ
You can reproduce all the steps above using your favorite toolchain. Here are key reasons why we used Metaflow and why you might want to consider it too:
- Staying organized without effort - a big benefit compared to random Python scripts or notebooks is that Metaflow persists all artifacts automatically, tracks all executions, and keeps everything organized. We relied on Metaflow's namespaces to execute large and expensive runs alongside prototypes, knowing that the two can't interfere with each other. We used Metaflow tags to organize data sharing between flows while they were being developed independently.
- Easy cloud scaling - crawling required horizontal scaling, DuckDB required vertical scaling, and LLMs required a GPU backend. Metaflow handled all the cases out of the box.
- Highly available orchestrator - running a large dataset through an LLM can cost thousands of dollars. You don't want the run to fail because of random issues. We relied on a highly available Argo Workflows orchestrator that Metaflow supports out of the box to keep the run running for hours.
You can do all of the above using open-source Metaflow, but we had a few additional benefits by running the flows on Outerbounds Platform: It is simply fun to develop code like this, including notebooks, with VSCode running on cloud workstations, scaling to the cloud is smooth sailing, and @nim
allowed us to hit LLMs without worrying about cost or rate limiting.
If features like this sound relevant to your interests, we are happy to get you started for free.
Dive deeper at homeβ
There's much more that can be analyzed and visualized with this dataset. Instead of spending a few thousand dollars hitting OpenAI APIs, you can download the topics and sentiments we created:
post-sentiment.json
contains a mappingpost_id -> sentiment_score
post-topics.json
contains a mappingpost_id -> [topics]
topics-data.json
contains a cleaned and joined dataset based on the above JSONs, powering the tool at the top of this article.
You can find metadata related to post IDs in this HuggingFace dataset and in the Google BigQuery Hacker News dataset. To view posts, simply open https://news.ycombinator.com/item?id=[post_id].
For instance, it would be interesting to look into the correlation between post domains and sentiments and topics. Our hunch is that certain domains produce predominantly positive sentiments and vice versa. Or, do divisive topics garner more points?
If you create something fun with this data, please link back to this blog article and let us know! Join Metaflow Slack and drop a note on #ask-metaflow
.
To support our open-source efforts, please give Metaflow a star! π€
β
Start building today
Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.β¨