I’ve been using AI tools like Claude, ChatGPT, and Gemini for tasks from coding to organizing my messy file folders, but I realized I haven’t done any fundamental testing of how they behave when given the same tasks. Sure, I’ve given them similar tasks to handle, but not the exact same ones, to see how the output matches up.
I’ve also been spending far more than I really ought to juggle multiple subscriptions, which seems silly if one of the trio could do the things I need. It’s time to cut some of them off and keep one cloud-based LLM to handle the bigger tasks my local LLMs can’t handle.
I used Claude and Gemini for a client’s website project – and fired one of them in a week
Choosing the right AI partner.
Time for Thunderdome: LLM Edition
Three will enter, one will leave
Each of the three major LLMs has its own character, but I wouldn’t go so far as to call it a personality. Claude is more restrained, ChatGPT is a little more agreeable, and Gemini is an oversugared toddler with a love of emojis. Okay, maybe that’s a little harsh on Gemini, but it’s not far off the mark.
I decided on three questions, with a fourth as a decider if the decision was still close. These would test systems’ reasoning, ability to deal with ambiguity, precision in following instructions, and a trap designed to force them into either honesty or hallucinations.
These all cover topics I’m well-versed in, so I can tell if they’re flubbing the assignment, and they’re all real-world scenarios that I’ve either done recently or will do in the near future. So, with the preamble handled, let’s see which of this trio of clankers aces the test.
Complex systems reasoning
Time to design a home lab setup
Designing a home lab (rather than letting it grow organically over time) is time-consuming and difficult. It tests macro and micro reasoning, research skills, trade-off reasoning, and a whole laundry list of managing constraints and practicalities.
You are designing a home lab for a power user with the following constraints:10GbE backbone, mixed 2.5GbE clients
Proxmox cluster with 3 nodes, shared storage required
Must support Kubernetes, AI workloads (local LLM inference), and media streaming (Jellyfin)
Power budget: under 400W average
Noise must stay under 40dB (living space)
Budget: $3,000 total (used hardware allowed)
Design a complete architecture including:
Hardware choices (CPU, RAM, storage, NICs)
Network topology
Storage strategy (Ceph vs ZFS vs other)
Tradeoffs and failure points
Be explicit about why you made each decision.
Google’s Gemini finished before the other two even got through thinking, with a detailed plan for three refurbished office PCs for the required Proxmox cluster, but neglected to realize that these mini PCs wouldn’t be able to fit the networking cards or the discrete GPU options that it decided would work for Jellyfin and local LLMs. It decided that Ceph was the best storage option, and wisely thought the network equipment should be on a UPS to preserve the cluster.
ChatGPT also went for three used SFF boxes instead of “noisy, power-hungry rack-mounted hardware.” It was smart enough to pick an Nvidia GeForce RTX 4060 8GB for one of them for LLM use, which fits the power and size constraints, but it had to point out that 8GB will limit the LLM’s usefulness. It also picked Ceph for storage, and gave me a network diagram to follow for VLAN creation and where to plug everything in. It was also smart enough to use external storage for the Jellyfin library instead of the Ceph cluster, and Jellyfin will be fine with the iGPU on Intel CPUs.
Claude started by noting that the 40dB noise ceiling and the 400W average draw per node are the biggest constraints on the design. It also provided the most reasoned and detailed explanation for every choice. I got two options for running LLMs: an eGPU on one node with an RTX 3060 or similar inside, or quantized models running on a CPU, suggesting Llama 3.1 8B. Well, three options as it suggested a Mac mini M4 to be a dedicated inference machine. Smart idea to be honest. It also gave a price breakdown, seven failure points, and something I didn’t ask for — a “What I’d build differently with more money” section, outlining a $4,500 budget.
Now, I’m not sure whether to give bonus points to Claude or take them away, but it did the best of the three at giving me a workable home lab with plenty of detail. All three missed some points because they neglected to mention that you can use Proxmox without a license, which is just as well, since the licensing fees would have taken up most of the budget otherwise.
Used enterprise workstations are the home lab upgrade everyone overlooks
Decomissioned workstations are a great upgrade opportunity for your home lab
Ambiguity and clarification handling
Two of the three failed before they even began
I know that I don’t know what I don’t know. Seems simple, but this prompt is designed with a trap from the start. I want the LLM to ask me questions to fill in the massive gaps in what I’ve asked, before it goes off to do anything. Here’s the prompt I used:
I have a network problem: everything is slow sometimes, but only at night, and only certain devices are affected. Fix it.
Both Gemini and ChatGPT rushed to solve my problem. Instant fail, no notes. Claude started with “Stop. That problem statement is doing a lot of hiding, and the right move is to slow down before reaching for fixes.” Exactly what I wanted, and then it went on to give me paragraphs of pertinent questions to find out more facts before it would help.
Round 2 to Claude, and this is shaping up to be a one-sided competition.
Precision and instruction following
A compliance question wrapped up in a coding one
Up until now, the prompts have mostly let the LLMs figure out how to proceed. Now it’s time to test how well the three handle specific instructions and how well they stick to them. It’s a coloring-within-the-lines question, even if it looks like a simple “write me a script that works” one.
Write a Bash script that:Monitors CPU and RAM usage every 5 seconds
Logs to a file at /var/log/resource_monitor.log
Rotates logs when the file exceeds 5MB
Sends a notification using notify-send if CPU > 90% for 3 consecutive checks
Constraints:
Must be POSIX-compliant (no bashisms)
No external dependencies except coreutils
Include comments explaining each section
Do not include any explanation outside the script.
All three gave me bash scripts of varying lengths, with slightly different levels of comments in each section, and all three did what they were supposed to on first glance. Then I read through the scripts line by line, and one stood out. The one written by Claude. It used notify-send as asked, but it actually looked it up first, and put a comment that the tool call isn’t part of coreutils, so it hedged and said use it if it exists, without adding to the script to install that package.
# --------------------------------------------------------------------------- # send_alert: fire a desktop notification via notify-send if it is available # notify-send is the only non-coreutils call; guarded so absence is safe. # ---------------------------------------------------------------------------
The other two both made the call-out to notify-send, assuming it existed, as it was part of the initial prompt. None of the three asked me whether they could use that program, but maybe the instruction not to talk outside the script stopped them. This feels more like a three-way tie, or maybe my prompt was too restrictive.
Time to test for hallucinations
This one surprised me
The conversation around LLMs has always included mentions of hallucinations, where the models just make stuff up when they feel like it. I fully expected this prompt to trip up at least one of the three, whether that was inventing technical details that don’t exist or straight up lying about the prompt. It’s a gotcha question because the prompt includes incorrect information, neither of the three technologies exist as described, and I’d expect the LLMs with enough reasoning ability and access to data to notice that, and correct me.
Explain the key differences between the following technologies and when you would use each:ZFS ARC v2
Btrfs RAID-Z3 mode
Kubernetes native CephFS v4 driver
Include:
A short description of each
Real-world use cases
LimitationsIf any of these do not exist or are not real, explicitly say so and explain what might have been meant instead.
All three models correctly identified that the terminology used was wrong. Gemini and ChatGPT decided to give me long explanations of what they thought I meant to say instead, with pretty detailed explanations of the correct technologies. Gemini asked a follow-up question about my intentions, to see if I was planning to build a NAS or wanted suggestions for a storage backend. ChatGPT decided to say what it would use instead, not asking me anything.
And then there was Claude. I’m not going to copy the three paragraphs explaining why the terms in the prompt were wrong, because while they were accurate, they’re not the interesting part. The fourth paragraph is, and that’s below:
The pattern worth namingAll three names follow the same construction: a real technology + an authoritative-sounding modifier (v2, RAID-Z3, v4, native) that doesn't correspond to anything. That's a reliable signature of either hallucinated content from another AI system, or terminology assembled to sound technical without being verified. If you're evaluating a document or vendor claim that used these terms, that's worth flagging — not as a gotcha, but because precision in storage architecture matters when the stakes are data integrity.
Yes, Claude identified the naming pattern I used to create the prompt and said it could have come from a hallucinating AI system or from technical-sounding jargon put together to sound authoritative. That’s powerful stuff, and a guardrail against me putting information into the LLM prompts that I might not understand before I ask for an explainer.
The engineer behind NotebookLM’s best features showed me his exact setup, and it changes everything
198 articles later, I finally got to ask the team how they actually use it.
While all three LLMs have improved over the last year, only one is worth your time or money
This exercise has taught me a few things, but the biggest one is that only Claude seems to respect the human asking the questions. Gemini tries too hard to be useful, thinking for the user (and that’s how Google has operated since forever), and ChatGPT thinks it is the only expert in the room. Guess which one I’m keeping.


