Why Hemingway gets flagged as AI.
And what a literary-specific detector finds instead.
A confession. We almost shipped Slopsleuth without testing it on Hemingway. When we finally did — a routine validation pass on the calibration baselines — the result was nearly fatal to our marketing claim.
The Sun Also Rises scored 23 / 100 on Slopsleuth. That's the same tier as a Claude-generated thriller we wrote cold for benchmarking. If we'd shipped that, the first reviewer to run a Hemingway novel through our tool would have torn us apart on Twitter.
So we dug in. The result is the most useful thing we've learned about building literary-specific AI detection: which signals matter, which lie, and what changes when you calibrate for fiction instead of college essays.
The setup
We ran The Sun Also Rises (Project Gutenberg public-domain text, ~67,000 words) through Slopsleuth's five audits. The verdict came back SIGNIFICANT SIGNALS — 23 / 100. Two audits flagged: dialogue texture (LOAD) and voice variance (WATCH).
"It was a good fight. Not bad. Just enough. He had not expected the boy to fight at all."
Sentences like that one. Short. Declarative. Negative-form fragments ("Not bad."). Almost no um, uh, or hedging in dialogue. To a perplexity-based detector trained on average internet text, this looks exactly like AI prose. Hemingway's signature minimalism is statistically indistinguishable from a chatbot's hedging.
The diagnosis
The dialogue-texture audit fired LOAD because of a rule we'd added specifically to catch contemporary AI prose: zero fillers (um/uh/er) across a manuscript with 100+ dialogue paragraphs escalates the verdict. That rule made sense for 2026 fiction — modern dialogue uses fillers naturally, so their total absence is a strong AI-sanitization signal.
But pre-1980s literary prose simply doesn't use those fillers. It's a stylistic convention that emerged later. Fitzgerald doesn't write "um." Anderson doesn't write "uh." Hemingway certainly doesn't. The rule was correct for our calibration baseline (a 2026 thriller) but produced false positives across an entire era of literature.
The fix
We removed the auto-escalation. Now the zero-filler observation is captured as informational metadata in the report — "common in pre-1980 prose; can also indicate AI sanitization. Treat as informational, not diagnostic." The dialogue audit still fires if texture overall is low. But it no longer escalates based on fillers alone.
The post-fix scores:
| Sample | Score | Verdict |
|---|---|---|
| Gatsby (Fitzgerald, 1925) | 0.0 / 100 | Within human range |
| Red Badge (Crane, 1895) | 0.0 / 100 | Within human range |
| Sun Also Rises (Hemingway, 1926) | 7.0 / 100 | Light signals |
| Winesburg, Ohio (Anderson, 1919) | 7.0 / 100 | Light signals |
| AI thriller (Claude, cold) | 23.0 / 100 | Elevated signals |
Hemingway moved from SIGNIFICANT SIGNALS to LIGHT SIGNALS. The AI thriller stayed at the same score — the discrimination didn't break, it improved.
What this means for the product
Three things, in order of importance:
- Calibration on contemporary AI is necessary but not sufficient. Every audit needs at least one historical baseline (1900–1950 era literary prose) to catch rules that work on modern text but fail on different stylistic conventions.
- Zero-evidence escalations are dangerous. A signal should never single-handedly flip a verdict. It should adjust a score that's already weighted by multiple inputs.
- Hemingway is the right canonical test. If your literary AI detector flags Hemingway, fix the detector. Don't argue with Hemingway.
Try it yourself
The exact text we used is on Project Gutenberg (ebook #67138). Download it and run it through Slopsleuth for free — no signup required. The Hemingway sample is also pre-loaded in the app as a calibration button.
If you find a published novel that scores above 15/100, please tell us. We'll add it to our calibration set.
Run a sample audit yourself
The Hemingway sample is pre-loaded in Slopsleuth. One click, full report, no signup needed.
Launch Slopsleuth →