#001 — Memory
Memory is the reconstruction of the past. Both humans and machines feel as if they recall information; both are actually building it up from fragments at the moment of recall. Misremembering is structural, not a bug. Handing the act of remembering to a tool changes what gets encoded in the first place — and what the tool returns gets blended into what the human reports as their own memory.
The pattern: recall reconstructs the past rather than replaying it. Humans confabulate; machines hallucinate; the systems built from both inherit failures that neither component can self-correct.
🧠 In humans
Memory is not a recording. It is a reconstruction assembled at the moment of recall, drawing on traces, schemas, and current context. Bartlett’s “War of the Ghosts” experiments (Bartlett, 1932) showed that subjects asked to retell an unfamiliar folk tale systematically reshaped the story toward their own cultural expectations on each retelling. The reshaping was unconscious; the subjects believed they were reporting accurately.
The reconstructive nature shows up sharply in the misinformation effect (Loftus & Palmer, 1974). Subjects who watched a film of a car crash gave higher speed estimates when the verb in the question was “smashed” rather than “hit.” A week later, those given the “smashed” framing were more likely to remember broken glass that had not appeared in the film. Post-event information had altered memory.
The DRM paradigm (Roediger & McDermott, 1995) reliably produces false memories under controlled conditions. Subjects shown lists of related words (“bed, rest, awake, tired…”) frequently report later having seen “sleep,” a critical lure that never appeared. They report the false memory with as much confidence as the true ones.
Schacter’s catalog of memory failures (Schacter, 1999) organizes the failure space: transience, absent-mindedness, blocking, misattribution, suggestibility, bias, persistence. Each, in his framing, is the cost-side of a feature that serves memory well in other contexts.
Canonical: Bartlett (1932); Loftus & Palmer (1974); Roediger & McDermott (1995); Schacter (2001).
🤖 In machines
Large language models produce text that reads like recall but does not function like it. There is no underlying memory store; output is generated token by token from learned distributions over the training corpus. The result, when the model is asked about a specific fact, ranges from accurate retrieval (for high-frequency, well-documented facts) to hallucination: fluent, confident output with no basis in training data.
Hallucinated citations are the canonical case. A model asked for relevant papers in a niche area produces plausible-sounding titles, real-author names attached to fictional papers, and DOI strings that resolve to nothing. The model is not lying; it has no mechanism to distinguish “fact recalled” from “pattern generated.” Both are, internally, the same operation.
Context-window forgetting is a second class of memory failure. As context approaches model limits, information in the middle of long inputs is recalled less reliably than information at the beginning or end (Liu et al., 2023, “lost in the middle”). The failure mode is not random; it is structural to attention mechanisms over long sequences.
Knowledge-cutoff drift affects time-sensitive recall. A model trained through a cutoff date will confidently answer about events after that date. Sometimes it hedges, sometimes not, often without clear awareness of where its training ended.
Retrieval-augmented generation (RAG) was proposed as a partial fix: ground generation in retrieved documents. The fix introduces its own failure modes: models that misquote, paraphrase past the source’s meaning, or cite retrieved documents that did not in fact support the claim being made.
Canonical papers: Kadavath et al. (2022) on what models know about what they know; Liu et al. (2023) on long-context attention failure; ongoing work on RAG faithfulness.
🤝 In hybrid systems — cognitive offloading
Sparrow, Liu & Wegner (2011), in Google Effects on Memory, showed that subjects who expected to be able to look information up later remembered less of it. The effect was not “they took notes and trusted the notes.” The effect was internal: memory encoding was reduced when external storage was expected.
The Sparrow finding has generalized as the cognitive-offloading literature has grown. The effect is robust across digital tools that augment recall, from search engines to calendar apps to LLM-mediated conversation.
Three things change when external storage is a large language model. First, the storage is interactive: the human does not retrieve a record, they ask and receive a constructed answer. Second, the storage is confidently wrong sometimes: unlike a search result that returns nothing for an unknown query, the model produces an answer regardless. Third, the human’s later recall blends with the model’s output: hallucinations are incorporated into what the human reports as their own memory of the topic.
The third effect is the load-bearing one. A human who asked Claude about a paper a week ago, and a human who actually read that paper a week ago, often produce indistinguishable accounts of “what the paper said,” including hallucinated details. The texture of recall has changed; the human cannot easily tell which parts of their memory came from where.
Canonical: Sparrow, Liu & Wegner (2011); recent work on cognitive offloading and on LLM-mediated false memory incorporation.
↔ Where they converge
- All three produce false memories that feel like real ones.
- All three are biased by recency, frequency, and salience.
- All three confabulate to fill gaps rather than report absence.
- All three are reconstructive; none are reproductive.
⤨ Where they diverge
- Human memory is embodied, contextual, and metacognitively accessible. With effort, humans can recognize a memory as a memory.
- Machine memory is distributed across weights and, for RAG systems, external indexes. There is no analogous self-knowledge of retrieval; the model cannot report whether a given output came from training, retrieval, or generation.
- The hybrid case is the only one in which memory has been exteriorized in a way that changes the encoding side. Humans encode less when they expect the AI to remember for them. Neither component alone can produce that effect.
🌀 Open question
What does memory become when externalized cognition is continuously available? The Sparrow finding suggests humans encode less when they expect external storage. If external storage is an LLM conversation, does that effect compound across years of use, or does the shape of LLM-mediated recall differ from search-engine-mediated recall in ways the existing literature has not yet measured? The studies needed have not yet been done. (Open as of mid-2026.)
📡 Recent entries (auto-fed)
[Pipeline not yet operational. Entries will appear here once Loop 1
(Ingest) is deployed and Loop 3 (Curate) appends to the feed.]
📡 Recent entries (auto-fed)
Week 2026-W22
2026-05-22 — Ogundoyin, Ikram, and Masood assessed 6,233 web-deployed medical GPTs and 10 open-source LLMs with the MedGPT-HEval framework, reporting that 25 to 30 percent exhibit low factual accuracy and 57.06 percent of Action-enabled variants lack adequate privacy disclosures (arXiv:2605.20591).
2026-05-22 — A GraphRAG benchmark on Electronic Health Record schema retrieval across Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, and Phi-4-mini 3.8B found that local retrieval reduced hallucination relative to global summarization, while sub-7B models failed to produce valid structured outputs (arXiv:2605.20815).
2026-05-22 — A study of hallucination in large vision-language models links the failure to insufficient and decaying attention on correct visual tokens across layers, and proposes ILVAD, a training-free re-weighting method that uses inter-layer attention discrepancy to build a saliency map (arXiv:2605.20965).
2026-05-22 — ClaimRAG-LAW, a bilingual French and English benchmark for legal retrieval-augmented generation, applies claim-level evaluation to state-of-the-art systems and reports that hallucination persists in both general-purpose and legal-specific RAG pipelines (arXiv:2605.21071).
2026-05-22 — VerbatimRAG applied to ACL Anthology papers maps user queries to verbatim text spans in retrieved documents; a 150M-parameter ModernBERT token classifier reaches word-level F1 of 53.6, ahead of the strongest evaluated LLM extractor at 48.7 (arXiv:2605.21102).
2026-05-22 — Yeom and colleagues tested Qwen and Llama models from 0.8B to 72B parameters and found that 16 to 47 percent of Instruct-model hallucinations occur when substantial probability mass already sits on the correct answer concept, with the rate rising monotonically with scale; the distinguishing factor was whether probability concentrated on one surface form or dispersed across alternatives (arXiv:2605.22007).
2026-05-25 — A lawyer at Binnall Law Group apologized to a federal judge in San Francisco after submitting a court filing containing phantom quotations generated by an AI tool, in a matter related to the Trump administration’s firing of government workers (AIID #1499).