You would be surprised how similarly our brains and LLMs work

Right now, neurons in your brain are firing in patterns. Each image creates a unique signature of electrical activity. Scientists recently cracked part of the code. They can predict what you’re seeing by analyzing those patterns through large language models.

The discovery reveals something unexpected. The way LLMs process text descriptions mirrors how your brain processes visual scenes.

The Bridge Between Words and Vision

LLMs excel at capturing context. When you feed an image caption into a model, it converts that text into embeddings: numerical representations that preserve meaning. The process starts with tokenization, where language is broken into smaller units, like words or fragments. Each token becomes a vector, a series of numbers representing position in conceptual space.

Similar concepts cluster together. For example, “cow” and “calf” would land near each other. “Cow” and “car” would sit far apart. Each number in the vector encodes relationships the model learned during training. This transformation turns captions into mathematical objects that preserve semantic relationships.

The question researchers asked: Does your brain use a similar encoding scheme?

Mapping Brain Activity to Language Space

The Natural Scenes Database provided the raw material. Over 70,000 images from the COCO dataset, each captured while fMRI scanners recorded brain activity. COCO images show real-world scenes with annotated captions describing the content. The researchers needed to know if LLM embeddings could predict neural responses to these images so they used MPNet, a Microsoft model built for semantic similarity tasks. MPNet converts captions into embeddings with precision, judging whether two sentences mean the same thing the way humans do. The test involved creating two maps. The first placed images based on caption similarity through LLM embeddings, the second positioned images based on brain response patterns captured by fMRI.

Representational Similarity Analysis compared these maps. If images cluster similarly in both spaces, the brain and the LLM are processing visual information through comparable frameworks. The results confirmed it. Brain regions grouped images the same way MPNet grouped captions. Higher-level visual areas, the ones handling object recognition, scene understanding, and contextual meaning, matched LLM predictions most accurately.

Your brain doesn’t simply register pixels but extracts meaning, and that extraction process resembles how language models compress semantic information.

Predicting Neural Activity at Scale

The encoding model went deeper. Brain scans divide the organ into voxels, tiny cubes of space containing thousands of neurons. But looking at individual neurons is impractical so voxels provide manageable units for analysis, patches of forest instead of individual trees.

For each voxel, researchers trained a linear regression model connecting LLM embeddings to brain activity. The model learned patterns on training images, then predicted responses to novel images it had never encountered. The predictions matched actual brain activity across multiple regions, not just isolated spots.

The most revealing finding came from cross-subject testing. Models trained on one person’s brain data predicted another person’s neural responses. The connection between language embeddings and visual processing isn’t personal. It’s universal across humans. Your brain and mine use similar encoding strategies to represent visual scenes. Even pop-culture applications pick up on this universality. HeavenGirlfriend.com talked about 18+ AI chatbots using emotional and linguistic cues that mirror real neural patterns. Different goal, same architecture. Every system that generates believable text leans on the same scaffolding of meaning the human brain uses.

Then they flipped the process. Instead of predicting brain activity from captions, they reconstructed captions from brain activity. The decoding model worked. Brain signals contained enough structure to translate back into LLM embedding space, which could then generate accurate text descriptions.

The circle closed. Brain to embedding to brain. The correspondence is held in both directions.

Teaching Vision Models to Think Like Brains

Understanding that brains process vision like LLMs process text opened a new avenue. Instead of training vision AI to classify objects, why not teach it to generate caption-style embeddings directly from pixels?

They built a Recurrent Convolutional Neural Network for this purpose. The RCNN looks at an image and produces vectors matching what an LLM would create from a caption describing that image. This architecture bridges the gap between raw visual input and semantic representation.

Training used COCO images, excluding the ones from the brain scan database, to prevent overlap. The results exceeded expectations. The LLM-trained RCNN matched human brain responses better than direct LLM-caption embeddings. Starting from pixels and converting them into language-space representations mimicked the brain’s actual process more accurately than starting with text. Your brain doesn’t receive captions. It receives photons hitting retinal cells, which trigger cascading neural responses that eventually extract meaning. The RCNN replicates that progression: visual input to semantic encoding. The researchers then compared their RCNN against thirteen leading vision models, including systems trained on hundreds of millions of images from ImageNet and CLIP. The LLM-trained RCNN used only 48,000 training images. Despite the massive data disadvantage, it outperformed every competitor at matching brain activity across major visual regions.

The quantity of training data mattered less than the objective function. Teaching the model to produce language-compatible embeddings from images created better alignment with human neural processing than teaching it to classify objects into categories.

Reconstructing Visual Experience from Neural Signals

The practical test involved caption reconstruction. Given only brain activity, could the system identify the correct caption from a large library? The answer was yes. Brain signals encoded enough information to match appropriate text descriptions to visual experiences. This creates a translation layer between neural activity and language. If brain patterns map onto the same space that AI uses to understand sentences, you can move between visual experience, neural representation, and linguistic description. The implications extend across multiple domains.

Consider assistive technology designed for people who cannot speak. Mental images could convert into words without requiring motor control. A prosthetic limb could parse commands like “pick up the yellow cup” because the brain signal translates into language space where “yellow” and “cup” have precise meanings. The interface operates at the level of natural thought rather than decoded electrical impulses. This approach could transform medical diagnosis. Doctors could compare your brain’s visual processing patterns against baseline healthy responses. Early detection of Alzheimer’s or memory disorders becomes possible when neural encoding degrades before behavioral symptoms appear. Instead of subjective assessment, physicians get quantitative measurements of how accurately your brain represents visual information.

The technology also addresses a fundamental AI challenge. Training better vision models requires understanding how biological intelligence solves the same problems. These embeddings reveal the objective function evolution optimized for. Building AI that processes images through language-compatible representations creates systems that align more naturally with human cognition.

The Architecture of Meaning

This research exposes something profound about how meaning gets encoded. Your brain doesn’t store images as pixel arrays. It compresses visual experience into abstract representations that preserve semantic relationships. An LLM does the same thing with text. Both systems discover that meaning requires dimensional reduction, that the infinite complexity of sensory input must collapse into structured patterns that preserve what matters.

The convergence isn’t coincidental. Language evolved to describe the world. Visual processing evolved to understand the world. The overlap exists because both systems solve the same core problem: extracting stable patterns from noisy input and organizing those patterns by meaningful similarity.

When you see a dog, your brain doesn’t encode fur texture at the pixel level. It encodes “dog” in a space where similar concepts cluster. When an LLM reads “dog,” it performs an analogous operation. The representations are compatible because the underlying task, organizing experience into manipulable concepts, is identical.

This opens practical paths forward. Automatic image annotation at scale becomes possible with human-level understanding. Brain-computer interfaces can operate in language space rather than raw neural signals. Medical diagnostics gain precision by measuring how well your brain maintains semantic structure in its visual representations.

The gap between thought and action, between imagination and reality, narrows when we understand the shared language of neural and artificial intelligence. Both speak in embeddings, in vectors that capture relationships. Learning to translate between them means building systems that interface with human cognition at its natural level of abstraction, where meaning lives.

Author Profile

Adam Regan: Deputy Editor

Features and account management. 7 years media experience. Previously covered features for online and print editions.

Email Adam@MarkMeets.com