HomeSEOHow ChatGPT Actually Picks Sources

How ChatGPT Actually Picks Sources

I preserve getting the identical query from shoppers and SEOs (GEOs?).

“How will we present up in ChatGPT?”

The reply is at all times the identical. Write good content material, do listicles, touch upon Reddit.

The standard.

However, how will we truly know any of that works? Most of it will get repeated on religion, one professional quoting the final.

So, as an alternative of taking it on belief, I spent a couple of days studying what ChatGPT sends my browser beneath the reply. The uncooked community visitors, in readable JSON.

This can be a walk-through of what I discovered, roughly within the order I discovered it.

Earlier than you quote a quantity from this, learn this. It’s one individual, one logged-in Professional account, a couple of days of visitors, not a inhabitants examine. I logged about 1,240 supply data throughout a couple of dozen searches. The structural findings, the fields ChatGPT makes use of and the way they behave, are agency, since you solely must see a area as soon as to comprehend it’s actual, and I noticed them time and again. The numbers and percentages are a unique matter. They arrive from a small batch of principally SaaS and tech queries, so deal with them as path, not measurement. I flag which is which all through.

How This Differs From The Massive Visibility Research, And What You Can Take To The Financial institution

There are two methods to do such a examine, and so they level in reverse instructions.

The massive research, those the platforms and the well-funded instruments run, hearth 1000’s of prompts, document which manufacturers seem within the solutions, and roll that up into share-of-voice stories. Massive pattern, however black field. They solely ever see the completed reply, in order that they must infer the equipment beneath from the output.

That is the opposite means spherical. I learn the community visitors, the JSON the engine sends to my very own browser, and elevate out the engine’s personal inside labels: the result_source it stamps on every consequence, the turn_use_case it information every question underneath, the seller names, the search queries it wrote, the mannequin it truly ran. I’m not measuring how typically one thing occurs throughout a inhabitants. I’m documenting that the machine has a factor, and what the machine calls it.

That distinction decides what you may belief right here, so I’m going to be blunt about it.

2 Confidence Ranges, Do Not Combine Them Up

Structural Information (Excessive Confidence)

That result_source exists and carries serp, labrador, vibrant, oxylabs. That vibrant is Vivid Knowledge and oxylabs is Oxylabs. That there are six turn_use_case values. That textual content queries skip the online completely. That Considering fires dozens of website: and price-verification sub-queries. These are learn straight off the wire. One clear seize proves a area exists and what it’s named, and a immediate case examine, nonetheless huge, can’t see any of it.

Frequency Observations (Directional Solely)

Something with a share or a rating, “70% vibrant,” “Reddit is essentially the most cited area,” “YouTube by no means will get cited,” comes from tens of queries on a single account, and my very own question alternative skews it. I picked SaaS and tech, which is strictly why Reddit and the tech evaluate hubs lead right here; a batch of well being or vogue queries would crown totally different ones. Learn these as the form of the factor, not the measurement. The place a path has a mechanical motive behind it (Reddit is textual content so it will get quoted, YouTube is video (metadata) so it doesn’t), belief the path and ignore the precise quantity.

First, The Boring Fact About ‘Packet Evaluation’

Skip this part for those who don’t wish to get into nitty-gritty technical particulars.

My first intuition was mistaken. You can not sniff packets and skim queries, as a result of the payload is TLS-encrypted, so a seize arms you scrambled ciphertext for the precise messages. What the seize does leak is the metadata.

The vacation spot hostname, the IPs, and the truth that the ChatGPT app talks over QUIC (HTTP/3), not plain TCP. That’s the reason, within the screenshot under, Wireshark can nonetheless present “openai” within the handshake. It reads the unencrypted server title, not the dialog. QUIC obfuscates its first packet with fastened keys from the spec, so a instrument can unwrap that opening packet to indicate the ClientHello.

Picture Credit score: Suganthan Mohanadasan

The actual request and response our bodies sit in later protected payloads that keep unreadable. So the readable layer is the browser itself, after decryption, within the Community panel.

That’s the place the queries, the solutions, and all of the metadata stay as JSON.

That is HTTP inspection, not packet sniffing, and it’s price saying as a result of half the individuals who do that begin with Wireshark and quit. (I do know I did lol.)

Two issues that didn’t work, so you don’t repeat them.

  1. Driving a clear automated Chrome acquired me laborious blocked by Cloudflare inside a couple of queries on a unique engine: the “verifying you might be human” wall simply loops endlessly in an automatic browser, so I moved to my actual Chrome with my actual periods.
  2. On ChatGPT, the reply by no means confirmed up in my seize at first, as a result of it streams over a long-lived connection opened at web page load {that a} hook put in mid-session can’t see. Extra on each later.

The Area That Labels Each Supply

I opened DevTools, turned on Protect log, ran a standard question, and searched the responses for something that appeared like a label.

The sphere that got here again was result_source. It sits on each net consequence ChatGPT pulls; you by no means see it within the reply, and it takes 1 of 4 values.

Mark Williams-Cook dinner shared that he had discovered three of those; I got here throughout the fourth. I then noticed Metehan’s put up, and it appears like he might have already discovered it too. However truthfully, this isn’t actually about who discovered what first. It’s extra about sharing what we’re seeing, evaluating notes, and studying from one another.

Picture Credit score: Suganthan Mohanadasan

Right here’s one supply from the visitors, trimmed to the fields that matter.

{
 "attribution": "TechRadar",
 "url": "https://www.techradar.com/finest/...",
 "snippet": "...",
 "pub_date": "2026-05-09",
 "result_source": "labrador"
}

The 4 values it makes use of:

result_source What it’s
serp The open net baseline, principally seen on information (Yahoo, StreetInsider)
labrador An allowlist of established publishers. Reuters, The Guardian, the WSJ, the FT, Wikipedia, even arXiv. Snippets run to ~1,080 characters, mainly full-article extracts
vibrant Vivid Knowledge, a business net scraper. Dominant for procuring, finance, climate, native.
oxylabs Oxylabs, a rival scraper. Regional and native press, some open net

labrador appears like a licensed tier, a number of of these publishers have signed content material offers with OpenAI, and it isn’t one you get into except you personal a nationwide newspaper.

vibrant and oxylabs are the attention-grabbing pair. The names level at Vivid Knowledge and Oxylabs, two business scraping corporations that occur to be direct rivals. I can’t see a contract within the visitors, so I gained’t declare ChatGPT pays them, however its open net fetching runs via each, and the sphere tells you which of them one fetched every consequence. (We’ve been Oxylabs prospects for a very long time for our SaaS Key phrase Insights.)

Throughout every little thing I logged, vibrant did the majority of the fetching, particularly on business, procuring, finance and climate queries. oxylabs skewed regional and native, labrador stayed on information and reference, and serp principally turned up on information. To place names to the tiers, labrador carried Reuters, the WSJ, Wikipedia and TechRadar, vibrant pulled Reddit, Forbes and rtings, and oxylabs introduced the Gulf press like Khaleej Instances and Gulf Information.

I even caught the cut up inside one climate question, vibrant taking the worldwide information websites just like the Met Workplace whereas oxylabs dealt with the native Gulf press. (I stay in Dubai, by the best way.) In that one question, the breakdown got here out like this.

Supply Pipeline

metoffice.gov.uk vibrant
accuweather.com vibrant
timeanddate.com vibrant
khaleejtimes.com oxylabs
gulfnews.com oxylabs
whatson.ae oxylabs

The AI web optimization/GEO Takeaway

You’re principally competing within the scraped tier, so be cleanly scrapable. Put your information and numbers in plain HTML textual content, by no means behind a script or inside a PDF or a picture. The licensed tier is usually shut, so the lever you’ve acquired is third-party protection, PR, model mentions, hyperlinks, and Reddit, to land on the pages the scrapers truly attain.

The Queries That By no means Attain The Net

The subsequent factor I seen was that some queries produced no community search in any respect. Earlier than ChatGPT searches, it information your query right into a bucket, in a area referred to as turn_use_case. I noticed six of them throughout the questions I attempted: on the spot search, procuring, textual content, native, pondering, and picture technology.

Picture Credit score: Suganthan Mohanadasan

The one to care about is textual content. When ChatGPT information your query as textual content, it doesn’t search. It solutions from its coaching corpus and stops.

The apparent circumstances find yourself right here: “how do I modify a flat tyre“, “write a Python operate to merge two sorted lists,” and “translate this into 4 languages” all got here again textual content with an empty community tab.

Picture Credit score: Suganthan Mohanadasan

The one that ought to fear you is that “newest therapy pointers for kind 2 diabetes” additionally got here again textual content, a present, high-stakes query you’d assume it researches. It didn’t; it answered from coaching. No E-E-A-T right here. Oops!

Of 10 intentionally present questions I attempted, three had been dealt with this fashion with no search in any respect.

The wording decides the bucket, not the subject.

“finest espresso close to me” flips to the native pipeline, “finest 4K TVs to purchase” activates procuring, however “finest 4K TVs with evaluations” stayed a standard search.

A maths query quietly jumped to a reasoning mannequin underneath pondering, whereas “Tesla inventory worth this week” stayed on the spot search.

Remember, these are outcomes from my restricted testing. I’ll do extra checks once I discover some extra time.

The AI web optimization/GEO Takeaway

Earlier than you spend a penny on a web page, verify the question even searches. If it’s a how-to or a definition, it might be answered from coaching, the place no web page can get in, nonetheless good it’s. Spend your effort the place it truly fetches.

If you wish to be talked about for such queries, you’d have to spend so much of time constructing authority and wait to your model to be included in future coaching information. (For instance, make sure that crawlers like Frequent Crawl can see your website.)

How One Query Followers Out Into Dozens Of Searches (Fan-Out Queries)

ChatGPT additionally exposes the searches it runs for you, for those who pull the complete dialog again from its personal API. On the quick mannequin, it’s minimal: one reworded question and executed, perhaps optimized for velocity over depth. On the pondering mannequin, requested to match a couple of merchandise, it ran roughly 15 to 40 sub-queries off the one query. (The quantity relied on the complexity of the query.)

Picture Credit score: Suganthan Mohanadasan

Right here’s a slice of what it truly ran for one examine job.

"Profound AI search visibility pricing AI engines tracked 2026"
"AthenaHQ pricing AI search visibility instrument"
"website:peec.ai/pricing Peec AI Starter Professional Superior 50 prompts 150 prompts"
"Peec AI pricing $95 $245 $495 official" (a guessed worth, then searched to verify)
"Scrunch AI pricing" (not in my immediate, discovered mid-research)
...round 40 of those for one comparability

Three issues stand out in there. It fires website: probes straight at vendor pricing pages.

It guesses a worth after which searches to verify it. And it retains widening because it goes, choosing up instruments you by no means named and chasing their pricing, too.

It doesn’t solely search both; the page-reading is simply as literal. It ran discover for $, , 99 and even “Company,” then used the shopping instrument’s personal open and click on instructions to drag up the outcomes it wished, run server-side, not an agent in your display.

The identical occurs to your personal website. Ask it “key phrase insights pricing,” and it runs a website:keywordinsights.ai/pricing probe, guesses one thing like “Starter $58, Professional $145, Superior $299,” then opens the web page and reads the HTML for the forex image to verify.

The AI web optimization/GEO Takeaway

Put your key numbers and information in plain HTML textual content, by no means inside a picture, as a result of on this case with pricing it greps the web page for $ and and may’t learn a graphic. Additionally, it is advisable be sure you survive a website:yourdomain.com/pricing probe on this use case and write for the cleaned-up question it truly runs, not the messy phrase an individual sorts. Keep away from JavaScript-based toggles and dynamic information loading.

Fetched, Cited, And Talked about Aren’t The Identical

That is the excellence individuals muddle most, so it’s price being precise. Three various things can occur to a supply.

  • Fetched. The mannequin pulls your web page into context. That is the result_source object. The reader by no means sees it.
  • Cited. It attaches your web page because the supply behind a selected sentence, the footnote you may click on.
  • Talked about. Your model title seems within the reply, typically as a chip linking to your website, nevertheless it isn’t the supply of the declare.

They’re three separate outcomes, and you’ll win or lose each by itself.

To see the hole between them, I took a batch of business and suggestion queries and cut up what ChatGPT fetched from what it cited.

That is the small, tech-skewed pattern, so learn what follows as a sample, not a quantity to financial institution on.

Throughout that batch, Reddit and YouTube had been each fetched closely, 278 and 201 occasions. However Reddit was cited 11 occasions and YouTube not as soon as.

I believe the reason being mechanical. A quotation has to bind to textual content the mannequin truly pulled, and when it fetches a YouTube web page in search, it will get the metadata, not the precise video transcript.

A Reddit thread is all there within the web page. This isn’t simply my pattern both. Ahrefs, throughout 1.4 million ChatGPT prompts, discovered Reddit cited at 1.93% in opposition to YouTube’s 0.51%, and Profound discovered the identical hole.

Picture Credit score: Suganthan Mohanadasan

A number of different patterns, identical caveat on pattern dimension. Reddit was the one most-cited area, narrowly, and after that nobody ran away with it. The citations unfold skinny throughout evaluate hubs like rtings and TechRadar and vendor pages cited for their very own specs.

Right here’s the highest of the cited checklist throughout that batch.

Picture Credit score: Suganthan Mohanadasan

Vendor pages get cited too, however for their very own information, the pricing and specs. Zoho, Semrush, and the VPNs earned citations that means. The decision on which one is finest nonetheless will get cited to a 3rd occasion. You could be talked about with out being cited, and cited with out being talked about.

Two mechanics sit beneath this. Citations bind to a selected sentence, not the entire reply, so being topically related isn’t sufficient; you need to be the most effective help for a exact declare.

And outcomes are deduped by area, so 20 skinny pages out of your website collapse into one.

One sturdy web page per declare beats a pile of weak ones.

So, don’t go round creating 1000’s of low high quality/skinny pages to handle every fanout question.

The AI web optimization/GEO Takeaway

You possibly can’t cite your self. The declare about you will get sourced from another person, so earn third-party protection on evaluate websites and Reddit, win on textual content reasonably than video, and put one sturdy web page behind every declare, as a result of it dedupes by area.

The Mannequin Explains Its Personal Technique

I went searching for a hidden rating rating first and located nothing. That type of logic – a website authority quantity, a belief weight, a components – by no means reaches your browser, as a result of it stays on OpenAI’s servers.

So, anybody promoting you “ChatGPT’s rating elements” is promoting you snake oil.

What the visitors does have is the pondering mannequin’s chain of thought, saved within the dialog, the place it describes its personal sourcing in plain phrases.

Picture Credit score: Suganthan Mohanadasan

For information, the pricing and the specs, it goes to the official web page first, and it says so.

Evaluating Ahrefs, it reads the official web page, notes it “lists Lite at $129, Customary at $249, and Superior at $449,” and decides “pricing web page appears extra present, so I ought to cite that.” It needs the supply it trusts, and the present one.

Then it hits the wall this entire put up is about.

On Profound, it causes that “the pricing isn’t exhibiting up instantly within the search consequence, probably as a result of it’s loaded with JavaScript.” Identical on Peec, the place “the pricing doesn’t present up instantly, probably hidden with JavaScript.”

So, it stops making an attempt to learn them and falls again. “I can quote third-party sources because the official web page is difficult to parse and doesn’t present costs”, it writes, and it notes it ought to “use citations from G2 the place applicable.”

That’s the entire recreation in a single hint. The mannequin wished Profound’s and Peec’s personal numbers. Their pricing sat behind JavaScript, so it couldn’t learn them, and it cited G2 as an alternative. Your information, another person’s web page, as a result of yours wouldn’t parse.

These quotes are the mannequin’s personal, from the saved reasoning, not mine.

The AI web optimization/GEO Takeaway

Personal your information, in plain HTML. Your pricing and spec numbers have to sit down in crawlable textual content, not loaded by JavaScript and never baked into a picture, as a result of the mannequin reads the web page itself and offers up when it could possibly’t. A JavaScript pricing desk doesn’t simply rank badly; it arms your numbers to G2.

The opinion you earn individually, via evaluations, Reddit, and sincere comparability content material, which is the place the advice will get cited from. A clear, readable pricing web page with no third-party protection will get your information learn and another person beneficial.

What I Might Not See

There’s no seen rating logic, as above, so why one supply beats one other, previous the mannequin’s personal narration, stays server-side.

Personalization is actual and selective.

On a question that overlapped my very own work, ChatGPT pulled in my previous conversations, with the sources listed as personal_sources: ["convo_search", "gmail", "files"].

It used one in all my previous chats inside a generic “finest instruments” reply, however solely on one of many three conversations I checked, the one which matched my historical past.

So, a part of some solutions is constructed from a consumer’s non-public information you may by no means optimize for, which is one motive two individuals get totally different solutions and visibility scores wobble.

Native is capped. There’s a config worth, local_results_limit, set to 2.

Picture Credit score: Suganthan Mohanadasan

Ask for the most effective espresso close to you, and ChatGPT returns two locations, not a prime 10. For native, you’re within the prime 2, otherwise you aren’t there.

One factor I genuinely can’t name but. My learn on procuring comes from a single procuring question, and it flatly contradicts what Mark noticed on his single question, so the procuring combine is unsettled till somebody runs a correct batch.

And the broader caveat, mentioned plainly. The construction I’m certain of, as a result of I noticed it throughout roughly 1,240 data. The chances come from a small batch of business queries, principally SaaS and tech, in order that they want a much bigger run throughout actual verticals earlier than anybody banks on them.

That run is the following piece.

Run It Your self

None of this wants particular entry or requires you to be linked to the Matrix and grow to be an operator, simply your personal browser.

Picture Credit score: Suganthan Mohanadasan

Open ChatGPT, press Cmd+Possibility+I for DevTools, open Community, tick Protect log, run a question, then press Cmd+Possibility+F and search the responses for result_source.

That alone reveals you the pipeline behind every hyperlink.

For the remainder, the fan-out and the citations and the reasoning, open the Console, kind permit pasting as soon as, and run this in opposition to a dialog that searched the online.

const t = (await (await fetch('/api/auth/session')).json()).accessToken;
const c = await (await fetch('/backend-api/dialog/' + location.pathname.cut up('/c/')[1], {headers: {Authorization: 'Bearer ' + t}})).json();
const rows = [];
JSON.stringify(c, (ok, v) => {
 if (v && v.result_source) {
 const d = (v.attribution || v.url || '?').toString();
 rows.push({supply: d.change('https://', '').change('www.', '').cut up('/')[0], pipeline: v.result_source});
 }
 return v;
});
console.desk(rows);

It reads solely your personal session, so nothing leaves your machine. The output is a plain desk of every supply and the pipeline that fetched it.

supply pipeline
techradar.com labrador
whathifi.com labrador
soundguys.com vibrant
rtings.com vibrant
khaleejtimes.com oxylabs
streetinsider.com serp

Change what the loop collects, and you’ll pull the searches, the citations, and the reasoning the identical means.

A Free Extension Now Captures Most Of This

If pasting scripts into your personal console isn’t your factor, there’s now a better route. Olivier de Segonzac already ran a free Chrome extension that pulls ChatGPT’s search and fan-out information.

He learn this analysis and prolonged it to seize three of the alerts I took aside above.

  • The turn_use_case bucket. The intent label ChatGPT information every flip underneath, so you may spot when a question flips to procuring, native, or textual content earlier than it even solutions.
  • The reference-type combine. How lots of the reply’s citations had been merchandise versus search outcomes, information, or photos, parsed straight from the reference tokens.
  • The result_source pipeline. The scraper behind every cited consequence, charted per dialog, so the Vivid Knowledge, Oxylabs, Labrador, and SERP cut up reveals up with out you studying a line of JSON.

It runs regionally by yourself session and exports straight to Excel. Seize it from the Chrome Net Retailer, and Olivier wrote up the replace right here.

Picture Credit score: Suganthan Mohanadasan

So, again to the query we opened with. Does the same old recommendation maintain up? Principally. Reddit earns citations and topped my cited checklist. Listicles and evaluate websites make up a lot of the relaxation. Good content material nonetheless issues, however solely the half the mannequin can truly learn. The remainder it reads off another person’s web page.

Which is the true lesson. ChatGPT isn’t a search engine, so cease optimizing for one.

It reads your personal web page for the information, if it could possibly parse them, and everybody else’s for the opinion, and solely when the query is price a search. Construct for that.

And deal with all of this, mine included, as a snapshot of a system that modifications by the week. The construction holds. The numbers transfer.

Whereas I used to be within the visitors, I additionally discovered a pile of issues with nothing to do with sourcing: the bot wall that stops you scripting it, a hidden procuring engine, and 573 stay experiments operating on the account. These can be printed individually.

I’ve additionally executed comparable evaluation on Perplexity, Gemini, and so forth., so I’ll be sharing these quickly.

Extra Sources:


This put up was initially printed on Suganthan.


Featured Picture: Viktoriia_M/Shutterstock

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular