Tired Of SEO Spam, Software Engineer Creates A New Search Engine

August 18, 2025

A software program engineer from New York obtained so fed up with the irrelevant outcomes and search engine marketing spam in serps that he determined to create a greater one. Two months later, he has a demo search engine up and working. Right here is how he did it, and 4 essential insights about what he feels are the hurdles to making a high-quality search engine.

One of many motives for creating a brand new search engine was the notion that mainstream serps contained rising quantity of search engine marketing spam. After two months the software program engineer wrote about their creation:

“What’s nice is the comparable lack of search engine marketing spam.”

Table of Contents

Neural Embeddings

The software program engineer, Wilson Lin, determined that the perfect strategy could be neural embeddings. He created a small-scale check to validate the strategy and famous that the embeddings strategy was profitable.

Chunking Content material

The following section was the best way to course of the information, like ought to it’s divided into blocks of paragraphs or sentences? He determined that the sentence stage was probably the most granular stage that made sense as a result of it enabled figuring out probably the most related reply inside a sentence whereas additionally enabling the creation of bigger paragraph-level embedding items for context and semantic coherence.

However he nonetheless had issues with figuring out context with oblique references that used phrases like “it” or “the” so he took an extra step so as to have the ability to higher perceive context:

“I skilled a DistilBERT classifier mannequin that will take a sentence and the previous sentences, and label which one (if any) it relies upon upon with a purpose to retain that means. Subsequently, when embedding an announcement, I might observe the “chain” backwards to make sure all dependents had been additionally offered in context.

This additionally had the advantage of labelling sentences that ought to by no means be matched, as a result of they weren’t “leaf” sentences by themselves.”

Figuring out The Foremost Content material

A problem for crawling was creating a method to ignore the non-content elements of an internet web page with a purpose to index what Google calls the Foremost Content material (MC). What made this difficult was the truth that all web sites use completely different markup to sign the elements of an internet web page, and though he didn’t point out it, not all web sites use semantic HTML, which might make it vastly simpler for crawlers to establish the place the principle content material is.

So he principally relied on HTML tags just like the paragraph tag

to establish which elements of an internet web page contained the content material and which elements didn’t.

That is the listing of HTML tags he relied on to establish the principle content material:

blockquote – A citation
dl – An outline listing (an inventory of descriptions or definitions)
ol – An ordered listing (like a numbered listing)
p – Paragraph factor
pre – Preformatted textual content
desk – The factor for tabular knowledge
ul – An unordered listing (like bullet factors)

Points With Crawling

Crawling was one other half that got here with a mess of issues to resolve. For instance, he found, to his shock, that DNS decision was a reasonably frequent level of failure. The kind of URL was one other concern, the place he needed to block any URL from crawling that was not utilizing the HTTPS protocol.

These had been a few of the challenges:

“They will need to have https: protocol, not ftp:, knowledge:, javascript:, and many others.

They will need to have a sound eTLD and hostname, and may’t have ports, usernames, or passwords.

Canonicalization is completed to deduplicate. All parts are percent-decoded then re-encoded with a minimal constant charset. Question parameters are dropped or sorted. Origins are lowercased.

Some URLs are extraordinarily lengthy, and you may run into uncommon limits like HTTP headers and database index web page sizes.

Some URLs even have unusual characters that you just wouldn’t suppose could be in a URL, however will get rejected downstream by techniques like PostgreSQL and SQS.”

Storage

At first, Wilson selected Oracle Cloud due to the low value of transferring knowledge out (egress prices).

He defined:

“I initially selected Oracle Cloud for infra wants on account of their very low egress prices with 10 TB free per thirty days. As I’d retailer terabytes of knowledge, this was a very good reassurance that if I ever wanted to maneuver or export knowledge (e.g. processing, backups), I wouldn’t have a gap in my pockets. Their compute was additionally far cheaper than different clouds, whereas nonetheless being a dependable main supplier.”

However the Oracle Cloud answer bumped into scaling points. So he moved the challenge over to PostgreSQL, skilled a unique set of technical points, and finally landed on RocksDB, which labored properly.

He defined:

“I opted for a set set of 64 RocksDB shards, which simplified operations and consumer routing, whereas offering sufficient distribution capability for the foreseeable future.

…At its peak, this method might ingest 200K writes per second throughout 1000’s of shoppers (crawlers, parsers, vectorizers). Every net web page not solely consisted of uncooked supply HTML, but additionally normalized knowledge, contextualized chunks, a whole bunch of excessive dimensional embeddings, and many metadata.”

GPU

Wilson used GPU-powered inference to generate semantic vector embeddings from crawled net content material utilizing transformer fashions. He initially used OpenAI embeddings through API, however that grew to become costly because the challenge scaled. He then switched to a self-hosted inference answer utilizing GPUs from an organization known as Runpod.

He defined:

“Looking for probably the most value efficient scalable answer, I found Runpod, who provide excessive performance-per-dollar GPUs just like the RTX 4090 at far cheaper per-hour charges than AWS and Lambda. These had been operated from tier 3 DCs with secure quick networking and many dependable compute capability.”

Lack Of search engine marketing Spam

The software program engineer claimed that his search engine had much less search spam and used the instance of the question “finest programming blogs” as an example his level. He additionally identified that his search engine might perceive advanced queries and gave the instance of inputting a complete paragraph of content material and discovering attention-grabbing articles in regards to the matters within the paragraph.

4 Takeaways

Wilson listed many discoveries, however listed here are 4 which may be of curiosity to digital entrepreneurs and publishers on this journey of making a search engine:

1. The Measurement Of The Index Is Vital

One of the essential takeaways Wilson realized from two months of constructing a search engine is that the scale of the search index is essential as a result of in his phrases, “protection defines high quality.” That is

2. Crawling And Filtering Are Hardest Issues

Though crawling as a lot content material as attainable is essential for surfacing helpful content material, Wilson additionally realized that filtering low high quality content material was tough as a result of it required balancing the necessity for amount towards the pointlessness of crawling a seemingly infinite net of ineffective or junk content material. He found {that a} means of filtering out the ineffective content material was needed.

That is truly the issue that Sergey Brin and Larry Web page solved with Web page Rank. Web page Rank modeled person conduct, the selection and votes of people who validate net pages with hyperlinks. Though Web page Rank is sort of 30 years outdated, the underlying instinct stays so related right this moment that the AI search engine Perplexity makes use of a modified model of it for its personal search engine.

3. Limitations Of Small-Scale Search Engines

One other takeaway he found is that there are limits to how profitable a small unbiased search engine could be. Wilson cited the lack to crawl the whole net as a constraint which creates protection gaps.

4. Judging belief and authenticity at scale is advanced

Mechanically figuring out originality, accuracy, and high quality throughout unstructured knowledge is non-trivial

Wilson writes:

“Figuring out authenticity, belief, originality, accuracy, and high quality mechanically isn’t trivial. …if I began over I might put extra emphasis on researching and creating this facet first.

Infamously, serps use 1000’s of alerts on rating and filtering pages, however I imagine newer transformer-based approaches in direction of content material analysis and hyperlink evaluation needs to be less complicated, value efficient, and extra correct.”

Focused on making an attempt the search engine? You could find it right here and you possibly can learn how the complete technical particulars of how he did it right here.

Featured Picture by Shutterstock/Pink Vector

Tired Of SEO Spam, Software Engineer Creates A New Search Engine

Neural Embeddings

Chunking Content material

Figuring out The Foremost Content material

Points With Crawling

Storage

GPU

Lack Of search engine marketing Spam

4 Takeaways

1. The Measurement Of The Index Is Vital

2. Crawling And Filtering Are Hardest Issues

3. Limitations Of Small-Scale Search Engines

4. Judging belief and authenticity at scale is advanced

Why Global Search Misalignment Is An Engineering Feature And A Business...

Google’s Mueller Explains ‘Page Indexed Without Content’ Error

Google Ads Using New AI Model To Catch Fraudulent Advertisers

LEAVE A REPLY Cancel reply

Most Popular

TikTok Adds Post Scheduling to Studio App

What The Scrub Daddy Tells Us About The Perfect...

10 New YouTube Marketing Strategies With Fresh Examples For...

Apple Marketing Strategy: What Brands Can Learn & Apply...

14 Digital Content Types You’re Probably Not Using Enough

Leveraging Multi-Channel Strategies For Maximum Reach

What Content Works Well In LLMs?

EDITOR PICKS

These 2 shares could earn me a £398 monthly second income!

Can anything stop the BAE Systems share price now?

xAI Is Back in Contention for Government Contracts

Popular News

New to investing? Here’s how to think about growth stocks

Google’s Mueller Explains ‘Page Indexed Without Content’ Error

Pinterest Announces New CTV Show

POPULAR Tags

Popular Tags

ABOUT US

FOLLOW US

Tired Of SEO Spam, Software Engineer Creates A New Search Engine

Neural Embeddings

Chunking Content material

Figuring out The Foremost Content material

Points With Crawling

Storage

GPU

Lack Of search engine marketing Spam

4 Takeaways

1. The Measurement Of The Index Is Vital

2. Crawling And Filtering Are Hardest Issues

3. Limitations Of Small-Scale Search Engines

4. Judging belief and authenticity at scale is advanced

Related posts:

LEAVE A REPLY Cancel reply

Most Popular

EDITOR PICKS

Popular News

POPULAR Tags

Popular Tags

ABOUT US

FOLLOW US