Google’s Secret 14000 Ranking Factors Leak

The leak of Google’s secret 14,000 ranking factors unveils intricate insights into its search operations, challenging long-held assumptions and shedding light on clandestine practices.

Google’s Secret 14000 Ranking Factors Leak 1

On Sunday, May 5th, an email was received from an individual purporting to have access to a significant leak of API documentation from within Google’s Search division. The email also asserted that these leaked documents had been verified as authentic by former Google employees, who, along with others, had disclosed additional private information about Google’s search operations.

Many of the assertions directly conflict with public statements made by Googlers over time, particularly the company’s repeated denial of employing click-centric user signals, denial of considering subdomains separately in rankings, denial of a sandbox for newer websites, denial of collecting or considering a domain’s age, and more.

Naturally, skepticism arose. The claims made by this source, who requested anonymity, appeared extraordinary—claims such as:

  • In their early years, Google’s search team recognized a need for full clickstream data (every URL visited by a browser) for a large percentage of web users to improve their search engine’s result quality.
  • A system called “NavBoost” (cited by VP of Search, Pandu Nayak, in his DOJ case testimony) initially gathered data from Google’s Toolbar PageRank, and the desire for more clickstream data served as the key motivation for the creation of the Chrome browser (launched in 2008).
  • NavBoost uses the number of searches for a given keyword to identify trending search demand, the number of clicks on a search result (I ran several experiments on this from 2013-2015), and long clicks versus short clicks (which I presented theories about in this 2015 video).
  • Google utilizes cookie history, logged-in Chrome data, and pattern detection (referred to in the leak as “unsquashed” clicks versus “squashed” clicks) as effective means for fighting manual & automated click spam.
  • NavBoost also scores queries for user intent. For example, certain thresholds of attention and clicks on videos or images will trigger video or image features for that query and related, NavBoost-associated queries.
  • Google examines clicks and engagement on searches both during and after the main query (referred to as a “NavBoost query”). For instance, if many users search for “Rand Fishkin,” don’t find SparkToro, and immediately change their query to “SparkToro” and click in the search result, (and websites mentioning “SparkToro”) will receive a boost in the search results for the “Rand Fishkin” keyword.
  • NavBoost’s data is used at the host level for evaluating a site’s overall quality (my anonymous source speculated that this could be what Google and SEOs called “Panda”). This evaluation can result in a boost or a demotion.
  • Other minor factors such as penalties for domain names that exactly match unbranded search queries (e.g. men’s-luxury-watches. com or, a newer “BabyPanda” score, and spam signals are also considered during the quality evaluation process.
  • NavBoost geo-fences click data, taking into account country and state/province levels, as well as mobile versus desktop usage. However, if Google lacks data for certain regions or user agents, they may apply the process universally to the query results.
  • During the Covid-19 pandemic, Google employed whitelists for websites that could appear high in the results for Covid-related searches
  • Similarly, during democratic elections, Google employed whitelists for sites that should be shown (or demoted) for election-related information

Remarkable assertions necessitate compelling corroboration. While some align with disclosures from the Google/DOJ case (as referenced in this 2020 thread), others introduce fresh insights, implying an insider perspective.

So, on Friday, May 24th (following numerous email exchanges), a video conference was arranged with the undisclosed informant.

Google’s Secret 14000 Ranking Factors Leak 2
An anonymized screen capture from Rand’s call with the source

Update (5/28 at 10:00 am Pacific): The informant has opted to unveil their identity. This video discloses them as Erfan Azimi, an SEO specialist and the founder of EA Eagle Digital.

An eagle uses the storm to reach unimaginable heights.

– Matshona Dhliwayo

Following the conference, I managed to corroborate aspects of Erfan’s professional background, connections we share in the marketing realm, and some of their assertions regarding attendance at specific industry events alongside insiders (including Googlers), although the specifics of these encounters or the topics discussed cannot be independently verified.

During our conversation, Erfan presented the leaked material: over 2,500 pages of API documentation featuring 14,014 attributes (API features) purportedly sourced from Google’s internal “Content API Warehouse.” According to the document’s commit history, this code was uploaded to GitHub on Mar 27, 2024, and removed on May 7, 2024. (Note: Due to post-publication edits reflecting Erfan’s identity, he is referred to below as “the anonymous source”).

This documentation does not divulge specifics such as the weighting of individual elements in the search ranking algorithm or which elements are utilized in the ranking systems. Nevertheless, it provides intricate insights into the data collected by Google. An illustration of the document format is provided below:

Google’s Secret 14000 Ranking Factors Leak 3
Screen capture of leaked data about “good” and “bad” clicks, including length of clicks (i.e. how long a visitor spends on a web page they’ve clicked from Google’s search results before going back to the search results)

The source showed me around a few of these API modules before outlining their goals (holding Google accountable, transparency, etc.) and their expectation that I would write a piece exposing this leak, exposing some of the many fascinating facts it held, and dispelling some of the “lies” Google employees “had been spreading for years.”

Google’s Secret 14000 Ranking Factors Leak 4
A sample of statements from Google representatives (Matt Cutts, Gary Ilyes, and John Mueller) denying the use of click-based user signals in rankings over the years

Is this API Leak Authentic? Can We Trust It?

A critical next step in the process was verifying the authenticity of the API Content Warehouse documents. Ex-Googler friends were contacted to share and assess the leaked documents. Three responded: one declined to comment, while the other two provided anonymous feedback.

  • “I didn’t have access to this code when I worked there. But this certainly looks legit. “
  • “It has all the hallmarks of an internal Google API.”
  • “It’s a Java-based API. And someone spent a lot of time adhering to Google’s own internal standards for documentation and naming.”
  • “I’d need more time to be sure, but this matches internal documentation I’m familiar with.”
  • “Nothing I saw in a brief review suggests this is anything but legit.”

Help was needed to analyze the naming conventions and technical aspects of the documentation. Despite some experience with APIs, it had been years since coding or practicing SEO professionally. Assistance was sought from Mike King, founder of iPullRank, one of the world’s foremost technical SEOs.

During a 40-minute phone call on Friday afternoon, Mike reviewed the leak and confirmed suspicions: the documents appeared to be legitimate, containing a significant amount of previously unconfirmed information about Google’s inner workings.

With 2,500 technical documents, it was an unreasonable task to review everything in a single weekend. Nevertheless, Mike conducted an initial review of the Google API leak, which is referenced in the findings. Mike also agreed to present a detailed analysis of the leak at SparkTogether 2024 in Seattle, WA, on Oct. 8.

What is the Google API Content Warehouse?

The initial questions when examining the extensive API documentation might be: “What is this? What is its purpose? Why does it exist?”

The leak appears to have originated from GitHub, aligning with the explanation provided by the anonymous source. These documents were inadvertently made public for a brief period between March and May 2024. During this time, many links in the documentation were directed to private GitHub repositories and internal Google pages requiring specific credentials. The API documentation was subsequently spread to Hexdocs, which indexes public GitHub repositories, and circulated by other sources, though public discourse on the matter was notably absent until now.

Ex-Googler sources confirmed that such documentation exists across almost every Google team. These documents explain various API attributes and modules to familiarize team members with the available data elements. This leak is consistent with others found in public GitHub repositories and Google’s Cloud API documentation, sharing the same notation style, formatting, and references to processes, modules, and features.

In simpler terms, this documentation serves as instructions for Google’s search engine team, akin to a library’s card catalog, detailing available resources and how to access them.

However, unlike public libraries, Google Search remains one of the most secretive and closely guarded systems globally. No previous leak from Google’s search division has ever been reported with such magnitude or detail in the last quarter century.

How certain can we be that Google’s search engine uses everything detailed in these API docs?

This interpretation is subjective. Google may have retired some of these features, used others exclusively for testing or internal projects, or made API features available that were never employed.

However, the documentation includes references to deprecated features and notes on others indicating they should no longer be used. This strongly suggests that features not marked as deprecated were still in active use as of the March 2024 leak.

It is uncertain whether the March leak represents the most recent version of this documentation. The latest date found in the API docs is August 2023:

Google’s Secret 14000 Ranking Factors Leak 6

The relevant text reads:

“The domain-level display name of the website, such as “Google” for See go/site-display-name for more details. As of Aug 2023, this field is being deprecated in favor of info.[AlternativeTitlesResponse].site_display_name_response field, which also contains host-level site display names with additional information.”

A reasonable conclusion is that the documentation was up-to-date as of last summer, with references to changes in 2023 and earlier years, dating back to 2005, and possibly up-to-date as of March 2024.

Google search undergoes significant changes yearly, and recent introductions like their controversial AI Overviews are not present in this leak. The active use of the items mentioned in Google’s ranking systems today is speculative. The trove contains intriguing references, many of which will be new to non-Google search engineers.

It is advised not to conclude that a particular API feature in this leak is definitive proof of its use in Google rankings. It is a strong indication, stronger than patent applications or public statements from Googlers, but not a guarantee.

Nevertheless, this leak is as close to a smoking gun as anything since Google’s executives testified in the DOJ trial last year. Much of that testimony is corroborated and expanded upon in the document leak, as detailed in Mike’s post.

What can we learn from the Data Warehouse Leak?

Expect interesting and marketing-applicable insights to be mined from this extensive file set for years to come. The sheer volume and density of the documents make it unrealistic to think that a weekend of browsing could uncover a comprehensive set of takeaways.

However, here are five of the most intriguing early discoveries. Some shed new light on practices long assumed of Google, while others suggest the company’s public statements, particularly regarding data collection, have been inaccurate. Rather than detailing side-by-sides of what Googlers said versus what this document insinuates (as this could be perceived as tedious or personal grievances due to Google’s historic attacks), the focus will be on noteworthy and useful takeaways. Mike’s post already covers the side-by-sides effectively.

The emphasis here is on interesting and useful conclusions drawn from the reviewed modules, Mike’s analysis, and how this information aligns with known facts about Google.

#1: Navboost and the use of clicks, CTR, long vs. short clicks, and user data

Google’s Secret 14000 Ranking Factors Leak 7

Features like “goodClicks,” “badClicks,” “lastLongestClicks,” impressions, squashed, unsquashed, and unicorn clicks are mentioned in a few documentation modules. These have to do with Navboost and Glue, two terms that those who read Google’s DOJ testimony might recognize. This is an important passage from DOJ lawyer Kenneth Dintzer’s cross-examination of Pandu Nayak, the Search Quality team’s vice president of search:

Q. So remind me, is navboost all the way back to 2005?
A. It’s somewhere in that range. It might even be before that.

Q. And it’s been updated. It’s not the same old navboost that it was back then?
A. No.

Q. And another one is glue, right?
A. Glue is just another name for navboost that includes all of the other features on the page.

Q. Right. I was going to get there later, but we can do that now. Navboost does web results, just like we discussed, right?
A. Yes.

Q. And glue does everything else that’s on the page that’s not web results, right?
A. That is correct.

Q. Together they help find the stuff and rank the stuff that ultimately shows up on our SERP?
A. That is true. They’re both signals into that, yes.

A savvy reader of these API documents would find they support Mr. Nayak’s testimony (and align with Google’s patent on-site quality):

  • Quality Navboost Data module
  • Geo-segmentation of Navboost Data
  • Clicks Signals in Navboost
  • Data Aging Impressions and clicks

Google seems to have mechanisms to filter out clicks they don’t want to count in their ranking systems while including the ones they do. They also appear to measure click duration (such as pogo-sticking, where a user quickly returns to the search results after clicking on a result) and impressions.

Much has been discussed about Google’s use of click data, so the focus here is on the significant point: Google has named and described features for this measurement, providing further evidence of its use.

#2: Use of Chrome browser clickstreams to power Google Search

Google’s Secret 14000 Ranking Factors Leak 8

#3: Whitelists in Travel, Covid, and Politics

A module on “Good Quality Travel Sites” suggests the existence of a whitelist for Google in the travel sector, though it is unclear if this applies solely to Google’s “Travel” search tab or to web search more broadly. References to flags for “isCovidLocalAuthority” and “isElectionAuthority” indicate that Google likely whitelists specific domains for highly controversial or potentially problematic queries.

For instance, after the 2020 US Presidential election, one candidate falsely claimed the election was stolen and incited followers to storm the Capitol and commit acts of violence. Google, being a primary source for information, could have exacerbated the situation if its search engine returned propaganda sites with false election information. It is crucial that Google’s engineers use whitelists in such cases to ensure accurate information is presented, preventing further conflict and preserving democratic processes.

#4: Employing Quality Rater Feedback

Google’s Secret 14000 Ranking Factors Leak 9

Google has a quality rating platform called EWOK, which Cyrus Shepard, a prominent SEO expert, contributed to and documented. Recent evidence indicates that some elements from these quality raters are integrated into Google’s search systems.

The influence and specific usage of these rater-based signals remain unclear, but it is likely that SEO experts will investigate and provide more insights. It is noteworthy that scores and data from EWOK’s quality raters may be directly involved in Google’s search system, rather than merely serving as a training set for experiments. When the documents indicate usage “just for testing,” it is explicitly mentioned in the notes and module details.

Google’s Secret 14000 Ranking Factors Leak 10

One module mentions a “per document relevance rating” sourced from EWOK evaluations, implying the significance of human evaluations of websites. Another notes “Human Ratings (e.g. ratings from EWOK)” as typically populated in evaluation pipelines, suggesting they primarily serve as training data. Nonetheless, this role is crucial, highlighting the importance of quality raters’ perceptions and ratings of websites.

#5: Google Uses Click Data to Determine How to Weight Links in Rankings

A particularly intriguing point, highlighted by the anonymous source who shared the leak, involves Google’s method of classifying link indexes into three tiers: low, medium, and high quality. Click data is pivotal in determining which tier a document’s link belongs to. According to the source:

  • A link with no clicks, such as, is classified as low quality and ignored.
  • A link with a high volume of clicks from verifiable devices, like, is categorized as high-quality, passing ranking signals.
  • Once a link is deemed “trusted” due to its higher tier classification, it can contribute to PageRank and anchor text or be filtered by link spam systems. Links in the low-quality index do not negatively impact a site’s ranking; they are simply disregarded.

Recently reported by GreatGameIndia, Google’s AI faced criticism for historical inaccuracies and woke biases. Google co-founder Sergey Brin admitted errors in Gemini, citing insufficient testing at the Gemini Hackathon.

GreatGameIndia is being actively targeted by powerful forces who do not wish us to survive. Your contribution, however small help us keep afloat. We accept voluntary payment for the content available for free on this website via UPI, PayPal and Bitcoin.

Support GreatGameIndia

Leave a Reply