We occasionally receive questions from concerned clients regarding their "crawl budget" along with requests to unpublish some of their content in bulk. Crawl budget is an undefined term that has become popular with third-party SEO review services.
Google clarifies some crawling processes in a blog post from 2017 here: What Crawl Budget Means for Googlebot
The sitemap is what tells Google that there are new links to index, meaning the unchanged ones don't need to be, and won’t be, continuously recrawled. Google is looking to crawl more sustainably*, so unchanged pages will be crawled less.
In addition, when you set content and events to expire off of your site with our CMS, those links will return a 404 should they eventually be recrawled, which does not affect your SEO and tells Google that the page is gone: How to Handle 404 (Page Not Found) Errors
In other words, when you set outdated content or events to expire in our system after one month, you automatically create redirects to 404 pages, which search engines will stop crawling once they have read out the new status of those pages. These 404s do not affect your SEO; they are the right way to tell search engines that those pages are no longer relevant and should be ignored.
The microdata on events in the Metro Publisher CMS follows best practice to let search engines know if events are outdated. We also have a feature that allows your team to batch-delete old or expired events from your database: Expired Events
You can activate the appropriate checkbox for locations that have permanently closed on the respective location’s edit page in your admin area for best practice purposes.
* 2022 Interview Excerpt:
[00:04:03] Gary Illyes: We are carbon-free, since I don't even know, 2007 or something, 2009, but it doesn't mean that we can't reduce even more our footprint on the environment. And crawling is one of those things that early on, we could chop off some low-hanging fruits. For example, with HTTP/2, we could stream better, for example, from the net, open fewer connections, and then that saved us resources. But also for the servers that we were crawling from, like the servers that were hosting the sites.
[00:04:36] So we are thinking more about these things. Like how can we reduce even more Googlebots and other crawlers, Google crawlers' footprint on the Internet, on the environment. And then, if you think about it, one thing that we do and we might not need to do that much is refresh crawls. Which means that once we discovered a document, a URL, then we go, we crawl it, and then, eventually, we are going to go back and revisit that URL. That is a refresh crawl. And then every single time we go back to that one URL, that will always be a refresh crawl. Now, how often do we need to go back to that URL?
[00:05:17] You could say that, for example, if you take the CNN or Wall Street Journal homepage, which is changing every five seconds, then we do need to go back very often. But then the About page of either of these news outlets, they don't change too often. So you don't have to go back there that much. And often, we can't estimate this well, and we definitely have room for improvement there on refresh crawls, because sometimes, it just seems wasteful that we are hitting the same URL over and over again. Sometimes we are hitting 404 pages, for example, for no good reason or no apparent reason. And all these things are basically stuff that we could improve on and then reduce our footprint even more.
[00:06:03] John Mueller: [00:06:03] Do you think that's something that Google can do on its side? Or does that need extra work from the site owners?
[00:06:13] Gary Illyes: To a large extent, I think we can do it from our side. Basically, we're rethinking how we issue refresh crawls. But then, there might be other things that we could try. And for example, "IndexNow," which I have a beef with that name because it's more like "CrawlNow," not "IndexNow." It could be "IndexMaybe."
Source: Google 'Search Off The Record' Podcast "Year in Review 2022"
Comments