Using Pages to Create a Custom Robots.txt File

Follow

WARNING!: This is an advanced feature. You must consult Google's Webmaster Tools or some other outside source before you make any edits to your robots.txt file. Improperly editing this page can cause damage to your SEO.

These resources explain the functions and limitations of this file:

About /robots.txt (The Web Robots Page)

Learn about robots.txt files (Google Help)

 

The robots.txt file is used by search engines to categorize and archive web sites. It can be used in conjunction with Google's Webmaster Tools to optimize changes to your site. For example, if you have an existing site and are relaunching as a Metro Publisher website, you can use alterations to the robots.txt file to inform Google of the various changes from the old site to the new one. This can help minimize the adverse temporary effects caused whenever relaunching a website.

For more information about how specifically to use your robots.txt file including the exact lines of code you should input, you will need to consult Google's Webmaster Tools support documents.

Example of a robots.txt file:

 robots.jpg

The image above shows the default robots.txt file for Metro Publisher sites. These lines of code direct search engines to the sitemaps for your website. It is an essential part of how your site is indexed. You do not need to add this code if you create your own custom robots.txt file, it is automatically included as follows:

User-agent: *
Disallow:
Sitemap: http://design.metropublisher.net/sitemap.xml
Sitemap: http://design.metropublisher.net/sitemap_news.xml

 

You must follow these instructions to block Google from certain pages: https://developers.google.com/search/docs/crawling-indexing/block-indexing
 

To keep a web page out of Google, block indexing with noindex or password-protect the page.

 

If you would like to alter or customize the robots.txt page, use Pages to do so.

  1. Log in as an editor and click on "Pages" from the main navigation.
  2. Then at the bottom of the page select  "Add" from the menu.
  3. On the subsequent screen, select "Text."
  4. Add the title.
    NOTE:  This will not appear within the content of the page, so you may call it anything you want.
  5. Manually give the new page the filename "robots.txt"
    NOTE: This is the only filename that you can give it. If you name it any other way, the search engines will not find it. Usually, the system automatically creates a filename with dashes from the Page Title you enter, so this is an exception; you have to override that behavior!
  6. DO NOT assign it to a section or subsection. Leave it located on the default "Top Level / Root."
  7. Add whatever code you need.
    NOTE: Unless you are an advanced user, you MUST consult webmaster tools to edit this page. If you edit this page improperly, search engines will not index your site properly.
  8. Remember to "Save" and "Publish."

That's it. Wait a few minutes for your new page to clear the cache.

NOTE: Metro Publisher's own default robots.txt code is appended to your robots.txt file. Bots should react to the first rule that matches for them, meaning your entry would override any conflicting default code in the robots.txt file automatically generated by Metro Publisher.

 

Blocking AI Bots

Since it has become relevant, Metro Publisher currently blocks artificial agents such as OpenAI's artificial agent, Google's BARD, Facebook's artificial agent, Bing's, etc. This helps avoid heavy traffic loads and helps keep your copyrighted material from being republished and possibly misquoted. 

Metro Publisher is committed to keeping your data from being exploited and making sure bots / spiders do not cause excessive data traffic and server loads. We therefore additionally block the following aggressive bots by default, for example, and we add to this list whenever we see detrimental bot activity on client sites:

  • Bytespider
  • Baiduspider
  • GPTBot
  • SemrushBot
  • dotbot
  • AhrefsBot
  • DataForSeoBot
  • SeekportBot
  • etc.

A comprehensive list of known artificial agents can be found here.

Furthermore, we block bots from our site-wide search tool API. This does not affect search results or SEO.

Please also visit our help article on Protecting Your Site From Excessive Bot Traffic for more details on the issue.

Based on our findings, we block the Bing search engine bot by default as well. We are seeing a lot of traffic from search engines with a very small market share in addition to that of malicious bots. Bing is at just under 4% market share currently and the amount of crawling from artificial agents has gotten out of hand worldwide.

You can unblock that if you wish using the following code:

User-agent: bingbot
Allow: /

The same reasoning applies to DuckDuckGo, which you may unblock in the same manner: 

User-agent: DuckDuckBot
Allow: /

The so-called "FacebookBot" crawls exclusively for Meta's own AI learning, not for any other purpose. If you would like your content crawled for the purpose of sharing on Meta Products (Facebook, Instagram, Messenger, etc.) you would need to allow a bot called "facebookexternalhit" as follows. This bot required active opting. Metro Publisher does not block it:

User-agent: facebookexternalhit
Allow: /

 

Have more questions? Submit a request

Comments

Powered by Zendesk