Articles on: Website Publishing

Editing Robots.txt

What is robots.txt?


Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. The Robots.txt file will ask search engines not to crawl and index the site. This will help deter potential SEO harm.



How does Site Stacker handle robots.txt? 


Site Stacker will automatically create the robots.txt for each Site Channel added in the Sites component. By default, the file will allow all search engines to index anything at the server level. So anything at the server level in your domain will be crawlable by default.


User-agent Directive


Is used to specify which crawler should obey a given set of rules. This directive can either be a wildcard (*) or any other rules that apply to all crawlers:


Ex. User-agent:* - meaning ALL web crawlers should obey this rule.


User-agent: Googlebot - meaning ONLY Googlebot should obey this rule.


Disallow Directive


This rule is to specify what directory is off-limits to web crawlers.


Ex. User-agent:*


Disallow: /some-page


This rule will block all URLs that have a path that starts with “/some-page”:


http://example.com/some-page


http://example.com/some-page?filter=0


http://example.com/some-page/another-page


http://example.com/some-pages-will-be-blocked


However, it will not block URLs that do not start with “/some-page”, like:


http://example.com/subdir/some-page


NOTE: Whatever comes after the “Disallow:” is treated as a simple string of characters (with the notable exceptions of * and $). This string is compared to the beginning of the path part of the URL (everything from the first slash after the domain to the end of the URL) which is also treated as a simple string. If they match, the URL is blocked.


Allow Directive


By default, pages with no specified “Disallowed” rules are allowed. This is just to specify exemptions to a disallow rule.


This directive is useful if you have a subdirectory that is “Disallowed” but you want to allow a page from that subdirectory to be crawled.


Ex. User-agent:*


Allow: /do-not-show-me/show-me-only


Disallow: /do-not-show-me/


This example will block these URLs:


http://example.com/do-not-show-me/


http://example.com/do-not-show-me/page-one


http://example.com/do-not-show-me/pages


http://example.com/do-not-show-me/?a=z


However, this will not block these URLs:


http://example.com/do-not-show-me/show-me-only/


http://example.com/do-not-show-me/show-me-only-now-you-see-me


http://example.com/do-not-show-me/show-me-only/page-one


http://example.com/do-not-show-me/show-me-only?a=z


Wildcards


These extensions are directives that allow you to block pages that have an unknown path or variable.


  • * (asterisk) - This will match any text between the (2) two directories.


Ex. Disallow:/users/*/profile


This will block these URLs:


http://example.com/users/name-1/profile


http://example.com/users/name-2/profile


http://example.com/users/name-3/profile


And so on…


  • $ (dollar sign) - This means the URL ends at that point.


Ex. Disallow: /page-one$


This will block:


http://example.com/page-one


But will not block:


http://example.com/page-one-of-ten


http://example.com/page-one/


http://example.com/page-one?a=z


Robots.txt Precedence and Specificity


Web crawlers like Google and Bing do not care what order you specify the syntax to either crawl (Allow: /your/file/path) or block from crawling (Disallow: /your/file/path) within the robots file. By default, if the robot crawler does not encounter any instructions within the robots.txt file it will assume the position that the whole website should be indexed.


Note: In putting a rule in Robots.txt in SiteStacker, It is ideal to only put the ‘User-agent’ and the ‘Disallow’ rule if you do not want a path to be crawled by web crawlers. Putting only ‘Allow:/’ after Disallowed rules will override the Disallowed rules.


Ex: User-agent:*


Disallow:/path/page


Disallow:/path/


Allow:/


This will override all disallowed rules since it allows all paths to be crawled.


It is also ideal to put the ‘Allow’ rule first then the ‘Disallow’ rule because most web search engines, aside from Google and Bing, follow the order of directives for group-member records. 



Google Custom Search is a tool used to search content inside your website. It will crawl to all your indexed pages showing filtered contents connected or associated with your search query. In SiteStacker, content can be searchable or unsearchable within the site. This will allow the publisher to hide or secure pages that have sensitive content.


To set up page searchability:


  • Log-in to SiteStacker

  • Click Site Planner

  • Add or Edit a content/item

  • On the right side;




  • Check ‘Searchable’ if you want this page to be searched and be available on your website.

  • Uncheck ‘Searchable’ if you do not want this to be available  your website.


  • Save & Close


Note: The ‘Searchable’ option for your page is only to make the page available or unavailable when searching it inside your website. Meaning, If your Robots.txt is not specified to ‘Disallow’ this page, web crawlers can still crawl to this page.

Updated on: 14/04/2026

Was this article helpful?

Share your feedback

Cancel

Thank you!