Answers to questions about indexing

What pages can't be included in the search database?

Every day Yandex indexes millions of pages and adds them to the search database. To avoid filling it with documents that will never show up in the search results, Yandex analyzes each document using a special algorithm.

If the algorithm determines that the page is unlikely to become one of the most relevant answers for any search, the page isn't included in the current search database.

Thus, not all the indexed documents can be seen in the Yandex search results. The removal of the page from the search database doesn't affect the page or the site traffic, because this page wouldn't appear in the search anyway.

In addition, Yandex continues reindexing and analyzing these documents in the same way as others. If at some point the algorithm reveals that the page can get into the search results, it will be added to the search engine database.

What is a page duplicate?

Page duplicates are site pages that have identical content but different URLs.

For example:

  • http://example.com and http://example.com/index.php/,
  • http://example.com/page/ and http://example.com/page.

If both pages are indexed by the Yandex robot, the indexing system groups them as duplicates. Only one of the pages is listed in the search results.

There are many reasons why duplicate pages may come up:

  • Natural reasons (for example, if a page with a product description is available in several categories of an online store).
  • Issues related to incorrect site structure.

To have the right page in the search results, we recommend that you indicate it for the Yandex robot. Here's how you can do it:

My site has moved (the URL changed). What should we do?

If the old and new site pages are exactly the same, the server should respond with a 301 error (“Moved Permanently”) when the old page is requested . The Location field should contain the new site's URL. If the old site has been shut down, you can speed up its removal from the index by filling in this form: Remove URLRemove URL.

You are trying to download confidential information from our server. What should we do?

The robot takes links from other pages. This means that that some other page contains links to confidential sections of your site. You can either protect them with a password or disallow indexing by the Yandex robot in the robots.txt file. In both cases, the robot won't download confidential information.

How do I protect myself from the fake robots that pretend to be the Yandex robots?

To protect yourself against fake robots, use the reverse DNS lookup filter. This method is preferable to managing access by IP addresses, as it is more resistant to changes in the Yandex internal networks.

Is it a problem if my server doesn't provide the last-modified values? I tried to set it up, but I couldn't make it work.

Your site will still be indexed even if your server doesn't provide last-modified document dates. However, you should keep in mind the following:

  • The date won't be displayed in the search results next to your site pages.

  • Most users won't see your site if they sort the search results by date.

  • The robot won't know if a site page has been updated since it was last indexed. Modified pages will therefore be indexed less often, given that the number of pages the robot gets from a site each time is limited.

My server doesn't send the encoding. Is that a problem? I tried to set it up, but I couldn't make it work.

The Yandex robot can determine document encoding. If the encoding is missing in server headers, it doesn't prevent the robot from indexing the site.

My site uses frames. Yandex displays links to internal site frames in the search results. All navigation is unavailable because it's in a different frame. What should we do?

You can try using JavaScript to solve the problem. Make sure that the parent frame with the navigation is open before loading the page. If it isn't, open it.

There is too much traffic going back and forth between my web server and your robot. Does Yandex support downloading of compressed pages?

Yes, it does. Each time the Yandex robot requests a page it says: “"Accept-Encoding: gzip,deflate”. This means you can set up your web server to reduce the traffic between the server and our robot. However, note that sending compressed content increases CPU usage on your server. If it is overloaded, it can cause problems. For gzip and deflate download, the robot applies the rfc2616 standard, section 3.5.

Your robot tries to download my site pages using broken links. Why?

The robot takes links from other pages, which means that one of them contains broken links to your site. Perhaps you changed the site structure and the links on other sites became broken.

What does the robot do if there's a redirect on the page? What if I use the refresh directive?

When the Yandex robot receives a respond with the 3xx code heading (which means the URL is a redirect), it adds the redirect's target URL to its crawling list. If it is a constant redirect (301 code or the page contains a refresh directive), the old URL is excluded from the crawling list.

My page is regularly missing from the search results. What's the problem?

If the robot gets an error when contacting a page (for example, due to unstable hosting), it removes the page from the search until the next successful contact.

Can I manage reindexing frequency with the Revisit-After directive?

No. The Yandex robot ignores it.

Which data transfer protocols are supported for indexing?

Right now, Yandex supports two protocols: HTTP and HTTPS.

How do I tell the robot that it should index pages with or without a forward slash ("/") at the end of the URL?

For the Yandex robot, pages with a “/” at the end of the URL are different from those without it. If the pages are identical, set up a 301 redirect from one page to the other (you can set it in the htaccess file) or indicate the canonical URL.

Why does the robot request non-existent pages/subdomains on my site?

Probably the robot found links to them somewhere and tried to index them. Non-existent subdomains and pages should be unavailable or return a 404 error code. This way the robot will index only the useful pages of your site.