Advertisement

Baidu blocks Google, Bing from scraping content amid demand for data used on AI projects

  • Wikipedia-style service Baidu Baike recently barred the search engine crawlers of Google and Bing from indexing its online content

Reading Time:2 minutes
Why you can trust SCMP
Baidu has boosted efforts to safeguard its online assets, as demand for vast troves of data have risen for use in generative AI projects. Photo: Shutterstock
Ben Jiangin Beijing
Chinese internet search giant Baidu appears to have started blocking the online search engines of Alphabet’s Google and Microsoft’s Bing from scraping content derived out of the mainland firm’s Wikipedia-style service, a Post survey found.
Advertisement

A recent update of Baidu Baike’s robots.txt – a file that tells search engine crawlers which uniform resource locators, commonly known as web addresses, can be accessed from a site – has outright blocked the ability of the Googlebot and Bingbot crawlers to index content from the Chinese platform.

That update appears to have been made some time on August 8, according to records on internet archive service the Wayback Machine. It also showed that earlier on the same day Baidu Baike still allowed Google and Bing to browse and index its online repository of nearly 30 million entries, with only part of its website designated as off limits.

This initiative shows Beijing-based Baidu’s increased effort to safeguard its online assets, as demand for vast troves of data have increased for training and building artificial intelligence (AI) models and applications.

That followed US social news aggregation platform and forum Reddit’s move in July, when it blocked various search engines, except Google, from indexing its online posts and discussions. Google has a multimillion dollar deal with Reddit that gives it the right to scrape the social media platform for data to train its AI services.

Since OpenAI released ChatGPT on November 30, 2022, major search platforms Google and Microsoft have sought to obtain more data for use in their own generative artificial intelligence systems. Photo: Shutterstock
Since OpenAI released ChatGPT on November 30, 2022, major search platforms Google and Microsoft have sought to obtain more data for use in their own generative artificial intelligence systems. Photo: Shutterstock
Even Microsoft last year threatened to cut off access to its internet-search data, which it licenses to rival search engine operators, if they did not stop using it as the basis for their chatbots and other generative AI (GenAI) services, according to a Bloomberg report.
Advertisement