Huge data leak of 1 billion records exposes China’s vast surveillance state – TechCrunch

A huge store of data containing information about about a billion Chinese residents could be one of the largest personal data breaches in history.

Portions of the leaked data surfaced last week on a well-known cybercrime forum by someone who sold the cache for 10 bitcoins, or about $200,000, and was allegedly siphoned from a Shanghai police database stored on Alibaba’s cloud.

Although details of the breach remain scarce, portions of the data have been verified as authentic, suggesting at least some of the data is genuine. Where the data came from and how it got into the hands of an underground trafficker whose motives are unknown is still unclear.

News of the alleged breach went largely unreported in mainland China, where language and expression restrictions are tightly controlled and internet access is censored and severely restricted.

The breach, if authentic, raises questions about the sheer scale of China’s surveillance state, the largest and most expansive in the world, and Beijing’s ability to keep that data safe.

Here’s what we’ve learned so far.

How was the data leaked?

In a now-deleted Cybercrime forum post, the seller claimed to have downloaded the data from a cloud storage server hosted by Alibaba, the cloud computing arm of the Chinese e-commerce giant. When reached by TechCrunch on Monday, Alibaba said it was looking into the claims.

Exactly how the data was leaked is unclear, but experts say the database may have been misconfigured and exposed through human error since April 2021 before it was discovered. This appears to rule out a claim that the database credentials were inadvertently published as part of a technical blog post on a Chinese developer site in 2020 and later used to siphon off the billions of records from the police database, as access did not require passwords.

Bob Diachenko, a Ukrainian security researcher, told TechCrunch that his own surveillance footage shows the database was also exposed in late April via a Kibana dashboard, web-based software for visualizing and searching huge Elasticsearch databases. If the database hadn’t required a password, as believed, anyone could have accessed the data if they knew the web address.

Security researchers often scan the Internet for accidentally exposed databases or other sensitive data, often to collect bounty offered by the companies they help secure. But threat actors also perform the same scans, often with the aim of copying data from an exposed database, deleting it, and offering to return the data for a ransom—a tactic increasingly used by criminal dumpster divers in recent years. Diachenko said that happened on that occasion; A malicious actor found, looted, and wiped the exposed database, leaving a ransom note demanding 10 bitcoins for its return.

“My hypothesis here is that the ransom note didn’t work and the attacker decided to get money elsewhere. Or another malicious actor came across the data and decided to put it up for sale,” Diachenko said.

Little is known about the seller or why the data was put online. It is not uncommon for large amounts of personal data to be offered for sale on cybercrime forums and the dark web, but rarely for such sensitive data or in such quantity.

How does the data look?

TechCrunch reviewed a larger sample of data uploaded by the seller, which contained three files totaling about 500 megabytes in size, each containing 250,000 individual records.

The data itself is formatted in JSON, a standard file format for Elasticsearch databases, making it easy to read and analyze. The database’s format suggests that it was meticulously maintained and downloaded, rather than being created by sheer aggregation of information from multiple data sources, a common technique used by information sellers and data brokers. However, some data may come from external sources, e.g. B. from delivery orders for groceries.

What also likely makes the data real is the sheer size of the data and that level of detail would be difficult – if not impossible – to fake.

TechCrunch translated the police files, which were written in Chinese, and redacted personal data.

The files appear to contain detailed police reports from 1995 to 2019, including names, addresses, phone numbers, ID numbers, gender, and why the police were called. The records viewed by TechCrunch include detailed coordinates of where incidents occurred or police reports were made – and the names of the whistleblowers who made the reports – which match the exact addresses also listed in each record, as well as race and ethnicity of the persons. (The Chinese government has imprisoned more than a million of its own citizens, mostly members of minority Muslim ethnic groups, including Uyghurs and Kazakhs, in what the Biden administration has declared “genocide.”)

The records contain complaints and criminal allegations, ranging from serious violent crimes to the relatively mundane, such as detailed accounts of credit card fraud, internet fraud and gambling, which is illegal in China. Several recordings viewed by TechCrunch show police reports cracking down on the use of VPNs, or virtual private networks, used to access websites that are blocked by China’s censorship system and as such are banned in China. A recording showed that a Shanghai resident was accused of using a VPN to post critical remarks about the government on Twitter, which is banned in China. What happened to the person after that is not known.

The data also included full web addresses to photos stored on the same server, none of which were accessible at the time of writing, but the associated data often indicates what was uploaded, such as: B. a person’s residence documentation or their passport upon departure. These web addresses are formatted to match the way Alibaba’s cloud service stores files.

Many of the records we examined appeared to contain information about children based on their dates of birth and ages listed in the data.

Without the (unlikely) confirmation of the Chinese government, it is difficult to determine with certainty whether the seller’s claims are genuine and whether the data as claimed is from the Shanghai Police Department. The Wall Street Journal, The New York Times and CNN have verified parts of the data by calling people whose information was found in the database, confirming its authenticity.

What is the impact?

This alleged breach, if proven legitimate, could be very damaging to Beijing and raises questions about the government’s cybersecurity measures and the impact the breach will have on individuals.

It comes at a time when China is strengthening personal data protection. Last September, China passed the Personal Data Protection Act, its first comprehensive data protection legislation, which is widely regarded as China’s equivalent of the GDPR data protection rules in Europe. The law restricts how companies can collect personal data and is expected to have wide-ranging implications for the advertising deals of the country’s biggest tech giants, but allows wide-ranging exemptions for government agencies and departments that account for China’s vast surveillance capabilities.

Beijing is already reportedly censoring messages about the alleged breach, and Chinese messaging apps WeChat and Weibo are blocking messages and mentions such as “data leak” and “database breach.” The Chinese government has not yet commented on the violation.

It’s not the first security breach that has exposed a massive set of passwordless Chinese residents’ data to the broader internet. In 2019, TechCrunch reported that a smart city installation in China spilled the contents of a facial recognition database of nearby residents.


You can contact this reporter on Signal and WhatsApp at +1 646-755-8849 or email zack.whittaker@techcrunch.com.