A recently discovered dataset used for training large language models (LLMs) has revealed a shocking truth – it contains nearly 12,000 active secrets, including passwords and API keys that can be used to gain unauthorized access.
This finding underscores the severe security risks posed by hardcoded credentials, which can be devastating for both users and organizations. Furthermore, when LLMs suggest insecure coding practices to their users, it can exacerbate the problem.
Truffle Security conducted an analysis of a December 2024 archive from Common Crawl, a repository of web crawl data that is free and open to the public. The dataset, which spans 18 years and contains over 250 billion pages, is a treasure trove of information.
The archive comprises 400TB of compressed web data, 90,000 WARC files, and data from 47.5 million hosts across 38.3 million registered domains. Upon closer inspection, the company’s analysis uncovered 219 different types of secrets, including AWS root keys, Slack webhooks, and Mailchimp API keys.
Security researcher Joe Leon explains that “‘live’ secrets refer to API keys, passwords, and other credentials that can successfully authenticate with their respective services.” He adds that LLMs are unable to distinguish between valid and invalid secrets during training, which means that both types of secrets contribute equally to providing insecure code examples.
This issue can lead to insecure coding practices being reinforced, even if the secrets are invalid or merely examples. The consequences of this can be severe, as it can put users and organizations at risk of security breaches.
The discovery comes on the heels of a warning from Lasso Security, which highlighted the risks of data exposed via public source code repositories being accessible through AI chatbots like Microsoft Copilot. This vulnerability, known as Wayback Copilot, can allow attackers to access sensitive information even after it has been made private.
The attack method has already uncovered 20,580 GitHub repositories belonging to 16,290 organizations, including Microsoft, Google, Intel, Huawei, Paypal, IBM, and Tencent, among others. These repositories have exposed over 300 private tokens, keys, and secrets for GitHub, Hugging Face, Google Cloud, and OpenAI.
Lasso Security warns that any information that was ever public, even for a short period, could remain accessible and distributed by Microsoft Copilot. This vulnerability is particularly dangerous for repositories that were mistakenly published as public before being secured due to the sensitive nature of the data stored there.
Recent research has also shown that fine-tuning an AI language model on examples
Source Link