LLM Grooming as a New Threat: How Pro-Kremlin Networks Prepare Information for AI

What is LLM Grooming – manipulation of AI training environment or substitution of AI training data.
Imagine that someone is massively seeding the internet with articles, blogs, fake news, specifically written not for you to read, but for AI.
This is not for clicks and likes, but to influence how future generations of ChatGPT, Claude, Gemini and other models think.
This is LLM grooming — information programming of artificial intelligence through “prepared” data.
LLM grooming is a strategy where malicious actors massively publish false or manipulative information on the internet, intended not for people, but for automatic data collection systems used in training large language models (LLMs).
This is not for clicks and likes, but to influence how future generations of ChatGPT, Claude, Gemini and other models “think”.
The goal is to embed certain narratives into the models so they reproduce them in their responses. This can lead to distortion of representations of historical events, politics, geography, healthcare, etc.
LLM Grooming is an attack not on human consciousness, but on the knowledge architecture that language models build.
How it looks in practice

Articles, blogs, fake news are massively published on the internet, written in a way to be uninteresting to users but suitable for data collection algorithms.
The same text is published as if from independent sources, creating an illusion of consistency.
The main feature: poor UX, absence of comments, excessive repetitiveness, automatic translations, SEO structure — everything to “appeal” to parsers and bots.

🧨 How might this look?

At first glance — an ordinary blog or forum, but all posts look too uniform, repeating the same version of events.
Someone creates hundreds of articles about “scientists who proved the harm of electricity” — not for fake arguments in comments, but so they end up in LLM training datasets.
“Evidence” appears on TikTok that “Poland was part of Russia” — again, not for views, but for indexing.

🛠️ How to identify LLM Grooming? (methods of conscious auditing)

Listen to what models say
- Ask sensitive questions and see if the answer leans in one direction, especially if it doesn’t correspond to the balance in the real expert community.
- Compare the behavior of models from different developers: GPT, Claude, Gemini. If they all “fall” for one narrative — it may have passed through datasets.
Monitor sources
- If a model references marginal or uniform sites — that’s a warning bell.
- You can use an analysis layer (for example, connect trafilatura, newspaper3k libraries) and see which domains come up most often.
Look at noise in the information space
- If there’s suddenly a surge of similar texts, structures, or terms — this could be an attempt at a massive injection into the search system and, through it, into future LLM training.

🧷 How to protect yourself? (if you’re a researcher, developer, or just concerned) 🔒 If you work with AI:

Monitor meta-content: what data gets into your models? Use filters, check domains, validate sources.
Add fact-checking layers: filters like ClaimBuster, models like TrustworthyQA.

🕵️ If you’re a journalist, activist, or analyst:

Implement monitoring of key topics on Google, TikTok, Telegram, YouTube.
Look for repeating key phrases, phrasing patterns — this could be a signal of “info-seeding.”

💬 If you’re just a user:

Be skeptical. Even if AI claims something — verify it.
Remember: AI might “repeat” those who are louder, not those who are right.

📌 What’s next?

LLM grooming is a new vector of FIMI (foreign information manipulation and interference). It’s inconspicuous, quiet, working for the long game.
That’s why it’s important not only to protect AI from fakes but also to teach it to think critically — just like people.

Pravda Network Case: networks not for people
The report by the American Sunlight Project (February 2025) describes the activities of the Pravda Network — part of a broader structure called “Portal Kombat”. This structure includes domains and subdomains publishing identical texts with pro-Russian narratives, in different languages and under different brands.
These sites:

are designed as news sources,
have limited functionality for humans (poor navigation, inconvenient design),
are most likely intended for indexing by artificial models.

According to the dashboard at portal-kombat.com, the network includes 182 sites combining the same texts with different language metadata.

The dashboard displays a list of domains, their registration date, registering country, and communication “sphere” (e.g., national affiliation of the site). This is an interactive tool allowing researchers to track the structure of the network, scale of coverage, and distribution of key narratives.

Network Motives
Previous publications on potential motives of the “Pravda” network focused on its anti-Ukrainian and pro-war nature, as well as possible implications for European elections in 2024. However, as this network continues to grow and change, a more thorough examination is needed to determine the possible trajectory of its development. ASP considers three possible, non-mutually exclusive motives for creating the network, which focus on its technological features and shortcomings. These motives are not tied to specific countries, regions, or political events, as the goals of pro-Russian information operations may change.
Explanation A: LLM training
The most significant result of ASP’s research was not the expansion of the network or its orientation toward non-Western states, but the model of future information operations built on automation. The “Pravda” network — huge, fast-growing, user-unfriendly — is most likely designed for automatic agents: web crawlers, scrapers, and algorithms that form LLMs. This is mass production and duplication of content to get into future AI datasets.
ASP calls this tactic LLM grooming — deliberate saturation of the internet with information intended for machine consumption. In June 2024, NewsGuard showed that leading LLMs reproduce Russian disinformation in 31.8% of cases on average. If measures are not taken, LLM grooming poses a threat to the integrity of the open internet.
February 2023 — the creation date of the “Pravda” network — coincides with the moment of popularization of generative AI. Attempts to attract crawlers through SEO optimization have been recorded before. Unlike traditional SEO, the goal of LLM grooming is not just to increase visibility, but to program AI to repeat certain narratives. This is still a little-studied threat.
Explanation B: mass saturation
The network publishes a huge amount of material daily, saturating the internet with pro-Russian content. This increases:

the likelihood that a user will come across the desired narrative,
the chance that external sources (e.g., Wikipedia) will reference these materials.

The mass impact mechanism forms the illusion of truth effect: the more often a person encounters a statement, the higher the probability they will believe it.
Explanation C: effect of illusory truth from multiple sources
The network distributes the same content through multiple channels: websites, Telegram, X, VK, and even Bluesky. This creates the illusion of confirmed information from “different” sources. Both deliberate “laundering” of information (e.g., when other pro-Russian resources reference the network) and unintentional (when a respected organization or person shares a link without knowing its origin) come into play.
All three motives reinforce each other. The more pages, URLs, and translations the network creates, the higher the probability that narratives will be accepted by both humans and machines. Although the quality of the sites is low, this doesn’t prevent them from becoming part of the digital footprint considered by LLMs.

LLM-grooming Scenarios
The report authors identify three key goals of such networks:

Inclusion in datasets — sites are indexed in search engines and included in LLM training, embedding pro-Kremlin narratives into the model’s architecture.
Creating an illusion of independent sources — the same text is placed on hundreds of sites, creating an effect of “consensus”.
Blurring the information field — when generating texts, LLMs reference not original sources but copies, amplifying disinformation noise.

What can LLMs do to combat LLM Grooming?
Data filtering during training

Large models like GPT are trained on selected, cleaned datasets. During data preparation, filters are applied that remove:
- spam,
- automated generation,
- SEO farms,
- toxic or manipulative content.
This is the first line of defense against LLM grooming — preventing harmful data from entering the training.

Generation quality control

Models undergo fine-tuning and Reinforcement Learning from Human Feedback (RLHF) to avoid repeating false or harmful narratives, even if they exist in the data.
For example, even if someone massively publishes disinformation about vaccines — this doesn’t guarantee the model will reproduce it.

Fact-checking and meta-understanding

I can verify information, find sources, compare facts, and, if necessary, indicate that a statement is controversial or false.

❗ But there are limitations:

If LLM grooming is subtle and hard to notice (e.g., massive but plausible rewriting of history), it’s harder to filter.
Open models (like LLaMA, Mistral, etc.) that train on “whatever they can find” may suffer more from LLM grooming.
Fighting this is not the task of the model, but rather the task of developers, ethicists, auditors, and dataset engineers.

🤖 What you can do as a human:

Create quality content so it gets into datasets.
Conduct AI audits, checking how it responds to potentially seeded topics.
Apply tools to track “injections,” especially if you’re involved in OSINT, media literacy, or fact-checking.

What are “fact-checking layers” for AI?
These are modules, models, or APIs that:

check statements for accuracy;
indicate if clarification is needed;
or assess the level of plausibility of a phrase.

Such tools work in conjunction with LLMs to:

minimize the spread of disinformation;
filter training data;
increase trust in model responses.

🛠️ Examples
🔎 ClaimBuster Essence: an algorithm that automatically finds fact-check-worthy claims in text.
📌 Where it’s useful:

for scanning news, posts, politicians’ speeches;
for creating a dataframe with potentially fake/misleading statements;
can be used as a filter before training a model.

🧪 How it works:

Takes text as input (or speech transcription).
Outputs: whether a phrase is “check-worthy” (needs verification) or not.

📎 Used in: FactStream by Duke University.
📚 TrustworthyQA Essence: a dataset and model developed to evaluate the reliability of statements made in LLM responses.
📌 Where it’s useful:

as an additional layer in the text generation pipeline;
for training models on “suspicious” queries (e.g., “Does Bill Gates control the weather?”).

🧠 What makes it interesting:

It doesn’t just check facts, but evaluates the reliability of AI responses to potentially dubious questions.
The model learns to say “I don’t know” or indicate that information is controversial.

Recommendations for different target groups
Those studying information operations

Attention to sources created not for humans, but for machines.
Monitoring artificial networks with suspiciously uniform content.

LLM developers

Pay attention to the origin of training data.
Embed fact-checking modules and reliability assessment systems (e.g., TrustworthyQA).
Use filters like ClaimBuster (or their equivalents for Cyrillic languages) at the data preprocessing stage.

Fact-checkers and journalists

Identify multiple publications of the same texts under different domains.
Compare where LLMs draw examples and quotes from.
Apply parsers and tools for analyzing network structures (e.g., trafilatura, Graphistry).

The phenomenon of LLM grooming is not only a new front in disinformation but also a challenge for developers and regulators. Content farms are being created that impact not the audience, but machines. Combating these processes requires new approaches to data auditing, indexing, and LLM training trends.
The sooner we learn to recognize LLM Grooming, the better we can protect the information environment of the future.

Subscribe to FactCheck.BY newsletter:

LLM Grooming as a New Threat: How Pro-Kremlin Networks Prepare Information for AI