



Meta Starts Using Data From EU Users To Train Its AI Models (engadget.com) 29
Meta said the company plans to start using data collected from its users in the European Union to train its AI systems. Engadget reports: Starting this week, the tech giant will begin notifying Europeans through email and its family of apps of the fact, with the message set to include an explanation of the kind of data it plans to use as part of the training. Additionally, the notification will link out to a form users can complete to opt out of the process. "We have made this objection form easy to find, read, and use, and we'll honor all objection forms we have already received, as well as newly submitted ones," says Meta.
The company notes it will only use data it collects from public posts and Meta AI interactions for training purposes. It won't use private messages in its training sets, nor any interactions, public or otherwise, made by users under the age of 18. As for why the company wants to start using EU data now, it claims the information will allow it to fine tune its future models to better serve Europeans. "We believe we have a responsibility to build AI that's not just available to Europeans, but is actually built for them. That's why it's so important for our generative AI models to be trained on a variety of data so they can understand the incredible and diverse nuances and complexities that make up European communities," Meta states.
"That means everything from dialects and colloquialisms, to hyper-local knowledge and the distinct ways different countries use humor and sarcasm on our products. This is particularly important as AI models become more advanced with multi-modal functionality, which spans text, voice, video, and imagery."
The company notes it will only use data it collects from public posts and Meta AI interactions for training purposes. It won't use private messages in its training sets, nor any interactions, public or otherwise, made by users under the age of 18. As for why the company wants to start using EU data now, it claims the information will allow it to fine tune its future models to better serve Europeans. "We believe we have a responsibility to build AI that's not just available to Europeans, but is actually built for them. That's why it's so important for our generative AI models to be trained on a variety of data so they can understand the incredible and diverse nuances and complexities that make up European communities," Meta states.
"That means everything from dialects and colloquialisms, to hyper-local knowledge and the distinct ways different countries use humor and sarcasm on our products. This is particularly important as AI models become more advanced with multi-modal functionality, which spans text, voice, video, and imagery."
begin notifying Europeans through email (Score:2)
So they are going to send emails to everyone in Europe?
Or just citizens of the EU.
(of course the term "European" could also mean white people living in the Americas)
Re: (Score:2)
Those in EU.. as said in thr source..
"Beginning this week, people based in the EU who use Metaâ(TM)s platforms ..."
Re: (Score:2)
Their official announcement refers to "people based in the EU who use Meta’s platforms" https://about.fb.com/news/2025... [fb.com]
Re: begin notifying Europeans through email (Score:1)
Re: (Score:2)
The EU issued opinion several month ago saying it's legal if the legitimate interest is properly justified, citing as example deploying an assistant accessible to help the users.
Not surprising (Score:3)
Because to hell with the GDPR, right? (Score:2)
Big corporations will just ignore all attempts to regulate their profits away from them.
Re: (Score:2)
Right to be Forgotten? (Score:2)
And how will this work with the EU Right To Be Forgotten?
Re: (Score:2)
They then delete the source data as they are legally obligated to do.
Re: (Score:1)
And how will this work with the EU Right To Be Forgotten?
Ironically it will work exactly as it does right now.
I see you do a thing I don't like and I remember you did it.
I treat you differently because of what you did, but I do not ever say that's why.
I've now committed an unprovable crime.
The LLM will also remember you did bad things even after the source reference is removed.
It will treat you differently because of that.
With the original material deleted, the LLM won't know the reason you are different, only that you are.
It too has committed an unprovable crime
Re: (Score:2)
If the data is completely anonymised - and there would be no other way to collect this data in the EU - then there will be no personal data to be forgotten to begin with.
Opt-out, really? (Score:4, Interesting)
GPDR requires Facebook to choose among six legal basis:
I do not see where "grab the data first and remove later if people complain" fits.
Re: (Score:2)
GPDR requires Facebook to choose among six legal basis:
I do not see where "grab the data first and remove later if people complain" fits.
Not to mention, I'm not sure how they'll yank the data out of the LLMs they are training post-facto, unless they plan to restart the training every time they get an opt-out request. Don't these thing tokenize the data as it's sucked in in such a way that any attempt to post-edit the tokenized data would be near impossible? It seems like they're not only going about this completely backwards on a legal basis, but they may also be outright lying about what they are technically capable of doing. Because you kn
Re: (Score:2)
I'm not sure how they'll yank the data out of the LLMs they are training post-facto
I suspect by arguing that the data isn't in the model to begin with.
Don't these thing tokenize the data as it's sucked in in such a way that any attempt to post-edit the tokenized data would be near impossible?
It's impossible, because the tokenized data isn't stored.
It's used to train. I.e., each sequence of token is used to get the output a bit closer to teh answer.
In the cases where exact information is duplicated many times, the model could technically reproduce it exactly via over-fitting, but you'd never ever be able to "take the model apart and find out where those particular weights are".
Re: (Score:3)
The question is where is it stored. The answer is in both the choice of tokens and the relative relationships between the tokens. In this case, the relationships are high dimensional geometric patterns precisely constructed to make the training data appear the most likely for extraction.
The collage analogy applies here. If you take a paper document and pass it t
Re: (Score:2)
It's silly to say that the data isn't stored.
No, it is not.
It obviously is
If a falsehood seems obviously true to you, then that means you don't understand the topic you're talking about.
and researchers have been able to extract it in previous jailbreaking experiments.
For certain sets of data that may have been duplicated many many times in the training set, combined with a properly engineered prompt, you may be able to induce an LLM to reproduce data that was used to produce its weights. You may also end up with random bullshit.
This does not mean the data is in the model.
The collage analogy applies here. If you take a paper document and pass it through a shredder that produces confetti tokens, it may be difficult to recombine the confetti into the original document, but it's not impossible and is already often done in archaeological studies.
That is not what is done in LLM pretraining, at all.
Backpropagation can be
Re: (Score:2)
I also don't know why you keep harping on about data not being in the model, as if a model is purely a vector of numbers representing wei
Re: (Score:2)
We've had these kinds of discussions before, DamnOregonian.
Then you were wrong then, too.
You can't ignore the fact that researchers have extracted painter's signatures from image generators, and phrases and paragraphs from public and private corpus documents that were learned by LLMs.
Nobody's ignoring it. What you're ignoring is the nature of the memorization- calling it data storage, which is flatly, undeniably, incorrect.
That is enough to falsify your basic contentions, even though you can't immediately see a mechanism for it yourself.
No, it literally is not.
Given only a few terms, I can write a simple function that generates your name out of some specific input that is smaller than it.
Is this compression?
Is anything that can produce something compression?
Is 3 just the compressed form of 1+2? Then what about 4-1?
Tokens are reduced to obscenely high dimensional ve
Re: (Score:2)
I stand by my claim, you are trying to make distinctions between approximation and exact representation which don't exist. Perhaps read up on how data is actually stored on physical storage systems? If nothing else, you'll see that the entire Encyclopaedia Britannica in
Re: (Score:2)
The Kolmogorov complexity of any given string does not imply that anything that can generate that string contains the data from that string.
I stand by my claim, you are trying to make distinctions between approximation and exact representation which don't exist.
To the contrary, you're trying to conflate relationships with data.
a, b, c.
If I store 1, 2, that doesn't mean I've stored a, b, c, and yet, a, b, and c, can be reconstructing with the correct algorithm.
The trick, of course,
Re: (Score:2)
Make it yourself easy: There is a huge pile of trianing data and a small model. The more artist signatures you can extract, the more others cannot be extracted. Pigeon hole principle - if the space is limited, each part that goes in means another part can't be stored.
IF the model is overfit on certain pieces it was trained badly, but in particular it means that it cannot overfit on others. That's also stated by related work, people are only reading the parts they want to read.
What's relevant for you: I am p
Re: (Score:2)
The European Data Protection Board had previously (2024-12-18) issued a lengthy analysis stating that such AI training can rely on legitimate interest.
the Opinion recalls that an interest may be regarded as legitimate if the following three cumulative criteria are met: the interest (1) is lawful; (2) is clearly and precisely articulated; and (3) is real and present (i.e. not speculative). Such interest may cover, for instance, in the development of an AI model - developing the service of a conversational agent to assist users, or in its deployment - improving threat detection in an information system. https://www.edpb.europa.eu/our... [europa.eu]
Re: (Score:2)
Companies use "legitimate interest" as a backdoor. Click into the details of a cookie banner and look what is listen there:
1) PLEASE allow our 156 partners to sell your data! (default off if you choose customize)
2) Legitimate Interest: Our advertising partners want to track you! (default on, can be disabled)
3) Necessary: NECESSARY analytics and datamining and user profiling and fingerprinting and verification through data brokers (you cannot disable this)
A bit exaggerated, but you get the idea.
Re: (Score:2)
GPDR requires Facebook to choose among six legal basis:
I do not see where "grab the data first and remove later if people complain" fits.
Basically, because the US has given rich people and corporations carte blanche they assume it applies everywhere.
Only public posts (Score:2)
So theyâ(TM)re going to train their AI to be an internet troll?
Training AI on facebook posts ? (Score:2)
There is a difference (Score:2)
There is a difference between European data and data from European users.
Crawl the web like everyone else. Don't exploit your access to social media messages.