Update re: Microsoft and training data

Nov 26, 2024

Relevant to the post I sent earlier today, https://www.howtogeek.com/is-microsoft-using-your-word-documents-to-train-ai/says that an unnamed spokesperson at Microsoft claims that “Microsoft does not use customer data from Microsoft 365 consumer and commercial applications to train large language models.

Read →

16 Comments

John Ball

Nov 26

It depends I suppose on what you mean by training. Maybe the lawyers define what they do as using data not training.

Expand full comment

Chaos Goblin

Nov 26

Another rule: PR = BS!

"Facebook cares about the mental health of teens, pinky promise."

Expand full comment

Kathy Matheson

Nov 26

Thank you. 🙏

Expand full comment

Johan Brandstedt

Nov 26

So I too freaked out over this one and ended doing my homework and seeing that it looks legit. It is optional, it sends data only when asked, it follows sound data management practices of minimizing data sent and retention time.

The one remaining issue is, many users reported being opted in by default. Which could be partly explained by company policy or users clicking away the nag box without reading.

Still.

When both the CEO and CTO make shockingly dumb statements in public about IP, and they partner with OpenAI of all people, and own parts of LinkedIn which pulled a silent opt-in stunt just now, and roll out piracy-based, celeb deepfake engine DallE3 as safe to minors with their OS, there is negative trust to build on.

Expand full comment

Reply (1)

Costa

Nov 26

m$crosoft is not to be trusted at all - see this: https://sneak.berlin/20200307/the-case-against-microsoft-and-github/ .

Windows 11 is one of the most horrible oses they released if not the one.

Expand full comment

Sufeitzy

Nov 27

A vast number of fortune 5000 companies use OneDrive for secure storage. Microsoft is not going to enter into parallel lawsuits with that threat.

You can use 365 to do simple textual analysis from models MSFT already has trained for simple things like sentiment analysis.

I’m not sure where or how using a textual classification service devolves into widespread violation of intellectual property law. Perhaps people worrying about it should read their contracts for use? That usually clears up any question.

I had to answer similar questions on projects 5-6 years ago. The master services agreements in force with companies cover this.

Expand full comment

Glen

Nov 26

I just recently yanked everything I have off of Google Drive, and I never trusted OneDrive due to it's various eccentricities.

And because I'm a cheap bastard I'm using OpenOffice now for writing and spreadsheets.

It's inconvenient as hell, but I don't trust Google or Microsoft any more. Not just as regards to scraping all my writing into an LLM, but because I don't trust them to not abruptly lock my documents in place and charge me a fee to use them.

Google Docs and Drive have been great, but as they say, if the product is free you are the product.

Expand full comment

Federico

Nov 28

Under GDPR this is punishable by fines. It remains to be seen whether Trump will stand with us Europeans to protect users or on the side of companies because "they are American".

Expand full comment

Jacob Stoller

Nov 26

Microsoft is constantly prompting me to back up my documents on their cloud server (OneDrive). Is their intention here to make my data more readily available for training their LLMs?

Expand full comment

Reply (1)

Forrest

Nov 26

I doubt it, they've been bothering me about that since 2018

Expand full comment

James McDermott

Nov 29

Reading the original post, it looks to me that people are confusing training and runtime.

Expand full comment

keithdouglas

Nov 28

But they might (and in fact do, if I understand the training they offered) use them to produce the boundary API systems which are not LLMs (because not ANNs) - that possibility is consistent with that statement. I have tried to raise this question officially and gotten nowhere, but I will also keep pressing.

Expand full comment

Susan I Weinstein

Nov 27

Thanks for asking for clarity. So far, very little is explained nor is permission requested from consumers. What about teach-ins, town halls all across the country explaining objectives of this technology. Why passive partners?

Expand full comment

macirish

Nov 26

Why would anyone waste cycles training AI with a bunch of random documents?

The cesspool of the internet is bad enough. But data from a bunch of word documents, not curated or verified in any way?

Expand full comment

Richard

Nov 26

That option has been around for a long time, I don't think it has anything to do with AI unless they've extended the definition secretly to include it recently. But given that it would cost them 4% of their global turnover in the world's most obvious GDPR violation, it's unlikely

Expand full comment

Inklings Book Club

Nov 26

keep them on their toes, Marcus! They can say what they will but it's a fine line that we're "trusting" them.

Expand full comment

Marcus on AI

Update re: Microsoft and training data