Email contents and email addresses are both text. Yes, the concept still applies. The point is that email contents are almost never the same to begin with. If we're including the timestamp, then every email is almost unique by default. Mentioning that the email addresses are unique is making the point that we've identified just as many people [1], which is interesting. The statement "these 3.3 billion emails are unique" is much less interesting, because we've identified messages and not people. Also, people are usually more concerned with the information in the email rather than the count. Most/a lot of the value of an email comes from the information in it.
If I were to release 3.3 billion emails between random low-profile office workers (let's say) which contain nothing interesting, I'm not so sure that would make a headline.
Why would anyone assume Hacker news titles are maximally interesting? In practice they often aren't. I am with the OP on this one.
Also 3.3 billion unique emails are strictly more interesting than just the addresses since an email includes adresseses and a subject line by definition.
> The point is that email contents are almost never the same to begin with
Obviously, if you have emails that were generated in different events, thus having different Message-ID and timestamp fields, they will be unique.
But non-uniqueness could crop up in a dataset for various reasons. As the simplest example, imagine this guy aggregated datasets A, B, and C, but it turns out C was itself already an aggregate of A and B. Then all the emails in A and B would be duplicated in the final dataset.
So of course when publishing some huge collection of data from many different sources, it's useful to make sure each piece of data is unique, and the title is just pointing out that for this data set, that has indeed been done. This logic applies whether the data is messages or addresses.
If you just look at the body text, and not the headers, it is even less likely for emails to be unique due to mass spam.
> Mentioning that the email addresses are unique is making the point that we've identified just as many people, which is interesting.
No it isn't. He didn't say that the addresses correspond to unique _people_, just that they are unique addresses, textually. The mapping of email addresses to people is not even close to one-to-one.
> Also, people are usually more concerned with the information in the email rather than the count. Most/a lot of the value of an email comes from the information in it.
But the article/headline isn't just saying a count was published, it's saying the emails themselves were leaked. If this meant email messages rather than addresses, then it would indeed mean the valuable information in the emails had been compromised. Why are you saying that wouldn't be interesting?
> If I were to release 3.3 billion emails between random low-profile office workers (let's say) which contain nothing interesting, I'm not so sure that would make a headline.
I think it would, assuming they were between humans and not just spam. A leak of 3.3 billion ostensibly private messages, on any platform (email, twitter DMs, whatever), would be by far the most serious data breach in the history of the internet.
Who really cares? e-mail like SMS and phonecalls is 99% e-generated garbage.
With my phone, if you're not in the directory you have to leave voicemail. If you leave voicemail then based on what the voicemail is, I might or might not respond or I might just block the number.
With email, everything is automatically reported as spam unless it's in the whitelist. No exceptions.
SMS is harder to deal with but I can and do report SMS spam.
He didn’t leak them, he just collected whatever is circulating around, cleaned it (with a regex expression used by Troy Hunt of HaveIBeenPwned) and then distributed. The post is not clear, but from the screenshot of BreachForums it seems to be email AND password, not just emails.
Is there any reason to suspect every single one is valid? I have some experience with breached password collections, and at least 80% of entries is fake (even more for larger collections).