openai / requests-for-research

A living collection of deep learning problems

Home Page:https://openai.com/requests-for-research

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Background information for "Spam the spammers"

freach opened this issue · comments

A bit of background info about me, I co-founded a startup in the security industry, my main focus is to collect security related data (spams, attacks, ...) to identify infected or in other way compromised hosts on the internet. Therefor we built a sensor network to collect ~500 million security events per day, ~70% of it Spam alone. I did a lot of research based on the data to find out how Spams look like, where they originate from or what mechanisms where used to send them. I'd like to share my findings about Email Spam with you because they might be helpful for your "Spam the spammers" idea.

First I'd like to stress that Spammers send Spam, not because they want to annoy you, but because they want to make money. The business model behind each Spam can be very different. Spams are used to advertise real e-commerce platforms or services, phish for account information, distribute malware or trick you into behavior to for example share information you shouldn't share. So the purpose of Spam sometimes would be to directly drive sales of products where others would be to expand resources. Only a fraction of the different business models of Spam require a reply from the recipient of the Spam and most Spams are designed to provide that reply or feedback through other communication channels than Email. This is because of the very nature of how Spam is distributed. Sending the amount of Spam we currently face require mechanisms very different to traditional Email infrastructures. Spammers don't use they're own infrastructure to send Spam, they use other people resources. They use malconfigured Email servers, compromised accounts, vulnerable websites, infected computers, compromised servers, they get very creative I saw printers and TVs send Spam. Using other people resources often mean you need to impersonate a valid sender to send out your Spam. Impersonating valid senders or using other people services for Spam often mean you can't reach out to the original sender of the Spam anymore. Replies on Spams would end up on real people Email inboxes or would be denied by real Email infrastructure. The protocol used for Spam is purely for getting your content out into the world, not actually to use it in the intended way.

So, as much as I like the idea of Spamming the Spammers I think replying on Spams in a large scale fashion would hurt the wrong people, who are not even aware that they send Spams.

What I noticed in Spam content was that different campaigns often use similar or even the same template. Templates would often use Spintax combined with static blocks of text. Of course there are more sophisticated techniques to generate the Spam corpus, but this applied almost only to highly targeted content. The high volume Spammers don't care about quality they care about volume and throughput. So to kill 90% of the Spam out there template based detection mechanisms should be sufficient. The techniques currently used by the industry to detect Spam on the content level are very primitive I would say, that's why Spam is still an issue. What I would like to see is an approach to detect Spam not on a content level, but on a communication level. I noticed Spammers often use the same software underneath to send out Spam. This software leaves traces in how a Spammer communicate and behave on a protocol level. There are traces in all layers starting from IP, TCP up to the application layer. I would like to see learning mechanisms, which learn how Spammers communicate and kill them before their content hit your inbox.

So, as much as I like the idea of Spamming the Spammers I think replying on Spams in a large scale fashion would hurt the wrong people, who are not even aware that they send Spams.

As another person who was interested in the "Spamming the Spammers" idea, I agree full-heartily with this post. However, "text generation" could still be useful for a whole variety of other fields (robot-journalism, stories for procedurally-generated games, machine-generated literature, other generative arts, etc.). Thus, research in the development of a machine that can generate convincing email responses would be useful...so long as it never actually gets used to send real e-mails.

Thank you @freach for your comment.

Some fraction of spam emails that I am receiving require reply. Such emails could be automatically replied. As you described, responding to all other spam emails would harm users.

What's the fraction of spam emails that require user to reply ?

Sure @wojzaremba .The request to reply makes sense if the Spammer wants to validate real addresses, a so to say opt-in mechanism. I also saw Spams, which didn't really advertise for services or the like, but wanted to suggest trust worthiness. Spams which are used for social engineering, usually highly targeted. I saw very good phishing Emails targeting services like PayPal, which required a reply.

There is also a very classic example for people using services from valid businesses to do very shady Email marketing. People would go to companies like Mailchimp, use a dirty list of Email addresses (often bought from shady people) and use the services of the Email marketing company to validate Email addresses by sending out a marketing campaign and check which addresses delivered, opened or replied. Those people cause a lot of trouble for companies like Mailchimp and the anti abuse departments are usually very quick to identify and ban those kind of customers.

So there are a couple of examples, where a reply would make sense, but detecting those kind of Emails, where a reply would cause a little trouble for the Spammer, is an other not so trivial problem.

I see that there are many components that are not necessary research related. I am going to remove this task.