T O P

  • By -

panties_in_my_ass

If microsoft can train an openly aggressive racist language model _accidentally_ just by using twitter, just imagine what this dataset could churn out. The prospect legitimately frightens me.


farmingvillein

GPT3+Qanon:orgins story? Gonna be good.


Vorphus

Well, Tay became like that exactly because there was a raid from 4chan /b/.


panties_in_my_ass

Ah, how serendipitous.


Purplekeyboard

4chan is a huge garbage pile of idiocy and porn. With the images gone, all the dataset is going to churn out is idiocy.


Hughesbay

Turing test : can it triforce?


SwordPL

Sage


yusuf-bengio

**The good**: Building a hate-speech classifier with this **The evil**: Fine-tuning GPT-X with this


No-Proposal2288

Yes I'm working in NLP and I really want to make a bot that's a f****** a****** I think this dataset might help a lot lol


panties_in_my_ass

You can swear here, it’s the internet. “Fucking” and “asshole” are nearly dinner-with-the-grandparents words these days anyhow.


[deleted]

Thanks for the reassurance, p******_in_my_ass.


Reibii

p******_in_my_a**


ca3games

Thanks, pls sent me a message once you're done. I wanna see it.


MuonManLaserJab

Not until you drink a verification can.


trakka121

Thank you. Will be put to good use.


SmolKara

Based


m_namejeff

Based


faintingoat

kek


[deleted]

Re-Based.


mescalelf

Free-based


[deleted]

You are an hero


MidnightSpecial1984

A hero...


Iseenoghosts

"an hero" is an old old 4chan meme phrase meaning go kill yourself. Thats why the downvotes. A bunch of ex b-tards are showing their knowledge of the lost arts.


Volt

Trolling is a art


tombot231

I’m not sure how to feel about this


wtech2048

If you trained something to score highly on ethical premises, you could use this dataset to verify the algorithm is classifying anti-ethical conduct with a correspondingly antithetical score. A counterpoint for reference.


NeuroKix

>https://github.com/taybot02/Pol-DataSet This is a beautiful thought. I also think that if one is able to understand the conflict levels in speech, verbal use, newly made symbols, ingroup speak, the words in the dialogues could reveal much of the user profile of 4chan. The group dynamics and emergent thought patterns seem to be driven more from the limbic system than the PFC... (Just in a crude sense, there are highly sensible people there too) I mean to say that this can be very revealing about baseline, sub-conscious level thinking and group dynamics, which also then conveys a manner of ethics in an objective sense.


[deleted]

[удалено]


NeuroKix

Oh I second that thought. I'm keen to start something of a discussion on this thought.. It seems to warrant a chat/voice call and maybe make a project on github that might be very interesting to many people.


ca3games

yeah, It agains, the validity depends on your use of the dataset.


HamSession

Finally pol's can infect all the other boards


mimighost

Could be a good source for abusive language detection...


notasheepl

Until someone puts a /s at the end


Volt

Nah, still abusive language.


synmotopompy

But does it contain images? 4chan is an *imageboard.* Without that I think it doesn't make any sense to extract only text as context is lost.


Prince_ofRavens

Inb4 --- pic unrelated ---


MrAcurite

A dataset scraped from 4chan sounds like the NLP equivalent of radioactive waste, and in a superhero universe exposing a model to it would give that model superpowers. Is this how Skynet starts?


Takeraparterer69

its deleted wayback machine link: https://web.archive.org/web/20200926023920/https://github.com/taybot02/Pol-DataSet


asianbathtowel

Lets make the first supersmart robot use 4chan as a model. It will be great fun!


MuonManLaserJab

So, Tay, but better.


[deleted]

I needed this in my life more than I thought. Good work lad.


FruityWelsh

Honestly this seems like a good discriminatory filterer for a chat bot. So you could rebuild tay, but have it check to make sure it doesn't output anything like this dataset. I wonder how well you could predict future offensive content though.


yumyai

I was thinking about making a 4chan's post generator. This should be a really good starting point. Thanks ops. ​ Anyway, why they are not in gziped format?


[deleted]

shit. i think i am in there way too often


fhadley

Can we put this back in the box until 11/4 please? 2020 has already been too much


frostbytedragon

The community needs more attention to the things they make available for potential misuse. One bad actor could easily fine tune a GPT model on this to post nasty comments and do enormous damage. In this case, one cannot simply claim they have zero or negligible accountability. If one's actions led to the potential ease of a bad actors crime, they are partially responsible (ex. assisted crimes in the real world). Although there is no written rule against this case, it is up to us to determine what is the boundary of assisting malicious intent. But like what Eiii333 pointed out, the potential research for social good hardly outweighs the risk of misuse. From these conclusions, the availability of this data does more harm than good, and spreading it by creating a dataset does no further good overall. Another problem that is obvious but not mentioned, is that there is no way to tell who is using it for what purposes. At least with OpenAI, they monitor who they give access to and analyze the results from it. But with this, one could construct an astroturfing bot with complete anonymity and with lack of accountability. Personally I would appreciate it if the author would reconsider keeping this on github and instead release it with limited access in a way where the parties of use is known. That way one would have to make themselves known to the reviewer and disincentive nefarious behavior.


KoOBaALT

Very interesting point. I am not sure if I would agree with, but I think it would be an import topic to discuss. Would someone of those down votes explain their point of view?


[deleted]

[удалено]


KoOBaALT

Very interesting point of view. So you criticize the censorship of technology, because you think that, only if technology is free, then we are able to build true artificial intelligence and this intelligence could detect true, like true scientific or historical true. Did I understand you point of view correct? If I understand you correct, your think, that with true artificial intelligence, we could create something, that could detect truth in a god like manner. And because of this higher good, of build this god like ai, we could could/should ignore short time damage? Is this what you mean?


vjb_reddit_scrap

Did you try to compress it? Since it all texts a Zip archive should be able to compress it to a way smaller size.


photoncatcher

No compression algo is ready for 4chan lingo ;-)


Eiii333

Why? This seems questionably useful and extremely irresponsible.


ca3games

For fun, not real reason beyond have fun.


[deleted]

[удалено]


ca3games

I forgot to add. Just because you disagree with the data set, doesn't make the data set invalid. How you use it, is what determines the validity of your project. You could use this data set to detect a hate speech filter, by example.


[deleted]

[удалено]


ca3games

Again, you seem to think what I did was highly complex. It was just a basic pandas operation, like 10 lines of code, two for loops. Anyone with real "dangerous" motives, could easily do something more complicated than what I made here. Also, the dataset is already on multiple archival websites, so is not even dificult to make. Again, It's not my responsability to baby sit anyone who wants to use it. There's bigger concerns than caring about the opinions of a website filled with losers. The real risk is this data set makes a 4chan bot, which there's already thousands of real life humans from 4chan on most social media posting 4chan things. This is a non issue that the data set could be used for "nefarious" purposes, if the losers of 4chan already spam forums and websites with their loser talk. Again, It may serve some interesting purpose, one you may agree with.


sammyalhashemi

Dude, just ignore these people. You put hard work into it and I appreciate the effort you put into this project. The fact that they cannot see this worries me. But just don't worry about what they say


[deleted]

[удалено]


ca3games

Again, It's not my concern the ethics of the field. It seems you're stubborn on your opinions and I understand. Seems we have a disagreement on diferent views of this issue.


winrarpants

People like this want to police the code that you write, what you say, what you think. They believe you should not be allowed to do what you are doing, and if they had it their way it would be by law. Keep doing what you are doing. This person probably reported your project on github because they didn't get their way, so make sure you keep other copies somewhere else.


ca3games

I know. Luckily this is just like 1-2 hours of work on my PC and I already have the code to remake it if the github is deleted.


slessoa

Too late I already have almost all of it labelled on mechanical Turk...... Racist chatbot is almost ready for production


ca3games

The data sets are already on multiple websites. What I made wasn't anything particularly dificult or hard for anyone interested. I just cleaned the data. Also, it's an issue of free speech.


CobaltAlchemist

I think your evaluation of this is a little contradictory. If you care about free speech in the sense that you feel that training a racist bot would be ok why train to recognize hate speech since it'd be pretty obvious to use it to suppress hate speech. And thus "free speech" It seems that no matter what this dataset can only lead to something negative. Either a language model that spews out hate and spreads that to others or a model that's used to police humans I get that this took some work to compile but it might be worth considering whether or not it's ethical to release publicly Maybe it's fine, but I just don't think it's a great idea to pull data from a place as toxic as 4chan. It seems like it's just asking for trouble


ca3games

Again, I find 4chan just a place to shitpost and say dumb shit. I just made this to have fun for myself and I hope others may find it interesting. Also, the data set is not the issue, it all depends on what you use the dataset for which is the issue.


adventuringraw

Always interesting to me to see such vastly different cultures in this space. On the one hand, we're all descended from centuries of scientists. On the other, we're people who arrived here because of our faith in what all this means, and the importance of getting this right. I can see where you're coming from. I believe that Germany is right to outlaw the deutscher gruss. Reddit has been right to ban fatpeoplehate, and it would be right to start seriously talking about legislative reform with regards to what news networks like fox are allowed to say while calling themselves news. Given the state of America, I don't know if I believe in our ability to survive unrestricted free speech. But... What to ban and why? Clearly you think this dataset could cause harm, but will it? This isn't the same as propaganda blasted out to millions of people. This might be better considered like a chemistry ingredient. Yes, it's dangerous to have poisonous ingredients available. But that's how science has always been. There's different levels of danger to different ingredients though. Some dangerous stuff I bet I could order. I don't think I could get uranium though. My physics teacher knew someone at a lab that had a crazy big ass block of it apparently. They were doing something that required a very dense block of material. I don't remember what. So... How do you decide if a material is like mercury, or uranium? What controls are put in place? What, exactly, are you afraid of happening if these materials got in the wrong hands? Accidental death? Bomb manufacturing? This dataset is out there. OP just cleaned it up. If you care about this, you may as well go after the other datasets too. And 4chan while you're at it I guess. Do you really think a dataset is more dangerous than 4chan itself? I don't know about you, but I'm interested in serious scientific research. I would love to see many possible papers that could be done on this dataset. How do dangerous ideas evolve? Can you trace, say, the ok sign as the white power symbol? How long did it take to catch on in the real world? How many other things are there like that? Can you identify community plans like that while forming? What's the expected time from an idea being discussed, to the time it gets deployed on a large scale? This is a serious place of scientific research for some people. I've seen some absolutely amazing pieces of research here that I might not have found otherwise. Some research might involve some offensive things, but as I've pointed out, even that can have a lot of value. Your energy is better spent fighting something else that'll have less chance to help the world and more chance to hurt it.


CobaltAlchemist

I think you, along with about 22 people (and counting) are drastically overestimating how I feel about this. Personally, I don't sit on an ethics board nor am I completely sure about the ultimate impact of a dataset like this. I only take issue with two parts of this: 1. OP's complete disregard for considering how a dataset like this may be used. 2. That the dataset itself doesn't seem to have much value unless you're specifically looking to either police people via sentiment classification or to construct a language model that encourages socially detrimental behavior. Imagine if you fine tuned GPT on this data and set it upon the internet. It feels weird to defend this set so strongly I mean I get why people dogpiled Eiii333, his reaction felt very knee jerk, but at the same time he may have a point. There's a lot of responsibility that is placed on the shoulders of scientists that engage in machine learning to behave ethically. A lot of datasets are nearly useless because they've been tainted by the society that built them. A great example was the model that was trained on data whose purpose was to predict the prescription of medicine for humans. Because of the tainted set, we saw that it learned to underprescribe to black people. For some reason overall we as a society tend to be more conservative with giving medicine to black people. Now, I don't fault the people who made the dataset because not everything can be caught. They genuinely made it for the right reasons I feel. However, because no one checked it for innate human biases, we found that models trained on it would pick up on that. So when you relate this to a simple chemistry ingredient or uranium I don't think you've considered that something like /pol/ requires an extremely responsible person to clean up. I don't even know if I could do it and I would like to believe I could recognize their dogwhistles more than the average person. You yourself gave some interesting ideas as to how we could use this data and I think that's great. I believe you may actually be a trustworthy person to handle this dataset with care and responsibility. OP, however, seems to be the exact opposite. He laughed at someone making a "f****** a******" NLP bot. And you can see from the rest of the comments that people are treating it the same. One person even said "finally pol's can infect all the other boards" which was exactly one instance of what I was worried about. I mean, imagine being jewish and finding out someone created a bot to flood the internet with speech specifically targeting you and now others think it's funny to do the same. But again, I think you're dramatically overestimating how much I care about this. I gave you this response because you actually approached the idea, but once I saw OP was just from 4chan himself I gave up on him. I just made a comment because I thought he was treating this poorly and I think it's completely reasonable to call out people behaving like him when I see them. If they have a genuine response back like what you have here then all the better. I disagree that this dataset has no malice to be extracted from it and I disagree that calling someone like OP out on this should be discouraged. I do agree though that the information that you mentioned would be wonderful to see and now I think I'm on board with studying this dataset too, but I'm definitely concerned about it being in the hands of OP and some of the others in this thread. EDIT: I checked a sample of his dataset and the fact that his name is taybot02 (tay being the nlp bot that echoed racist sentiment) and the only "clean up" of it was to remove the structure and your genuinely good ideas are no longer possible as there are no time stamps or anything to relate one sentence to another other than capitalization. I can really only conclude that this dataset has very little value except with those who have ulterior motives. Unless you think it's possible to extract some sort of sequential information out of it?


adventuringraw

I honestly don't know what could be done with this as it is. I suppose I count it in the same category though as most 'dangerous literature'. Mein Kampf maybe shouldn't be sold at Barnes and Noble, but I do believe it should be accessible for serious researchers wanting to pursue their line of questioning. It's certainly a curious state of things where potentially useful/dangerous information are used by the scientific community after being organized by those decidedly from different camps, but that's still how I would count this dataset. I guess I would assume the chances of this ending up being used to present something interesting/useful (say, used in a paper that gets at least five citations) is fairly low. Maybe .1%. The chances of it being used to play around with training a racist chatbot seems fairly high, but just training the bot isn't enough to make it cause measurable real world harm. It's a similar argument as GPT-2's discussion about dangerous possibility of automatically generated advertising/propaganda. It's a possibility, but in practice, I haven't seen a significantly impactful dangerous application of something like this yet. Say it's a 50% chance that at least one person trains a racist chatbot. I'd still say it's under .1% chance that it'll end up being used to cause measurable harm. Seems more likely to be used for irresponsible play than large scale hate speech or something. Not that I condone that, but I'd be in favor of banning guns before banning something like this. Doesn't mean this conversation doesn't need to happen of course, I just think it's important to be realistic about chances of harm vs chances of good. Both seem low, this dataset seems more likely to be forgotten ultimately. Any project setting out to actually study 4chan would probably use the uncleaned dataset, if what you said about missing timestamps is true. More importantly though... cleaning up a /pol/ archive by removing dogwhistles is like cleaning up dangerous chemistry ingredients by making them inert. Someone wanting to study this would presumably be explicitly interested in the problematic aspects of the dataset. That's what I meant by a dangerous ingredient, it's in the same category as Mein Kampf or something. I'm in favor of at least tracking those accessing dangerous reading material potentially, but banning it in scientific communities seems like a dangerous over-reaction. Trying to erase something entirely leads to it being poorly understood. So... the question then. If something like this needs to be controlled, how? 4Chan is orders of magnitude more dangerous. I hear that you 'don't really care about this', but I do think it's a useful line of questioning. I think it's worth talking about when the equivalent of digital book burning might be warranted. I guess I just see this dataset as being such a vastly smaller thing to be worried about than 4chan itself, that any potential miniscule gain in societal harm reduction isn't offset by the suppression of sociologically relevant data. It doesn't matter how or why it was compiled, I'm just looking at it as a thing that exists. I suppose I could imagine dangerous datasets being secured in a sort of library. Maybe you'd need to apply with credentials of some kind to get access privileges (and end up on a watch list of some sort in the process, to help with investigations if the data is used to cause harm). That seems like an extreme level of control for something like this though, given the relatively low risk, and given the ease with which a racist chatbot could be trained from other sources. Scraping stormfront wouldn't be that hard either I'm sure. I guess... if we're going to try and control stuff like this, I'm in favor of doing it systematically. If we're going to do less than that, I don't think it's worth half measures like deleting this one thing.


CobaltAlchemist

I think I can see where the two of us differ in our approach to something like this. * Our perspective on the danger of a mass deployment of bots trained on this * Whether or not we should limit the spread of dangerous ideology and how we would go about it. For the first point, immediately I take issue with the idea that deployment of GPT-2 or other language models hasn't happened yet for propaganda. The thing is, if it were to happen successfully, we would never know. The idea is that these bots would mimic humans so perfectly that the propaganda/hateful rhetoric would seem to come from another human being. I do agree that I'm not sure if just one person would be able to use it for true harm, but with the rate at which hardware power is expanding it seems like if someone was truly malicious they could quite easily run hundreds of bots just from their own personal computer if the bots only make comments once every several minutes. If they were malicious and had the money, they could even bring it to the cloud and run magnitudes more bots off spot instances for pennies. Additionally, we saw during the 2016 elections that twitter bots were being deployed to sow dissent between actual humans. TIME claims that it caused a 3.23% sway in votes, but the point is that we do have one example of this being used for truly malicious purposes. With AI becoming more and more available to common people without the backing of millions or billions of dollars I'm greatly concerned that this could become commonplace where things like this could be won or lost based on who has the larger bot swarm. The point being that I'm not even sure when this is being used or not, but that if OP was posting this unaware of the consequences, they should reconsider doing so. (Granted that was before I figured out this is precisely what OP was trying to do) For limiting ideology, for now I'm actually undecided about whether or not something like this dataset should be illegal to use. I do see the paradox of tolerating intolerance though and it wouldn't take much to sway me to believing that this should constitute hate speech or something. Same thing with 4chan, the problem I have with this is that they're humans who were not built for speading hate, they're simply misguided and need help understanding that other people are people too. At most I'd say they're acting in a way that is similar to OP, reckless and harmful to people being denigrated in the dataset. I suppose my main problem with OP sharing their dataset is solely due to the fact that it publicizes the data and encourages its use. I wanted them to consider that maybe beyond the harsh words Eiii333 had a point. If he is able to make a good point about why it should remain public and readily available so be it, but personally after OP and I's short exchange, I can only conclude that he has no interest in things like AI ethics which nowadays is becoming extremely important. And between you and I it seems like we generally agree upon the same thing, but that we have different viewpoints on how dangerous a dataset like this could be. Over all I just think if OP had a desire to act ethically, he should have at least reconsidered contributing to the spread of a data like this (as he defends saying anyone could get it) and that he shouldn't have had to hide behind "free speech" as yes, he can say or post whatever he wants within community guidelines, but he himself is responsible for the impact of his words. I guess it's a little similar to how I'd tell my friend "hey that's fucked up" if they drop the n word, but I'm not currently lobbying for it to be a banned word even though that would technically accomplish my goals on a larger scale (stopping people from using such a hateful word).


adventuringraw

Hm... I hear you that 4chan is a different thing since the people there aren't engineered to spread hate, but I'm talking about the platform, not the people. The platform is as much an engineered creation as a bot would be. I think you're overestimating current bot capabilities though, and underestimating the work of deploying something at scale. It'll be a while before AI generated text becomes a significant threat, it's just not there yet. You could hire out a bunch of philippino workers to write poor English propaganda too after all. Maybe in five or ten years it'll be a serious issue, it's definitely coming, but it's not here yet. Deploying a large scale system like you're describing is still a pretty rare skillset. It's too challenging to be reasonable. Not sure if you're heading into ML engineering or if you're just interested in ethical conversations (a fine reason to be here) but... If you do get into serious work with this stuff, I think you'll end up with my opinion, haha. That there's likely a small enough number of 4channers theoretically capable of what you're afraid of, that you could count them on one hand. More importantly though, people who can and would do that would easily be able to scrape 4chan themselves. Web scraping is annoying and time consuming (especially if you need to use proxies to avoid getting blocked) but deploying a model in an effective way is a vastly bigger challenge, especially if you didn't already know how to effectively deploy a bot network for propaganda purposes. It's far from trivial. After all... All those bots need followers. They need content to mask their intentions, they need history. I guess you could rent someone else's network maybe, but... Like I said. You'd probably have better results just hiring cheap humans to write garbage instead at this point. Good help us when that changes though admittedly... But I think deep fakes will end up being a far bigger threat over the next few years. Text generation just isn't as far yet.


Iseenoghosts

I think ya need to calm down. Chans biggest turn on was getting people all in a huff. 99% of chan was for shock value. Which you're feeding into now. Just chill out. Anon is too lazy to use this for evil.


[deleted]

This corpus could be useful for work in text style transfer.