THC Science

Backlog back to 2010

CLOSED

Lurking shadow was topic banned from discussions of copyright violation policy and functions. Furthermore, consensus was clearly opposed to messing with this venue. Last !vote was on November 24th so I'm assuming it's safe to close this as WP:SNOW. Kirbanzo (talk) 20:53, 4 December 2018 (UTC)[reply]

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

I recently tagged an article as copyright violation and over the copyright issues encountered this site. This page has a massive backlog back to 2010 and is thus obviously not working. Removal of copyright violations, however, definitely has to work. A localized discussion here will simply not work so I will skip the steps at WP:RFC and make a RFC directly: How do we fix this backlog and make sure it doesn't come back again? Lurking shadow (talk) 21:37, 15 November 2018 (UTC)[reply]

Deprecate venue and tag {{historical}}. The scope is a fork of ANI. We don't have a specific venue for other specific issues, and urgent copyright issues are dealt with in other ways. This is unfortunately out of date. wumbolo ^^^ 22:07, 15 November 2018 (UTC)[reply]
This is a very important function, and one we see very frequently here on WP. I think it does need it's own venue but needs to be active. Could the grunt work be automated somehow? --Tom (LT) (talk) 23:14, 16 November 2018 (UTC)[reply]
I have a means of reducing the number of diffs in new CCIs by about 10%. I'm running it now as I fill out the September 2018 cases. MER-C 11:56, 17 November 2018 (UTC)[reply]
Oppose. Closing the venue won't make the problem go away. What's really needed is proper detection and blocking of copyright violators (as in, admins should always block indefinitely for copyvios) and their sockpuppets and always using presumptive removal when socks are involved. MER-C 11:56, 17 November 2018 (UTC)[reply]
Support measures to 1. always indefinitely(not necessarily infinitely e.g. unblock after discussion with blocking admin) block editors for violating copyright, to 2. remove all contributions of copyright violators with more than 4 copyright violations and more than 10 edits under presumptive removal, 3. to permanently and automatically siteban all people who were warned and/or blocked for copyright violations more than twice(excluding illegitimate warnings or blocks) and 4., should first two measures be enacted, to mark this page historical.Lurking shadow (talk) 14:41, 17 November 2018 (UTC)[reply]
Even if you do all those things (which cannot be guaranteed), one needs pages like this to track progress regarding presumptive removals. MER-C 21:00, 17 November 2018 (UTC)[reply]
That's probably true.Lurking shadow (talk) 22:33, 17 November 2018 (UTC)[reply]
Oppose the idea of closing this is nuts. First off, the number of active admins who are actually somewhat knowledgeable with how Wikipedia handles text copyright can be counted on your fingers. Sending this to ANI is crazy. Also, there is a backlog, yes, but its better that it exists than doesn't exist: at least there is something to organize and track this problem. TonyBallioni (talk) 20:47, 17 November 2018 (UTC)[reply]
- The question how to handle the backlog remains, then. Not doing anything is hardly a valid option.Lurking shadow (talk) 20:54, 17 November 2018 (UTC)[reply]
Oppose There is a backlog because people brought here have hundreds, if not thousands, of articles that have to be checked. The ones from years ago all show this pattern. And you want that at ANI? That board is already such a wonderful place to be we wouldn't want to add anything to throw that off balance.^[sarcasm] If you want an opinion on how to handle the backlog it could be to, I don't know, help? Check the pages that need to be checked. Clean them if need be and then put a {{revdel}} request on it for an admin to take action. As already stated, the number of people that actually deal with copyright problems here is incredibly small. Any additional help would be appreciated and of course help with the backlog. --Majora (talk) 03:42, 19 November 2018 (UTC)[reply]
- I have made this thread because a indiscriminate cleanup as advertised in the front page for situations like this(ten-thousands of unchecked violations back to 2010) is a blunt tool with ramifications.Lurking shadow (talk) 09:21, 19 November 2018 (UTC)[reply]
Oppose closing this page, but I have a lot of sympathy for the problem. (I came here from CENT.) If there are ways to automate some of this, that would be very good. (Would there be something to request at the Community Wishlist at meta?) --Tryptofish (talk) 22:15, 24 November 2018 (UTC)[reply]

Threaded discussion

There is long-standing consensus that poking people on noticeboards is not helpful in dealing with backlogs (e.g. see any "AIV backlog" thread on ANI), and without a tangible proposal beyond "what should we do", that's essentially what this RfC is. Further, talk page discussions have generally not resolved workload issues in the past and I don't imagine they are about to now in one of the most poorly understood administrative areas on Wikipedia. I have removed this discussion from {{Centralized discussion}} for the time being. TheDragonFire (talk) 05:41, 19 November 2018 (UTC)[reply]

If there is such a massive backlog the methods how to handle this have to change. I already did make some suggestions how to handle this backlog and make sure it doesn't come back again.Lurking shadow (talk) 09:21, 19 November 2018 (UTC)[reply]

Tentative proposal

I fully agree with the basic tenet here – the backlog is unacceptably large, something has to change. Obviously that something is not just saying "oh well, 80000 potential copyvios unchecked, let's just archive them". Equally, we can't go on relying on a minuscule number of dedicated volunteers to deal with this, it is a community problem and needs the help of the full community to solve it.

I suggest we should look at the possibility of setting up a partially-automated process to start chipping away at it, perhaps something along the lines of what was used to deal with Wikipedia:Contributor copyright investigations/Darius Dhlomo (over 23000 articles). I'm not at all bot-literate, but tentatively: a bot tags (say) 200 articles a day with a new prod-type notice which can be removed by any extended-confirmed user once the page has been checked; after (say) a week, the bot notifies relevant WikiProjects of any page that still has the notice; after a further week, the page can be deleted (or not) by any admin, just like an expired prod. As I understand it, the Darius Dhlomo notice actually blanked the article in the same way that {{copyvio}} does; that might be effective, but might also run into some strong objections.

Any thoughts? Justlettersandnumbers (talk) 11:28, 19 November 2018 (UTC)[reply]

It would be better, probably, if those drastic measures would be taken, and indeed they probably have to be taken because 80000 is a realistic number and we don't have the contributors necessary to remove them all, to revert to the version before their edit, and to revision delete everything in between. If that would be also too much then unilateral prod + deletion is probably necessary.

but either we make this process permanent or the automatic removal and revision-deletion... depending on what works.Lurking shadow (talk) 14:29, 19 November 2018 (UTC)[reply]

This approach would help bring down the backlog and is useful. The bot would have to be programmed to either only remove copyvio content, or to PROD articles that are primarily copyvio. The backlog dates back to 2010 so it's possible a lot of articles have been improved or changed since then in significant ways. It will also have to update the lists here. It is definitely something that will bring down the huge backlog.--Tom (LT) (talk) 23:40, 19 November 2018 (UTC)[reply]

Above struck through. Oppose this proposal as above per some discussions below. Whilst good in theory, a bot that indicates what is a problematic edit would be very useful, but it would be impossible to with absolute certainty have it go about removing copyvio content. --Tom (LT) (talk) 07:03, 24 November 2018 (UTC)[reply]

Oppose closing, but I'm all for this: "How do we fix this backlog and make sure it doesn't come back again?" Willing hands, I think. Any contributor without a history of copyright issues can help. (Also open to tool usage, where and if that will work. :) As per below.) --Moonriddengirl ^(talk) 00:23, 20 November 2018 (UTC)[reply]

I don't think the tentative proposal above is going to be practical. We did blank all pages created by Darius Dhlomo as part of that CCI, but that was an exceptionally serious one and we didn't blank all pages affected by the CCI (if I remember correctly it only 10,000 pages were blanked). Having spent a lot of time staring at his edits I can tell you that he wasn't able to compose the anything other than the most bare-bones of prose without copying it from somewhere, and he was probably responsible for thousands of copyvios. This makes it rather more serious than the typical CCI. Nor was the blanking very effective, plenty of pages were unblanked when they still had copyvio in them by people who didn't bother to check or didn't check properly. Blanking 90,000 pages, almost all of which are clean, would cause a huge mess. Oppose closing though, that would just be a case of sticking your head in the sand and pretending the problem doesn't exist. Hut 8.5 22:04, 20 November 2018 (UTC)[reply]

Oppose the "tentative proposal". No blanking of pages unless there is evidence demonstrated that they contain copy vios, no bots, no assumptions.Smeat75 (talk) 01:25, 21 November 2018 (UTC)[reply]

Just for the record, presumptive blanking based on that assumption is already policy. "If contributors have been shown to have a history of extensive copyright violation, it may be assumed that all of their major contributions are likely to be copyright violations, and they may be removed indiscriminately." (WP:CV) However, except in a very few rare cases, I have not seen this invoked, and I don't think it should be. This process was created at least in part so that it would not have to be. There are many articles that are largely clean with small patches of problems and some that have none at all. --Moonriddengirl ^(talk) 21:44, 24 November 2018 (UTC)[reply]

Options for removal of backlog

I think the options here are clear:

Option 1:Delete all pages these copyright violators edited indiscriminately, unless they were checked off in CCI

Option 2:Prod and blank all pages these copyright violators edited(checked pages excluded) indiscriminately per bot(and then delete them after short notice)

Option 3:Manually revert to the last version and revision delete anything inbetween/speedy delete if it is the first version or otherwise speedily deletable.

(Option 4:Try to manage the backlog by individually examining every page.)

(Option 5:Do Nothing)

Option 1 is the quickest and safest method to deal with it, but also with the largest damage to genuine articles.

Option 2 may take longer and these things should be removed as fast as possible... but a few high value pages could possibly stay.

Option 3 will take significantly longer but the damage inflicted is lower(but still quite high)

Option 4 won't work because that's what's responsible for the backlog in the first place - it is too much work with not enough helpers.

Option 5 does not conform with our copyright policy and WP:PILLARS and is not legitimate.Lurking shadow (talk) 07:00, 20 November 2018 (UTC)[reply]

I have tested this and the selective deletion of pages(especially of old pages with thousands of revisions) is far more workload than the deletion of pages(which I selected when the page was created by a violator). I extrapolate that the mass-deletion of pages is feasible while the selective deletion is not.

As such, I support Option 1, maybe 2. Oppose Option 3(and 4 and 5)

The only one of those which is acceptable is option 4.Smeat75 (talk) 23:32, 20 November 2018 (UTC)[reply]
Option 4 has been proven to be too inefficient. You need better arguments.Lurking shadow (talk) 07:08, 21 November 2018 (UTC)[reply]
Option 4 is the only acceptable option. Yes there is a huge backlog but we will be significantly damaging the encyclopedia by randomly deleting large chunks of text. If the articles are not individually examined, or there is some doubt that the editor has a not small amount of good faith not plaigarised text in between, the damage done outweighs the potential risks for the Wikimedia movement getting bad press or being sued (in which case I would presume the first step would be for the complainer to notify the movement regarding the affected articles, which could then be prioritised and changed). --Tom (LT) (talk) 06:25, 24 November 2018 (UTC)[reply]
Would it be practical to have an automated process to identify and locate any pages or edits that would fall under Options 1 and 2? That is, automate the way of finding them, but not actually delete or blank anything automatically. --Tryptofish (talk) 22:19, 24 November 2018 (UTC)[reply]

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

"Open cases"

The majority of the backlog are 'open investigations' which (as I understand it) is intended to be investigationn for copyvio, rather than cleaning up known copyvio. A bot could definitely assist with this process, by marking the % of copyvio'd edits (or marking which edits) are copyvios. Starting with editors with low edit counts may help, as it is impractical and probably quite time consuming to deal with editors that have a long edit record. Thoughts? --Tom (LT) (talk) 23:44, 19 November 2018 (UTC)[reply]

Actually, removing copyvio is part of the process (it's mention as the final step, quite an important one). Identifying is the first step, however. I'd be all for speeding this and cutting down this dreadful backlog, but I'm unsure how effective a bot would be, Tom (LT). Sometimes these individuals copy content from books or, for instance, do direct translations of copyrighted sources. Bots could help with some, but maybe not such contributors. What do you think? --Moonriddengirl ^(talk) 00:21, 20 November 2018 (UTC)[reply]

@Moonriddengirl that's a good point. I can see a use for this bot in some cases - particularly egregious copying from online sources in editors whose activities are primarily copyvio. The bot would be useful evaluating them and marking edits. This may help with a proportion of cases but from the looks of things given the extensive workload a few tools are probably needed.--Tom (LT) (talk) 00:30, 20 November 2018 (UTC)[reply]

Tom (LT), Moonriddengirl, just to clarify: I'm not for one moment suggesting that a bot should be trying to remove copyvios – I don't believe such a program is realistically possible, and I don't believe that the community would accept it even if it were. All I'm suggesting that we might try to automate is (a) the initial placement of the notice and (b) notifications to interested WikiProjects in some cases. The rest has to be done by real people – us. My thought is that by tagging the pages with a large notice we may be able to involve more people in the process, and so perhaps start to reduce the backlog instead of just watching it grow (my 80000 figure is well out of date, it's now just shy of 90000). We don't need to search for copyvio in these articles – we've already established that nothing written by these editors can be trusted; we just need to remove every word they wrote, whether by editing the article text, by reverting to the last version before their first edit, or by deleting the page entirely. That's a huge amount of work, but much less than checking them one by one for verifiable violations. If I'm talking nonsense, please just say so! Justlettersandnumbers (talk) 01:19, 20 November 2018 (UTC)[reply]

OK! You are talking nonsense.Smeat75 (talk) 23:19, 20 November 2018 (UTC)[reply]

I, Crow, Earwig et al have discussed the scopes of programming a bot to run over CCI cases. AFAIR, there was a consensus that it was quite non-feasible due to the vast amount of credits required for running such intensive trawls.

I'll link the discussion, once I find it. ∯WBG^converse 10:06, 20 November 2018 (UTC)[reply]

All what's needed right now is a bot that deletes or proddelete-tags all articles edited by all the copyright violators currently in the frontpage. Everything else is not feasible anyways, and that should be possible.Lurking shadow (talk) 12:25, 20 November 2018 (UTC)[reply]

Wow. Completely ridiculous, you guys over here want to run a bot to delete all articles ever edited by everyone listed on the front page?!?It would destroy this project. Not going to happen. You have to go through every article and identify copy vios and then remove them, you cannot delete whole articles because a suspected "copyright violator" made an edit.Smeat75 (talk) 23:17, 20 November 2018 (UTC)[reply]

No, absolutely not. The deletion of 80000 articles is not the end of the world. It is not nice, it is not really desirable, but necessary.Lurking shadow (talk) 07:10, 21 November 2018 (UTC)[reply]

@Smeat75 agree it would be completely ridiculous for a bot to blank delete 80,000 pages or even content, or have any automated deletion at all. But there are useful automated things bots could do to help - eg scanning requests to provide a likelihood of plaigiarism, or scanning through edit logs to indicate which ones have likely plagiarised from sources that the bot can access (which won't be all sources, but will help). A human will have to be involved at some point to make the final decision but if bots can help speed or ease this process up I'm all for it. --Tom (LT) (talk) 06:59, 24 November 2018 (UTC)[reply]

Yes what you are saying is perfectly reasonable User:Tom (LT) but that is wholly different from Lurking Shadow wanting to run a bot that deletes or proddelete-tags all articles edited by all the copyright violators currently in the frontpage and if that means deleting 80,000 articles it would not be "the end of the world".Smeat75 (talk) 13:58, 24 November 2018 (UTC)[reply]

FWIW here is the bot request I wrote on this topic ages ago. Didn't get any interest, but perhaps some bot operator now would be willing to pitch in. (Assuming that we could get the necessary number of queries for the Earwig tool.) Calliopejen1 (talk) 19:39, 21 November 2018 (UTC)[reply]

I'll have a go at the weekend. I'm pretty sure I can write some sort of script to run a few revisions highlighted in a CCI through the Earwig API mentioned in that bot request. The first thing to establish would be whether that output is actually useful before we try to scale it up. Hut 8.5 20:04, 21 November 2018 (UTC)[reply]

Thanks @Hut 8.5. Having bots help out in whatever way possible will make the best use of the few hands here. --Tom (LT) (talk) 06:59, 24 November 2018 (UTC)[reply]

Would an admin or editor experienced in copyvios please comment on this AN posting?

Here. Beyond My Ken (talk) 21:12, 20 November 2018 (UTC)[reply]

I've left a comment there. Hut 8.5 21:57, 20 November 2018 (UTC)[reply]

Helping out

Hi. I've tried to start helping out at Wikipedia:Contributor copyright investigations/ItsLassieTime, but I'd like to make sure I am doing this right before going deeper. Can someone take a look? Thanks, --DannyS712 (talk) 02:06, 4 April 2019 (UTC)[reply]

Maybe I’m not the one to say this, since I’m starting out here too, but you seem to be doing fine. 💵Money💵emoji💵^💸 23:47, 4 April 2019 (UTC)[reply]

@@ Line 112: / Line 112: @@
 *Maybe I’m not the one to say this, since I’m starting out here too, but you seem to be doing fine. [[User:Money emoji |<span style="text-shadow:#396 0.2em 0.2em 0.5em; class=texhtml"><b style="color:#060">💵Money💵emoji💵</b></span>]]<sup>[[User talk:Money emoji|💸]]</sup> 23:47, 4 April 2019 (UTC)
-== Can I be promoted to a CCI clerk? ==
-As some may know, I've been active in this area and have been cleaning up reports. I even finished two, [[Wikipedia:Contributor copyright investigations/BigButterfly]] and [[Wikipedia:Contributor copyright investigations/BeeCeePhoto]]. However, I can only mark them as completed and am unable to archive them due only clerks being able to do so, and only one clerk (Lazygas) is currently active, and they seem to be busy with other stuff. Thus, I am asking the regulars here if they believe I meet the qualifications to become a clerk. I do not have any sort of past with copyright issues, and I believe I am knowledgeable enough in the field of copyvio to be qualified for the tools.[[User:Money emoji |💵Money💵emoji💵]]<sup>[[User talk:Money emoji|💸]]</sup> 20:24, 4 May 2019 (UTC)