THC Science

Problem[edit]

To editor Mr. Stradivarius: I'm trying to use this but when I bring up the tags interface, all the fields are greyed out and I can't add title or tags to new Signpost articles. Past articles already tagged are fine. Chris Troutman (talk) 23:08, 6 February 2022 (UTC)[reply]

@Chris troutman: Hm, there seems to be a problem with User:Mr. Stradivarius/gadgets/SignpostTagger. I'll take a look at it later on. — Mr. Stradivarius ^{♪ talk ♪} 00:58, 7 February 2022 (UTC)[reply]

@Chris troutman: It turns out I didn't build SignpostTagger to handle the situation when a year module such as Module:Signpost/index/2019 does not exist. This meant that it was failing for all articles from 2020 onwards, as those index modules never got created. I have now fixed the gadget, so you should be able to save tags for any article now. If the year index module does not exist, the gadget will create it. Best — Mr. Stradivarius ^{♪ talk ♪} 14:52, 10 February 2022 (UTC)[reply]

@Mr. Stradivarius: Many thanks! Chris Troutman (talk) 22:20, 11 February 2022 (UTC)[reply]

Adding authorship[edit]

Hi @Mr. Stradivarius! Thanks so much for creating this. Would it be possible to add authors (bylines) to the metadata? This could be used to generate profile pages of Signpost writers and their articles. Cheers! 🐶 EpicPupper ^{(he/him | talk)} 05:08, 6 June 2022 (UTC)[reply]

Hi EpicPupper. It's certainly possible. Is this something that's been discussed with other Signpost editors? It would be a not-insignificant amount of work to update the module and the gadget, and I'd like to be sure that it's a feature that people actually want before putting in the work to build it. Best — Mr. Stradivarius ^{♪ talk ♪} 07:07, 6 June 2022 (UTC)[reply]

Hi @Mr. Stradivarius! This has been discussed with the other EiC and some other people, and we think that this is a good idea :) Another (separate) proposal would be to incorporate some type of tagging system per-publication or during publication, so that editors can add tags using a template, and perhaps the publication script would automatically add it to the module. Cheers 🐶 EpicPupper ^{(he/him | talk)} 01:30, 22 June 2022 (UTC)[reply]

Apologies for the extremely slowpoke.jpg followup on this, but I think author tags would be a very good idea, and implementing them here would save me a large amount of work versus implementing them separately in an independent module. As one example, they would allow us to link authors' bylines to lists of their articles, as basically all modern news outlets do. and I am willing to assist in modifying the script / module (or assist with harmonization of input data on the Signpost pages themselves, automation to update old indices etc) if additional work is required. jp×g 15:53, 4 November 2022 (UTC)[reply]

@EpicPupper and JPxG: I have updated Module:Signpost and WP:SPT to support adding authorship. The index modules can now have an "authors" table, which you can see some examples of at Module:Signpost/index/2019. To really make this useful, though, we need to add author data for all 4000 Signpost articles, or at least a significant subset. I plan on writing a script that can do this automatically when I next find some free time. Best — Mr. Stradivarius ^{♪ talk ♪} 06:54, 12 November 2022 (UTC)[reply]

@Mr. Stradivarius: Excellent! One loves to see it. I think we are aligned on automatically filling out article information, to the point that I was already beginning to write it when I saw this. My thinking is a simple Python script that runs on a user's computer and does the following:

- Retrieves Signpost article pages as wikitext from the server

- Parses out departments, headlines, subheadings, author information (and potentially other metadata)

- Retrieves the Signpost module indices

- Integrates the scraped metadata (either by upserting, or only inserting where fields are blank)

- ???

- Profit!

Parenthetically: when contemplating this, it got me to thinking that it might be a good idea to automatically run this after each issue is published (i.e. it seems like a lot of work, and rather error-prone, for people to manually insert index entries for each article, rather than simply adding tags manually). Let me know what you think of this, and if it's a good idea to go forward with running it for each issue. jp×g 22:48, 16 November 2022 (UTC)[reply]

@JPxG: Yes, this sounds like a good idea. I suspect that it will be easier to get the Signpost article metadata from the HTML rather than from the wikitext. We can insert IDs into the elements we need to parse, as I did here for the article authors, which then allowed me to write the author-parsing code without too much trouble.

I see article subheadings on Wikipedia:Wikipedia Signpost, but I don't see them on article pages or anywhere in the archives. Would it be acceptable to leave these out of the index modules? Adding these would also mean adding them to WP:SPT, and I would prefer to keep things simple if the subheadings are not used all that much. Also, I couldn't find any mention of departments in the Signpost articles I checked - do you have any examples of the department metadata that you mentioned?

Also, yes, this script, or a variant of it, should be run after each article is published (or we could probably just run it daily). It is not that much of a stretch from running a script on a user's computer to running a script every day automatically on Toolforge. Best — Mr. Stradivarius ^{♪ talk ♪} 00:12, 17 November 2022 (UTC)[reply]

Hello everyone, and also bug?[edit]

Hi. I first heard of this module a while ago, but I didn't have the time to go into great detail with it -- now I am trying to do a comprehensive review of Signpost technical infrastructure, so I am here. First of all, I think it whips ass. This is great! I have a few ideas for how I could use it to accomplish a few new features (and probably some ideas for new features the module could have).

Second of all, I notice that something strange seems to be happening in Module:Signpost/index/2022 (and possibly elsewhere): a bunch of article titles have "subscribe subscribe" at their beginning for no apparent reason. If I have time I will try to go figure out what is causing this (probably some templates not playing well together) but I am not very familiar with Lua so it is unlikely I can fix it very well myself if it ends up being something in the module. jp×g 15:45, 4 November 2022 (UTC)[reply]

Looks like they begin some time around August. Here is a diff with the weird text -- looks like it is coming in from SPT. jp×g 15:49, 4 November 2022 (UTC)[reply]

@JPxG: I have now fixed this. It was due to SPT trying to get the article title by converting everything inside the <h2>...</h2> tags to text, but this included the subscribe link added by mw:Extension:DiscussionTools. This link was presumably added around August. I chose to fix this by getting the title from a new "data-signpost-article-title" attribute added to Wikipedia:Wikipedia Signpost/Templates/Signpost-article-header-v2, and as a backup, getting it from the span inside the h2 tag with the class "mw-headline". I also went through and fixed all the instances where the "subscribe subscribe" links were added to the index modules. All signpost articles newly tagged since August had the "subscribe subscribe" text added, so it was not just limited to 2022 articles, although that's where the problem was most common. — Mr. Stradivarius ^{♪ talk ♪} 08:52, 11 November 2022 (UTC)[reply]

New fields[edit]

@Mr. Stradivarius: Today I succeeded in writing something that I have wanted for quite some time, viz. a way to look at Signpost viewership statistics that isn't bad and useless. Source code is at https://github.com/jp-x-g/wegweiser --what it does is very simple. It finds and records view counts for Signpost articles after publication (for a standardized interval afterwards, for purposes of comparison). Anyway, the reason this involves this module is as such:

Storing this data necessitates the creation of some large index of all Signpost articles, and rather than reinvent the wheel, I reckon it would be useful to do so in this module's indices, and I've found a way to make my script parse and update the Lua tables properly. I tested it briefly on Module:Signpost/index/2022 (diff here of what it looks like with the extra fields). I'm not very hot with Lua, so I don't know what this does on the backend utilities that use this module, but SPT works fine with these extra fields, as does Wikipedia talk:Wikipedia Signpost/Single/2022-01-30 (which uses Wikipedia:Wikipedia Signpost/Templates/Single talk, which uses Wikipedia:Wikipedia Signpost/Templates/Article list maker, which uses Module:Signpost).

Anyway, I have everything working, and I am ready to add the fields to all the indices (only back to 2015 since per-page view counts aren't available before then), but I wanted to hold off and make sure that this isn't going to break everything first. What do you say? jp×g 08:19, 5 January 2023 (UTC)[reply]

@JPxG: What will the data be used for? My first reaction is that unless you need to make the data available via a template, you could use the Pageviews API to get the data dynamically and not have to worry about storing it in the index modules. If we do need to store the data in the index modules, WP:SPT will need to be updated; with the current way that it is written, it will delete all the extra view fields when it changes any tags (see this diff for an example). Also, rather than using fields like views30, views60 etc., I would prefer that the page view statistics are put into their own subtable, like views = {[7] = 642, [30] = 1966, [60] = 2279, [90] = 2419}. The data would be more structured this way. Best — Mr. Stradivarius ^{♪ talk ♪} 06:51, 6 January 2023 (UTC)[reply]

I thought restructuring the views would be hard, but it wasn't really. Anyway, yeah -- the pageviews thing is a little strange. Basically, it is necessary because {{Graph:PageViews}} is bizarrely broken (it can only return graphs, and is completely incapable of returning straightforward numbers -- tried to figure this out for quite some time to no avail). That is to say, if we want to just look at "how many pageviews did the traffic report get versus the discussion report", we either have to manually enter each page title into the pageviews website, or wild-ass-guess the area under the curve on a graph...

At any rate, if it's possible, I would be glad to help rewrite whatever part of the JS poses issues for passthrough of extra parameters (since this might prove useful for other stuff as well). jp×g 03:16, 8 January 2023 (UTC)[reply]

@JPxG: If you just want the page view numbers, you can query the API directly, without going through {{Graph:PageViews}}. Is the page view data intended to just be used by Signpost editors, not by all readers? Because in that case, you can get the desired result by using a user script, without having to duplicate the page view data in the index modules. (And with a default gadget, you could even do the same for all readers if the community agrees.)

I will have a look at WP:SPT to see how difficult it would be to pass through arbitrary Lua data tables. — Mr. Stradivarius ^{♪ talk ♪} 12:19, 8 January 2023 (UTC)[reply]

Hm, it looks like my suggestion for the Lua table structure above wasn't so great, as WP:SPT uses JSON as a data interchange format between Lua and JavaScript, and JSON can't have numbers as object keys. For example, the Lua function mw.text.jsonEncode() encodes number keys in tables as strings:

mw.text.jsonEncode({[7] = 642, [30] = 1966, [60] = 2279, [90] = 2419})
-- '{"7":642,"30":1966,"60":2279,"90":2419}'

Assuming that we don't go down the route of making a JavaScript gadget for this, probably something like the following would be an easier structure for WP:SPT to cope with:

views = {d7 = 642, d30 = 1966, d60 = 2279, d90 = 2419},

— Mr. Stradivarius ^{♪ talk ♪} 12:39, 8 January 2023 (UTC)[reply]

I thought I had sent a reply to this earlier, but I guess maybe I forgot to press "publish". Anyway, yeah -- I have incorporated this style into the pageview filler. I've also got some kind of half-assed prototype for how to incorporate this data into the Signpost module, which (along with some of its test cases) is at Module:Sandbox/JPxG. jp×g 00:40, 9 January 2023 (UTC)[reply]

I've updated WP:SPT so that it doesn't delete extra fields added by other tools. At the moment, it outputs a different Lua table format than WegweiserBot (see this diff for an example). This makes for unclean diffs, so it would be a good idea to settle on one format. I'll have a look at WegweiserBot's code to see how easy it would be to change. — Mr. Stradivarius ^{♪ talk ♪} 13:17, 9 January 2023 (UTC)[reply]

@JPxG: I'm still not sure why page view data is necessary in these modules to begin with. By storing it here we are duplicating the data that's already available in the page view API's database. What are you planning to do with the page view data in the index modules that can't be done using the page view API directly? — Mr. Stradivarius ^{♪ talk ♪} 13:26, 9 January 2023 (UTC)[reply]

@Mr. Stradivarius: Here are a few mockups I made to show the data in use:

There's no way to query pageviews in wikitext (because they have to be dynamically loaded from a different API); it has to be done by an external application. The only thing we have that can do this is the Vega graphing extension; it might theoretically be possible to use a grotesque hack like rendering a Vega embed for pageview data in some way that displayed a wikitext error message that embedded the pageview count in it... but even if that worked, it would require viewers' browsers to perform thousands of queries to the pageview API every time they tried to load a table like the ones above (for data that isn't really going to change ever, like "pageviews between January and February 2017"). jp×g 22:47, 11 January 2023 (UTC)[reply]

@JPxG: I made a pull request to Wegweiser to use the same Lua table format as SignpostTagger. Does that look like an acceptable way of making the diffs cleaner in the index modules?

As for the question of why page view data is necessary, I understand that you want to display the page view data in wikitext, and that using the page view API for this directly would be impractical. The thing I'm not understanding is why you want to add page view data to wikitext in the first place. Are you planning to add a page view counter on article pages? Or are you planning to use the data in some other way? Best — Mr. Stradivarius ^{♪ talk ♪} 05:02, 22 January 2023 (UTC)[reply]

Fleshing[edit]

The current version of Wegweiser, while not perfect, now has the ability to pull article lists from the PrefixIndex API and generate skeleton entries (no tags, but date and subpage) in the indices. I filled them out from 2005 to present, which added some several hundred articles previously unindexed (i.e. 2017 only had a couple articles in the index for some reason). jp×g 01:33, 7 January 2023 (UTC)[reply]

I've also started working through the extremely long list of obsolete Signpost categories from 2005 to 2015; most of these (except for the by-month categories) are completely redundant to tags in the module. That said, the taggage in the module indices is not always complete (for example, not every article in Category:Wikipedia Signpost Arbitration report archives 2005 was tagged "arbitrationreport" in Module:Signpost/index/2005). So I wrote a script to do this as well; it is capable of going through these categories and fixing the module index tags as appropriate. (If the article is already tagged as being whatever, of course, it is left alone)

Anyway, even with the heavy automation on scraping the categories and modifying the index tables, this cleanout would still take me days of copy-pasting, so I wrote another script that can upload the modifications automatically. I've submitted a BRFA at Wikipedia:Bots/Requests for approval/WegweiserBot, because it seems to me that this is required (it will also be useful for automatically updating the indices upon publication when I finish those scripts). jp×g 03:20, 8 January 2023 (UTC)[reply]

Titles and authors[edit]

Boom goes the dynamite. It works for everything after 2017-02, which was when Wikipedia:Wikipedia Signpost/Templates/Signpost-article-header-v2 came into use. Before that, things get a little hazy, but here is my chronology of headline and byline styles:

2005-01-10 to 2005-05-16

==AMA begins plans for new election==

:<small>By [[User:Michael Snow|Michael Snow]], [[10 January]] [[2005]]</small>

2005-05-23 to 2009-02-08

<h2 style="margin-right:60px;">Reporter who plagiarized Wikipedia gets dismissed</h2>

<small>:By [[User:Michael Snow|Michael Snow]], [[16 January]] [[2006]]</small>

2009-02-16 to 2017-02-06

{{Wikipedia:Signpost/Template:Signpost-article-start|Let's get serious about plagiarism|By [[User:Awadewit|Awadewit]], [[User:Elcobbola|Elcobbola]], [[User:Jbmurray|Jbmurray]], [[User:Kablammo|Kablammo]], [[User:Moonriddengirl|Moonriddengirl]] and [[User:Tony1|Tony1]] |13 April 2009}}

2017-02-27 to present

{{Wikipedia:Wikipedia Signpost/Templates/Signpost-article-header-v2|{{{1|Wikipedia has cancer}}}|By [[User:Guy Macon|Guy Macon]]| 9 February 2017}}

Note that the date in this template is different from the publication date -- this was the Feb. 27 issue, but the title says 9 February, because that's when it was written!

The existing SPT script works on span tags, not wikitext, so I made the Wegweiser scripts work the same way, albeit with some more sophistication (it takes the rendered text of the author span, splits it apart in various ways to get individual authors, and then puts that in an array -- since a lot of authors don't have userpage links in their bylines).

For titles, it's a lot more straightforward, because there aren't multiples of them -- I just take the text of the title span and store that. So far, I have gone through everything from 2017 to now, and will now attempt to make something work for previous years. jp×g 00:37, 9 January 2023 (UTC)[reply]

Code written to handle titles and author fields from 2009 to 2017, running scripts now. jp×g 01:26, 9 January 2023 (UTC)[reply]

The whole batch from January to May 2005 was small enough that it didn't warrant writing extra code, so I was going to do an AWB run to normalize it to the 2005-2009 style, but I figured as long as I was firing up JWB I might as well bring it all the way into the future, so those hundred-and-change now use the modern style (with Signpost-article-header-v2). jp×g 01:59, 9 January 2023 (UTC)[reply]

I am currently working through a massive JWB run to amend all back issues with the modern header template that formats titles and author metadata in a sensible fashion, among other things (and while I'm at it, I am updating User:JPxG/The Illuminated Signpost). Anyway, everything through mid-2006 is done now.

Finished: everything from 2005-01-10 through 2009-02-16 now uses the modern headers: Wikipedia:Wikipedia Signpost/Templates/Signpost-article-header-v2 formatted properly with the title and author params (as well as the footer which closes divs and embeds comments). This means that WegweiserBot is now able to fill everything in :^) jp×g 05:08, 17 January 2023 (UTC)[reply]

Validation[edit]

To find weird edge-cases and messed up pages, I have written a new script, validator.py, which outputs to the following page:

Wikipedia:Wikipedia Signpost/Technical/Index validation

This displays all of the entries in year indices that are missing fields -- a summary table is at the top, and then it lists all the individual articles with errors. You can ignore the gigantic numbers for 2006, 2007, 2008 and 2009 (I am still reformatting them to have parseable titles and authors): from 2010 to present the missing fields are actual errors. Most of them are just articles without tags, for which some help would be appreciated. I have noticed some strange behavior, though. @Mr. Stradivarius: Is there a reason that the SignpostTagger isn't showing a "manage tags" box for old articles, like Wikipedia:Wikipedia Signpost/2005-12-12/Welcome RSS readers? I looked through the .js and I didn't see anything that looked like it was excluding articles by year. jp×g 22:36, 11 January 2023 (UTC)[reply]

Never mind, it seems to work now, at least intermittently; I think it probably has something to do with the header templates (?) jp×g 22:50, 11 January 2023 (UTC)[reply]

Could be a bug in parsing usernames or something else that runs on page load - if there is an error there, then the script wouldn't progress far enough to display the "Manage tags" link. There was a recent issue reported here where this happened. If you see this again, check if there are any exceptions reported in the console. — Mr. Stradivarius ^{♪ talk ♪} 14:46, 18 January 2023 (UTC)[reply]

Author and pageview in returnable form[edit]

@Mr. Stradivarius: I have a working modification for the module, currently located at Module:Sandbox/JPxG, which is capable of using the author/pageview metadata in list/table generation. I don't want to just slap it into the main module without any notice, so I am letting you know here. This version (which you can see the test cases for at the doc page) allows for returning the author, as well as viewsSeven, viewsFifteen [...] up to viewsOneEighty. Below I'll embed a use case, which is a table of the view counts for yesterday's issue:

{{User talk:JPxG/sandbox99 | rowformat = {{Wikipedia:Wikipedia Signpost/Templates/Article list maker/Pageviews|date=${date}|subpage=${subpage}|title=${title}|viewsseven=${viewsSeven}}} | sortdir = descending | startdate = 2023-01-15 | enddate = 2023-12-01 }}

Date	Subpage	Title	7-day

Anyway, I don't see this conflicting with any other use of the module, and it hasn't broken in the period of me testing it, so I will add these changes to the main module, unless you have an objection or want to do it better (I don't think this is the best-written code, as I am not a "Lua guy"). jp×g 05:05, 17 January 2023 (UTC)[reply]

@JPxG: Thank you for implementing this. Personally, I would use digits to specify the page view durations: views7 and getViews7 instead of viewsSeven and getViewsSeven. I find it easier to tell at a glance what the number is when using digits. Also, once it gets over 100, things start to get less obvious. Should it be getViewsOneTwenty, getViewsOneHundredTwenty or getViewsOneHundredAndTwenty? Users of the module will probably have to look at the documentation to get it right. Otherwise, the code looks good to me. — Mr. Stradivarius ^{♪ talk ♪} 14:32, 18 January 2023 (UTC)[reply]

@Mr. Stradivarius: Yeah, that would be the sane thing to do. That was what I initially tried, and it triggered Lua errors every time. You can try it and see: if you go into the sandbox code and do nothing but change viewsSeven to views7, then edit the table maker to call views7... it will flip out. I don't know if this is an insurmountable issue with Lua, or a skill issue on my part (likely the latter). If there's any way to make it just be "views7", that wold be infinitely preferable lol. jp×g 23:29, 18 January 2023 (UTC)[reply]

@JPxG: I edited the sandbox to make it work. The trick was to change the pattern that detected variables like ${title} to accept both letters and numbers - previously it only accepted letters. Best — Mr. Stradivarius ^{♪ talk ♪} 15:06, 19 January 2023 (UTC)[reply]

whitespace[edit]

@Mr. Stradivarius: I accepted your pull request, but running the program, the whitespace seems to still be quite different -- is this a bug, or am I running it incorrectly? jp×g 04:31, 25 January 2023 (UTC)[reply]

@JPxG: This looks like the output from the luadata package, not the lua_serializer module that I wrote. I tested that the output of the lua_serializer module looks like this. Probably the most likely explanation is that you ran the bot from an old version of the code. Or perhaps the code you are using to update the Lua index modules does not use lua_wrangler.luaify? There may be another entry point that I have missed. — Mr. Stradivarius ^{♪ talk ♪} 05:36, 26 January 2023 (UTC)[reply]

Yeah, it is kind of strange (I assume you tested it and it worked). I will have to take another look. jp×g 05:37, 26 January 2023 (UTC)[reply]

@JPxG: Did you get a chance to look into this? I can also take a look if you're willing to let me have access to the bot's execution environment. (Are you running the bot from a shared environment like Toolforge, or just running it locally?) Best — Mr. Stradivarius ^{♪ talk ♪} 06:28, 30 January 2023 (UTC)[reply]

@JPxG: Actually, I think I found the cause: I missed an entry point in my pull request. It looks like Wegweiser is still using the luadata package for serialisation in mass_tagger.py. I'll submit another pull request later on to fix this; the execution environment access won't be necessary. — Mr. Stradivarius ^{♪ talk ♪} 06:35, 30 January 2023 (UTC)[reply]

Ok, pull request submitted. — Mr. Stradivarius ^{♪ talk ♪} 07:21, 30 January 2023 (UTC)[reply]

@JPxG: Looks like I still missed a few, so I submitted another pull request. These should be the last ones. — Mr. Stradivarius ^{♪ talk ♪} 00:51, 31 January 2023 (UTC)[reply]

@JPxG: I undid WegweiserBot's edits to the index modules with my new serialisation code, due to this issue. The fix was also pretty simple, so I've made a new pull request. — Mr. Stradivarius ^{♪ talk ♪} 13:49, 31 January 2023 (UTC)[reply]

I had a look into WP:SPT's handling of \uxxxx escapes as well. It turns out that Lua doesn't actually have that kind of escape, as it treats strings as a series of bytes, rather than a series of Unicode characters. So it's not really a bug in WP:SPT, but rather a fundamental issue of how Lua was trying to parse those strings. In other words, my serialisation code in Wegweiser was outputting incorrect Lua strings, so that's the thing we need to fix. — Mr. Stradivarius ^{♪ talk ♪} 14:14, 31 January 2023 (UTC)[reply]

I merged your PR request from this morning, and am running it on a few index pages. It seems to work as intended :) jp×g 23:06, 31 January 2023 (UTC)[reply]

Nice. 😎 Thanks for the merge! — Mr. Stradivarius ^{♪ talk ♪} 01:30, 1 February 2023 (UTC)[reply]

Multiple authors[edit]

@JPxG: Regarding this edit - how about adding |authortemplate=, |authorformat= and |authorseparator= parameters to specify how to format each author in the output? That way you can do things like link each user's userpage, etc., instead of just listing the usernames separated by commas. Also, maybe we could do the same for tags with |tagtemplate=, |tagformat= and |tagseparator= parameters. — Mr. Stradivarius ^{♪ talk ♪} 04:03, 29 January 2023 (UTC)[reply]

Authors: names or usernames?[edit]

@JPxG: About Special:Diff/1139235436: it looks like WegweiserBot is using display names for author names, but WP:SPT is using usernames. This is going to cause the two scripts to keep overriding each other, so we should choose one approach. I would prefer to include only username, as that makes it easier to do things like make user links. I can see that there would be a case for listing display names instead, though. Or if we really want, we could include both, in tables like {user = "Bluerasberry", display = "Lane Rasberry"}. What do you think would be best? — Mr. Stradivarius ^{♪ talk ♪} 04:28, 14 February 2023 (UTC)[reply]

@Mr. Stradivarius: Sorry for not having responded to this for a million years (I thought I had at the time). I pondered this for a while; what I eventually came up with was for each writer to be credited under one canonical name, which in a great many cases was the person's real name -- or failing that, whatever name they want to be credited under. Just going by username would simplify the process of linking to their userpage, but it's sometimes the case that we run articles from people with different home wikis, or Heaven forfend, people who aren't Wikimedians at all -- so I think whatever scheme we use should account for this. The idea of having two fields for the author does put another idea in my head, which is to have some kind of structured data for authors.

Most newspapers and magazines have bylines for authors, which appear under their articles, and on the CMS's author page (e.g. https://slate.com/author/mary-c-curtis shows Mary Curtis's articles, but also "Mary C. Curtis is a columnist at Roll Call and host of its Equal Time podcast" etc etc). It occurs to me that this would be nice to do for Signpost articles as well, which would allow for the freedom to link to whatever someone wanted (like say they wanted the link to go to their blog or personal website), as well as to give more detailed bylines than just a name.

I don't know how complicated it would be to implement something like that in the module -- my guess is "very" -- so it is more of a pipe dream I'm plopping out here than a thing I think is likely to actually happen, but it is worth considering. Maybe. jp×g 08:19, 6 August 2023 (UTC)[reply]

On that matter, I absolutely do not appreciate you/the bot having mass-changed all these bylines in old stories some months ago without consulting with the rest of the team first. There are various arguments against it, for example that it is ahistorical to retroactively change the byline of something that was published over a decade ago - it does make a difference whether a story was written under a pseudonym or a real name, for instance. But however one weighs the pros and cons here (also about which version of a contributor's name to standardize on), the main point is that it is not a technical decision but a content decision, one that can also greatly affects SERPs for people's names btw.

Regards, HaeB (talk) 20:59, 6 August 2023 (UTC)[reply]

Perhaps this was dumb. I did post about this to Wikipedia talk:Wikipedia Signpost/Technical back in January; perhaps some more discussion would have been good. From the height of a few months, this seems like it would have been a better choice, but at the time, I was just trying to make author search function. What I discovered was that the lack of machine-readable Signpost archives was concealing a bizarre shitpile of some 1,212 distinct author fields, about 300 of which were nonsense. For example, sorting alphabetically gives [User:GerardM, {{{2}}}, §hep, \/, +sj +, 2 May, 3 July 2006, 03 July 2006, 3family6, 3family6 1 April 2016 19:58 (UTC), 05 November 2007, 10 other editors, 11 other editors, 12 August 2015, 14 April, 18 August, 19 November 2012, 22mikpau, 24 April, 24 Apri, 26 April 2010, 27 other editors,, 28 other editors, 32 other editors, 38 editors, 51 other editors, 53 other editors, 79 other editors, 91 editors, 106 editors on the French Wikipedia; translated for The Signpost by JohnNewton8, 273 other editors, 1233, 2008, 16912 Rhiannon 15 July 2015 (literally all of which are parsing errors except for three); your articles, specifically, were under HaeB, Tilman Bayer, Tilman Bayer, Tilman Bayer 1 April 2016 19:58 (UTC), Tilman Bayer 03:08 (UTC) (two of those appearing to be identical strings, but one of them with a nonprinting character). I guess what I am trying to say here is that if you want me to de-alias your name changes, I can do that, but I think that given the volume of work being done it was not something I could realistically post about at WT:Signpost (I recall that around December/January you had been saying we should be stricter about posting stuff on the right talk pages, which is why it was at /Technical). jp×g 23:14, 6 August 2023 (UTC)[reply]

Appreciate your reply, but my point above was exactly that this is a major content decision (and one affecting authors - and interested readers! - who may not be interested in following the Signpost's technical side). In that repsect, /Technical was the wrong place for such discussions (at least without announcing them in more content-focused locations too, which I don't recall seeing at the time).

As for the specific content examples you're raising: Yes, I'm sure there were a lot of uncontroversial fixes among these mass changes, but also a lot of deliberate choices being altered. (I wasn't mainly thinking about my own bylines here, but apart from nonprinting characters etc., these were also not accidentally different.) Lastly, I would recommend to distinguish between building a more systematic, consistent way for handling bylines in future editions (great!) and retroactively changing reader-facing content in old issues. Regards, HaeB (talk) 00:02, 7 August 2023 (UTC)[reply]

JSON for the indices[edit]

@Mr. Stradivarius: Per this, it's now possible to have Lua modules load JSON rather than Lua tables. I think this would be a lot better to work with (i.e. all utilities wouldn't have to constantly serialize and deserialize Lua tables using the idiosyncratic whitespace/indentation/etc format). Additionally, a page with the JSON content model would be constrained and sanitized by MediaWiki, rather than a Lua table which can just have wrong stuff in it etc. I would like to write some more utilities to work with these indices but the Lua tables are kind of an awkward sticking point. What would the procedure be for converting them? I would be able to write patches for the Signpost tagger and Wegweiser (and may be able to help with the Lua module itself). jp×g🗯️ 23:52, 8 December 2023 (UTC)[reply]

@JPxG: I guess you could do that, but you would still have the issue of making sure the JSON is indented and sorted the same way across all tools that write to the data pages. Just switching to JSON doesn't guarantee that all JSON parsers format everything the exact same way. Also, you can't save syntactically incorrect Lua pages; Mediawiki won't allow it. So I don't think there is a real difference between Lua and JSON there. One pitfall is that you would have to change the content model to JSON when creating the data pages, which as far as I'm aware only admins can do. This would mean that WegweiserBot would have to become an admin bot, which would require another BRFA. Also, non-admin users would no longer be able to use SignpostTagger to tag pages for which the data page is not yet created; they would have to wait until either WegweiserBot or an admin user changes the content model. All in all, to me this sounds like a solution looking for a problem. — Mr. Stradivarius ^{♪ talk ♪} 08:47, 9 December 2023 (UTC)[reply]

My thinking is mostly that converting the JSON to Lua tables adds another dependency for every program that puts things into or out of the indices; JSON is a pretty widely used standard and there's lots of utilities available for parsing it. It just seems like it's more straightforwardly compatible with what exists now, and more forward-compatible with whatever the future brings.

As for the content models, I hadn't thought about that. It is true that TEs can change content models, but that's still not very great. Fortunately my RfA was successful so I'd be happy to manually pre-create and set content models for blank JSON indices up to, say, 2060 or so ;) jp×g🗯️ 02:41, 10 December 2023 (UTC)[reply]

New subheading and image fields[edit]

Lately it has occurred to me that the way old Signpost archives get generated is very ass-backwards. We have a giant database of every article, its title, its author, et cetera... we're just not using it. It does get used sometimes, like in Wikipedia:Wikipedia_Signpost/Templates/Single_talk. But for the archive issues, we have hundreds of individual pages, like Wikipedia:Wikipedia Signpost/Single/2022-11-28, that redundantly store article titles, subheadings, et cetera. The modifications I'm working on now (which I've incorporated into the snippet template and the publishing script) allow articles to be associated with custom images, so the archives will have that too, but then this creates a problem: modifying an image for an article requires me to edit the article, then update Wikipedia:Wikipedia Signpost, then update Wikipedia:Wikipedia Signpost/2023-12-04, and such.

Anyway: I'd like to, as much as possible, use the module for stuff instead of static wikitext pages. However, this will again require some more fields to be added. Right now, my thinking is:

subhed (subheading, or blurb, or whatever) -- string, nowadays this is a sentence or so but in some old issues it's GIGANTIC (paragraph or more). Has various random crap in it (templates, quotation marks, etc).
piccy -- string, image file for the article. These, ideally, would be a 1:1 aspect ratio, but might sometimes not be, which brings me to:
piccy-meta -- coordinates for CSS crop of the image. I am not 100% on this. It might just be four CSS crop coordinates. But I am also contemplating other attributes, like filters (what if we want to desaturate an image, etc).

Like above, I am fine to incorporate these into the publishing script, Wegweiser, and the tagging script, but may need some assistance with the Lua part (and I don't know if this should be done before or after the lua table-json thing). @Mr. Stradivarius: what do you think? jp×g🗯️ 00:04, 9 December 2023 (UTC)[reply]

@JPxG: This sounds fine to me, although I would avoid abbreviating the fields: subheading and image should be fine. I would also make "image" a table, so you could do something like image = {filename = "Example.png", width = 100, height = 100}, where "width" and "height" (or whatever metadata you need) are optional. I'm not aware of any way to crop an image in Mediawiki using inline styles - I think we are limited to the options provided at mw:Help:Images/en, but let me know if I'm missing something. Best — Mr. Stradivarius ^{♪ talk ♪} 09:03, 9 December 2023 (UTC)[reply]

@Mr. Stradivarius: I was able to figure out what was going on in {{CSS crop}} well enough to implement it (far slimmer) on the snippet template; specifics (and examples) at Wikipedia talk:Wikipedia Signpost/Newsroom#Template for article images. Here is what the template looks like with all its arguments;

{{Signpost/snippet|2023-12-04|Essay|I am going to die|And so are you.|0.3 MB|sub=0.3 MB|by=[[User:WhatamIdoing|WhatamIdoing]]|pic=File:Memento Mori 'To This Favour' by William Michael Harnett, c. 1879.JPG|credit-name=William Michael Harnett|credit-license=PD|pic-p=800|pic-x=350|pic-y=100}}

After actually working it out enough to build something that functioned, it turns out there are only three meta attributes needed to fully specify a scale and crop: scale, x and y. All are integers, although I think it might also be useful to permit values like top, bottom, left, right and center (these aren't handled by the template yet but they could be in the future). There are also two more params I didn't think of above, author and license, which are necessary for image attribution. jp×g🗯️ 03:59, 11 December 2023 (UTC)[reply]

Subheadings[edit]

Need to be supported by various scripts in order to work properly.

Accepts Y processes Y Wegweiser
Accepts Y processes Y Signpost tagger
Accepts Y processes Y Module

I have rewritten Wegweiser to fetch metadata from parsing article wikitext instead of HTML pages; this posed some slight difficulties with respect to user tags in author fields but is now good. It can now work a lot faster, and it also provides subheadline metadata. There's a line I have commented out right now in the script, but can enable to make it store subheading data when it parses metadata. When the subheading data is in the module indices, the module still works fine to retrieve articles etc (of course it can't do anything to parse or use the subheading yet, but it doesn't break anything). However, the Signpost tagger chokes trying to save tags for articles with subheading data and won't work on them, so I won't do all of the indices with it for now. jp×g🗯️ 22:06, 15 December 2023 (UTC)[reply]

Now done. jp×g🗯️ 12:23, 23 December 2023 (UTC)[reply]

best practices for image information fields[edit]

So right now I've integrated into Wegweiser and SignpostTagger the fields for piccy information, like this:

	{
		date = "2023-12-04",
		subpage = "Essay",
		title = "I am going to die",
		authors = {"WhatamIdoing"},
		tags = {"essay"},
		views = {d007 = 1526, d015 = 1919, d030 = 2029, d060 = 2029, d090 = 2029, d120 = 2029, d180 = 2029},
		piccycredits = "William Michael Harnett",
		piccyfilename = "File:Memento Mori 'To This Favour' by William Michael Harnett, c. 1879.JPG",
		piccylicense = "PD",
		piccyscaling = "400",
		piccyxoffset = "70",
		piccyyoffset = "",
		subhead = "And so are you.",
	},

496 chars. This seems, uh, stupid. It works but I have temporarily reverted. Since these six fields are all about the same thing, there's no good reason to have them occupy six whole fields -- they should probably just be a dict like the viewcounts are. Like such:

	{
		date = "2023-12-04",
		subpage = "Essay",
		title = "I am going to die",
		authors = {"WhatamIdoing"},
		tags = {"essay"},
		views = {d007 = 1526, d015 = 1919, d030 = 2029, d060 = 2029, d090 = 2029, d120 = 2029, d180 = 2029},
		piccy = {filename = "File:Memento Mori 'To This Favour' by William Michael Harnett, c. 1879.JPG", credits = "William Michael Harnett", license = "PD", scaling = "400", xoffset = "70", yoffset = ""}		
		subhead = "And so are you.",
	},

467. But, upon thinking this thought, something rather devious popped into my mind: aren't these labels kind of long? It seems pretty insubstantial, but... there are a lot of articles. Six letters for each key, across 5519 (currently 5462) articles (plus 3 extra chars for the = necessitated by using a label at all) is 49671 bytes (currently 49158). For reference, the size of all extant module indices is 2256199 (currently 1406253). So that's, uh, 3.496% of the total index size being taken up just by key names. These have to be parsed by, basically, everything, and it's not quite clear that repeating the field names all these thousands of times is more efficient than just having them as an array whose ordering is documented. Perhaps it would be better to do this? @Mr. Stradivarius: jp×g🗯️ 13:51, 23 December 2023 (UTC)[reply]