THC Science

Data extraction technique[edit]

Thanks to instigation by Pete, this script creates the table on this article.

#!/bin/bash
wpor()
{
        wget http://stats.grok.se/en/$2/$1 -O - 2>/dev/null | \
        grep " has been viewed " | \
        sed 's#.*</a> has been viewed \([0-9]*\).*#\1#;'
}

echo '{| class="wikitable sortable"'
echo '! article !! importance !! rating !! Dec 2007 !! Jan 2008 !! Feb 2008 !! Mar 2008'

dates="200712 200801 200802 200803"
while read article importance rating
do
        x=""
        for month in `echo $dates`; do
                y=$(wpor $article $month)
                x="$x || $y"
        done
        echo "|-"
        echo "| [[$article]] || $importance || $rating $x"
done
echo "|}"

It is fed input which came from the table generated by the automatic rating thingy which appears on the project page. ~~I got the data from there, but for the life of me I can't figure out where that is now.~~ The beginning of the data looks like

Oregon_State_Capitol    Top     FA
1980_eruption_of_Mount_St._Helens       Mid     FA
1984_Rajneeshee_bioterror_attack        Mid     FA
D._B._Cooper    Mid     FA

It was mildly reformatted from the magically generated article. —EncMstr (talk) 03:55, 21 April 2008 (UTC)[reply]

June 2008 updates[edit]

summary[edit]

This is a summary of the steps detailed below which create an update of this (Readership) page:

Edit Wikipedia:Version 1.0 Editorial Team/Oregon articles by quality/1.
Copy and paste the wikitext into Vim hosted on Linux
Execute the search and replace command (below), change "^I" to tab characters if necessary
Remove header and trailer lines
Save the resulting data as "file"
Execute the script below, saved as "wpor", with ./wpor <file >result
Copy and paste "result" into the article. Preview, then fix any UTF-8 character problems revealed as redlinked articles

gory detail[edit]

The format of the article containing assessments Wikipedia:Version 1.0 Editorial Team/Oregon articles by quality/1 has changed. The wikisource of that article is trimmed to exclude the header and trailer text, then fed through these vim commands to produce the article table (which is demonstrated above):

:%s/{{assessment | page=\[\[\(.*\)]].*importance={{\(.*\)-Class.*class={{\(.*\)-Class.*/\1^I\2^I\3       (for most entries)
:%s/^{{assessment | page=\[\[\(.*\)]].*class={{\(.*\)-Class.*/\1^I#na^I\2                                (for unknown importance entries)

The first vim command transforms a line like

 {{assessment | page=[[Berkeley Lent]] [http://en.wikipedia.org/w/index.php?title=Berkeley_Lent&oldid=135741266 ] | importance={{Mid-Class}} | date=June 4, 2007 | class={{Start-Class}} | version= | comments= }}

into

Berkeley Lent   Mid     Start

The second command transforms a line which has ... | importance= | date=... into

Berkeley Lent   #na     Start

The result is fed into the script below as stdin (that is, < file):

#!/bin/bash
wpor()
{
        wget http://stats.grok.se/en/$2/$1 -O - 2>/dev/null | \
        grep " has been viewed " | \
        sed 's#.*</a> has been viewed \([0-9]*\).*#\1#;'
}

declare -a monthlist
monthlist=(200712 200801 200802 200803 200804 200805)

n=${#monthlist[@]}
declare -a coltotals

echo '{| class="wikitable sortable" style="text-align:right"'
echo '! article !! importance !! rating !! Dec 2007 !! Jan 2008 !! Feb 2008 !! Mar 2008 !! Apr 2008 !! May 2008 !! Total'


for (( m = 0;  m < n; ++m )); do
        coltotals[$m]=0
done

rowcount=0

while IFS=$'\t\n' read article importance rating
do
        wikiarticle=`echo $article | tr " " "_"`
        #echo "article $article=$wikiarticle, importance $importance, rating $rating"
        x=""
        linetot=0
        for (( m = 0;  m < n; ++m )); do
                y=$(wpor $wikiarticle ${monthlist[$m]})
:               $((linetot = linetot + y))
                x="$x || $y"
                coltotals[$m]=$(( coltotals[$m] + y ))
        done
        echo "|-"
        echo "| [[$article]] || $importance || $rating$x || $linetot"
:       $((rowcount = rowcount + 1))
done

linetot=0
x=""
for (( m = 0;  m < n; ++m )); do
        y=$(( coltotals[$m] ))
:       $((linetot = linetot + y))
        x="$x || $y"
done

echo "|-"
echo "| __Total__ $rowcount articles || || $x || $linetot"
echo "|}"

This script is based on the old one, but calculates row and column totals. Also, its output is directly suitable for inclusion, whereas the old one needed some text tweakings. The only current glitch is that some extended UTF-8 characters are munged, about 6 article names presently. —EncMstr (talk) 21:14, 3 June 2008 (UTC)[reply]

Kudos![edit]

Wow EncMstr, thanks for the major expansion of data, and all the documentation of how you did it! -Pete (talk) 23:59, 3 June 2008 (UTC)[reply]