Printing Sections of a MediaWiki Wiki

From Admin-SIG

I still do not have a good solution to this problem. I find it strange that somebody has not written some sort of PHP query which can be installed alongside MediaWiki that can traverse a sub-tree of a MediaWiki, and stick it all in one massive document for printing.

I'm not looking for something that looks like a properly typeset book. Just something that CAN be printed for archival, and perhaps a very least-common-denominator ASCII based format, like HTML or an XML that can be archived electronicly. Just in case a situation comes up where the technology and/or expertese to revive MediaWiki and/or MySQL dies in the great nuclear winter, and somebody wants to laborously cut-and-paste the ASCII into MS-Word 2037.

I am thinking that a MediaWiki based solution is THE way to pass something like an ISO 9001 audit on procedures, and could be a GREAT way for lower-level employees to update inefficient procedures without a lot of hassle. We just need to be able to print the "procedure manual" for lawyers, even if the sys-admin gets hit by a bus.

Table of contents

Printing Wiki Sections with Aggregation Articles

The method suggested in the WikiMedia meta-wiki FAQ is to collect all the realated articles comprising a document into one big article. then you should be able to click on printable version, and print, save as HTML, or print/convert to PDF.

This mega-page would look something like

     = Big Headings =
     {{:Page title1}}
     {{:Page title2}}
     

Wiki Article Aggregation Example

Printing Wiki Sections with wget

wget is an old-school HTML downloader common on almost all LINUX distributions. It even works for other protocalls like FTP now.

wget -r should be able to recursively download some HTML, but it is problematic with MediaWiki. All of the skins I have seen include decorations with references to the entire Wiki. This makes it hard (impossible) to have a simple wget command to traverse a Wiki section. If there is a skin, that is like the printable version page, but with internal article links enabled, that would be great. We could dump a Wiki section with a single wget command.

Barring that, we could set up a script of wget commands to grab one article at a time, and collect them into one big HTML document.

Creating Books from Wiki pages with htmldoc

HTMLDOC seems to be a fancier program which can do a task similar to wget, but seems to have some features to control the book formatting better. It also has direct support for HTML, PS, and PDF.

The traversal problem still seems to be a problem.

However, it does have a GUI mode where you can edit a book project that lists URL's for all accumulated WWW pages.

These URL's can have the
&printable=yes
extension

to make for a fairly professional looking book.

It still has the problem of needing somebody who knows how to run and maintain the HTMLDOC project to generate the desired books. The first option has the advantage that it uses MediaWiki to print MediaWiki, and one can assume that any organization that uses MediaWiki has knowledgable MediaWiki users, even if the MediaWiki administrator becomes unavailable.

Writing a PHP page to aggregate MediaWiki Articles

Adaptions and updates of MHart's article on PDF Output (http://meta.wikimedia.org/wiki/PDF_Export) from the MediaWiki meta-wiki.

I intend to update this with my experiences and observations if I ever implement it.

Start off by making a copy of index.php and call it PrintArticles.php. Near the bottom (between the big switch/case and $wgOut->output();, add the script below. What this script does is look for special coding on the page that it is viewing (in the same way index.php views a page).

Also, I first tested this stuff in /images and then created a folder called /printouts - gave it the same privileges as /images as well as the same .htaccess file.

The coding is very simple and works like this: (we'll call this page "Test Print")

  • Put articles that will appear in sequence in curly braces:
{Help-style indexing}
{Email Digest}
  • and put articles to combine into a single set of curly braces separated by the | pipe symbol.
{Word doc search | PDF doc search | Excel doc search}
  • These articles will not have a page break in the PDF file. This is really useful for articles that are short and related, such as a function list.
  • Then add a link to this page, but using the new PrintArticles.php file:
[http://yoursite.com/PrintArticles.php/Test_Print Print these articles]

Now when a user browses to this page on your site and clicks the above link, the page will re-output through the PrintArticles.php file instead of index.php. The page will be changed from looking like this:

{Help-style indexing) {Email Digest} {Word doc search | PDF doc search | Excel doc search}
Print these articles (http://yoursite.com/PrintArticles.php/Test_Print)

to this:

{Email Digest} {Word doc search | PDF doc search | Excel doc search}
Print these articles (http://yoursite.com/PrintArticles.php/Test_Print)
Creating: (Email Digest) Creating: (Word doc search) Creating: (PDF doc search) Creating: (Excel doc search)
Test_Print.pdf (http://yoursite.com/printouts/Test_Print.pdf)

Couple of important notes about the following code:

  • Hard coded site urls... yes yes, I'm still lazy...
  • Hard coded /tmp folder for temporary files...
    • And I'm not deleting the /tmp files either...
  • I'm removing certain links and such - kinda clunky, and might not work with all skins. I'm doing it to make the resulting PDF look cleaner (like removing edit tags). It's okay if they are left in - except if you do leave all img tags in, you need to make sure and give read rights to the templates and skins and all image folders to other users - otherwise HTMLDOC can't import some of them.
  • Funky caching issues
  • Quotes print out as: a with an accent, euro symbol, followed by the trademark symbol
  • to take in account UTF8 characters, use by example 'iconv' tool in the loop, for example :
 exec("iconv --from-code=UTF8 --to-code=ISO_8859-1 -o /tmp/toto_" . str_replace(" ","_",$art) . " /tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm");
 exec("cp /tmp/toto_" . str_replace(" ","_",$art) . " /tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm");
  • Images inserted in the article can be caught with something like that (replacing the same line in the code in the next paragraph):
$NewBodyText .= "<h1>" . $art . "</h1><hr>" . str_replace('href="/wiki', 'href="http://' . $_SERVER["SERVER_NAME"] . '/wiki', 
                    str_replace('src="/wiki', 'src="http://' . $_SERVER["SERVER_NAME"] . '/wiki',
                    str_replace('<a href="/index.php', '<a href="http://' . $_SERVER["SERVER_NAME"] . '/index.php',
                    str_replace('<img src="/images/thumb',
                    '<img src="' . $_SERVER["DOCUMENT_ROOT"] . '/images/thumb', $bodyText))));

Here's the code that does all the dirty work:

$PDFFile = $_SERVER["DOCUMENT_ROOT"] . '/printouts/' . str_replace("'","_",
       str_replace(" ","_",$wgTitle->getText())) . ".pdf";
$PDFExec = "/usr/bin/htmldoc --webpage -f " . $PDFFile;
$addedText = "";

$SaveText = $wgOut->mBodytext;
$wgOut->mBodytext = "";

$i = strpos($SaveText,"{");
while ($i >= 0 && $SaveText != "") {
  $j = strpos($SaveText,"}");
  if ($j <= $i) break;
  $multi_art = explode('|',substr($SaveText, $i+1, $j-$i-1));
  if (strlen($SaveText) > $j+1)
    $SaveText = substr($SaveText, $j+1);
  else
    $SaveText = "";
  $NewBodyText = "";
  foreach ($multi_art as $one_art) {
    $wgOut->mBodytext = "";
    $art = trim($one_art);
    $addedText .= "Creating: (" . $art . ")<br>";
    $PDFTitle = Title::newFromURL( $art );
    $PDFArticle = new Article($PDFTitle);
    $PDFArticle->view();
    $bodyText = str_replace('<img src="/stylesheets/images/magnify-clip.png" width="15" height="11" alt="Enlarge" />',
                '',
                str_replace('<div class="editsection" style="float:right;margin-left:5px;">[',
                '',
                str_replace('>edit</a>]</div>',
                '></a>', 
                $wgOut->mBodytext)));
    $NewBodyText .= "<h1>" . $art . "</h1><hr>" . str_replace('<a href="/index.php',
                    '<a href="http://' . $_SERVER["SERVER_NAME"] . '/index.php',
                    str_replace('<img src="/images/thumb',
                    '<img src="' . $_SERVER["DOCUMENT_ROOT"] . '/images/thumb',
                    $bodyText));
  }
  $h = fopen("/tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm" ,"w");
  fwrite($h,"<html><body>");
  fwrite($h,$NewBodyText);
  fwrite($h,"</body></html>");
  fclose($h);
  $PDFExec .= " " . "/tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm";
  $i = strpos($SaveText,"{");
}

exec($PDFExec, $results);
foreach ($results as $line)
  $addedText .= $line . "<br>";

$addedText .= "<br><a href='http://" . $_SERVER["SERVER_NAME"] . '/printouts/' .
              str_replace("'","_",str_replace(" ","_",$wgTitle->getText())) . ".pdf'>" . 
              $wgTitle->getText() . ".pdf</a>";
$wgOut->mBodytext = "";

$wgArticle->view();
$wgOut->addHTML($addedText);