Newer
Older
Mediawiki Tames Unruly Text Data
################################
:date: 2019-01-23 16:20
:author: james
:slug: 165
:status: draft
OTS has been working with the `MacArthur Foundation <https://www.macfound.org/>`__. They're doing something interesting with a private install of Mediawiki, the software that runs Wikipedia. They have a largish set of text data and documents that they need to organize and make available to reviewers. Mediawiki turns out to be a very efficient way to turn a pile of undifferentiated text and files into an organized place for interacting with it.
The text data is a csv file. It is a messy collection of giant blocks of text resulting from web forms. There are 1500 rows, which isn't a lot for a computer but is a lot for a human to process, especially when those rows can run for many screens. The text includes links to videos. If you open the csv file and try to read entries there to review them, you're going to have a bad time. The documents are 5 thousand files: 9 Gb of pdfs, Microsoft Word and Microsoft Excel docs, some in Zip files. A lot of the data is private and needs to be live in secure places. Making all that info and those files available to reviewers is a pain.
Macfound came up with what we think is a rather smart idea. Rather than build a custom interface to this data set, we loaded it all into a wiki. Reviewers get a table of contents, easily displayed text, embedded videos, and links to the external documents. They can hit the edit button and leave comments directly on the page. Many of them are already familiar with wikis as users (though not as editors). This is a huge usability win-- the reviewers are not generally highly sophisticated users of technology, but they ramp up fast when you sit them in front of a familiar-looking wiki.
Overall, this is a **very** cheap and fast route to making a jumbled pile of incoming data available directly to the people who need to work with it. We scripted the process of loading it into the wiki, and we can script the process of moving the data to other processes as needed. The wiki gives us all kinds of nice-to-haves for free: change/revision management, fine-grained access controls, indexing, and easy ways to see what has changed most recently. New features are just a plugin away, and Mediawiki has a large set of plugins available for us to use and/or customize. We're working next on exporting subsets of the data to PDF for easy sharing with collaborators.
Of course, we're sharing the software we wrote to get this working. Our first step was sanitizing the messy data input. That work is quite specific to our dataset, and `we share it on GitHub <https://github.com/OpenTechStrategies/MacFound>`__ just so people can see how we approached the problem. The actual code will not work with your dataset unless you make some major changes to it. Then, we ran `a more generic tool to read the sanitized cvs file and put it in the wiki <https://github.com/OpenTechStrategies/csv2wiki>`__. Finally, we wrote an `upload tool <https://github.com/OpenTechStrategies/csv2wiki/tree/master/upload_files>`__ to bulk upload files into the wiki and then add text links to those files. If you want to upload a bunch of files to your Mediawiki instance and then link to those files from wiki pages, there are a lot of fiddly little things you need to do. Hopefully `our tool <https://github.com/OpenTechStrategies/csv2wiki/tree/master/upload_files>`__ can help you shortcut that process. Mediawiki has a ton of safeguards that make uploading files difficult. We don't need those safeguards-- the wiki we're making is for internal use and we don't think anybody is going to try to sabotage it. if you base your upload on our scripts, you might want to retighten security after the uploads.
You might also want to be aware that the Mediawiki API for uploading can be temperamental. The python libraries for accessing it will surprise you, and not in good ways. We had to work around some old Mediawiki bugs (maybe we'll fix them and upstream the changes) and some non-standard behavior. We're using Mediawiki in a way that it wasn't originally intended, so some difficulties were expected. Overall, we're quite happy with the final product.
Having built a custom wiki as a way to present and edit data for team members, we think this is a prototype-level solution. It's fast, dirty, cheap, and effective. The work was fun and interesting, and we are eager to see how this technique might apply to other environments. A second wiki would also spur us to improve the tools and generalize them some more. If you have a largish text dataset and documents that you're having trouble approaching, drop us a line. We'd love to help!