Passing structured formatted data
Created by: slifty
The Use Case
We rely on CSVs for passing data into torque. This makes plenty of sense, but it also limits the amount of (basic) formatting that can be assigned to that data. For instance, if a diligence report involves bulleted lists, or bolded / emphasized text within a given sentence.
There is an additional issue that some of the diligence / followup documents have lots of different sections, and asking folks to paste text into csvs feels a bit clunky (since spreadsheets aren't really intended for long form multi-paragraph text blocks).
The non-engineered solution to this involved taking PDF / Word documents, using ghostscript and pandoc to extract text, and manually inserting the data into the system. This was a fraught process (but resulted in a deeper understanding of the data which is always nice).
An Engineered Solution
What I'm working on now is just a first draft at an engineered solution. We'll iterate over time I'm sure.
-
I've created very simple word templates which leverage the
header
word formatting types to demark various sections and subsections. -
PDF and other long form data is inserted into these word "templates" manually and provided to OTS.
-
As part of ETL, I use pandoc to convert them to plain text (markdown or wikimedia).
-
Remove any random HTML that got inserted (sometimes indentation causes trouble)
-
Do some basic string replacement to convert the document to archieML format. Thus each section becomes semantically accessible.
-
Convert that structured object into the CSV torque expects.
At that point, it's all just Torque data like anything else.