webtex

The making of a LaTeX pre-processor with Haskell - Part I

24 January 2015

I’ve written a decent amount LaTeX from relatively simple documents such as math assignments and essays, and longer more involved documents such as thesis’s and CV’s. In writting these simple documents I often find the LaTeX syntax a bit too much, such as having to emphasis text as \emph{italic text} and the syntax for lists being somewhat verbose. So I plan on writting a pre-processor using the Parsec library to acheive something similar do Jekyll’s system of using markdown for content, and yaml headers for additional information.

Why Haskell? Haskell is my first foray into functional programming, with my experience only consisting of doing a few problems on Project Euler. And even with these simple programs, I enjoy programming in Haskell a lot. I think its mostly due to my math background and the similarity in thinking styles. As of writing this post, my knowledge extends to almost to the end of the chapters on Monads in Learn You a Haskell For Great Good.

The plan for the preprocessor is take something such as this:

This is a sample latex document. *I am emphasized*, **but I am in bold**. And here is a list

- Item 1
- Item 2
- Item 3

And produce a .tex file reading for processing:

\documentclass{article}

    \begin{document}
        This is a sample latex document. \emph{I am emphasized}, \textbf{but I am in bold}. And here is a list 

        \begin{itemize}
            \item Item 1
            \item Item 2
            \item Item 3
        \end{itemize}
    \end{document}

Part I - First steps

The first to sort out is how to parse a file in the first place. After a quick google, I was a led to a wikibooks book Real World Haskell.

A simple first goal is to parse text containing italics and/or bold characters. For italics, text will be wrapped with *, bold with **, and bold italics with *_text_*. Note that this does not include mixed emphasis like *italics **bold italic** italics*.

Here is my first attempt at the parser:

 1 import System.IO
 2     import Control.Monad
 3     import Text.ParserCombinators.Parsec
 4     import Data.List
 5 
 6     type Latex = String
 7 
 8     emphasisSymbol :: Parser Char
 9     emphasisSymbol = char '*'
10 
11     boldSymbol :: Parser String
12     boldSymbol = string "**"
13 
14     beginBoldEmphasisSymbol :: Parser String
15     beginBoldEmphasisSymbol = string "*_"
16 
17     endBoldEmphasisSymbol :: Parser String
18     endBoldEmphasisSymbol = string "_*"
19 
20     emphacizedChar :: Parser Char
21     emphacizedChar = noneOf "*"
22 
23     boldEmphasizedChar :: Parser Char
24     boldEmphasizedChar = try (do char '_'
25                                  noneOf "*")
26                      <|> noneOf "_"
27                      <?> "Didn't find bold emph."
28 
29     boldChar :: Parser Char
30     boldChar = try (do char '*'
31                        noneOf "*")
32            <|> noneOf "*"
33            <?> "Didn't find bold." 
34 
35     boldEmphasis = do beginBoldEmphasisSymbol
36                       content <- many1 boldEmphasizedChar
37                       endBoldEmphasisSymbol
38                       return content
39 
40     emphasis = do emphasisSymbol
41                   content <- many1 emphacizedChar
42                   emphasisSymbol
43                   return content
44 
45     bold = do boldSymbol
46               content <- many1 boldChar
47               boldSymbol
48               return content
49 
50     bodyText = try (boldEmphasis) <|> try (bold) <|> try (emphasis) <|> many1 (noneOf "*")
51 
52     htexFile = many bodyText
53 
54     readInput :: String -> [Latex]
55     readInput input = case parse htexFile "" input of
56         Left err  -> ["No match " ++ show err]
57         Right val -> val
58 
59     main = do 
60         contents <- readFile "input.htex"
61         putStrLn $ intercalate "\n" (readInput contents)

The function bodyText is where all of the parsers are put together. The order in important: boldEmphasis and bold should come before emphasis, otherwise emphasis will consume the first *, then consume the second * recognising it as the end of an emphasis, or consume _ and just recoginise it as a character. I can forsee combining these three functions, and choosing which one to do after the first * has been chosen.

Each of the smaller parsers are straight forward: they parse the first character as starting the special block, then process the stuff inside until it hits the end block.

Each of these parsers are then fairly straight forward: they parse the first charactising symbol, then read text until they hit the closing characteristic symbol. Incorporating text such as *italics **bold italic** italics* could involve modifying emphasis to something like

    emphasis = do emphasisSymbol
              content <- many1 emphacizedChar <|> boldEmphasis
              emphasisSymbol
              return content

This would then involve having two parsers for bold italics, which is not not ideal. For this reason, and that bold italics should not be used all that often (if at all), plus the way LaTeX renderes bold italics (stack exchange discussion, and another on LaTeX emphasis commands), I will exclude bold italics until its really needed.

Output

Compiling and running with input.htex containing This is not in italics. *But this is.* **This is bold.** This is not bold. *_This is in bold italics._* This is not in bold italics. will print the following:

This is not in italics.
But hits is.

This is bold.
 This is not bold.
This is in bold italics
 Thisi is not in bold italics.

Promising!

Next up, I will introduce links using the same format as markdown ([link](url)), and also attempt at actually producing LaTeX output.