webtex

The making of a LaTeX pre-processor with Haskell - Part II

16 February 2015

In the previous part, the progam could parse text for emphasized and bold characters. So there’s some functionality, except that it doesn’t do anything useful like produce output. In this part, I’ll implement parsing links Markdown stlye (this is really more of an implementation of Markdown than anything else at the moment), and then turn the parsed text into something useful!

To start off with, I removed the ability for bold emphasized text. This is largely because there’s no plan on how to deal with embedded styles, so I’d rather not start now. Implementation of link parsing exists in the link function, which is very similiar to the other parsers, except it produces two strings. Suprisingly easy to put together in this case, and illustrates the parsing process a bit more clearly. Here is the resulting code:

 1 import System.IO
 2     import Control.Monad
 3     import Text.ParserCombinators.Parsec
 4     import Data.List
 5 
 6     type Latex = String
 7 
 8     emphasisSymbol :: Parser Char
 9     emphasisSymbol = char '*'
10 
11     boldSymbol :: Parser String
12     boldSymbol = string "**"
13 
14     emphacizedChar :: Parser Char
15     emphacizedChar = noneOf "*"
16 
17     boldChar :: Parser Char
18     boldChar = try (do char '*'
19                        noneOf "*")
20            <|> noneOf "*"
21            <?> "Didn't find bold." 
22 
23     textChar :: Parser Char
24     textChar = noneOf "*("
25 
26     linkChar :: Parser Char
27     linkChar = noneOf "]"
28 
29     linkDescriptionChar :: Parser Char
30     linkDescriptionChar = noneOf ")"
31 
32     emphasis = do emphasisSymbol
33                   content <- many1 emphacizedChar
34                   emphasisSymbol
35                   return (content
36 
37     bold = do boldSymbol
38               content <- many1 boldChar
39               boldSymbol
40               return content
41 
42     link = do char '('
43               description <- many1 linkDescriptionChar
44               string ")["
45               link <- many1 linkChar
46               char ']'
47               return (link ++ " " ++ description)
48 
49     bodyText = try (link) <|> try (bold) <|> try (emphasis) <|> many1 (textChar)
50 
51     htexFile = many bodyText
52 
53     readInput :: String -> [Latex]
54     readInput input = case parse htexFile "" input of
55         Left err  -> ["No match " ++ show err]
56         Right val -> val
57 
58     main = do 
59         contents <- readFile "input.htex"
60         putStrLn $ foldl (\acc x -> acc ++ x) "" (readInput contents)

So not a whole lot changed. without too much effort its resonably easy follow the parser from bodyText.

The next thing to do is make the parser return LaTeXified text. This turned out to be more simple than I thought, though it helps that all I am doing it returning strings. All that was required is to wrap the content in the return statements with LaTeX commands, and altering the main function simply concatanate the resulting parsed text:

 1 emphasis = do emphasisSymbol
 2               content <- many1 emphacizedChar
 3               emphasisSymbol
 4               return ("\\emph{" ++ content ++ "}")
 5 
 6     bold = do boldSymbol
 7           content <- many1 boldChar
 8           boldSymbol
 9           return ("\\textbf{" ++ content ++ "}")
10 
11     link = do char '('
12           description <- many1 linkDescriptionChar
13           string ")["
14           link <- many1 linkChar
15           char ']'
16           return ("\\href{" ++ link ++ "}{" ++ description ++ "}")
17 
18     main = do 
19         contents <- readFile "E:\\Google Drive\\Code\\latexpreprocessor\\input.htex"
20         putStrLn $ foldl (\acc x -> acc ++ x) "" (readInput contents)

Running it with input.htex as This is not in italics. *But this is.* **This is bold.** This is not bold. (Description for a link)[link]. This is not in bold italics. results in

This is not in italics. \emph{But this is.} \textbf{This is bold.} This is not bold. \href{link}{Description for a link}. This is not in bold italics.

Easy! It’s not quite right, since you can’t just run LaTeX on it on the output, but that requires a bit more thinking to manage package handling and other preamble bits.

Where to?

At this point, I could continue and create a fully fledged pre-preprocessor, but would this satisfy a true need for it. As I said in the previous part, LaTeX syntax is somewhat verbose for a lot of uses. What I can see my pre-processor doing is allowing users to only need content to create pdfs, with some customization via something like a YAML header (like Jekyll), allowing for a bit of a “hands off” experience. Pandoc already allows for the creation of LaTeX files from Markdown, so what further use would a slightly different Markdown derivative be?

Templates are another possible extension, but this functionality is already available in Pandoc, and can be implemented fairly easily with Python using Jinja. One problem with templates is that with the seperation of content and design, the content must be written without regard for where it is in the page. This reduces to having to edit a .tex file anyway.

This all applies to normal text, not math notation since there’s no real alternative markup. Which is real shame since latex is over a bit too verbose what you need it for (especially if you’re writing any calculus). To suit the needs of ordinary text, in my experience with helping create my partners masters thesis in LyX, all that’s needed it a good framework to handle bibliographies and references.

So with all of the above, I’m going leave this project as is. On the plus side, what if you could write math in Haskell? Say, if you wanted to typeset

naturalNumbers = [1..]
    [x | x <- naturalNumbers, x < 20]

as LaTeX? This doesn’t look too diffucult, and could produce the output:

\{x | x \in \mathbb{N}, x < 20 \}

Looks fairly trivial to do. How about some calculus? How would you represent

\frac{d \theta}{d \gamma} = \lambda^{x + \epsilon} \frac{d^2 \theta}{d \gamma^2}

in Haskell? This example is a bit complicated complicated math wise, since we don’t know what \theta is as a function, and \lambda may not be invertable, so rearraging is not so easy. In addition, both side of the equations have functions applied to \theta. But luckely, we don’t care about the math, but the typesettings instead! Inside of a LaTeX (or even html file), one could have:

 1 gamma = Var
 2     lambda = Var
 3     epsilon = Var
 4     x = Var
 5     theta = Func gamma
 6 
 7     leftEquation = Derivative theta gamma
 8     rightEquation = (lambda `pow` (x + epsilon)) `multiply` (Derivative (Derivative theta gamma) gamma)
 9 
10     equationOne :: (Expression, Expression)
11     equationOne = (leftEquation, rightEquation)

Where the type definitions are:

 1 class ExpressionOps a where
 2         multiply :: a -> a -> b
 3         pow :: a -> a -> b
 4         frac :: a -> a -> b
 5 
 6     data Expression = Var
 7                     | Func Expression
 8                     | Func2 Expression Expression
 9                     | Derivative Expression Expression
10 
11     instance ExpressionOps Expression where
12         multiply exp1 exp2 = Func2 exp1 exp2
13         pow exp1 exp2 = Func2 exp1 exp2
14         frac exp1 exp2 = Func2 exp1 exp2

Then a parser could use type introspection (for variable names/operations) alongside evaluating equationOne, which maps names to LaTeX commands contained within a text file.

In this case, the Haskell version is a bit longer, and it’s readability compared to the LaTeX version is debatable. But one thing it does allow you to do is reuse functions quickly (without using those bloody backslashes everywhere!). For instance, if say you wanted to rewrite the above to a system of differential equations, all you would have to do is write

diff1 = (y1, leftExpression)
        diff2 = (y2, rightExpression)

And your set!

One other way that may work is to take working mathematical functions and rely heavily on type introspection to get the meta-data. But in the case of the above differential equation, theta is unknown and may not have a closed form. So you would need some method of saying theta is a function that explicity depends on x, y, and z. Using this method opens up support for incorperating other languages, but I’m not sure there’d be a reliable way of implementing it considering considering the diverse ranges of languages used in the computing world. I’ll stick to the former method and see where it takes me.