7.2 KiB
A static site generator
when I decided to start blogging, it was mostly for me to learn and remember all tech thing I learnt throughout time. I also want to explore a wide diversity of technology, not focus on a particular one.
Hence to start blogging, I obviously needed a static site generator. Many of them exist already, like Hugo for example, however rewriting one from scratch is typically the kind of exercise I want to throw myself into. The advantage of a static site is clearly its loading speed : a simple html file, combined with a small licked css, and a whole new blog is born Anyway, writing this static site generator from scratch is also the perfect excuse to explore a not so widely know technology to manipulate text files.
Introduction to AWK
AWK, from the intials of its creator, is an old an powerful text file maniulation. Syntactically close to C, it is a scripting language to manipulation text entries. Its wikipedia page sums up nicely its story. I thought it was clever to use is for a site generator, to parse markdown files and generate html ones. However, according to this listing of static site generator programs, another one has had the same idea. Hence, the following, as well as my code is heavily inspired by Zodiac (even though the repo has not been touched for 8years).
Parsing markdown
Following the official syntax, is a good start for a parser. AWK works as follow : it takes an optional regex and execute some code between bracket, as a function, at each line of the text input. For example :
/^#/ {
print "<h1>" $0 "</h1>"
}
Although $n
refers to the n-th records in the line (according to a delimiter, like in a csv), the special $0
refers to the whole line.
In this case, for each line starting with #
, awk will print (to the standard output), <h1> [content of the line] </h1>
.
This is the beginning to parse headers in markdown.
However, by trying this, we immediatly see that #
is part of the whole line, hence it also appear in the html whereas it sould not.
AWK has a way to prevent this, as it is a complete scripting language, with built-in functions, that enable further manipulations.
substr
acts as its name indicates, it return a substring of its argument.
/^#/ {
print "<h1>" substr($0, 3) "</h1>"
}
In the example above, as per the documentation
it returns the subtring of $0
starting at 3 (1 being #
and 2 the whitespace following it) to the end of the line.
Now this is better, but we now are able to generalized it to all headers. Another function, match
can return the number of char matched by a regex,
and allows the script to dynamically determine which depth of header it parses :
/^#+ / {
match($0, /#+ /);
n = RLENGTH;
print "<h" n-1 ">" substr($0, n + 1) "</h" n-1 ">"
}
Reproducing this technique to parse the rest proves to be difficult, as lists for example, are not contained in a single line, hence
how to know when to close it with </ul>
or </ol>
Introducing a LIFO stack
Since according to the markown syntax, it is possible to have nested blocks such as headers and lists withing blockquotes, or lists withing lists, I came with the simple idea to track to current environnement in a stack in AWK. Turns out it came out to be easy, I only needed a pointer to track the size of the lifo, a fonction to push an element, an another one to pop one out :
BEGIN {
env = "none"
stack_pointer = 0
push(env)
}
# Function to push a value onto the stack
function push(value) {
stack_pointer++
stack[stack_pointer] = value
}
# Function to pop a value from the stack (LIFO)
function pop() {
if (stack_pointer > 0) {
value = stack[stack_pointer]
delete stack[stack_pointer]
stack_pointer--
return value
} else {
return "empty"
}
}
The stack does not have to be strictly declared. The value of inside the LIFO correspond to the current markdown environment.
This is a clever trick, because when I need to close an html tag, I use the poped element between a </
and a >
instead of having a matching table.
I also used a simple last()
function to return the last pushed value in the stack without popping it out :
# Function to get last value in LIFO
function last() {
return stack[stack_pointer]
}
This way, parsing lists became trivial :
# Matching unordered lists
/^[-+*] / {
env = last()
if (env == "ul" ) {
# In a unordered list block, print a new item
print "<li>" substr($0, 3) "</li>"
} else {
# Otherwise, init the unordered list block
push("ul")
print "<ul>\n<li>" substr($0, 3) "</li>"
}
}
I believe the code is pretty self explanatory, but when the last environement is not ul
, then we enter this environement.
This translates as pushing it to the stack.
Otherwise, it means we are already reading a list, and we only need to add a new element to it.
Parsing the simple paragraph and ending the parser
I showed examples of lists and headers, but it works the same way for code blocks, blockquotes, etc.. Only the simple paragraph is different : it does not start with a specific caracter. That is, to match it, we match everything that is not a special character. I have no idea if this is the best solution, but so far it proved to work:
# Matching a simple paragraph
!/^(#|\*|-|\+|>|`|$|\t| )/ {
env = last()
if (env == "none") {
# If no block, print a paragraph
print "<p>" replaceEmAndStrong($0) "</p>"
} else if (env == "blockquote") {
print $0
}
}
AS BEGIN
, AWK provide the possibilty to execute code at the very end of the file, with the END
keyword.
Naturally we need to empty the stack and close all html tags that might have been opened during the parsing.
It only is a while loop, until the last environement is "none", as it way initiated :
END {
env = last()
while (env != "none") {
env = pop()
print "</" env ">"
env = last()
}
}
This way we are able to simply parse markdown and turn it into an HTML file. Of course I am aware that is lacks emphasis, strong and code within a line of text. However I did implement it, but maybe it will be explained in another edit of this post. Nonetheless the code can still be consulted on github.
A testing suite for markdown parser
Having a markdown parser is cool, having one well tested id better. I embarked in writing a testing suite for markdown parsers. I wanted it to be generic, meaning you only had to provide a parsing program, that takes markdown in the standard input, and returns html in the standard output. All tests would be provided by the test suite.