207 lines
10 KiB
Markdown
207 lines
10 KiB
Markdown
# Markdown to HTML using AWK
|
|
|
|
when I decided to start blogging, it was mostly for me to learn and remember all tech thing I learnt throughout time.
|
|
I also want to explore a wide diversity of technology, not focus on a particular one.
|
|
|
|
Hence to start blogging, I obviously needed a static site generator.
|
|
Many of them exist already, like Hugo for example, however rewriting one from scratch is typically the kind of exercise I want to throw myself into.
|
|
The advantage of a static site is clearly its loading speed : a simple html file, combined with a small licked css, and a whole new blog is born
|
|
Anyway, writing this static site generator from scratch is also the perfect excuse to explore a not so widely know technology to manipulate text files.
|
|
|
|
## Introduction to AWK
|
|
|
|
AWK, from the intials of its creator, is an old an powerful text file maniulation. Syntactically close to C, it is a scripting language to manipulation text entries.
|
|
Its [wikipedia page](https://en.wikipedia.org/wiki/AWK) sums up nicely its story.
|
|
I thought it was clever to use is for a site generator, to parse markdown files and generate html ones.
|
|
However, according to this [listing](https://jamstack.org/generators/) of static site generator programs, another one has had the same idea.
|
|
Hence, the following, as well as my code is heavily inspired by [Zodiac](https://github.com/nuex/zodiac) (even though the repo has not been touched for 8 years).
|
|
|
|
## Parsing markdown
|
|
|
|
Following the official [syntax](https://daringfireball.net/projects/markdown/syntax), is a good start for a parser.
|
|
AWK works as follow : it takes an optional regex and execute some code between bracket, as a function, at each line of the text input.
|
|
For example :
|
|
|
|
/^#/ {
|
|
print "<h1>" $0 "</h1>"
|
|
}
|
|
|
|
Although `$n` refers to the n-th records in the line (according to a delimiter, like in a csv), the special `$0` refers to the whole line.
|
|
In this case, for each line starting with `#`, awk will print (to the standard output), `<h1> [content of the line] </h1>`.
|
|
This is the beginning to parse headers in markdown.
|
|
However, by trying this, we immediatly see that `#` is part of the whole line, hence it also appear in the html whereas it sould not.
|
|
AWK has a way to prevent this, as it is a complete scripting language, with built-in functions, that enable further manipulations.
|
|
`substr` acts as its name indicates, it return a substring of its argument.
|
|
|
|
/^#/ {
|
|
print "<h1>" substr($0, 3) "</h1>"
|
|
}
|
|
|
|
In the example above, as per the [documentation](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html#index-substr_0028_0029-function)
|
|
it returns the subtring of `$0` starting at 3 (1 being `#` and 2 the whitespace following it) to the end of the line.
|
|
|
|
Now this is better, but we now are able to generalized it to all headers. Another function, `match` can return the number of char matched by a regex,
|
|
and allows the script to dynamically determine which depth of header it parses. This length is stored is the global variable `RLENGTH`:
|
|
|
|
/^#+ / {
|
|
match($0, /#+ /);
|
|
n = RLENGTH;
|
|
print "<h" n-1 ">" substr($0, n + 1) "</h" n-1 ">"
|
|
}
|
|
|
|
Reproducing this technique to parse the rest proves to be difficult, as lists for example, are not contained in a single line, hence
|
|
how to know when to close it with `</ul>` or `</ol>`
|
|
|
|
## Introducing a LIFO stack
|
|
|
|
Since according to the markown syntax, it is possible to have nested blocks such as headers and lists withing blockquotes, or lists withing lists, I came with the simple idea to track to current environnement in a stack in AWK.
|
|
Turns out it came out to be easy, I only needed a pointer to track the size of the lifo, a fonction to push an element, an another one to pop one out :
|
|
|
|
BEGIN {
|
|
env = "none"
|
|
stack_pointer = 0
|
|
push(env)
|
|
}
|
|
|
|
# Function to push a value onto the stack
|
|
function push(value) {
|
|
stack_pointer++
|
|
stack[stack_pointer] = value
|
|
}
|
|
|
|
# Function to pop a value from the stack (LIFO)
|
|
function pop() {
|
|
if (stack_pointer > 0) {
|
|
value = stack[stack_pointer]
|
|
delete stack[stack_pointer]
|
|
stack_pointer--
|
|
return value
|
|
} else {
|
|
return "empty"
|
|
}
|
|
}
|
|
|
|
The stack does not have to be strictly declared. The value of inside the LIFO correspond to the current markdown environment.
|
|
This is a clever trick, because when I need to close an html tag, I use the poped element between a `</` and a `>` instead of having a matching table.
|
|
|
|
I also used a simple `last()` function to return the last pushed value in the stack without popping it out :
|
|
|
|
# Function to get last value in LIFO
|
|
function last() {
|
|
return stack[stack_pointer]
|
|
}
|
|
|
|
|
|
This way, parsing lists became trivial :
|
|
|
|
# Matching unordered lists
|
|
/^[-+*] / {
|
|
env = last()
|
|
if (env == "ul" ) {
|
|
# In a unordered list block, print a new item
|
|
print "<li>" substr($0, 3) "</li>"
|
|
} else {
|
|
# Otherwise, init the unordered list block
|
|
push("ul")
|
|
print "<ul>\n<li>" substr($0, 3) "</li>"
|
|
}
|
|
}
|
|
|
|
I believe the code is pretty self explanatory, but when the last environement is not `ul`, then we enter this environement.
|
|
This translates as pushing it to the stack.
|
|
Otherwise, it means we are already reading a list, and we only need to add a new element to it.
|
|
|
|
## Parsing the simple paragraph and ending the parser
|
|
|
|
I showed examples of lists and headers, but it works the same way for code blocks, blockquotes, etc.. Only the simple paragraph is different :
|
|
it does not start with a specific caracter. That is, to match it, we match everything that is not a special character.
|
|
I have no idea if this is the best solution, but so far it proved to work:
|
|
|
|
# Matching a simple paragraph
|
|
!/^(#|\*|-|\+|>|`|$|\t| )/ {
|
|
env = last()
|
|
if (env == "none") {
|
|
# If no block, print a paragraph
|
|
print "<p>" $0 "</p>"
|
|
} else if (env == "blockquote") {
|
|
print $0
|
|
}
|
|
}
|
|
|
|
AS `BEGIN`, AWK provide the possibilty to execute code at the very end of the file, with the `END` keyword.
|
|
Naturally we need to empty the stack and close all html tags that might have been opened during the parsing.
|
|
It only is a while loop, until the last environement is "none", as it way initiated :
|
|
|
|
END {
|
|
env = last()
|
|
while (env != "none") {
|
|
env = pop()
|
|
print "</" env ">"
|
|
env = last()
|
|
}
|
|
}
|
|
|
|
This way we are able to simply parse markdown and turn it into an HTML file.
|
|
|
|
## Parsing in-line fonctionnalities
|
|
|
|
For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line.
|
|
Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link.
|
|
|
|
The previously introduced but very useful function `match` fits this need : it literally is a regex engine, looking for a pattern in a string.
|
|
Whenever the pattern is found, two global variables are filled :
|
|
- RSTART : the index of the first character matching the *group*
|
|
- RLENGTH: the length of the matched *group*
|
|
|
|
For the following, `line` represents the line processed by the function, as the following `while` loops are actually part of a single function.
|
|
|
|
This way `match(line, /\*([^*]+)\*/)` matches a string (that does not start with a `*`) surrounded by two `*`, corresponding to an emphasis text.
|
|
The `*` are espaced as they are special characters, and the *group* is delimited by the parenthesis.
|
|
To match several instances of emphasis text within a line, a simple `while` will do the trick.
|
|
We now only have to insert html tags `<em>` are the right space around the matched text, and we are good to go.
|
|
We can save the global variables `RSTART` and `RLENGTH` for further use, in case they were to be change. Using them we also can extract the
|
|
matched substrings and reconstruct the actual html string :
|
|
|
|
|
|
while (match(line, /\*([^*]+)\*/)) {
|
|
start = RSTART
|
|
end = RSTART + RLENGTH - 1
|
|
# Build the result: before match, <em>, content, </em>, after match
|
|
line = substr(line, 1, start-1) "<em>" substr(line, start+1, RLENGTH-2) "</em>" substr(line, end+1)
|
|
}
|
|
|
|
The while loop enables us to repeat this process as many times as this pattern is encountered within the line.
|
|
|
|
We now can repeat the pattern for all inline fonctionnalities, e.g. strong and code.
|
|
|
|
The case of url is a bit more deep as we need to match two groups : the actual text and the url itself.
|
|
No real issue here, the naïve way is to match the whole, and looking for both the link and the url within the matched whole.
|
|
|
|
This way `match(line, /\[([^\]]+)\]\([^\)]+\)/)` matches a text between `[]` followed by a text between `()` : the markdown representation of links.
|
|
As above, we store the `start` and `end` and also the whole match :
|
|
|
|
start = RSTART
|
|
end = RSTART + RLENGTH - 1
|
|
matched = substr($0, RSTART, RLENGTH)
|
|
|
|
It is possible to apply the match fonction on this `matched` string, and extract, first, the text in `[]`, and last the text in `()`
|
|
|
|
|
|
if (match(matched, /\[([^\]]+)\]/)) {
|
|
matched_link = substr(matched, RSTART+1, RLENGTH-2)
|
|
}
|
|
if (match(matched, /\([^\)]+\)/)) {
|
|
matched_url = substr(matched, RSTART+1, RLENGTH-2)
|
|
}
|
|
|
|
As the link text and the url are stored, using the variables `start` and `end`, it is easy to reconstruct the html line :
|
|
|
|
line = substr(line, 1, start-1) "<a href=\"" matched_url "\">" matched_link "</a>" substr(line, end+1)
|
|
|
|
The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser.
|
|
|
|
This, of course, is the first brick of a static site generator, maybe the most complexe one.
|
|
We shall see up next how to orchestrate this parser to make is a actual site generator.
|
|
|
|
The code is available in the [repo](https://git.simonpetit.top/simonpetit/top).
|