# Markdown to HTML using AWK when I decided to start blogging, it was mostly for me to learn and remember all tech thing I learnt throughout time. I also want to explore a wide diversity of technology, not focus on a particular one. Hence to start blogging, I obviously needed a static site generator. Many of them exist already, like Hugo for example, however rewriting one from scratch is typically the kind of exercise I want to throw myself into. The advantage of a static site is clearly its loading speed : a simple html file, combined with a small licked css, and a whole new blog is born Anyway, writing this static site generator from scratch is also the perfect excuse to explore a not so widely know technology to manipulate text files. ## Introduction to AWK AWK, from the intials of its creator, is an old an powerful text file maniulation. Syntactically close to C, it is a scripting language to manipulation text entries. Its [wikipedia page](https://en.wikipedia.org/wiki/AWK) sums up nicely its story. I thought it was clever to use is for a site generator, to parse markdown files and generate html ones. However, according to this [listing](https://jamstack.org/generators/) of static site generator programs, another one has had the same idea. Hence, the following, as well as my code is heavily inspired by [Zodiac](https://github.com/nuex/zodiac) (even though the repo has not been touched for 8 years). ## Parsing markdown Following the official [syntax](https://daringfireball.net/projects/markdown/syntax), is a good start for a parser. AWK works as follow : it takes an optional regex and execute some code between bracket, as a function, at each line of the text input. For example : /^#/ { print "
" $0 "
" } else if (env == "blockquote") { print $0 } } AS `BEGIN`, AWK provide the possibilty to execute code at the very end of the file, with the `END` keyword. Naturally we need to empty the stack and close all html tags that might have been opened during the parsing. It only is a while loop, until the last environement is "none", as it way initiated : END { env = last() while (env != "none") { env = pop() print "" env ">" env = last() } } This way we are able to simply parse markdown and turn it into an HTML file. ## Parsing in-line fonctionnalities For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line. Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link. The previously introduced but very useful function `match` fits this need : it literally is a regex engine, looking for a pattern in a string. Whenever the pattern is found, two global variables are filled : - RSTART : the index of the first character matching the *group* - RLENGTH: the length of the matched *group* For the following, `line` represents the line processed by the function, as the following `while` loops are actually part of a single function. This way `match(line, /*([^*]+)*/)` matches a string (that does not start with a `*`) surrounded by two `*`, corresponding to an emphasis text. The `*` are espaced as they are special characters, and the *group* is delimited by the parenthesis. To match several instances of emphasis text within a line, a simple `while` will do the trick. We now only have to insert html tags `` are the right space around the matched text, and we are good to go. We can save the global variables `RSTART` and `RLENGTH` for further use, in case they were to be change. Using them we also can extract the matched substrings and reconstruct the actual html string : while (match(line, /*([^*]+)*/)) { start = RSTART end = RSTART + RLENGTH - 1 # Build the result: before match, , content, , after match line = substr(line, 1, start-1) "" substr(line, start+1, RLENGTH-2) "" substr(line, end+1) } The while loop enables us to repeat this process as many times as this pattern is encountered within the line. We now can repeat the pattern for all inline fonctionnalities, e.g. strong and code. The case of url is a bit more deep as we need to match two groups : the actual text and the url itself. No real issue here, the naïve way is to match the whole, and looking for both the link and the url within the matched whole. This way `match(line, /\[([^\]]+)\]\([^\)]+\)/)` matches a text between `[]` followed by a text between `()` : the markdown representation of links. As above, we store the `start` and `end` and also the whole match : start = RSTART end = RSTART + RLENGTH - 1 matched = substr($0, RSTART, RLENGTH) It is possible to apply the match fonction on this `matched` string, and extract, first, the text in `[]`, and last the text in `()` if (match(matched, /\[([^\]]+)\]/)) { matched_link = substr(matched, RSTART+1, RLENGTH-2) } if (match(matched, /\([^\)]+\)/)) { matched_url = substr(matched, RSTART+1, RLENGTH-2) } As the link text and the url are stored, using the variables `start` and `end`, it is easy to reconstruct the html line : line = substr(line, 1, start-1) "" matched_link "" substr(line, end+1) The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser. This, of course, is the first brick of a static site generator, maybe the most complexe one. We shall see up next how to orchestrate this parser to make is a actual site generator. The code is available in the [repo](https://git.simonpetit.top/simonpetit/top).