From 8ef470c84d357e1a7dbac3de3a15f2307918d6cb Mon Sep 17 00:00:00 2001 From: Simon Petit Date: Sat, 16 Nov 2024 12:27:47 +0100 Subject: [PATCH] first article --- .../awk_for_static_site_generation.md | 55 ++++++++++++++++++- posts/awk_for_static_site_generation.html | 49 ++++++++++++++++- 2 files changed, 102 insertions(+), 2 deletions(-) diff --git a/drafts/published/awk_for_static_site_generation.md b/drafts/published/awk_for_static_site_generation.md index 02d85f2..a9189e8 100644 --- a/drafts/published/awk_for_static_site_generation.md +++ b/drafts/published/awk_for_static_site_generation.md @@ -122,7 +122,7 @@ I have no idea if this is the best solution, but so far it proved to work: env = last() if (env == "none") { # If no block, print a paragraph - print "<p>" replaceEmAndStrong($0) "</p>" + print "<p>" $0 "</p>" } else if (env == "blockquote") { print $0 } @@ -151,3 +151,56 @@ Nonetheless the code can still be consulted on [github](https://github.com/Siwon For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line. Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link. +A very useful function in awk is `match` : it literally is a regex engine, looking for a pattern in a string. +Whenever the pattern is found, two global variables are filled : +- RSTART : the index of the first character matching the *group* +- RLENGTH: the length of the matched *group* + +For the following, `line` represents the line processed by the function, as the following `while` loops are actually part of a single function. + +This way `match(line, /\*([^*]+)\*/)` matches a string surrounded by two `*`, corresponding to an emphasis text. +The `*` are espaced are thez are special characters, and the *group* is inside the parenthesis. +To matche several instances of emphasis text within a line, a simple `while` will do the trick. +We now only have to insert html tags `` are the right space around the matched text, and we are good to go. +We can save the global variables `RSTART` and `RLENGTH` for further use, in case they were to be change. Using them we also can extract the +matched substrings and reconstruct the actual html string : + + + while (match(line, /\*([^*]+)\*/)) { + start = RSTART + end = RSTART + RLENGTH - 1 + # Build the result: before match, , content, , after match + line = substr(line, 1, start-1) "" substr(line, start+1, RLENGTH-2) "" substr(line, end+1) + } + +We now can repeat the pattern for all inline fonctionnalities, e.g. strong and code. + +The case of url is a bit more deep as we need to match two groups : the actual text and the url itself. +No real issue here, the naïve way is to match thd whole, and looking for both the link and the url within the matched whole. + +This way `match(line, /\[([^\]]+)\]\([^\)]+\)/)` matches a text between `[]` followed by a text between `()` : the markdown representation of links. +As above, we store the `start` and `end` and also the whole match : + + start = RSTART + end = RSTART + RLENGTH - 1 + matched = substr($0, RSTART, RLENGTH) + +It is possible to apply the match fonction on this `matched` string, and extract, first, the text in `[]`, and last the text in `()` + + + if (match(matched, /\[([^\]]+)\]/)) { + matched_link = substr(matched, RSTART+1, RLENGTH-2) + } + if (match(matched, /\([^\)]+\)/)) { + matched_url = substr(matched, RSTART+1, RLENGTH-2) + } + +As the link text and the url are stored, using the variables `start` and `end`, it is easy to reconstruct the html line : + + line = substr(line, 1, start-1) "" matched_link "" substr(line, end+1) + +The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser. + +This, of course, is the first brick of a static site generator, maybe the most complexe one. +We shall see up next how to orchestrate this parser to make is a actual site generator. + diff --git a/posts/awk_for_static_site_generation.html b/posts/awk_for_static_site_generation.html index f67b356..1f20d20 100644 --- a/posts/awk_for_static_site_generation.html +++ b/posts/awk_for_static_site_generation.html @@ -124,7 +124,7 @@ function last() { env = last() if (env == "none") { # If no block, print a paragraph - print "<p>" replaceEmAndStrong($0) "</p>" + print "<p>" $0 "</p>" } else if (env == "blockquote") { print $0 } @@ -151,6 +151,53 @@ function last() {

Parsing in-line fonctionnalities

For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line.

Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link.

+

A very useful function in awk is `match` : it literally is a regex engine, looking for a pattern in a string.

+

Whenever the pattern is found, two global variables are filled :

+ +

For the following, `line` represents the line processed by the function, as the following `while` loops are actually part of a single function.

+

This way `match(line, /\([^]+)\/)` matches a string surrounded by two ``, corresponding to an emphasis text.

+

The `` are espaced are thez are special characters, and the group* is inside the parenthesis.

+

To matche several instances of emphasis text within a line, a simple `while` will do the trick.

+

We now only have to insert html tags `` are the right space around the matched text, and we are good to go.

+

We can save the global variables `RSTART` and `RLENGTH` for further use, in case they were to be change. Using them we also can extract the

+

matched substrings and reconstruct the actual html string :

+
while (match(line, /\*([^*]+)\*/)) {
+    start = RSTART
+    end = RSTART + RLENGTH - 1
+    # Build the result: before match, , content, , after match
+    line = substr(line, 1, start-1) "" substr(line, start+1, RLENGTH-2) "" substr(line, end+1)
+}
+
+
+

The case of url is a bit more deep as we need to match two groups : the actual text and the url itself.

+

No real issue here, the naïve way is to match thd whole, and looking for both the link and the url within the matched whole.

+

This way `match(line, /\[([^\]]+)\]\([^\)]+\)/)` matches a text between `[]` followed by a text between `()` : the markdown representation of links.

+

As above, we store the `start` and `end` and also the whole match :

+

+
start = RSTART
+end = RSTART + RLENGTH - 1
+matched = substr($0, RSTART, RLENGTH)
+
+
+

It is possible to apply the match fonction on this `matched` string, and extract, first, the text in `[]`, and last the text in `()`

+
if (match(matched, /\[([^\]]+)\]/)) {
+    matched_link = substr(matched, RSTART+1, RLENGTH-2) 
+}
+if (match(matched, /\([^\)]+\)/)) {
+    matched_url = substr(matched, RSTART+1, RLENGTH-2)
+}
+
+
+

As the link text and the url are stored, using the variables `start` and `end`, it is easy to reconstruct the html line :

+
line = substr(line, 1, start-1) "" matched_link "" substr(line, end+1)
+
+
+

The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser.

+

This, of course, is the first brick of a static site generator, maybe the most complexe one.

+

We shall see up next how to orchestrate this parser to make is a actual site generator.