Date: Thu, 24 Apr 2025 09:55:52 +0000
Subject: [PATCH] wip
---
index.html | 1 +
posts/awk_for_static_site_generation.html | 88 ++++++++++++-----------
posts/markdown_testing_suite.html | 75 +++++++++++++++++++
3 files changed, 121 insertions(+), 43 deletions(-)
create mode 100644 posts/markdown_testing_suite.html
diff --git a/index.html b/index.html
index e98061e..158400f 100644
--- a/index.html
+++ b/index.html
@@ -12,6 +12,7 @@
simpet
diff --git a/posts/awk_for_static_site_generation.html b/posts/awk_for_static_site_generation.html
index 8fa26f2..27be97c 100644
--- a/posts/awk_for_static_site_generation.html
+++ b/posts/awk_for_static_site_generation.html
@@ -30,33 +30,34 @@
AWK works as follow : it takes an optional regex and execute some code between bracket, as a function, at each line of the text input.
For example :
/^#/ {
- print "<h1>" $0 "</h1>"
+ print "{{article}}lt;h1{{article}}gt;" $0 "{{article}}lt;/h1{{article}}gt;"
}
-Although `$n` refers to the n-th records in the line (according to a delimiter, like in a csv), the special `$0` refers to the whole line.
-In this case, for each line starting with `#`, awk will print (to the standard output), `<h1> [content of the line] </h1>`.
+Although $n refers to the n-th records in the line (according to a delimiter, like in a csv), the special
$0
refers to the whole line.
+In this case, for each line starting with #, awk will print (to the standard output),
{{article}}lt;h1{{article}}gt; [content of the line] {{article}}lt;/h1{{article}}gt;
.
This is the beginning to parse headers in markdown.
-However, by trying this, we immediatly see that `#` is part of the whole line, hence it also appear in the html whereas it sould not.
+However, by trying this, we immediatly see that #
is part of the whole line, hence it also appear in the html whereas it sould not.
AWK has a way to prevent this, as it is a complete scripting language, with built-in functions, that enable further manipulations.
+substr
acts as its name indicates, it return a substring of its argument.
/^#/ {
- print "<h1>" substr($0, 3) "</h1>"
+ print "{{article}}lt;h1{{article}}gt;" substr($0, 3) "{{article}}lt;/h1{{article}}gt;"
}
In the example above, as per the documentation
-it returns the subtring of `$0` starting at 3 (1 being `#` and 2 the whitespace following it) to the end of the line.
-Now this is better, but we now are able to generalized it to all headers. Another function, `match` can return the number of char matched by a regex,
-and allows the script to dynamically determine which depth of header it parses. This length is stored is the global variable `RLENGTH`:
+it returns the subtring of $0 starting at 3 (1 being
#
and 2 the whitespace following it) to the end of the line.
+Now this is better, but we now are able to generalized it to all headers. Another function, match
can return the number of char matched by a regex,
+and allows the script to dynamically determine which depth of header it parses. This length is stored is the global variable RLENGTH
:
/^#+ / {
match($0, /#+ /);
n = RLENGTH;
- print "<h" n-1 ">" substr($0, n + 1) "</h" n-1 ">"
+ print "{{article}}lt;h" n-1 "{{article}}gt;" substr($0, n + 1) "{{article}}lt;/h" n-1 "{{article}}gt;"
}
Reproducing this technique to parse the rest proves to be difficult, as lists for example, are not contained in a single line, hence
-how to know when to close it with `</ul>` or `</ol>`
+how to know when to close it with {{article}}lt;/ul{{article}}gt; or
{{article}}lt;/ol{{article}}gt;
Introducing a LIFO stack
Since according to the markown syntax, it is possible to have nested blocks such as headers and lists withing blockquotes, or lists withing lists, I came with the simple idea to track to current environnement in a stack in AWK.
Turns out it came out to be easy, I only needed a pointer to track the size of the lifo, a fonction to push an element, an another one to pop one out :
@@ -88,8 +89,8 @@ function pop() {
The stack does not have to be strictly declared. The value of inside the LIFO correspond to the current markdown environment.
-This is a clever trick, because when I need to close an html tag, I use the poped element between a `</` and a `>` instead of having a matching table.
-I also used a simple `last()` function to return the last pushed value in the stack without popping it out :
+This is a clever trick, because when I need to close an html tag, I use the poped element between a {{article}}lt;/ and a
{{article}}gt;
instead of having a matching table.
+I also used a simple last()
function to return the last pushed value in the stack without popping it out :
# Function to get last value in LIFO
function last() {
return stack[stack_pointer]
@@ -102,17 +103,17 @@ function last() {
env = last()
if (env == "ul" ) {
# In a unordered list block, print a new item
- print "<li>" substr($0, 3) "</li>"
+ print "{{article}}lt;li{{article}}gt;" substr($0, 3) "{{article}}lt;/li{{article}}gt;"
} else {
# Otherwise, init the unordered list block
push("ul")
- print "<ul>
-<li>" substr($0, 3) "</li>"
+ print "{{article}}lt;ul{{article}}gt;
+{{article}}lt;li{{article}}gt;" substr($0, 3) "{{article}}lt;/li{{article}}gt;"
}
}
-I believe the code is pretty self explanatory, but when the last environement is not `ul`, then we enter this environement.
+I believe the code is pretty self explanatory, but when the last environement is not ul
, then we enter this environement.
This translates as pushing it to the stack.
Otherwise, it means we are already reading a list, and we only need to add a new element to it.
Parsing the simple paragraph and ending the parser
@@ -120,84 +121,85 @@ function last() {
it does not start with a specific caracter. That is, to match it, we match everything that is not a special character.
I have no idea if this is the best solution, but so far it proved to work:
# Matching a simple paragraph
-!/^(#|\*|-|\+|>|`|$| | )/ {
+!/^(#|*|-|+|>|`|$| | )/ {
env = last()
if (env == "none") {
# If no block, print a paragraph
- print "<p>" $0 "</p>"
+ print "{{article}}lt;p{{article}}gt;" $0 "{{article}}lt;/p{{article}}gt;"
} else if (env == "blockquote") {
print $0
}
}
-AS `BEGIN`, AWK provide the possibilty to execute code at the very end of the file, with the `END` keyword.
+AS BEGIN, AWK provide the possibilty to execute code at the very end of the file, with the
END
keyword.
Naturally we need to empty the stack and close all html tags that might have been opened during the parsing.
It only is a while loop, until the last environement is "none", as it way initiated :
END {
env = last()
while (env != "none") {
env = pop()
- print "</" env ">"
+ print "{{article}}lt;/" env "{{article}}gt;"
env = last()
}
}
This way we are able to simply parse markdown and turn it into an HTML file.
-Of course I am aware that is lacks emphasis, strong and code within a line of text.
-However I did implement it, but maybe it will be explained in another edit of this post.
-Nonetheless the code can still be consulted on github.
Parsing in-line fonctionnalities
For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line.
Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link.
-A very useful function in awk is `match` : it literally is a regex engine, looking for a pattern in a string.
+The previously introduced but very useful function match
fits this need : it literally is a regex engine, looking for a pattern in a string.
Whenever the pattern is found, two global variables are filled :
-- RSTART : the index of the first character matching the *group*
-- RLENGTH: the length of the matched *group*
+- RSTART : the index of the first character matching the group
+- RLENGTH: the length of the matched group
-For the following, `line` represents the line processed by the function, as the following `while` loops are actually part of a single function.
-This way `match(line, /*([^*]+)*/)` matches a string surrounded by two `*`, corresponding to an emphasis text.
-The `` are espaced are thez are special characters, and the group* is inside the parenthesis.
-To matche several instances of emphasis text within a line, a simple `while` will do the trick.
-We now only have to insert html tags `<em>` are the right space around the matched text, and we are good to go.
-We can save the global variables `RSTART` and `RLENGTH` for further use, in case they were to be change. Using them we also can extract the
+For the following, line represents the line processed by the function, as the following
while
loops are actually part of a single function.
+This way match(line, /*([^{{article}}#42;]+)*/) matches a string (that does not start with a {{article}}#42
) surrounded by two
{{article}}#42;
, corresponding to an emphasis text.
+The {{article}}#42;
are espaced as they are special characters, and the group is delimited by the parenthesis.
+To match several instances of emphasis text within a line, a simple while
will do the trick.
+We now only have to insert html tags {{article}}lt;em{{article}}gt;
are the right space around the matched text, and we are good to go.
+We can save the global variables RSTART and
RLENGTH
for further use, in case they were to be change. Using them we also can extract the
matched substrings and reconstruct the actual html string :
-while (match(line, /*([^*]+)*/)) {
+while (match(line, /*([^{{article}}#42;]+)*/)) {
start = RSTART
end = RSTART + RLENGTH - 1
- # Build the result: before match, <em>, content, </em>, after match
- line = substr(line, 1, start-1) "<em>" substr(line, start+1, RLENGTH-2) "</em>" substr(line, end+1)
+ # Build the result: before match, {{article}}lt;em{{article}}gt;, content, {{article}}lt;/em{{article}}gt;, after match
+ line = substr(line, 1, start-1) "{{article}}lt;em{{article}}gt;" substr(line, start+1, RLENGTH-2) "{{article}}lt;/em{{article}}gt;" substr(line, end+1)
}
+The while loop enables us to repeat this process as many times as this pattern is encountered within the line.
+
+We now can repeat the pattern for all inline fonctionnalities, e.g. strong and code.
The case of url is a bit more deep as we need to match two groups : the actual text and the url itself.
-No real issue here, the naïve way is to match thd whole, and looking for both the link and the url within the matched whole.
-This way `match(line, /\[([^\]]+)\]\([^\)]+\)/)` matches a text between `[]` followed by a text between `()` : the markdown representation of links.
-As above, we store the `start` and `end` and also the whole match :
+No real issue here, the naïve way is to match the whole, and looking for both the link and the url within the matched whole.
+This way match(line, /[([^]]+)]([^)]+)/) matches a text between []
followed by a text between
()
: the markdown representation of links.
+As above, we store the start and
end
and also the whole match :
start = RSTART
end = RSTART + RLENGTH - 1
matched = substr($0, RSTART, RLENGTH)
-It is possible to apply the match fonction on this `matched` string, and extract, first, the text in `[]`, and last the text in `()`
-if (match(matched, /\[([^\]]+)\]/)) {
+It is possible to apply the match fonction on this matched string, and extract, first, the text in []
, and last the text in
()
+if (match(matched, /[([^]]+)]/)) {
matched_link = substr(matched, RSTART+1, RLENGTH-2)
}
-if (match(matched, /\([^\)]+\)/)) {
+if (match(matched, /([^)]+)/)) {
matched_url = substr(matched, RSTART+1, RLENGTH-2)
}
-As the link text and the url are stored, using the variables `start` and `end`, it is easy to reconstruct the html line :
-line = substr(line, 1, start-1) "<a href="" matched_url "">" matched_link "</a>" substr(line, end+1)
+As the link text and the url are stored, using the variables start and
end
, it is easy to reconstruct the html line :
+line = substr(line, 1, start-1) "{{article}}lt;a href="" matched_url ""{{article}}gt;" matched_link "{{article}}lt;/a{{article}}gt;" substr(line, end+1)
The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser.
This, of course, is the first brick of a static site generator, maybe the most complexe one.
We shall see up next how to orchestrate this parser to make is a actual site generator.
+The code is available in the repo.