This commit is contained in:
parent
1170b3d25c
commit
f6781edad8
@ -26,6 +26,15 @@ ul li a:hover {
|
||||
color: #AAA;
|
||||
}
|
||||
|
||||
ul li div {
|
||||
display: flex;
|
||||
flex-direction: row;
|
||||
}
|
||||
|
||||
ul li div p {
|
||||
font-size: 0.8em;
|
||||
}
|
||||
|
||||
.title {
|
||||
margin-top: 5vh;
|
||||
margin-bottom: 7vh;
|
||||
|
||||
@ -19,6 +19,12 @@ body {
|
||||
color: #AAA;
|
||||
}
|
||||
|
||||
.dates {
|
||||
display: flex;
|
||||
flex-direction: row;
|
||||
justify-content: space-around;
|
||||
}
|
||||
|
||||
article {
|
||||
width: 70vw;
|
||||
margin-right: auto;
|
||||
|
||||
@ -50,21 +50,21 @@ and I use `awk` to replace `{{article}}` with the actual content of the posts, l
|
||||
# Storing the path of the post/article to publish
|
||||
# The path is supposed to have this format "./drafts/published/<article>.*
|
||||
article_path=$1
|
||||
|
||||
|
||||
# from the relative path, only retrieving the name of the article (without file extension)
|
||||
article_name=$(echo $article_path | cut -d '/' -f 4 | cut -d '.' -f 1)
|
||||
|
||||
|
||||
# Convert the markdown draft into an html article and storing it locally
|
||||
post=$(awk -f ${BOB_LIB}/markdown.awk ./$article_path)
|
||||
|
||||
|
||||
# Retrieving the html article template
|
||||
template="${BOB_LIB}/template/post.html"
|
||||
|
||||
|
||||
# Escaping the & for next step to not confuse awk
|
||||
escaped_post=$(echo "$post" | sed 's/&/\\&/g')
|
||||
|
||||
escaped_post=$(echo "$post" | sed 's/&/\\\\&/g')
|
||||
|
||||
# In the template, replacing the string {{article}} by the actual content parsed above
|
||||
awk -v content="$escaped_post" '{gsub(/\{\{article\}\}/, content); print}' "$template" > "./posts/$article_name.html"
|
||||
awk -v content="$escaped_post" '{gsub(/{{article}}/, content); print}' "$template" > "./posts/$article_name.html"
|
||||
}
|
||||
|
||||
The home page template is similar :
|
||||
|
||||
@ -23,18 +23,18 @@ AWK works as follow : it takes an optional regex and execute some code between b
|
||||
For example :
|
||||
|
||||
/^#/ {
|
||||
print "<h1>" $0 "</h1>"
|
||||
print "<h1>" $0 "</h1>"
|
||||
}
|
||||
|
||||
Although `$n` refers to the n-th records in the line (according to a delimiter, like in a csv), the special `$0` refers to the whole line.
|
||||
In this case, for each line starting with `#`, awk will print (to the standard output), `<h1> [content of the line] </h1>`.
|
||||
In this case, for each line starting with `#`, awk will print (to the standard output), `<h1> [content of the line] </h1>`.
|
||||
This is the beginning to parse headers in markdown.
|
||||
However, by trying this, we immediatly see that `#` is part of the whole line, hence it also appear in the html whereas it sould not.
|
||||
AWK has a way to prevent this, as it is a complete scripting language, with built-in functions, that enable further manipulations.
|
||||
`substr` acts as its name indicates, it return a substring of its argument.
|
||||
|
||||
/^#/ {
|
||||
print "<h1>" substr($0, 3) "</h1>"
|
||||
print "<h1>" substr($0, 3) "</h1>"
|
||||
}
|
||||
|
||||
In the example above, as per the [documentation](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html#index-substr_0028_0029-function)
|
||||
@ -46,11 +46,11 @@ and allows the script to dynamically determine which depth of header it parses.
|
||||
/^#+ / {
|
||||
match($0, /#+ /);
|
||||
n = RLENGTH;
|
||||
print "<h" n-1 ">" substr($0, n + 1) "</h" n-1 ">"
|
||||
print "<h" n-1 ">" substr($0, n + 1) "</h" n-1 ">"
|
||||
}
|
||||
|
||||
Reproducing this technique to parse the rest proves to be difficult, as lists for example, are not contained in a single line, hence
|
||||
how to know when to close it with `</ul>` or `</ol>`
|
||||
how to know when to close it with `</ul>` or `</ol>`
|
||||
|
||||
## Introducing a LIFO stack
|
||||
|
||||
@ -82,7 +82,7 @@ Turns out it came out to be easy, I only needed a pointer to track the size of t
|
||||
}
|
||||
|
||||
The stack does not have to be strictly declared. The value of inside the LIFO correspond to the current markdown environment.
|
||||
This is a clever trick, because when I need to close an html tag, I use the poped element between a `</` and a `>` instead of having a matching table.
|
||||
This is a clever trick, because when I need to close an html tag, I use the poped element between a `</` and a `>` instead of having a matching table.
|
||||
|
||||
I also used a simple `last()` function to return the last pushed value in the stack without popping it out :
|
||||
|
||||
@ -99,11 +99,11 @@ This way, parsing lists became trivial :
|
||||
env = last()
|
||||
if (env == "ul" ) {
|
||||
# In a unordered list block, print a new item
|
||||
print "<li>" substr($0, 3) "</li>"
|
||||
print "<li>" substr($0, 3) "</li>"
|
||||
} else {
|
||||
# Otherwise, init the unordered list block
|
||||
push("ul")
|
||||
print "<ul>\n<li>" substr($0, 3) "</li>"
|
||||
print "<ul>\n<li>" substr($0, 3) "</li>"
|
||||
}
|
||||
}
|
||||
|
||||
@ -118,11 +118,11 @@ it does not start with a specific caracter. That is, to match it, we match every
|
||||
I have no idea if this is the best solution, but so far it proved to work:
|
||||
|
||||
# Matching a simple paragraph
|
||||
!/^(#|\*|-|\+|>|`|$|\t| )/ {
|
||||
!/^(#|*|-|+|>|`|$|\t| )/ {
|
||||
env = last()
|
||||
if (env == "none") {
|
||||
# If no block, print a paragraph
|
||||
print "<p>" $0 "</p>"
|
||||
print "<p>" $0 "</p>"
|
||||
} else if (env == "blockquote") {
|
||||
print $0
|
||||
}
|
||||
@ -136,7 +136,7 @@ It only is a while loop, until the last environement is "none", as it way initia
|
||||
env = last()
|
||||
while (env != "none") {
|
||||
env = pop()
|
||||
print "</" env ">"
|
||||
print "</" env ">"
|
||||
env = last()
|
||||
}
|
||||
}
|
||||
@ -155,19 +155,19 @@ Whenever the pattern is found, two global variables are filled :
|
||||
|
||||
For the following, `line` represents the line processed by the function, as the following `while` loops are actually part of a single function.
|
||||
|
||||
This way `match(line, /\*([^*]+)\*/)` matches a string (that does not start with a `*`) surrounded by two `*`, corresponding to an emphasis text.
|
||||
The `*` are espaced as they are special characters, and the *group* is delimited by the parenthesis.
|
||||
This way `match(line, /*([^*]+)*/)` matches a string (that does not start with a `*`) surrounded by two `*`, corresponding to an emphasis text.
|
||||
The `*` are espaced as they are special characters, and the *group* is delimited by the parenthesis.
|
||||
To match several instances of emphasis text within a line, a simple `while` will do the trick.
|
||||
We now only have to insert html tags `<em>` are the right space around the matched text, and we are good to go.
|
||||
We now only have to insert html tags `<em>` are the right space around the matched text, and we are good to go.
|
||||
We can save the global variables `RSTART` and `RLENGTH` for further use, in case they were to be change. Using them we also can extract the
|
||||
matched substrings and reconstruct the actual html string :
|
||||
|
||||
|
||||
while (match(line, /\*([^*]+)\*/)) {
|
||||
while (match(line, /*([^*]+)*/)) {
|
||||
start = RSTART
|
||||
end = RSTART + RLENGTH - 1
|
||||
# Build the result: before match, <em>, content, </em>, after match
|
||||
line = substr(line, 1, start-1) "<em>" substr(line, start+1, RLENGTH-2) "</em>" substr(line, end+1)
|
||||
# Build the result: before match, <em>, content, </em>, after match
|
||||
line = substr(line, 1, start-1) "<em>" substr(line, start+1, RLENGTH-2) "</em>" substr(line, end+1)
|
||||
}
|
||||
|
||||
The while loop enables us to repeat this process as many times as this pattern is encountered within the line.
|
||||
@ -196,7 +196,7 @@ It is possible to apply the match fonction on this `matched` string, and extract
|
||||
|
||||
As the link text and the url are stored, using the variables `start` and `end`, it is easy to reconstruct the html line :
|
||||
|
||||
line = substr(line, 1, start-1) "<a href=\"" matched_url "\">" matched_link "</a>" substr(line, end+1)
|
||||
line = substr(line, 1, start-1) "<a href=\"" matched_url "\">" matched_link "</a>" substr(line, end+1)
|
||||
|
||||
The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser.
|
||||
|
||||
|
||||
@ -12,8 +12,9 @@
|
||||
<body>
|
||||
<h1 class='title'>simpet</h1>
|
||||
<ul>
|
||||
<li><a href="./posts/markdown_testing_suite.html">markdown testing suite</a></li>
|
||||
<li><a href="./posts/awk_for_static_site_generation.html">awk for static site generation</a></li>
|
||||
<li><a href="./posts/awk_static_blog_generator.html">awk static blog generator</a><div><p>created at 2025-12-03 17:52:35</p><p>updated at 2025-12-03 17:52:35</p></div></li>
|
||||
<li><a href="./posts/awk_to_parse_markdown.html">awk to parse markdown</a><div><p>created at 2025-12-03 17:49:06</p><p>updated at 2025-12-03 17:49:06</p></div></li>
|
||||
<li><a href="./posts/markdown_testing_suite.html">markdown testing suite</a><div><p>created at 2024-12-09 14:51:41</p><p>updated at 2025-02-03 14:05:14</p></div></li>
|
||||
</ul>
|
||||
</body>
|
||||
|
||||
|
||||
141
posts/awk_static_blog_generator.html
Normal file
141
posts/awk_static_blog_generator.html
Normal file
@ -0,0 +1,141 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="fr" dir="ltr">
|
||||
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<title>simpet</title>
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
|
||||
<link href="https://fonts.googleapis.com/css?family=Cutive+Mono|IBM+Plex+Mono&display=swap" rel="stylesheet">
|
||||
<link rel="stylesheet" type="text/css" href="../css/poststyle.css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<h1 class='title'><a href="../index.html">simpet</a></h1>
|
||||
<article>
|
||||
<div class='dates'>
|
||||
<p>Created at: <time datetime="2025-12-03 17:52:35">2025-12-03 17:52:35</time></p>
|
||||
<p>Updated at: <time datetime="2025-12-03 17:52:35">2025-12-03 17:52:35</time></p>
|
||||
</div>
|
||||
<h1>Bob, a static blog generator</h1>
|
||||
<h2>The blog engine</h2>
|
||||
<p>Starting from my markdown AWK parser, which was litterally done to achieve this blog engine, I've added an extra layer to turn it into a statis blog generator
|
||||
Of course the parser is only one of the several components required for a blog generator, but I shall start from the beginning.
|
||||
Initially I wanted to blog for me, and as described <a href="https://simonpetit.top/posts/awk_for_static_site_generation.html">here</a>, it was to mostly talk about tech.
|
||||
The desire to make everything from scratch and reinvent the wheel is very strong, but we'll see how this evolve in the future.</p>
|
||||
<p>Now that I have my markdown to HTML converter I don't lack much to turn in into <code>bob</code> my blog generator.</p>
|
||||
<h2>the boilerplate</h2>
|
||||
<p>After thinking about it, I did want to rely on git to store my drafts and posts, and have a CI listening to my blog repository that would do all the publishing work on the actual webserver. Hence the need for a self hosted git instance, and CI (reinventing the wheel I said).
|
||||
Maybe I shall post about <code>gitea</code> and <code>drone CI</code> later on.</p>
|
||||
<p>For this to happend, <code>bob</code> shall be a simple CLI, and screw it, a docker image as well.</p>
|
||||
<p>I also wanted to only handle the markdown file, and let the html build itself.</p>
|
||||
<p>I came up with a very simple folder architecture : </p>
|
||||
<ul>
|
||||
<li>a <code>css</code> folder containing...css files</li>
|
||||
<li>a <code>draft</code> folder containing...drafts written in markdown. These shall not be published yet.</li>
|
||||
<li>a <code>draft/published</code> subfolder, where all the published posts shall be, still in the markdown format</li>
|
||||
<li>a <code>posts</code> folder containing the actual HTML files generated from the posts in <code>draft/published</code></li>
|
||||
</ul>
|
||||
<p>The idea is as simple as it gets : I write my drafts in the folder of the same name, when I want to publish them, I simply move them into the <code>published</code> subfolder and <code>bob</code> and the CI handle the rest.</p>
|
||||
<p>But the markdown converter does not create a full html page, so here comes the need for boilerplating :
|
||||
I made an <code>index.html</code> template, for the home page, and a <code>post.html</code> one, for the actual articles.</p>
|
||||
<p>Once again this is very simple : the post page template's body looks like this : </p>
|
||||
<pre><code><body>
|
||||
<h1 class='title'><a href="../index.html">simpet</a></h1>
|
||||
<article>
|
||||
{{article}}
|
||||
<footer>
|
||||
<div></div>
|
||||
</footer>
|
||||
</article>
|
||||
</body>
|
||||
</code>
|
||||
</pre>
|
||||
<p>and I use <code>awk</code> to replace <code>{{article}}</code> with the actual content of the posts, like so : </p>
|
||||
<pre><code>publish_one()
|
||||
{
|
||||
# Storing the path of the post/article to publish
|
||||
# The path is supposed to have this format "./drafts/published/<article>.*
|
||||
article_path=$1
|
||||
|
||||
# from the relative path, only retrieving the name of the article (without file extension)
|
||||
article_name=$(echo $article_path | cut -d '/' -f 4 | cut -d '.' -f 1)
|
||||
|
||||
# Convert the markdown draft into an html article and storing it locally
|
||||
post=$(awk -f ${BOB_LIB}/markdown.awk ./$article_path)
|
||||
|
||||
# Retrieving the html article template
|
||||
template="${BOB_LIB}/template/post.html"
|
||||
|
||||
# Escaping the & for next step to not confuse awk
|
||||
escaped_post=$(echo "$post" | sed 's/&/\\&/g')
|
||||
|
||||
# In the template, replacing the string {{article}} by the actual content parsed above
|
||||
awk -v content="$escaped_post" '{gsub(/{{article}}/, content); print}' "$template" > "./posts/$article_name.html"
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>The home page template is similar : </p>
|
||||
<pre><code><body>
|
||||
<h1 class='title'>simpet</h1>
|
||||
{{articles}}
|
||||
</body>
|
||||
</code>
|
||||
</pre>
|
||||
<p>and updated this way : </p>
|
||||
<pre><code>update_index()
|
||||
{
|
||||
# Listing all posts and making an html list (with there link) out of them
|
||||
posts=$(ls -t ./posts | awk '
|
||||
BEGIN {
|
||||
print "<ul>"
|
||||
}
|
||||
{
|
||||
ref=$0
|
||||
gsub(".html","",ref)
|
||||
gsub(/[_-]/, " ", ref)
|
||||
print "<li><a href=\"./posts/" $0 "\">" ref "</a></li>"
|
||||
}
|
||||
END {
|
||||
print "</ul>"
|
||||
}')
|
||||
# retrieving the template for the index.html
|
||||
template="${BOB_LIB}/template/index.html"
|
||||
# replacing {{articles}} in the template with the actual list of articles from above
|
||||
awk -v content="$posts" '{gsub(/{{articles}}/, content); print}' "$template" > "./index.html"
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>Whenever an new article is added or removed of the <code>drafts/published</code> folder, the <code>update_index()</code> will adjust the home page, because call by this function : </p>
|
||||
<pre><code>publish_all()
|
||||
{
|
||||
# List all drafts to be published
|
||||
published=$(ls -1 ./drafts/published)
|
||||
# turning it into an array
|
||||
published_array=($published)
|
||||
|
||||
# Remove all html articles in case a previously published one was removed
|
||||
rm ./posts/*.html
|
||||
|
||||
# Publish them one by one (ie turning md into html)
|
||||
for file in "${published_array[@]}"; do
|
||||
publish_one ./drafts/published/$file
|
||||
done
|
||||
# updating the index.html as new articles are supposedly present and some may be removed
|
||||
update_index
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>which basically only reads the ready to be published posts and turn them into an html file, using the template, and then update the <code>index.html</code> </p>
|
||||
<p>That's it ! </p>
|
||||
<h2>To sum up</h2>
|
||||
<p>I've made a very simple, not very customisable static blog generator, mostly using awk. It clearly is not optimized as it regenerated all the articles everytime, but awk is quite efficient, and for a few posts, I don't think it really matters.</p>
|
||||
<p>The real benefit is that I only handle markdown files, the CI and <code>bob</code> do the rest... </p>
|
||||
<p>Also, a statis site is blazing fast as loading in the browser, and since I do not use images (yet) nor javascript, I get a very very fast blog.</p>
|
||||
<p>To be continued...</p>
|
||||
<footer>
|
||||
<div></div>
|
||||
</footer>
|
||||
</article>
|
||||
</body>
|
||||
|
||||
</html>
|
||||
207
posts/awk_to_parse_markdown.html
Normal file
207
posts/awk_to_parse_markdown.html
Normal file
@ -0,0 +1,207 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="fr" dir="ltr">
|
||||
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<title>simpet</title>
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
|
||||
<link href="https://fonts.googleapis.com/css?family=Cutive+Mono|IBM+Plex+Mono&display=swap" rel="stylesheet">
|
||||
<link rel="stylesheet" type="text/css" href="../css/poststyle.css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<h1 class='title'><a href="../index.html">simpet</a></h1>
|
||||
<article>
|
||||
<div class='dates'>
|
||||
<p>Created at: <time datetime="2025-12-03 17:49:06">2025-12-03 17:49:06</time></p>
|
||||
<p>Updated at: <time datetime="2025-12-03 17:49:06">2025-12-03 17:49:06</time></p>
|
||||
</div>
|
||||
<h1>Markdown to HTML using AWK</h1>
|
||||
<p>when I decided to start blogging, it was mostly for me to learn and remember all tech thing I learnt throughout time.
|
||||
I also want to explore a wide diversity of technology, not focus on a particular one.</p>
|
||||
<p>Hence to start blogging, I obviously needed a static site generator.
|
||||
Many of them exist already, like Hugo for example, however rewriting one from scratch is typically the kind of exercise I want to throw myself into.
|
||||
The advantage of a static site is clearly its loading speed : a simple html file, combined with a small licked css, and a whole new blog is born
|
||||
Anyway, writing this static site generator from scratch is also the perfect excuse to explore a not so widely know technology to manipulate text files. </p>
|
||||
<h2>Introduction to AWK</h2>
|
||||
<p>AWK, from the intials of its creator, is an old an powerful text file maniulation. Syntactically close to C, it is a scripting language to manipulation text entries.
|
||||
Its <a href="https://en.wikipedia.org/wiki/AWK">wikipedia page</a> sums up nicely its story.
|
||||
I thought it was clever to use is for a site generator, to parse markdown files and generate html ones.
|
||||
However, according to this <a href="https://jamstack.org/generators/">listing</a> of static site generator programs, another one has had the same idea.
|
||||
Hence, the following, as well as my code is heavily inspired by <a href="https://github.com/nuex/zodiac">Zodiac</a> (even though the repo has not been touched for 8 years).</p>
|
||||
<h2>Parsing markdown</h2>
|
||||
<p>Following the official <a href="https://daringfireball.net/projects/markdown/syntax">syntax</a>, is a good start for a parser.
|
||||
AWK works as follow : it takes an optional regex and execute some code between bracket, as a function, at each line of the text input.
|
||||
For example :</p>
|
||||
<pre><code>/^#/ {
|
||||
print "<h1>" $0 "</h1>"
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>Although <code>$n</code> refers to the n-th records in the line (according to a delimiter, like in a csv), the special <code>$0</code> refers to the whole line.
|
||||
In this case, for each line starting with <code>#</code>, awk will print (to the standard output), <code><h1> [content of the line] </h1></code>.
|
||||
This is the beginning to parse headers in markdown.
|
||||
However, by trying this, we immediatly see that <code>#</code> is part of the whole line, hence it also appear in the html whereas it sould not.
|
||||
AWK has a way to prevent this, as it is a complete scripting language, with built-in functions, that enable further manipulations.</p>
|
||||
<pre><code>/^#/ {
|
||||
print "<h1>" substr($0, 3) "</h1>"
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>In the example above, as per the <a href="https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html#index-substr_0028_0029-function">documentation</a>
|
||||
it returns the subtring of <code>$0</code> starting at 3 (1 being <code>#</code> and 2 the whitespace following it) to the end of the line.</p>
|
||||
<p>Now this is better, but we now are able to generalized it to all headers. Another function, <code>match</code> can return the number of char matched by a regex,
|
||||
and allows the script to dynamically determine which depth of header it parses. This length is stored is the global variable <code>RLENGTH</code>:</p>
|
||||
<pre><code>/^#+ / {
|
||||
match($0, /#+ /);
|
||||
n = RLENGTH;
|
||||
print "<h" n-1 ">" substr($0, n + 1) "</h" n-1 ">"
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>Reproducing this technique to parse the rest proves to be difficult, as lists for example, are not contained in a single line, hence
|
||||
how to know when to close it with <code></ul></code> or <code></ol></code></p>
|
||||
<h2>Introducing a LIFO stack</h2>
|
||||
<p>Since according to the markown syntax, it is possible to have nested blocks such as headers and lists withing blockquotes, or lists withing lists, I came with the simple idea to track to current environnement in a stack in AWK.
|
||||
Turns out it came out to be easy, I only needed a pointer to track the size of the lifo, a fonction to push an element, an another one to pop one out :</p>
|
||||
<pre><code>BEGIN {
|
||||
env = "none"
|
||||
stack_pointer = 0
|
||||
push(env)
|
||||
}
|
||||
# Function to push a value onto the stack
|
||||
function push(value) {
|
||||
stack_pointer++
|
||||
stack[stack_pointer] = value
|
||||
}
|
||||
# Function to pop a value from the stack (LIFO)
|
||||
function pop() {
|
||||
if (stack_pointer > 0) {
|
||||
value = stack[stack_pointer]
|
||||
delete stack[stack_pointer]
|
||||
stack_pointer--
|
||||
return value
|
||||
} else {
|
||||
return "empty"
|
||||
}
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>The stack does not have to be strictly declared. The value of inside the LIFO correspond to the current markdown environment.
|
||||
This is a clever trick, because when I need to close an html tag, I use the poped element between a <code></</code> and a <code>></code> instead of having a matching table.</p>
|
||||
<p>I also used a simple <code>last()</code> function to return the last pushed value in the stack without popping it out :</p>
|
||||
<pre><code># Function to get last value in LIFO
|
||||
function last() {
|
||||
return stack[stack_pointer]
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>This way, parsing lists became trivial : </p>
|
||||
<pre><code># Matching unordered lists
|
||||
/^[-+*] / {
|
||||
env = last()
|
||||
if (env == "ul" ) {
|
||||
# In a unordered list block, print a new item
|
||||
print "<li>" substr($0, 3) "</li>"
|
||||
} else {
|
||||
# Otherwise, init the unordered list block
|
||||
push("ul")
|
||||
print "<ul>\n<li>" substr($0, 3) "</li>"
|
||||
}
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>I believe the code is pretty self explanatory, but when the last environement is not <code>ul</code>, then we enter this environement.
|
||||
This translates as pushing it to the stack.
|
||||
Otherwise, it means we are already reading a list, and we only need to add a new element to it.</p>
|
||||
<h2>Parsing the simple paragraph and ending the parser</h2>
|
||||
<p>I showed examples of lists and headers, but it works the same way for code blocks, blockquotes, etc.. Only the simple paragraph is different :
|
||||
it does not start with a specific caracter. That is, to match it, we match everything that is not a special character.
|
||||
I have no idea if this is the best solution, but so far it proved to work:</p>
|
||||
<pre><code># Matching a simple paragraph
|
||||
!/^(#|*|-|+|>|`|$|\t| )/ {
|
||||
env = last()
|
||||
if (env == "none") {
|
||||
# If no block, print a paragraph
|
||||
print "<p>" $0 "</p>"
|
||||
} else if (env == "blockquote") {
|
||||
print $0
|
||||
}
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>AS <code>BEGIN</code>, AWK provide the possibilty to execute code at the very end of the file, with the <code>END</code> keyword.
|
||||
Naturally we need to empty the stack and close all html tags that might have been opened during the parsing.
|
||||
It only is a while loop, until the last environement is "none", as it way initiated : </p>
|
||||
<pre><code>END {
|
||||
env = last()
|
||||
while (env != "none") {
|
||||
env = pop()
|
||||
print "</" env ">"
|
||||
env = last()
|
||||
}
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>This way we are able to simply parse markdown and turn it into an HTML file.</p>
|
||||
<h2>Parsing in-line fonctionnalities</h2>
|
||||
<p>For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line.
|
||||
Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link.</p>
|
||||
<p>The previously introduced but very useful function <code>match</code> fits this need : it literally is a regex engine, looking for a pattern in a string.
|
||||
Whenever the pattern is found, two global variables are filled :<ul>
|
||||
<li>RSTART : the index of the first character matching the <em>group</em></li>
|
||||
<li>RLENGTH: the length of the matched <em>group</em></li>
|
||||
</ul>
|
||||
</p>
|
||||
<p>For the following, <code>line</code> represents the line processed by the function, as the following <code>while</code> loops are actually part of a single function.</p>
|
||||
<p>This way <code>match(line, /<em>([^</em>]+)<em>/)</code> matches a string (that does not start with a <code></em></code>) surrounded by two <code>*</code>, corresponding to an emphasis text.
|
||||
The <code><em></code> are espaced as they are special characters, and the </em>group* is delimited by the parenthesis.
|
||||
To match several instances of emphasis text within a line, a simple <code>while</code> will do the trick.
|
||||
We now only have to insert html tags <code><em></code> are the right space around the matched text, and we are good to go.
|
||||
We can save the global variables <code>RSTART</code> and <code>RLENGTH</code> for further use, in case they were to be change. Using them we also can extract the
|
||||
matched substrings and reconstruct the actual html string :</p>
|
||||
<pre><code>while (match(line, /*([^*]+)*/)) {
|
||||
start = RSTART
|
||||
end = RSTART + RLENGTH - 1
|
||||
# Build the result: before match, <em>, content, </em>, after match
|
||||
line = substr(line, 1, start-1) "<em>" substr(line, start+1, RLENGTH-2) "</em>" substr(line, end+1)
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>The while loop enables us to repeat this process as many times as this pattern is encountered within the line.
|
||||
|
||||
We now can repeat the pattern for all inline fonctionnalities, e.g. strong and code.</p>
|
||||
<p>The case of url is a bit more deep as we need to match two groups : the actual text and the url itself.
|
||||
No real issue here, the naïve way is to match the whole, and looking for both the link and the url within the matched whole.</p>
|
||||
<p>This way <code>match(line, /\[([^\]]+)\]\([^\)]+\)/)</code> matches a text between <code>[]</code> followed by a text between <code>()</code> : the markdown representation of links.
|
||||
As above, we store the <code>start</code> and <code>end</code> and also the whole match :
|
||||
<pre><code>start = RSTART
|
||||
end = RSTART + RLENGTH - 1
|
||||
matched = substr($0, RSTART, RLENGTH)
|
||||
</code>
|
||||
</pre>
|
||||
|
||||
It is possible to apply the match fonction on this <code>matched</code> string, and extract, first, the text in <code>[]</code>, and last the text in <code>()</code></p>
|
||||
<pre><code>if (match(matched, /\[([^\]]+)\]/)) {
|
||||
matched_link = substr(matched, RSTART+1, RLENGTH-2)
|
||||
}
|
||||
if (match(matched, /\([^\)]+\)/)) {
|
||||
matched_url = substr(matched, RSTART+1, RLENGTH-2)
|
||||
}
|
||||
</code>
|
||||
</pre>
|
||||
<p>As the link text and the url are stored, using the variables <code>start</code> and <code>end</code>, it is easy to reconstruct the html line :</p>
|
||||
<pre><code>line = substr(line, 1, start-1) "<a href=\"" matched_url "\">" matched_link "</a>" substr(line, end+1)
|
||||
</code>
|
||||
</pre>
|
||||
<p>The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser.</p>
|
||||
<p>This, of course, is the first brick of a static site generator, maybe the most complexe one.
|
||||
We shall see up next how to orchestrate this parser to make is a actual site generator.</p>
|
||||
<p>The code is available in the <a href="https://git.simonpetit.top/simonpetit/top">repo</a>.</p>
|
||||
<footer>
|
||||
<div></div>
|
||||
</footer>
|
||||
</article>
|
||||
</body>
|
||||
|
||||
</html>
|
||||
83
posts/markdown_testing_suite.html
Normal file
83
posts/markdown_testing_suite.html
Normal file
@ -0,0 +1,83 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="fr" dir="ltr">
|
||||
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<title>simpet</title>
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
|
||||
<link href="https://fonts.googleapis.com/css?family=Cutive+Mono|IBM+Plex+Mono&display=swap" rel="stylesheet">
|
||||
<link rel="stylesheet" type="text/css" href="../css/poststyle.css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<h1 class='title'><a href="../index.html">simpet</a></h1>
|
||||
<article>
|
||||
<div class='dates'>
|
||||
<p>Created at: <time datetime="2024-12-09 14:51:41">2024-12-09 14:51:41</time></p>
|
||||
<p>Updated at: <time datetime="2025-02-03 14:05:14">2025-02-03 14:05:14</time></p>
|
||||
</div>
|
||||
<h1>A test suite for markdown parser</h1>
|
||||
<p>As I implemented my own markdown parser for <a href="https://git.simonpetit.top/simonpetit/bob">bob</a>, my static site (blog) generator, I also wanted to make sure it was parsing markdown correctly.</p>
|
||||
<p>Hence I thought about a custom testing suite. After all this blog is also to make things from scratch to grasp a better understanding of how things work overall.</p>
|
||||
<h2>The concept </h2>
|
||||
<p>Also this is a custom script I still wanted to make it somehow generic. In others words I wanted it to be used against any markdown parser (assuming it follows a certain input/ouput constraints).</p>
|
||||
<p>For example, if the test script is <code>test_md_parser</code>, the parser <code>parser</code>, then actually testing the parser shall be the command :</p>
|
||||
<pre><code>test_md_parser parser
|
||||
</code>
|
||||
</pre>
|
||||
<p>The one condition is that <code>parser</code> take the markdown string as a standard input, and print the rendered html in the standard output.
|
||||
This way the markdown parser can feed custom markdown, of which it known the outputed html, and directly compare it with the output of the parser.</p>
|
||||
<p>One more thing is, my markdown parser is written as an <code>awk</code> script, but some may be <code>bash</code> scripts or even executables. This means I need to add an argument to precise the interpreter (if needed).
|
||||
In my case this would look like this :</p>
|
||||
<pre><code>test_md_parser parser.awk awk
|
||||
</code>
|
||||
</pre>
|
||||
<p>and for a bash script, it will be as such :</p>
|
||||
<pre><code>test_md_parser parser.sh bash
|
||||
<h2>Unit testing</h2>
|
||||
</code>
|
||||
</pre>
|
||||
<p>The purpose of the testing suite is to confront an expect output with the actual outputs from typical markdown syntax.
|
||||
I started by making an array of size <code>3n</code>, <code>n</code> being the number of tests. Indeed for display purposes each test has <ul>
|
||||
<li>a title : quickly defining what kind of syntax is being tested</li>
|
||||
<li>a markdown input: a legal markdown syntax text</li>
|
||||
<li>an expected output: the corresponding html output</li>
|
||||
</ul>
|
||||
</p>
|
||||
<p>This approach has flaws, obviously, and the biggest one being the consistence of html. Indeed this html :</p>
|
||||
<pre><code><h1>Title</h1>
|
||||
</code>
|
||||
</pre>
|
||||
<p>is strictly equivalent to :</p>
|
||||
<pre><code><h1>
|
||||
Title
|
||||
</h1>
|
||||
</code>
|
||||
</pre>
|
||||
<p>whereas the strings are not equal.</p>
|
||||
<p>The most naive approach I came with (and because I wanted a quick prototype so I didn't think much about it) was to remove all carriage return from the parser output, using <code>tr -d '\n'</code>.
|
||||
The best solution would be to implement an html minimizer, and apply this minimization on the output of the parser (and maybe the expected result as well) to ensure perfect equality no matter how many carriage return and trailing spaces there would be. Most likely this could be done in a further version.</p>
|
||||
<p>All tests are hard coded within the script. I am aware this might not be the best solution, but on the other hand as this is a script, and not a compiled program, it is as easy as changing "hardcoded" tests as it would be on a separate config file, so its does not bother me right now.</p>
|
||||
<h2>Implementation of the testing suite</h2>
|
||||
<p>As mentionned earlier, all tests are defined in a array of size <code>3n</code> as such : <pre><code>
|
||||
declare -a tests=(
|
||||
"Test header"
|
||||
"# Header1"
|
||||
"<h1>Header1</h1>é
|
||||
)
|
||||
</code>
|
||||
</pre>
|
||||
|
||||
and so on...</p>
|
||||
<p>Running a single test would take an <code>input</code>, run the parser against it, store the <code>output</code> and compare it to the <code>expected</code>. It would return <code>0</code> for success and <code>1</code> in case of failure.
|
||||
Looping over the whole array introduced above, with the <code>input</code> being all <code>3n+1</code> elements of the array and <code>expected</code> being all <code>3n+2</code> ensures all tests are executed in order and as they should.</p>
|
||||
<p>To know whether or not all tests were successful, I simply make the sum of all the returned status of the <code>run_test()</code> function : if the sum equals <code>0</code>, logically all tests are ok.
|
||||
However I added a nice console output that prints all tests as they are being executed, which also prints in green succesful tests, and in red failed tests.</p>
|
||||
<p>This very simple testing suite would be part of the <a href="https://git.simonpetit.top/simonpetit/bob">bob</a> blog generator.</p>
|
||||
<footer>
|
||||
<div></div>
|
||||
</footer>
|
||||
</article>
|
||||
</body>
|
||||
|
||||
</html>
|
||||
Loading…
Reference in New Issue
Block a user