Percent Encoding URLs
I use djot for the markup on my static site generator. In djot, the link syntax looks like this:
This works fine for most links, but certain links that include parenthesis, it breaks. This is mostly a problem for Wikipedia, which uses parenthesis for disambiguation of topics. For example:
As written, this will break the link, since
djot will parse the first closing
) as the end of the URL, which is not the complete link. This is because, by design,
djot parses markup in linear time without backtracking.
Percent Encoding Parenthesis
This is where Percent Encoding comes in. Percent encoding is a method specified in the URI specification that allows for the encoding of arbitrary data in a URI.If you would like to read the details, they seem to be specified here in RFC 3986 Uniform Resource Identifier (URI): Generic Syntax.
Specifically, I needed to encode
( is also a reserved character and should be escaped, but I don’t run into any issues with
djot with that character, so I leave it out here.
Using my link from before, that looks like this:
If I use this markup: Variety, I correctly get the link to the Wikipedia article.
Elixir URI encoding
When I first learned about Percent Encoding, I quickly found the
URI.encode/1 function in Elixir. To my dismay, it didn’t correctly escape parenthesis:
iex()> URI.encode("(hello)") "(hello)"
Frustrated, I continued escaping my links for
djot by hand.
Today, I re-read the documentation of
URI.encode and found a line that I previously missedFull documentation here:
This function also accepts a predicate function as an optional argument. If passed, this function will be called with each byte in string as its argument and should return a truthy value (anything other than false or nil) if the given byte should be left as is, or return a falsy value (false or nil) if the character should be escaped. Defaults to URI.char_unescaped?/1.
The documentation for
URI.char_unescaped?/1 explains that it is deliberately escaping the minimum required and purposely leaving reserved characters unescaped:
Checks if character is allowed unescaped in a URI.
This is the default used by URI.encode/2 where both reserved (char_reserved?/1) and unreserved characters (char_unreserved?/1) are kept unescaped.
It also hints at a better predicate function for
URI.char_unreserved?/1, which escapes the parenthesis like I need:
iex()> URI.encode("(hello)", &URI.char_unreserved?/1) "%28hello%29"
djot’s linear parsing forces it to assume that the first
) it encounters is the end of a URL.
Using percent encoding, I can encode
%29, which solves the
djot issue while allowing the browser to route the URL correctly.
URI.encode doesn’t escape parenthesis by default, but does allow an alternate predicate function that does.I wrote a short script that I can call using kakoune’s pipe command. Since it has no dependencies, the startup time is minimal, but I may rewrite it in a faster startup time language later.