May 27 - Syntax Highlighting

I had an internship interview this last Friday (which I was accepted for!), so the past week has been a mix of prepping for the internship via leetcode (which didn't really help) and depressurizing after the interview. For this reason, I haven't really worked much on rano lately. But now that I'm rested, I'll have loads of time until the internship starts, so I'll be able to work on implementing some more features.

When I last left off about a week ago, I had just refactored the entire codebase to use gap buffers instead of the weird linked list implementation I had before. That had the benefit of simplifying a lot of the cursor handling with minimal overhead. Since then, I've also implemented copy-pasting and undo/redo, two of the three features my text editor needs before I can really call it a fully-fledged text editor. That leaves just one feature: syntax highlighting.

I've done my research, and there's a few ways of implementing syntax highlighting in a text editor. The simplest way is to do what nano does, and use regex. You can see in the source code that the way nano implements syntax highlighting is by:

  1. Compiling together a list of regex-based syntax highlighting rules (in nanorc files, they look something like color green "//.*")
  2. Each time a line is redrawn, loop through each regex
  3. For each looped-through regex, find all matches that intersect with the drawn line and redraw the matching text with the specified color

I've implemented a modified version of nano's regex-based matching system (currently only in the config-file branch - I'm waiting until I implement a proper configuration system to push to main) which operates on each (full-window) draw call instead, creating a vector of regex matches that intersect with the frame and applying the first match that "works".

Regex-based syntax highlighting is simple to implement, and it works decently well - I've used nano almost every day for about 5 years, and I haven't had any qualms with it so far. But there are a few drawbacks, some minor and some major. Although regular expressions are fairly expressive, they aren't enough to tokenize every language. Nano's syntax highlighting tends to break on multi-line strings, functions in if-statements, and other complicated highlighting that involve multiple matches in the same region. I can mostly tolerate these quirks, even if they aren't ideal. The more major issue is that this method of regex-based syntax highlighting is slow. In order to determine the syntax of a particular region of text, the syntax highlighter needs all prior text as context. This means that we're running an O(n)O(n) regex engine every time we do something that changes the text displayed - that's every time we add or delete text, or any time we scroll (!). There are, of course, ways I could optimize my regex approach, but there are also alternatives I could use that mitigate these problems entirely.

The two other editors I use for nontrivial amounts of time are VSCode (for writing Java for school, and React for gamut) and Emacs (for the occasional org-roam note), and both are "proper" editors with "proper" syntax highlighting. VSCode's syntax highlighting uses a mix of regex (via the TextMate grammar engine, built atop Oniguruma) and semantic parsing via language servers. Because the TextMate engine "stores some state at the end of a tokenized line," it can avoid having to retokenize the entire document every time a change is made, unlike nano's syntax highlighting system. Emacs, on the other hand, uses Tree-sitter to create ASTs from which text is highlighted. Tree-sitter is an incremental parser, so it can update the AST from edits made to the text document instead of reparsing the entire file. These two solutions are probably your best bet if you want to implement a robus syntax highlighting system for your full-fledged, "proper" editor.

There are a few ways of improving the purely regex-based system I have implemented already, as well. I could reduce the number of times I call the syntax highlighting function by storing previous regex results and reusing them in cases where the cursor position is changed without changing the frame position. I could also run the regex on the entire document each time an edit is made, and reuse the regex results for any type of cursor movement. In this case, I would have to loop over regex matches for the entire document to find a hit for a particular index, but I could use an interval tree to store the regex matches which would speed the querying up. I'll probably try these optimizations if I need faster performance, before I jump to TextMate or Tree-sitter.

That's all I have to report today - I hope this was an informative read on the current state of syntax highlighting.

P.S.

I found a few links while researching that were very interesting or informative, and which I would like to share:

A Play on Functional Expressions is a paper on a weighted regex parsing algorithm in the form of a 3-act play - absolutely insane.
Here's a blog post by the VSCode team on how they were able to improve bracket-pair colorization performance by a factor of 10k.

May 19 - Updates regarding Rano

Some of you may know I've been working on rano, a tiny nano-inspired text editor written in rust. I started one week before finals (hopefully) as a way to learn rust - and as a way to get better at coding - and I think I've come pretty far considering my coding ability. In this post, I'm going to reflect back on some of the implementation details, challenges, solutions, and future steps I'm going to take as a sort of engineering log.

Rano is written in rust using a very barebones ncurses wrapper, which might make it not a great intro project to Rust (a lot of the code I write is more C-like than rust-like). I've refactored the codebase several times, the most major refactoring being a migration from a linked list-based internal text representation (like the one nano uses) to a gap buffer. This turned out to be a great choice.

There are several pros and cons for using a linked list-based structure for storing the text buffer. First, the pros:

However, the tradeoff is that

A lot of the features I wanted in rano, like regex-based syntax highlighting and copy/paste, were difficult to implement using the linked list-based structure I had, so I decided to move to a gap buffer. Rewriting the code for scrolling, typing, and selecting using the gap buffer was much nicer, and I ended up writing less code for these functions because of the added simplicity, which was a nice plus.

I decided to implement my own gap buffer instead of relying on an existing distribution for the fun of it. I ended up using a single character vector to store text, along with the position and the size of the gap. As my use case is mainly small files that are no more than a megabyte large, I decided that I'm fine with just resizing the vector when my gap is depleted. This made iteration, insertion, and deletion very simple to implement, as I didn't have to juggle two different vectors.

The other features of rano aren't as interesting to talk about - scrolling, inserting and deleting characters are pretty standard stuff. At the end of the day, a text editor is just a bunch of fancy array operations for the most part.

There's three features I'm planning on implementing before I declare rano "good enough" - syntax highlighting, find and replace, and undo/redo. Both syntax highlighting and find can be implemented using regex, but I want my syntax highlighting to be configurable, so I'm most likely going to write my own parser for config files that declare the syntax highlighting rules. Find and replace should be fairly simple. Undo/redo seems like it would be a bit difficult - I would have to implement some sort of tree-like structure (which is notoriously annoying in Rust) to store the edit history, but I could just represent each action as an enum variant which should simplify things a bit.

Thanks for listening to me ramble for a bit. I'm not the best rust programmer, but I'm hoping I'll get to a level where I can comfortably write production code fairly soon. Until then, I'll be kickboxing the borrow checker :)

May 15 - A theorem involving arctangents

I flew back to the Bay Area from Madison today. I'm gonna be in the area (probably) until the beginning of August, when I have to move into my new apartment. It's gonna be a lot of sitting around and working on random projects I think of, so I'm going to be writing some small "tidbits" to remind myself of the things that I've done (and share it with other people, too).

On the plane, I had this idea of a theorem involving the value of arctangents. More specifically, it has to do with the set Qtan={atan(q):qQ}\mathbb{Q}_{\mathrm{tan}} = \{\mathrm{atan}(q)\,:\,q\in\mathbb{Q}\}, the set of inverse tangents of all rational numbers.

I make the claim that Qtan\mathbb{Q}_{\mathrm{tan}} forms a group under addition mod π\pi.

First, consider the ring Z[i]={a+bi:a,bZ}\mathbb{Z}[i] = \{a + bi\,:\,a, b \in \mathbb{Z}\}, the set of complex numbers with integer coefficients. If we have z=(a+bi),z=(a+bi)z = (a + bi), z' = (a' + b'i) with z,zZ[i]z, z'\in \mathbb{Z}[i], then zz=(a+bi)(a+bi)=(aabb)+(ab+ab)iZ[i]zz' = (a + bi)(a' + b'i) = (aa' - bb') + (ab' + a'b)i \in \mathbb{Z}[i], so Z[i]\mathbb{Z}[i] is closed under multiplication (which makes sense, because it's a ring).

Importantly, you might have learned in algebra class that you can express any complex number z=a+biz = a + bi, a,bRa, b \in \mathbb{R} as reθire^{\theta i}, where rR+r \in \mathbb{R}^+ and θ[0,2π)\theta \in [0, 2\pi). Then, for any two z=reθi,z=reϕiz = re^{\theta i}, z' = r'e^{\phi i}, we have zz=rre(θ+ϕ)izz' = rr'e^{(\theta + \phi)i}. So our multiplication of complex numbers turns into a sum in the exponents.

But what exactly does θ\theta represent? Well, if we view our zCz \in \mathbb{C} as a vector in R2\mathbb{R}^2, then θ\theta is the counterclockwise angle created by the x-axis unit vector and it. If we think about complex numbers in this way, then products of complex numbers boils down to multiplying the magnitudes and adding the angles!

The product of two complex numbers in polar form is their angles summed. (Source: Visual Complex Analysis by Needham, Ch. 1 Fig. 11a)

Back to our theorem. Remember from trigonometry class that for a right triangle, tan(θ)=\mathrm{tan}(\theta) = opposite ÷\div adjacent. Then our set Qtan\mathbb{Q}_{\mathrm{tan}} is just the set of all angles of right triangles with integer side lengths.

The angle of a complex number can be determined using the real and imaginary components. (Source: Visual Complex Analysis by Needham, Ch. 1 Fig. 4)

We now have everything we need to prove our statement. Let q,rQ+q, r \in \mathbb{Q}^{+}; then there exist a,bZa, b \in \mathbb{Z} s.t. q=a÷bq = a \div b, and a,bZa', b' \in \mathbb{Z} s.t. r=a÷br = a' \div b'. By the product of complex numbers, we know that atan(q)+atan(r)\mathrm{atan}(q) + \mathrm{atan}(r) is the angle formed by the x-axis unit vector and the product of the complex numbers (a+bi),(a+bi)Z[i](a + bi), (a' + b'i) \in \mathbb{Z}[i]. Let c+di=(a+bi)(a+bi)Z[i]c + di = (a + bi)(a' + b'i) \in \mathbb{Z}[i]. Then atan(dc)=atan(q)+atan(r)\mathrm{atan}\left(\frac{d}{c}\right) = \mathrm{atan}(q) + \mathrm{atan}(r), so the sum of two values in Qtan\mathbb{Q}_{\mathrm{tan}} is also in Qtan\mathbb{Q}_{\mathrm{tan}}.

Now let a+biZ[i]a + bi \in \mathbb{Z}[i]. Then (a+bi)(abi)Z(a + bi)(a - bi) \in \mathbb{Z}, so atan(ba)+atan(ba)=0\mathrm{atan}\left(\frac{b}{a}\right) + \mathrm{atan}\left(-\frac{b}{a}\right) = 0. Thus Qtan\mathbb{Q}_{\mathrm{tan}} contains inverses. This shows that Qtan\mathbb{Q}_{\mathrm{tan}} forms a group under addition modulo π\pi! ☺