From 3d6583383bffc75694635fadca2bc89cd5f980ba Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marc=20Andr=C3=A9=20Tanner?= Date: Sat, 11 Apr 2015 19:58:05 +0200 Subject: Rename README -> README.md --- README.md | 597 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 597 insertions(+) create mode 100644 README.md (limited to 'README.md') diff --git a/README.md b/README.md new file mode 100644 index 0000000..d5c6e93 --- /dev/null +++ b/README.md @@ -0,0 +1,597 @@ +Why another text editor? +======================== + +It all started when I was recently reading the excellent +[Project Oberon](http://www.inf.ethz.ch/personal/wirth/ProjectOberon/), +where in chapter 5 a data structure for managing text is introduced. +I found this rather appealing and wanted to see how it works in practice. + +After some time I decided that besides just having fun hacking around I +might as well build something which could (at least in the long run) +replace my current editor of choice: vim. + +This should be accomplished by a reasonable amount of clean (your mileage +may vary), modern and legacy free C code. Certainly not an old, +[500'000 lines long](https://www.openhub.net/p/vim) #ifdef cluttered +mess which tries to run on all broken systems ever envisioned by mankind. + +Admittedly vim has a lot of functionally, most of which I don't use. I +therefore set out with the following main goals: + + - Unicode aware + + - binary clean + + - handle arbitrary files (this includes large ones, think >100M SQL-dumps) + + - unlimited undo/redo support + + - syntax highlighting + + - regex search (and replace) + + - multiple file/window support + +The goal could thus be summarized as "80% of vim's features (in other +words the useful ones) implemented in roughly 1% of the code". + +Finally and most importantly it is fun! Writing a text editor presents +some interesting challenges and design decisions, some of which are +explained below. + +Text management using a piece table/chain +========================================= + +The core of this editor is a persistent data structure called a piece +table which supports all modifications in `O(m)`, where `m` is the number +of non-consecutive editing operations. This bound could be further +improved to `O(log m)` by use of a balanced search tree, however the +additional complexity doesn't seem to be worth it, for now. + +The actual data is stored in buffers which are strictly append only. +There exist two types of buffers, one fixed-sized holding the original +file content and multiple append-only ones storing the modifications. + +A text, i.e. a sequence of bytes, is represented as a double linked +list of pieces each with a pointer into a buffer and an associated +length. Pieces are never deleted but instead always kept around for +redo/undo support. A span is a range of pieces, consisting of a start +and end piece. Changes to the text are always performed by swapping +out an existing, possibly empty, span with a new one. + +An empty document is represented by two special sentinel pieces which +always exist: + + /-+ --> +-\ + | | | | + \-+ <-- +-/ + #1 #2 + +Loading a file from disk is as simple as mmap(2)-ing it into a buffer, +creating a corresponding piece and adding it to the double linked list. +Hence loading a file is a constant time operation i.e. independent of +the actual file size (assuming the operating system uses demand paging). + + /-+ --> +-----------------+ --> +-\ + | | | I am an editor! | | | + \-+ <-- +-----------------+ <-- +-/ + #1 #3 #2 + +Insert +------ + +Inserting a junk of data amounts to appending the new content to a +modification buffer. Followed by the creation of new pieces. An insertion +in the middle of an existing piece requires the creation of 3 new pieces. +Two of them hold references to the text before respectively after the +insertion point. While the third one points to the newly added text. + + /-+ --> +---------------+ --> +----------------+ --> +--+ --> +-\ + | | | I am an editor| |which sucks less| |! | | | + \-+ <-- +---------------+ <-- +----------------+ <-- +--+ <-- +-/ + #1 #4 #5 #6 #2 + + modification buffer content: "which sucks less" + +During this insertion operation the old span [3,3] has been replaced +by the new span [4,6]. Notice that the pieces in the old span were not +changed, therefore still point to their predecessors/successors, and can +thus be swapped back in. + +If the insertion point happens to be at a piece boundary, the old span +is empty, and the new span only consists of the newly allocated piece. + +Delete +------ + +Similarly a delete operation splits the pieces at appropriate places. + + /-+ --> +-----+ --> +--+ --> +-\ + | | | I am| |! | | | + \-+ <-- +-----+ <-- +--+ <-- +-/ + #1 #7 #6 #2 + +Where the old span [4,5] got replaced by the new span [7,7]. The underlying +buffers remain unchanged. + +Cache +----- + +Notice that the common case of appending text to a given piece is fast +since, the new data is simply appended to the buffer and the piece length +is increased accordingly. In order to keep the number of pieces down, +the least recently edited piece is cached and changes to it are done +in place (this is the only time buffers are modified in a non-append +only way). As a consequence they can not be undone. + +Undo/redo +--------- + +Since the buffers are append only and the spans/pieces are never destroyed +undo/redo functionality is implemented by swapping the required spans/pieces +back in. + +As illustrated above, each change to the text is recorded by an old and +a new span. An action consists of multiple changes which logically belong +to each other and should thus also be reverted together. For example +a search and replace operation is one action with possibly many changes +all over the text. Actions are kept in an undo respectively redo stack. + +A state of the text can be marked by means of a snapshotting operation. +The undo/redo functionality operates on such marked states and switches +back and forth between them. + +The history is currently linear, no undo / history tree is implemented. + +Properties +---------- + +The main advantage of the piece chain as described above is that all +operations are performed independent of the file size but instead linear +in the number of pieces i.e. editing operations. The original file buffer +never changes which means the `mmap(2)` can be performed read only which +makes optimal use of the operating system's virtual memory / paging system. + +The maximum editable file size is limited by the amount of memory a process +is allowed to map into its virtual address space, this shouldn't be a problem +in practice. The whole process assumes that the file can be used as is. +In particular the editor assumes all input and the file itself is encoded +as UTF-8. Supporting other encodings would require conversion using `iconv(3)` +or similar upon loading and saving the document, which defeats the whole +purpose. + +Similarly the editor has to cope with the fact that lines can be terminated +either by `\n` or `\r\n`. There is no conversion to a line based structure in +place. Instead the whole text is exposed as a sequence of bytes. All +addressing happens by means of zero based byte offsets from the start of +the file. + +The main disadvantage of the piece chain data structure is that the text +is not stored contiguous in memory which makes seeking around somewhat +harder. This also implies that standard library calls like the `regex(3)` +functions can not be used as is. However this is the case for all but +the most simple data structures used in text editors. + +Screen Drawing +============== + +The current code takes a rather simple approach to screen drawing. It +basically only remembers the starting position of the area being shown. +Then fetches a "screen full" of bytes and outputs one character at a +time until the end of the window is reached. A consequence of this +approach is that lines are always wrapped and horizontal scrolling is +not supported. + +No efforts are made to reduce the terminal output. This task is delegated +to the underlying curses library which already performs a kind of double +buffering. The window is always redrawn completely even if only a single +character changes. It turns out this is actually necessary if one wants +to support multiline syntax highlighting. + +While outputting the individual characters a cell matrix is populated +where each entry stores the length in bytes of the character displayed +at this particular cell. For characters spanning multiple columns the +length is always stored in the leftmost cell. As an example a tab has a +length of 1 byte followed by up to 7 cells with a length of zero. +Similarly a `\r\n` line ending occupies only one screen cell but has a +length of 2. + +This matrix is actually stored per line inside a double linked list of +structures representing screen lines. For each line we keep track of +the length in bytes of the underlying text, the display width of all +characters part of the line, and the logical line number. + +All cursor positioning is always performed in bytes from the start of +the file and works by traversing the double linked list of screen lines +until the correct line is found. Then the cell array is consulted to +move to the correct column. + +Syntax-Highlighting +------------------- + +The editor takes a similar regex-based approach to syntax highlighting +than sandy and reuses its syntax definitions but always applies them to +a "screen full" of text thus enabling multiline coloring. + +Window-Management +----------------- + +It is possible to open multiple windows via the `:split/:vsplit/:open` +commands or by passing multiple files on the command line. + +In principle it would be nice to follow a similar client/server approach +as sam/samterm i.e. having the main editor as a server and each window +as a separate client process with communication over a unix domain socket. + +That way window management would be taken care of by dwm or dvtm and the +different client processes would still share common cut/paste registers +etc. + +However at the moment I don't want to open that can of worms and instead +settled for a single process architecture. + +Search and replace +------------------ + +Currently the editor copies the whole text to a contiguous memory block +and then uses the standard regex functions from libc. Clearly this is not +a satisfactory solution for large files and kind of defeats the whole +effort spent on the piece table. + +The long term solution is to write our own regular expression engine or +modify an existing one to make use of the iterator API. This would allow +efficient search without having to double memory consumption. + +The used regex engine should use a non-backtracking algorithm. Useful +resources include: + + - [Russ Cox's regex pag](http://swtch.com/~rsc/regexp/) + - [TRE](https://github.com/laurikari/tre) as + [used by musl](http://git.musl-libc.org/cgit/musl/tree/src/regex) + which uses a parallel [TNFA matcher](http://laurikari.net/ville/spire2000-tnfa.ps) + - [Plan9's regex library](http://plan9.bell-labs.com/sources/plan9/sys/src/libregexp/) + which has its root in Rob Pike's sam text editor + - [RE2](https://github.com/google/re2) C++ regex library + +Command-Prompt +-------------- + +The editor needs some form of command prompt to get user input +(think `:`, `/`, `?` in vim). + +At first I wanted to implement this in terms of an external process, +similar to the way it is done in sandy with communication back to the +editor via a named pipe. + +At some point I thought it would be possible to provide all editor commands +as shell scripts in a given directory, then set $PATH accordingly and run +the shell. This would give us readline editing, tab completion, history and +Unicode support for free. But unfortunately it won't work due to quoting +issues and other conflicts of special symbols with different meanings. + +Later it occurred to me that the editor prompt could just be treated as +special 1 line file. That is all the main editor functionality is reused +with a slightly different set of key bindings. + +This approach also has the added benefit of further testing the main editor +component (in particular corner cases like editing at the end of the file). + +Editor Frontends +================ + +The editor core is written in a library like fashion which should make +it possible to write multiple frontends with possibly different user +interfaces/paradigms. The frontend to run is selected based on the +executable name. + +The default, and currently only, interface is a vim clone called vis. + +Key binding modes +----------------- + +The available key bindings for the different modes are arranged in a +hierarchical way in config.h (there is also an ascii tree giving an +overview in that file). Each mode can specify a parent mode which is +used to look up a key binding if it is not found in the current mode. +This reduces redundancy for keys which have the same meaning in +different modes. + +Each mode can also provide hooks which are executed upon entering/leaving +the mode and when there was an unmatched key. + +vis a vim like frontend +----------------------- + +The following section gives a quick overview over various vim features +and their current support in vis. + +### Operators + + d (delete) + c (change) + y (yank) + p (put) + > (shift-right) + < (shift-left), + J (join) + ~ (swap case) + gu (make lowercase) + gU (make uppercase) + +### Movements + + h (char left) + l (char right) + j (line down) + k (line up) + gj (display line down) + gk (display line up) + 0 (start of line) + ^ (first non-blank of line) + g_ (last non-blank of line) + $ (end of line) + % (match bracket) + b (previous start of a word) + B (previous start of a WORD) + w (next start of a word) + W (next start of a WORD) + e (next end of a word) + E (next end of a WORD) + ge (previous end of a word) + gE (previous end of a WORD) + { (previous paragraph) + } (next paragraph) + ( (previous sentence) + ) (next sentence) + gg (begin of file) + G (goto line or end of file) + | (goto column) + n (repeat last search forward) + N (repeat last search backwards) + f{char} (to next occurrence of char to the right) + t{char} (till before next occurrence of char to the right) + F{char} (to next occurrence of char to the left) + T{char} (till before next occurrence of char to the left) + /{text} (to next match of text in forward direction) + ?{text} (to next match of text in backward direction) + + An empty line is currently neither a word nor a WORD. + + The semantics of a paragraph and a sentence is also not always 100% + the same as in vim. + + Some of these commands do not work as in vim when prefixed with a + digit i.e. a multiplier. As an example in vim `3$` moves to the end + of the 3rd line down. However vis treats it as a move to the end of + current line which is repeated 3 times where the last two have no + effect. + + In general there are still a lot of improvements to be made in the + case movements are forced to be line or character wise. Also some of + them should be inclusive in some context and exclusive in others. + At the moment they always behave the same. + +### Text objects + + All of the following text objects are implemented in an inner variant + (prefixed with 'i') and a normal variant (prefixed with 'a'): + + w word + W WORD + s sentence + p paragraph + [,], (,), {,}, <,>, ", ', ` block enclosed by these symbols + + For sentence and paragraph there is no difference between the + inner and normal variants. + +### Modes + + At the moment there exists a more or less functional insert, replace + and visual mode (in both line and character wise variants). + +### Marks + + [a-z] general purpose marks + < start of the last selected visual area in current buffer + > end of the last selected visual area in current buffer + + No marks across files are supported. Marks are not preserved over + editing sessions. + +### Registers + + Only the 26 lower case registers `[a-z]` and 1 additional default register + is supported. + +### Undo/Redo and Repeat + + The text is currently snapshoted whenever an operator is completed as + well as when insert or replace mode is left. Additionally a snapshot + is also taken if in insert or replace mode a certain idle time elapses. + + Another idea is to snapshot based on the distance between two consecutive + editing operations (as they are likely unrelated and thus should be + individually reversible). + + The repeat command `.` works for all operators and is able to repeat + the last insertion or replacement. + +### Macros + +`[a-z]` are recoginized macro names, `q` starts a recording, `@` plays it back. +`@@` refers to the least recently recorded macro. + +### Command line prompt + + At the `:`-command prompt only the following commands are recognized: + + :nnn go to line nnn + :bdelete close all windows which display the same file as the current one + :edit replace current file with a new one or reload it from disk + :open open a new window + :qall close all windows, exit editor + :quit close currently focused window + :read insert content of another file at current cursor position + :split split window horizontally + :vsplit split window vertically + :new open an empty window, arrange horizontally + :vnew open an empty window, arrange vertically + :wq write changes then close window + :xit like :wq but write only when changes have been made + :write write current buffer content to file + :saveas save file under another name + :set set the options below + + tabwidth [1-8] + + set display width of a tab and number of spaces to use if + expandtab is enabled + + expandtab (yes|no) + + whether typed in tabs should be expanded to tabwidth spaces + + autoindent (yes|no) + + replicate spaces and tabs at the beginning of the line when + starting a new line. + + number (yes|no) + relativenumber (yes|no) + + whether absolute or relative line numbers are printed alongside + the file content + + syntax name + + use syntax definition given (e.g. "c") or disable syntax + highlighting if no such definition exists (e.g :set syntax off) + + Each command can be prefixed with a range made up of a start and + an end position as in start,end. Valid position specifiers are: + + . start of the current line + +n and -n start of the line relative to the current line + 'm position of mark m + /pattern/ first match after current position + + If only a start position without a command is given then the cursor + is moved to that position. Additionally the following ranges are + predefined: + + % the whole file, equivalent to 1,$ + * the current selection, equivalent to '<,'> + + The substitute command is recognized but not yet implemented. The `!` + command to filter text through an external program is also planned. + + History support, tab completion and wildcard expansion are other + worthwhile features. However implementing them inside the editor + feels wrong. + +### Tab <-> Space and Line endings \n vs \r\n + + Tabs can optionally be expaned to a configurable number of spaces. + The first line ending in the file determines what will be inserted + upon a line break (defaults to \n). + +### Jump list and change list + + A per window, file local jump list (navigate with `CTRL+O` and `CTRL+I`) + and change list (navigate with `g;` and `g,`) is supported. The jump + list is implemented as a fixed sized ring buffer. + +### Mouse support + + The mouse is currently not used at all. + +### Other features + + Things which I would like to add in the long term are: + + + code completion: this should be done as an external process. I will + have to take a look at the tools from the llvm / clang project. Maybe + dvtm's terminal emulation support could be reused to display an + slmenu inside the editor at the cursor position? + + + something similar to vim's quick fix functionality + + Things I might add + + + runtime configurable key bindings + + visual block mode / multiple selections + + text folding + + Stuff which vim does which I don't use and have no plans to add: + + - GUIs (neither x11, motif, gtk, win32 ...) + - plugins (certainly not vimscript, if anything it should be lua based) + - right-to-left text + - tabs (as in multiple workspaces) + - ex mode + +How to help? +------------ + +At this point it might be best to fetch the code, edit some scratch file, +notice an odd behavior or missing functionality, write and submit a patch +for it, then iterate. + +WARNING: There are probably still some bugs left which could corrupt your + unsaved changes. Use at your own risk. At this point I suggest to + only edit non-critical files which are under version control and + thus easily recoverable! + +A quick overview over the code structure to get you started: + + File(s) | Description + ------------------- | ----------------------------------------------------- + `text.[ch]` | low level text / marks / {un,re}do / piece table implementation + `text-motions.[ch]` | movement functions take a file position and return a new one + `text-objects.[ch]` | functions take a file position and return a file range + `vis.c` | vi(m) specific editor frontend, program entry point + `editor.[ch]` | editor window management + `window.[ch]` | ui-independent viewport, syntax highlighting, cursor placement + `ui.h` | abstract interface as implemented by user interface + `ui-curses.h` | a terminal / curses based user interface implementation + `buffer.[ch]` | dynamically growing buffer used for registers and macros + `ring-buffer.[ch]` | fixed size ring buffer used for the jump list + `map.[ch]` | crit-bit tree based map supporting unique prefix lookups and ordered iteration. used to implement `:`-commands. + `config.def.h` | definition of key bindings, commands, syntax highlighting + +Hope this gets the interested people started. + +TODO +---- + +Here is an incomplete list of TODO items and/or ideas for further work +in no particular order: + + * Review and cleanup the existing implementation (e.g. selection handling) + - Eliminate global state and expose vis frontend as "library" + * Implement `:!` using a proper (libuv based?) mainloop + * Implement `:substitute` + * Bugfix: editing the same file in multiple windows can cause "corruption" + * Review/Implement cindent mode #33 + * Add history support to `:`-prompt + * Implement wordwrap (i.e `gq` and `:set textwidth`) using `fmt(1)` ? + * Overhaul key bindings to support runtime configuration / streamline config.def.h + * Implement/review/merge history undo tree + * Implement a regex engine which works with the iterator API + * Write [unit test](http://ccodearchive.net/info/tap.html) for the low + level `text_*` interface + * Improve syntax highlighting, investigate whether already existing + syntax definitions from other editors could be reused + * Optimize `text_delete` in case of consecutive delete operations + * Add a RPC interface, experiment with a client/server architecture and + delegate window management to dwm/dvtm + +Feel free to ask questions if something is unclear! There are still a lot +of bugs left to fix, but by now I'm fairly sure that the general concept +should work. + +As always, comments and patches welcome! + +Cheers, +Marc -- cgit v1.2.3