From 3d6583383bffc75694635fadca2bc89cd5f980ba Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Marc=20Andr=C3=A9=20Tanner?= <mat@brain-dump.org>
Date: Sat, 11 Apr 2015 19:58:05 +0200
Subject: Rename README -> README.md

---
 README.md | 597 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 597 insertions(+)
 create mode 100644 README.md

(limited to 'README.md')

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..d5c6e93
--- /dev/null
+++ b/README.md
@@ -0,0 +1,597 @@
+Why another text editor?
+========================
+
+It all started when I was recently reading the excellent
+[Project Oberon](http://www.inf.ethz.ch/personal/wirth/ProjectOberon/),
+where in chapter 5 a data structure for managing text is introduced.
+I found this rather appealing and wanted to see how it works in practice.
+
+After some time I decided that besides just having fun hacking around I
+might as well build something which could (at least in the long run)
+replace my current editor of choice: vim.
+
+This should be accomplished by a reasonable amount of clean (your mileage
+may vary), modern and legacy free C code. Certainly not an old,
+[500'000 lines long](https://www.openhub.net/p/vim) #ifdef cluttered
+mess which tries to run on all broken systems ever envisioned by mankind.
+
+Admittedly vim has a lot of functionally, most of which I don't use. I
+therefore set out with the following main goals:
+
+ - Unicode aware
+
+ - binary clean
+
+ - handle arbitrary files (this includes large ones, think >100M SQL-dumps)
+
+ - unlimited undo/redo support
+
+ - syntax highlighting
+
+ - regex search (and replace)
+
+ - multiple file/window support
+
+The goal could thus be summarized as "80% of vim's features (in other
+words the useful ones) implemented in roughly 1% of the code".
+
+Finally and most importantly it is fun! Writing a text editor presents
+some interesting challenges and design decisions, some of which are
+explained below.
+
+Text management using a piece table/chain
+=========================================
+
+The core of this editor is a persistent data structure called a piece
+table which supports all modifications in `O(m)`, where `m` is the number
+of non-consecutive editing operations. This bound could be further
+improved to `O(log m)` by use of a balanced search tree, however the
+additional complexity doesn't seem to be worth it, for now.
+
+The actual data is stored in buffers which are strictly append only.
+There exist two types of buffers, one fixed-sized holding the original
+file content and multiple append-only ones storing the modifications.
+
+A text, i.e. a sequence of bytes, is represented as a double linked
+list of pieces each with a pointer into a buffer and an associated
+length. Pieces are never deleted but instead always kept around for
+redo/undo support. A span is a range of pieces, consisting of a start
+and end piece. Changes to the text are always performed by swapping
+out an existing, possibly empty, span with a new one.
+
+An empty document is represented by two special sentinel pieces which
+always exist:
+
+    /-+ --> +-\
+    | |     | |
+    \-+ <-- +-/
+     #1     #2
+
+Loading a file from disk is as simple as mmap(2)-ing it into a buffer,
+creating a corresponding piece and adding it to the double linked list.
+Hence loading a file is a constant time operation i.e. independent of
+the actual file size (assuming the operating system uses demand paging).
+
+    /-+ --> +-----------------+ --> +-\
+    | |     | I am an editor! |     | |
+    \-+ <-- +-----------------+ <-- +-/
+     #1             #3              #2
+
+Insert
+------
+
+Inserting a junk of data amounts to appending the new content to a
+modification buffer. Followed by the creation of new pieces. An insertion
+in the middle of an existing piece requires the creation of 3 new pieces.
+Two of them hold references to the text before respectively after the
+insertion point. While the third one points to the newly added text.
+
+    /-+ --> +---------------+ --> +----------------+ --> +--+ --> +-\
+    | |     | I am an editor|     |which sucks less|     |! |     | |
+    \-+ <-- +---------------+ <-- +----------------+ <-- +--+ <-- +-/
+     #1            #4                   #5                #6      #2
+
+           modification buffer content: "which sucks less"
+
+During this insertion operation the old span [3,3] has been replaced
+by the new span [4,6]. Notice that the pieces in the old span were not
+changed, therefore still point to their predecessors/successors, and can
+thus be swapped back in.
+
+If the insertion point happens to be at a piece boundary, the old span
+is empty, and the new span only consists of the newly allocated piece.
+
+Delete
+------
+
+Similarly a delete operation splits the pieces at appropriate places.
+
+    /-+ --> +-----+ --> +--+ --> +-\
+    | |     | I am|     |! |     | |
+    \-+ <-- +-----+ <-- +--+ <-- +-/
+     #1       #7         #6      #2
+
+Where the old span [4,5] got replaced by the new span [7,7]. The underlying
+buffers remain unchanged.
+
+Cache
+-----
+
+Notice that the common case of appending text to a given piece is fast
+since, the new data is simply appended to the buffer and the piece length
+is increased accordingly. In order to keep the number of pieces down,
+the least recently edited piece is cached and changes to it are done
+in place (this is the only time buffers are modified in a non-append
+only way). As a consequence they can not be undone.
+
+Undo/redo
+---------
+
+Since the buffers are append only and the spans/pieces are never destroyed
+undo/redo functionality is implemented by swapping the required spans/pieces
+back in.
+
+As illustrated above, each change to the text is recorded by an old and
+a new span. An action consists of multiple changes which logically belong
+to each other and should thus also be reverted together. For example
+a search and replace operation is one action with possibly many changes
+all over the text. Actions are kept in an undo respectively redo stack.
+
+A state of the text can be marked by means of a snapshotting operation.
+The undo/redo functionality operates on such marked states and switches
+back and forth between them.
+
+The history is currently linear, no undo / history tree is implemented.
+
+Properties
+----------
+
+The main advantage of the piece chain as described above is that all
+operations are performed independent of the file size but instead linear
+in the number of pieces i.e. editing operations. The original file buffer
+never changes which means the `mmap(2)` can be performed read only which
+makes optimal use of the operating system's virtual memory / paging system.
+
+The maximum editable file size is limited by the amount of memory a process
+is allowed to map into its virtual address space, this shouldn't be a problem
+in practice. The whole process assumes that the file can be used as is.
+In particular the editor assumes all input and the file itself is encoded
+as UTF-8. Supporting other encodings would require conversion using `iconv(3)`
+or similar upon loading and saving the document, which defeats the whole
+purpose.
+
+Similarly the editor has to cope with the fact that lines can be terminated
+either by `\n` or `\r\n`. There is no conversion to a line based structure in
+place. Instead the whole text is exposed as a sequence of bytes. All
+addressing happens by means of zero based byte offsets from the start of
+the file.
+
+The main disadvantage of the piece chain data structure is that the text
+is not stored contiguous in memory which makes seeking around somewhat
+harder. This also implies that standard library calls like the `regex(3)`
+functions can not be used as is. However this is the case for all but
+the most simple data structures used in text editors.
+
+Screen Drawing
+==============
+
+The current code takes a rather simple approach to screen drawing. It
+basically only remembers the starting position of the area being shown.
+Then fetches a "screen full" of bytes and outputs one character at a
+time until the end of the window is reached. A consequence of this
+approach is that lines are always wrapped and horizontal scrolling is
+not supported.
+
+No efforts are made to reduce the terminal output. This task is delegated
+to the underlying curses library which already performs a kind of double
+buffering. The window is always redrawn completely even if only a single
+character changes. It turns out this is actually necessary if one wants
+to support multiline syntax highlighting.
+
+While outputting the individual characters a cell matrix is populated
+where each entry stores the length in bytes of the character displayed
+at this particular cell. For characters spanning multiple columns the
+length is always stored in the leftmost cell. As an example a tab has a
+length of 1 byte followed by up to 7 cells with a length of zero.
+Similarly a `\r\n` line ending occupies only one screen cell but has a
+length of 2.
+
+This matrix is actually stored per line inside a double linked list of
+structures representing screen lines. For each line we keep track of
+the length in bytes of the underlying text, the display width of all
+characters part of the line, and the logical line number.
+
+All cursor positioning is always performed in bytes from the start of
+the file and works by traversing the double linked list of screen lines
+until the correct line is found. Then the cell array is consulted to
+move to the correct column.
+
+Syntax-Highlighting
+-------------------
+
+The editor takes a similar regex-based approach to syntax highlighting
+than sandy and reuses its syntax definitions but always applies them to
+a "screen full" of text thus enabling multiline coloring.
+
+Window-Management
+-----------------
+
+It is possible to open multiple windows via the `:split/:vsplit/:open`
+commands or by passing multiple files on the command line.
+
+In principle it would be nice to follow a similar client/server approach
+as sam/samterm i.e. having the main editor as a server and each window
+as a separate client process with communication over a unix domain socket.
+
+That way window management would be taken care of by dwm or dvtm and the
+different client processes would still share common cut/paste registers
+etc.
+
+However at the moment I don't want to open that can of worms and instead
+settled for a single process architecture.
+
+Search and replace
+------------------
+
+Currently the editor copies the whole text to a contiguous memory block
+and then uses the standard regex functions from libc. Clearly this is not
+a satisfactory solution for large files and kind of defeats the whole
+effort spent on the piece table.
+
+The long term solution is to write our own regular expression engine or
+modify an existing one to make use of the iterator API. This would allow
+efficient search without having to double memory consumption.
+
+The used regex engine should use a non-backtracking algorithm. Useful
+resources include:
+
+ - [Russ Cox's regex pag](http://swtch.com/~rsc/regexp/)
+ - [TRE](https://github.com/laurikari/tre) as
+   [used by musl](http://git.musl-libc.org/cgit/musl/tree/src/regex)
+   which uses a parallel [TNFA matcher](http://laurikari.net/ville/spire2000-tnfa.ps)
+ - [Plan9's regex library](http://plan9.bell-labs.com/sources/plan9/sys/src/libregexp/)
+   which has its root in Rob Pike's sam text editor
+ - [RE2](https://github.com/google/re2) C++ regex library
+
+Command-Prompt
+--------------
+
+The editor needs some form of command prompt to get user input
+(think `:`, `/`, `?` in vim).
+
+At first I wanted to implement this in terms of an external process,
+similar to the way it is done in sandy with communication back to the
+editor via a named pipe.
+
+At some point I thought it would be possible to provide all editor commands
+as shell scripts in a given directory, then set $PATH accordingly and run
+the shell. This would give us readline editing, tab completion, history and
+Unicode support for free. But unfortunately it won't work due to quoting
+issues and other conflicts of special symbols with different meanings.
+
+Later it occurred to me that the editor prompt could just be treated as
+special 1 line file. That is all the main editor functionality is reused
+with a slightly different set of key bindings.
+
+This approach also has the added benefit of further testing the main editor
+component (in particular corner cases like editing at the end of the file).
+
+Editor Frontends
+================
+
+The editor core is written in a library like fashion which should make
+it possible to write multiple frontends with possibly different user
+interfaces/paradigms. The frontend to run is selected based on the
+executable name.
+
+The default, and currently only, interface is a vim clone called vis.
+
+Key binding modes
+-----------------
+
+The available key bindings for the different modes are arranged in a
+hierarchical way in config.h (there is also an ascii tree giving an
+overview in that file). Each mode can specify a parent mode which is
+used to look up a key binding if it is not found in the current mode.
+This reduces redundancy for keys which have the same meaning in
+different modes.
+
+Each mode can also provide hooks which are executed upon entering/leaving
+the mode and when there was an unmatched key.
+
+vis a vim like frontend
+-----------------------
+
+The following section gives a quick overview over various vim features
+and their current support in vis.
+
+###  Operators
+
+    d   (delete)
+    c   (change)
+    y   (yank)
+    p   (put)
+    >   (shift-right)
+    <   (shift-left),
+    J   (join)
+    ~   (swap case)
+    gu  (make lowercase)
+    gU  (make uppercase)
+
+### Movements
+
+    h        (char left)
+    l        (char right)
+    j        (line down)
+    k        (line up)
+    gj       (display line down)
+    gk       (display line up)
+    0        (start of line)
+    ^        (first non-blank of line)
+    g_       (last non-blank of line)
+    $        (end of line)
+    %        (match bracket)
+    b        (previous start of a word)
+    B        (previous start of a WORD)
+    w        (next start of a word)
+    W        (next start of a WORD)
+    e        (next end of a word)
+    E        (next end of a WORD)
+    ge       (previous end of a word)
+    gE       (previous end of a WORD)
+    {        (previous paragraph)
+    }        (next paragraph)
+    (        (previous sentence)
+    )        (next sentence)
+    gg       (begin of file)
+    G        (goto line or end of file)
+    |        (goto column)
+    n        (repeat last search forward)
+    N        (repeat last search backwards)
+    f{char}  (to next occurrence of char to the right)
+    t{char}  (till before next occurrence of char to the right)
+    F{char}  (to next occurrence of char to the left)
+    T{char}  (till before next occurrence of char to the left)
+    /{text}  (to next match of text in forward direction)
+    ?{text}  (to next match of text in backward direction)
+
+  An empty line is currently neither a word nor a WORD.
+
+  The semantics of a paragraph and a sentence is also not always 100%
+  the same as in vim.
+
+  Some of these commands do not work as in vim when prefixed with a
+  digit i.e. a multiplier. As an example in vim `3$` moves to the end
+  of the 3rd line down. However vis treats it as a move to the end of
+  current line which is repeated 3 times where the last two have no
+  effect.
+
+  In general there are still a lot of improvements to be made in the
+  case movements are forced to be line or character wise. Also some of
+  them should be inclusive in some context and exclusive in others.
+  At the moment they always behave the same.
+
+### Text objects
+
+  All of the following text objects are implemented in an inner variant
+  (prefixed with 'i') and a normal variant (prefixed with 'a'):
+
+    w  word
+    W  WORD
+    s  sentence
+    p  paragraph
+    [,], (,), {,}, <,>, ", ', `         block enclosed by these symbols
+
+  For sentence and paragraph there is no difference between the
+  inner and normal variants.
+
+### Modes
+
+  At the moment there exists a more or less functional insert, replace
+  and visual mode (in both line and character wise variants).
+
+### Marks
+
+    [a-z] general purpose marks
+    <     start of the last selected visual area in current buffer
+    >     end of the last selected visual area in current buffer
+
+  No marks across files are supported. Marks are not preserved over
+  editing sessions.
+
+### Registers
+
+  Only the 26 lower case registers `[a-z]` and 1 additional default register
+  is supported.
+
+### Undo/Redo and Repeat
+
+  The text is currently snapshoted whenever an operator is completed as
+  well as when insert or replace mode is left. Additionally a snapshot
+  is also taken if in insert or replace mode a certain idle time elapses.
+
+  Another idea is to snapshot based on the distance between two consecutive
+  editing operations (as they are likely unrelated and thus should be
+  individually reversible).
+
+  The repeat command `.` works for all operators and is able to repeat
+  the last insertion or replacement.
+
+### Macros
+
+`[a-z]` are recoginized macro names, `q` starts a recording, `@` plays it back.
+`@@` refers to the least recently recorded macro.
+
+### Command line prompt
+
+  At the `:`-command prompt only the following commands are recognized:
+
+    :nnn     go to line nnn
+    :bdelete close all windows which display the same file as the current one
+    :edit    replace current file with a new one or reload it from disk
+    :open    open a new window
+    :qall    close all windows, exit editor
+    :quit    close currently focused window
+    :read    insert content of another file at current cursor position
+    :split   split window horizontally
+    :vsplit  split window vertically
+    :new     open an empty window, arrange horizontally
+    :vnew    open an empty window, arrange vertically
+    :wq      write changes then close window
+    :xit     like :wq but write only when changes have been made
+    :write   write current buffer content to file
+    :saveas  save file under another name
+    :set     set the options below
+
+     tabwidth   [1-8]
+
+       set display width of a tab and number of spaces to use if
+       expandtab is enabled
+
+     expandtab  (yes|no)
+
+       whether typed in tabs should be expanded to tabwidth spaces
+
+     autoindent (yes|no)
+
+       replicate spaces and tabs at the beginning of the line when
+       starting a new line.
+
+     number         (yes|no)
+     relativenumber (yes|no)
+
+       whether absolute or relative line numbers are printed alongside
+       the file content
+
+     syntax      name
+
+       use syntax definition given (e.g. "c") or disable syntax
+       highlighting if no such definition exists (e.g :set syntax off)
+
+  Each command can be prefixed with a range made up of a start and
+  an end position as in start,end. Valid position specifiers are:
+
+   .          start of the current line
+   +n and -n  start of the line relative to the current line
+   'm         position of mark m
+   /pattern/  first match after current position
+
+  If only a start position without a command is given then the cursor
+  is moved to that position. Additionally the following ranges are
+  predefined:
+
+    %          the whole file, equivalent to 1,$
+    *          the current selection, equivalent to '<,'>
+
+  The substitute command is recognized but not yet implemented. The `!`
+  command to filter text through an external program is also planned.
+
+  History support, tab completion and wildcard expansion are other
+  worthwhile features. However implementing them inside the editor
+  feels wrong.
+
+### Tab <-> Space and Line endings \n vs \r\n
+
+  Tabs can optionally be expaned to a configurable number of spaces.
+  The first line ending in the file determines what will be inserted
+  upon a line break (defaults to \n).
+
+### Jump list and change list
+
+  A per window, file local jump list (navigate with `CTRL+O` and `CTRL+I`)
+  and change list (navigate with `g;` and `g,`) is supported. The jump
+  list is implemented as a fixed sized ring buffer.
+
+### Mouse support
+
+  The mouse is currently not used at all.
+
+### Other features
+
+ Things which I would like to add in the long term are:
+
+   + code completion: this should be done as an external process. I will
+     have to take a look at the tools from the llvm / clang project. Maybe
+     dvtm's terminal emulation support could be reused to display an
+     slmenu inside the editor at the cursor position?
+
+   + something similar to vim's quick fix functionality
+
+ Things I might add
+
+   + runtime configurable key bindings
+   + visual block mode / multiple selections
+   + text folding
+
+ Stuff which vim does which I don't use and have no plans to add:
+
+   - GUIs (neither x11, motif, gtk, win32 ...)
+   - plugins (certainly not vimscript, if anything it should be lua based)
+   - right-to-left text
+   - tabs (as in multiple workspaces)
+   - ex mode
+
+How to help?
+------------
+
+At this point it might be best to fetch the code, edit some scratch file,
+notice an odd behavior or missing functionality, write and submit a patch
+for it, then iterate.
+
+WARNING: There are probably still some bugs left which could corrupt your
+         unsaved changes. Use at your own risk. At this point I suggest to
+         only edit non-critical files which are under version control and
+         thus easily recoverable!
+
+A quick overview over the code structure to get you started:
+
+ File(s)             | Description
+ ------------------- | -----------------------------------------------------
+ `text.[ch]`         | low level text / marks / {un,re}do / piece table implementation
+ `text-motions.[ch]` | movement functions take a file position and return a new one
+ `text-objects.[ch]` | functions take a file position and return a file range
+ `vis.c`             | vi(m) specific editor frontend, program entry point
+ `editor.[ch]`       | editor window management
+ `window.[ch]`       | ui-independent viewport, syntax highlighting, cursor placement
+ `ui.h`              | abstract interface as implemented by user interface
+ `ui-curses.h`       | a terminal / curses based user interface implementation
+ `buffer.[ch]`       | dynamically growing buffer used for registers and macros
+ `ring-buffer.[ch]`  | fixed size ring buffer used for the jump list
+ `map.[ch]`          | crit-bit tree based map supporting unique prefix lookups and ordered iteration. used to implement `:`-commands.
+ `config.def.h`      | definition of key bindings, commands, syntax highlighting
+
+Hope this gets the interested people started.
+
+TODO
+----
+
+Here is an incomplete list of TODO items and/or ideas for further work
+in no particular order:
+
+ * Review and cleanup the existing implementation (e.g. selection handling)
+    - Eliminate global state and expose vis frontend as "library"
+ * Implement `:!` using a proper (libuv based?) mainloop
+ * Implement `:substitute`
+ * Bugfix: editing the same file in multiple windows can cause "corruption"
+ * Review/Implement cindent mode #33
+ * Add history support to `:`-prompt
+ * Implement wordwrap (i.e `gq` and `:set textwidth`) using `fmt(1)` ?
+ * Overhaul key bindings to support runtime configuration / streamline config.def.h
+ * Implement/review/merge history undo tree
+ * Implement a regex engine which works with the iterator API
+ * Write [unit test](http://ccodearchive.net/info/tap.html) for the low
+   level `text_*` interface
+ * Improve syntax highlighting, investigate whether already existing
+   syntax definitions from other editors could be reused
+ * Optimize `text_delete` in case of consecutive delete operations
+ * Add a RPC interface, experiment with a client/server architecture and
+   delegate window management to dwm/dvtm
+
+Feel free to ask questions if something is unclear! There are still a lot
+of bugs left to fix, but by now I'm fairly sure that the general concept
+should work.
+
+As always, comments and patches welcome!
+
+Cheers,
+Marc
-- 
cgit v1.2.3