1 files changed, 518 insertions, 0 deletions
diff --git a/README b/README
new file mode 100644
index 0000000..2b32507
--- /dev/null
+++ b/README
@@ -0,0 +1,518 @@
+Why another text editor?
+========================
+
+It all started when I was recently reading the excellent Project Oberon[0],
+where in chapter 5 a data structure for managing text is introduced.
+I found this rather appealing and wanted to see how it works in practice.
+
+After some time I decided that besides just having fun hacking around I
+might as well build something which could (at least in the long run)
+replace my current editor of choice: vim.
+
+This should be accomplished by a reasonable amount of clean (your mileage
+may vary), modern and legacy free C code. Certainly not an old, 500'000
+lines[1] long, #ifdef cluttered mess which tries to run on all broken
+systems ever envisioned by mankind.
+
+Admittedly vim has a lot of functionally, most of which I don't use. I
+therefore set out with the following main goals:
+
+ - Unicode aware
+
+ - binary clean
+
+ - handle arbitrary files (this includes large ones, think >100M SQL-dumps)
+
+ - unlimited undo/redo support
+
+ - syntax highlighting
+
+ - regex search (and replace)
+
+ - multiple file/window support
+
+ - extensible and configurable through familiar config.def.h mechanism
+
+The goal could thus be summarized as "80% of vim's features (in other
+words the useful ones) implemented in roughly 1% of the code".
+
+Finally and most importantly it is fun! Writing a text editor presents
+some interesting challenges and design decisions, some of which are
+explained below.
+
+Text management using a piece table/chain
+=========================================
+
+The core of this editor is a persistent data structure called a piece
+table which supports all modifications in O(m), where m is the number
+of non-consecutive editing operations. This bound could be further
+improved to O(log m) by use of a balanced search tree, however the
+additional complexity doesn't seem to be worth it, for now.
+
+The actual data is stored in buffers which are strictly append only.
+There are two types of buffers, a fixed-sized for the original file
+content and append-only ones one for all modifications.
+
+A text, i.e. a sequence of bytes, is represented as a double linked
+list of pieces each with a pointer into a buffer and an associated
+length. Pieces are never deleted but instead always kept around for
+redo/undo support. A span is a range of pieces, consisting of a start
+and end piece. Changes to the text are always performed by swapping
+out an existing, possibly empty, span with a new one.
+
+An empty document is represented by two special sentinel pieces which
+always exist:
+
+    /-+ --> +-\
+    | |     | |
+    \-+ <-- +-/
+     #1     #2
+
+Loading a file from disk is as simple as mmap(2)-ing it into a buffer,
+creating a corresponding piece and adding it to the double linked list.
+Hence loading a file is a constant time operation i.e. independent of
+the actual file size (assuming the operating system uses demand paging).
+
+    /-+ --> +-----------------+ --> +-\
+    | |     | I am an editor! |     | |
+    \-+ <-- +-----------------+ <-- +-/
+     #1             #3              #2
+
+Insert
+------
+
+Inserting a junk of data amounts to appending the new content to a
+modification buffer. Followed by the creation of new pieces. An insertion
+in the middle of an existing piece requires the creation of 3 new pieces.
+Two of them hold references to the text before respectively after the
+insertion point. While the third one points to the newly added text.
+
+    /-+ --> +---------------+ --> +----------------+ --> +--+ --> +-\
+    | |     | I am an editor|     |which sucks less|     |! |     | |
+    \-+ <-- +---------------+ <-- +----------------+ <-- +--+ <-- +-/
+     #1            #4                   #5                #6      #2
+
+           modification buffer content: "which sucks less"
+
+During this insertion operation the old span [3,3] has been replaced
+by the new span [4,6]. Notice that the pieces in the old span were not
+changed, therefore still point to their predecessors/successors, and can
+thus be swapped back in.
+
+If the insertion point happens to be at a piece boundary, the old span
+is empty, and the new span only consists of the newly allocated piece.
+
+Delete
+------
+
+Similarly a delete operation splits the pieces at appropriate places.
+
+    /-+ --> +-----+ --> +--+ --> +-\
+    | |     | I am|     |! |     | |
+    \-+ <-- +-----+ <-- +--+ <-- +-/
+     #1       #7         #6      #2
+
+Where the old span [4,5] got replaced by the new span [7,7]. The underlying
+buffers remain unchanged.
+
+Cache
+-----
+
+Notice that the common case of appending text to a given piece is fast
+since, the new data is simply appended to the buffer and the piece length
+is increased accordingly. In order to keep the number of pieces down,
+the least recently edited piece is cached and changes to it are done
+in place (this is the only time buffers are modified in a non-append
+only way). As a consequence they can not be undone.
+
+Undo/redo
+---------
+
+Since the buffers are append only and the spans/pieces are never destroyed
+undo/redo functionality is implemented by swapping the required spans/pieces
+back in.
+
+As illustrated above, each change to the text is recorded by an old and
+a new span. An action consists of multiple changes which logically belong
+to each other and should thus also be reverted together. For example
+a search and replace operation is one action with possibly many changes
+all over the text. Actions are kept in an undo respectively redo stack.
+
+A state of the text can be marked by means of a snapshotting operation.
+The undo/redo functionality operates on such marked states and switches
+back and forth between them.
+
+The history is currently linear, no undo / history tree is implemented.
+
+Properties
+----------
+
+The main advantage of the piece chain as described above is that all
+operations are performed independent of the file size but instead linear
+in the number of pieces i.e. editing operations. The original file buffer
+never changes which means the mmap(2) can be performed read only which
+makes optimal use of the operating system's virtual memory / paging system.
+
+The maximum editable file size is limited by the amount of memory a process
+is allowed to map into its virtual address space, this shouldn't be a problem
+in practice. The whole process assumes that the file can be used as is.
+In particular the editor assumes all input and the file itself is encoded
+as UTF-8. Supporting other encodings would require conversion using iconv(3)
+or similar upon loading and saving the document, which defeats the whole
+purpose.
+
+Similarly the editor has to cope with the fact that lines can be terminated
+either by \n or \n\r. There is no conversion to a line based structure in
+place. Instead the whole text is exposed as a sequence of bytes. All
+addressing happens by means of zero based byte offsets from the start of
+the file.
+
+The main disadvantage of the piece chain data structure is that the text
+is not stored contiguous in memory which makes seeking around somewhat
+harder. This also implies that standard library calls like regex(3)
+functions can not be used as is. However this is the case for all but
+the most simple data structures used in text editors.
+
+Screen Drawing
+==============
+
+The current code takes a rather simple approach to screen drawing. It
+basically only remembers the starting position of the area being shown.
+Then fetches a "screen full" of bytes and outputs one character at a
+time until the end of the window is reached. A consequence of this
+approach is that lines are always wrapped and horizontal scrolling is
+not supported.
+
+No efforts are made to reduce the terminal output. This task is delegated
+to the underlying curses library which already performs a kind of double
+buffering. The window is always redrawn completely even if only a single
+character changes. It turns out this is actually necessary if one wants
+to support multiline syntax highlighting.
+
+While outputting the individual characters a cell matrix is populated
+where each entry stores the length in bytes of the character displayed
+at this particular cell. For characters spanning multiple columns the
+length is always stored in the leftmost cell. As an example a tab has a
+length of 1 byte followed by up to 7 cells with a length of zero.
+Similarly a \n\r line ending occupies only one screen cell but has a
+length of 2.
+
+This matrix is actually stored per line inside a double linked list of
+structures representing screen lines. For each line we keep track of
+the length in bytes of the underlying text, the display width of all
+characters part of the line, and the logical line number.
+
+All cursor positioning is always performed in bytes from the start of
+the file and works by traversing the double linked list of screen lines
+until the correct line is found. Then the cell array is consulted to
+move to the correct column.
+
+Syntax-Highlighting
+-------------------
+
+The editor takes a similar regex-based approach to syntax highlighting
+than sandy and reuses its syntax definitions but always applies them to
+a "screen full" of text thus enabling multiline coloring.
+
+Currently only the highlighting rules for C have been imported from sandy.
+
+Window-Management
+-----------------
+
+It is possible to open multiple windows via the :split/:vsplit/:open
+commands or by passing multiple files on the command line.
+
+In principle it would be nice to follow a similar client/server approach
+as sam/samterm i.e. having the main editor as a server and each window
+as a separate client process with communication over a unix domain socket.
+
+That way window management would be taken care of by dwm or dvtm and the
+different client processes would still share common cut/paste registers
+etc.
+
+However at the moment I don't want to open that can of worms and instead
+settled for a single process architecture.
+
+Search and replace
+------------------
+
+This is one of the last big conceptual problems.
+
+Currently the editor copies the whole text to a contiguous memory block
+and then uses the standard regex functions from libc. Clearly this is not
+a satisfactory solution for large files and kind of defeats the whole
+effort spent on the piece table.
+
+The long term solution is to write our own regular expression engine or
+modify an existing one to make use of the iterator API. This would allow
+efficient search without having to double memory consumption. At some
+point I will have to (re)read the papers of Russ Cox[2] and Rob Pike
+about this topic.
+
+Command-Prompt
+--------------
+
+The editor needs some form of command prompt to get user input
+(think :, /, ? in vim).
+
+At first I wanted to implement this in terms of an external process,
+similar to the way it is done in sandy with communication back to the
+editor via a named pipe.
+
+At some point I thought it would be possible to provide all editor commands
+as shell scripts in a given directory, then set $PATH accordingly and run
+the shell. This would give us readline editing, tab completion, history and
+Unicode support for free. But unfortunately it won't work due to quoting
+issues and other conflicts of special symbols with different meanings.
+
+Later it occurred to me that the editor prompt could just be treated as
+special 1 line file. That is all the main editor functionality is reused
+with a slightly different set of key bindings.
+
+This approach also has the added benefit of further testing the main editor
+component (in particular corner cases like editing at the end of the file).
+
+Editor Frontends
+================
+
+The editor core is written in a library like fashion which should make
+it possible to write multiple frontends with possibly different user
+interfaces/paradigms.
+
+At the moment there exists a barely functional, non-modal nano/sandy
+like interface which was used during early testing. The default interface
+is a vim clone called vis.
+
+The frontend to run is selected based on the executable name.
+
+Key binding modes
+-----------------
+
+The available key bindings for the different modes are arranged in a
+hierarchical way in config.h (there is also an ascii tree giving an
+overview in that file). Each mode can specify a parent mode which is
+used to look up a key binding if it is not found in the current mode.
+This reduces redundancy for keys which have the same meaning in
+different modes.
+
+Each mode can also provide hooks which are executed upon entering/leaving
+the mode and when there was an unmatched key.
+
+vis a vim like frontend
+-----------------------
+
+The vis frontend uses a similar approach to the one suggested by Markus
+Teich[3] but it turns out to be a bit more complicated. For starters
+there are movements and commands which consist of more than one key/
+character. As a consequence the key lookup is not a simple array
+dereference but instead the arrays are looped over until a match
+is found.
+
+The following section gives a quick overview over various vim features
+and their current support in vis.
+
+ Operators
+ ---------
+   working: d (delete), c (change), y (yank), p (put)
+   planned: > (shift-right), < (shift-left)
+            those depend on handling of indention tabs <-> spaces
+
+ Movements
+ ---------
+   h        (char left)
+   l        (char right)
+   j        (line down)
+   k        (line up)
+   0        (start of line)
+   ^        (first non-blank of line)
+   g_       (last non-blank of line)
+   $        (end of line)
+   %        (match bracket)
+   b        (previous start of a word)
+   w        (next start of a word)
+   e        (next end of a word)
+   ge       (previous end of a word)
+   {        (previous paragraph)
+   }        (next paragraph)
+   (        (previous sentence)
+   )        (next sentence)
+   gg       (begin of file)
+   G        (goto line or end of file)
+   |        (goto column)
+   n        (repeat last search forward)
+   N        (repeat last search backwards)
+   f{char}  (to next occurrence of char to the right)
+   t{char}  (till before next occurrence of char to the right)
+   F{char}  (to next occurrence of char to the left)
+   T{char}  (till before next occurrence of char to the left)
+   /{text}  (to next match of text in forward direction)
+   ?{text}  (to next match of text in backward direction)
+
+  There is currently no distinction between what vim calls a WORD and
+  a word, only the former is implemented. Though infrastructure for
+  the latter also exists.
+
+  The semantics of a paragraph and a sentence is also not always 100%
+  the same as in vim.
+
+  Some of these commands do not work as in vim when prefixed with a
+  digit i.e. a multiplier. As an example 3$ should move to the end
+  of the 3rd line down. The way it currently behaves is that the first
+  movement places the cursor at the end of the current line and the last
+  two have thus no effect.
+
+  In general there are still a lot of improvements to be made in the
+  case movements are forced to be line or character wise. Also some of
+  them should be inclusive in some context and exclusive in others.
+  At the moment they always behave the same.
+
+ Text objects
+ ------------
+
+  All of the following text objects are implemented in an inner variant
+  (prefixed with 'i') and a normal variant (prefixed with 'a'):
+
+   w  word
+   s  sentence
+   p  paragraph
+   [,], (,), {,}, <,>, ", ', `         block enclosed by these symbols
+
+  For word, sentence and paragraph there is no difference between the
+  inner and normal variants.
+
+ Modes
+ -----
+
+  At the moment there exists a more or less functional insert, replace
+  and character wise visual mode.
+
+  A line wise visual mode is planned.
+
+ Marks
+ -----
+
+  Only the 26 lower case marks [a-z] are supported. No marks across files
+  are supported. Marks are not preserved over editing sessions.
+
+ Registers
+ ---------
+
+  Only the 26 lower case registers [a-z] and 1 additional default register
+  is supported.
+
+ Undo/Redo and Repeat
+ --------------------
+
+  The text is currently snapshoted whenever an operator is completed as
+  well as when insert or replace mode is left. Additionally a snapshot
+  is also taken if in insert or replace mode a certain idle time elapses.
+
+  Another idea is to snapshot based on the distance between two consecutive
+  editing operations (as they are likely unrelated and thus should be
+  individually reversible).
+
+  The repeat command '.' currently only works for operators. This for
+  example means that inserting text can not be repeated (i.e. inserted
+  again). The same restriction also applies to commands which are not
+  implemented in terms of operators, such as 'o', 'O', 'J' etc.
+
+ Command line prompt
+ -------------------
+
+  At the ':'-command prompt only the following commands are recognized:
+
+   :nnn     go to line nnn
+   :edit    replace current file with a new one or reload it from disk
+   :open    open a new window
+   :qall    close all windows, exit editor
+   :quit    close currently focused window
+   :read    insert content of another file at current cursor position
+   :split   split window horizontally
+   :vsplit  split window vertically
+   :wq      write changes then close window
+   :write   write current buffer content to file
+
+  The substitute command is recognized but not yet implemented. The '!'
+  command to filter text through an external program is also planned.
+  At some point the range syntax should probably also be supported.
+
+  History support, tab completion and wildcard expansion are other
+  worthwhile features.
+
+ Tab <-> Space
+ -------------
+
+  Currently there is no expand tab functionality i.e. they are always
+  inserted as is. For me personally this is no problem at all. Tabs
+  should be used for indention! That way everybody can configure their
+  preferred tab width whereas spaces should only be used for alignment.
+
+ Jump list and change list
+ -------------------------
+
+  Neither the jump list nor the change lists are currently supported.
+
+ Mouse support
+ -------------
+
+  The mouse is currently not used at all.
+
+ Other features
+ --------------
+
+  Other things I would like to add in the long term are:
+
+   + code completion: this should be done as an external process. I will
+     have to take a look at the tools from the llvm / clang project. Maybe
+     dvtm's terminal emulation support could be reused to display an
+     slmenu inside the editor at the cursor position?
+
+   + something similar to vim's quick fix functionality
+
+  Stuff which vim does which I don't use and have no plans to add:
+
+   - GUIs (neither x11, motif, gtk, win32 ...)
+   - text folding
+   - visual block mode
+   - plugins (certainly not vimscript, if anything it should be lua based)
+   - runtime key bindings
+   - right-to-left text
+   - tabs (as in multiple workspaces)
+   - ex mode
+   - macro recording
+
+How to help?
+------------
+
+At this point it might be best to fetch the code, edit some scratch file,
+notice an odd behavior or missing functionality, write and submit a patch
+for it, then iterate.
+
+WARNING: There are probably still some bugs left which could corrupt your
+         unsaved changes. Use at your own risk. At this point I suggest to
+         only edit non-critical files which are under version control and
+         thus easily recoverable!
+
+A quick overview over the code structure to get you started:
+
+ config.def.h       definition of key bindings, commands, syntax highlighting etc.
+ vis.c              vi(m) specific editor frontend, program entry point
+ editor.[ch]        screen / window / statusbar / command prompt management
+ window.[ch]        window drawing / syntax highlighting / cursor placement
+ text-motions.[ch]  movement functions take a file position and return a new one
+ text-objects.[ch]  functions take a file position and return a file range
+ text.[ch]          low level text / marks / {un,re}do / piece table implementation
+
+Hope this gets the interested people started. Feel free to ask questions
+if something is unclear! There are still a lot of bugs left to fix, but
+by now I'm fairly sure that the general concept should work.
+
+As always, comments and patches welcome!
+
+Cheers,
+Marc
+
+[0] http://www.inf.ethz.ch/personal/wirth/ProjectOberon/
+[1] https://www.openhub.net/p/vim
+[2] http://swtch.com/~rsc/regexp/
+[3] http://lists.suckless.org/dev/1408/23219.html