diff options
Diffstat (limited to 'lua/lexers/lexer.lua')
| -rw-r--r-- | lua/lexers/lexer.lua | 2322 |
1 files changed, 1251 insertions, 1071 deletions
diff --git a/lua/lexers/lexer.lua b/lua/lexers/lexer.lua index 2973ea6..15ff432 100644 --- a/lua/lexers/lexer.lua +++ b/lua/lexers/lexer.lua @@ -1,602 +1,464 @@ --- Copyright 2006-2017 Mitchell mitchell.att.foicica.com. See LICENSE. +-- Copyright 2006-2022 Mitchell. See LICENSE. local M = {} --[=[ This comment is for LuaDoc. --- --- Lexes Scintilla documents with Lua and LPeg. --- --- ## Overview --- --- Lexers highlight the syntax of source code. Scintilla (the editing component --- behind [Textadept][] and [SciTE][]) traditionally uses static, compiled C++ --- lexers which are notoriously difficult to create and/or extend. On the other --- hand, Lua makes it easy to to rapidly create new lexers, extend existing --- ones, and embed lexers within one another. Lua lexers tend to be more --- readable than C++ lexers too. --- --- Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua --- [LPeg library][]. The following table comes from the LPeg documentation and --- summarizes all you need to know about constructing basic LPeg patterns. This --- module provides convenience functions for creating and working with other --- more advanced patterns and concepts. --- --- Operator | Description --- ---------------------|------------ --- `lpeg.P(string)` | Matches `string` literally. --- `lpeg.P(`_`n`_`)` | Matches exactly _`n`_ characters. --- `lpeg.S(string)` | Matches any character in set `string`. --- `lpeg.R("`_`xy`_`")` | Matches any character between range `x` and `y`. --- `patt^`_`n`_ | Matches at least _`n`_ repetitions of `patt`. --- `patt^-`_`n`_ | Matches at most _`n`_ repetitions of `patt`. --- `patt1 * patt2` | Matches `patt1` followed by `patt2`. --- `patt1 + patt2` | Matches `patt1` or `patt2` (ordered choice). --- `patt1 - patt2` | Matches `patt1` if `patt2` does not match. --- `-patt` | Equivalent to `("" - patt)`. --- `#patt` | Matches `patt` but consumes no input. --- --- The first part of this document deals with rapidly constructing a simple --- lexer. The next part deals with more advanced techniques, such as custom --- coloring and embedding lexers within one another. Following that is a --- discussion about code folding, or being able to tell Scintilla which code --- blocks are "foldable" (temporarily hideable from view). After that are --- instructions on how to use LPeg lexers with the aforementioned Textadept and --- SciTE editors. Finally there are comments on lexer performance and --- limitations. +-- Lexes Scintilla documents and source code with Lua and LPeg. +-- +-- ### Writing Lua Lexers +-- +-- Lexers highlight the syntax of source code. Scintilla (the editing component behind +-- [Textadept][] and [SciTE][]) traditionally uses static, compiled C++ lexers which are +-- notoriously difficult to create and/or extend. On the other hand, Lua makes it easy to to +-- rapidly create new lexers, extend existing ones, and embed lexers within one another. Lua +-- lexers tend to be more readable than C++ lexers too. +-- +-- Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua [LPeg library][]. The +-- following table comes from the LPeg documentation and summarizes all you need to know about +-- constructing basic LPeg patterns. This module provides convenience functions for creating +-- and working with other more advanced patterns and concepts. +-- +-- Operator | Description +-- -|- +-- `lpeg.P(string)` | Matches `string` literally. +-- `lpeg.P(`_`n`_`)` | Matches exactly _`n`_ number of characters. +-- `lpeg.S(string)` | Matches any character in set `string`. +-- `lpeg.R("`_`xy`_`")`| Matches any character between range `x` and `y`. +-- `patt^`_`n`_ | Matches at least _`n`_ repetitions of `patt`. +-- `patt^-`_`n`_ | Matches at most _`n`_ repetitions of `patt`. +-- `patt1 * patt2` | Matches `patt1` followed by `patt2`. +-- `patt1 + patt2` | Matches `patt1` or `patt2` (ordered choice). +-- `patt1 - patt2` | Matches `patt1` if `patt2` does not also match. +-- `-patt` | Equivalent to `("" - patt)`. +-- `#patt` | Matches `patt` but consumes no input. +-- +-- The first part of this document deals with rapidly constructing a simple lexer. The next part +-- deals with more advanced techniques, such as custom coloring and embedding lexers within one +-- another. Following that is a discussion about code folding, or being able to tell Scintilla +-- which code blocks are "foldable" (temporarily hideable from view). After that are instructions +-- on how to use Lua lexers with the aforementioned Textadept and SciTE editors. Finally there +-- are comments on lexer performance and limitations. -- -- [LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html --- [Textadept]: http://foicica.com/textadept --- [SciTE]: http://scintilla.org/SciTE.html +-- [Textadept]: https://orbitalquark.github.io/textadept +-- [SciTE]: https://scintilla.org/SciTE.html -- --- ## Lexer Basics +-- ### Lexer Basics -- --- The *lexers/* directory contains all lexers, including your new one. Before --- attempting to write one from scratch though, first determine if your --- programming language is similar to any of the 80+ languages supported. If so, --- you may be able to copy and modify that lexer, saving some time and effort. --- The filename of your lexer should be the name of your programming language in --- lower case followed by a *.lua* extension. For example, a new Lua lexer has --- the name *lua.lua*. +-- The *lexers/* directory contains all lexers, including your new one. Before attempting to +-- write one from scratch though, first determine if your programming language is similar to +-- any of the 100+ languages supported. If so, you may be able to copy and modify that lexer, +-- saving some time and effort. The filename of your lexer should be the name of your programming +-- language in lower case followed by a *.lua* extension. For example, a new Lua lexer has the +-- name *lua.lua*. -- --- Note: Try to refrain from using one-character language names like "c", "d", --- or "r". For example, Scintillua uses "ansi_c", "dmd", and "rstats", --- respectively. +-- Note: Try to refrain from using one-character language names like "c", "d", or "r". For +-- example, Scintillua uses "ansi_c", "dmd", and "rstats", respectively. -- --- ### New Lexer Template +-- #### New Lexer Template -- --- There is a *lexers/template.txt* file that contains a simple template for a --- new lexer. Feel free to use it, replacing the '?'s with the name of your --- lexer: +-- There is a *lexers/template.txt* file that contains a simple template for a new lexer. Feel +-- free to use it, replacing the '?'s with the name of your lexer. Consider this snippet from +-- the template: -- -- -- ? LPeg lexer. -- --- local l = require('lexer') --- local token, word_match = l.token, l.word_match --- local P, R, S = lpeg.P, lpeg.R, lpeg.S +-- local lexer = require('lexer') +-- local token, word_match = lexer.token, lexer.word_match +-- local P, S = lpeg.P, lpeg.S -- --- local M = {_NAME = '?'} +-- local lex = lexer.new('?') -- -- -- Whitespace. --- local ws = token(l.WHITESPACE, l.space^1) +-- local ws = token(lexer.WHITESPACE, lexer.space^1) +-- lex:add_rule('whitespace', ws) +-- +-- [...] +-- +-- return lex +-- +-- The first 3 lines of code simply define often used convenience variables. The fourth and +-- last lines [define](#lexer.new) and return the lexer object Scintilla uses; they are very +-- important and must be part of every lexer. The fifth line defines something called a "token", +-- an essential building block of lexers. You will learn about tokens shortly. The sixth line +-- defines a lexer grammar rule, which you will learn about later, as well as token styles. (Be +-- aware that it is common practice to combine these two lines for short rules.) Note, however, +-- the `local` prefix in front of variables, which is needed so-as not to affect Lua's global +-- environment. All in all, this is a minimal, working lexer that you can build on. +-- +-- #### Tokens +-- +-- Take a moment to think about your programming language's structure. What kind of key +-- elements does it have? In the template shown earlier, one predefined element all languages +-- have is whitespace. Your language probably also has elements like comments, strings, and +-- keywords. Lexers refer to these elements as "tokens". Tokens are the fundamental "building +-- blocks" of lexers. Lexers break down source code into tokens for coloring, which results +-- in the syntax highlighting familiar to you. It is up to you how specific your lexer is +-- when it comes to tokens. Perhaps only distinguishing between keywords and identifiers is +-- necessary, or maybe recognizing constants and built-in functions, methods, or libraries is +-- desirable. The Lua lexer, for example, defines 11 tokens: whitespace, keywords, built-in +-- functions, constants, built-in libraries, identifiers, strings, comments, numbers, labels, +-- and operators. Even though constants, built-in functions, and built-in libraries are subsets +-- of identifiers, Lua programmers find it helpful for the lexer to distinguish between them +-- all. It is perfectly acceptable to just recognize keywords and identifiers. +-- +-- In a lexer, tokens consist of a token name and an LPeg pattern that matches a sequence of +-- characters recognized as an instance of that token. Create tokens using the [`lexer.token()`]() +-- function. Let us examine the "whitespace" token defined in the template shown earlier: +-- +-- local ws = token(lexer.WHITESPACE, lexer.space^1) +-- +-- At first glance, the first argument does not appear to be a string name and the second +-- argument does not appear to be an LPeg pattern. Perhaps you expected something like: -- --- M._rules = { --- {'whitespace', ws}, --- } +-- local ws = token('whitespace', S('\t\v\f\n\r ')^1) -- --- M._tokenstyles = { +-- The `lexer` module actually provides a convenient list of common token names and common LPeg +-- patterns for you to use. Token names include [`lexer.DEFAULT`](), [`lexer.WHITESPACE`](), +-- [`lexer.COMMENT`](), [`lexer.STRING`](), [`lexer.NUMBER`](), [`lexer.KEYWORD`](), +-- [`lexer.IDENTIFIER`](), [`lexer.OPERATOR`](), [`lexer.ERROR`](), [`lexer.PREPROCESSOR`](), +-- [`lexer.CONSTANT`](), [`lexer.VARIABLE`](), [`lexer.FUNCTION`](), [`lexer.CLASS`](), +-- [`lexer.TYPE`](), [`lexer.LABEL`](), [`lexer.REGEX`](), and [`lexer.EMBEDDED`](). Patterns +-- include [`lexer.any`](), [`lexer.alpha`](), [`lexer.digit`](), [`lexer.alnum`](), +-- [`lexer.lower`](), [`lexer.upper`](), [`lexer.xdigit`](), [`lexer.graph`](), [`lexer.print`](), +-- [`lexer.punct`](), [`lexer.space`](), [`lexer.newline`](), [`lexer.nonnewline`](), +-- [`lexer.dec_num`](), [`lexer.hex_num`](), [`lexer.oct_num`](), [`lexer.integer`](), +-- [`lexer.float`](), [`lexer.number`](), and [`lexer.word`](). You may use your own token names +-- if none of the above fit your language, but an advantage to using predefined token names is +-- that your lexer's tokens will inherit the universal syntax highlighting color theme used by +-- your text editor. +-- +-- ##### Example Tokens +-- +-- So, how might you define other tokens like keywords, comments, and strings? Here are some +-- examples. -- --- } +-- **Keywords** -- --- return M +-- Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered choices, use another +-- convenience function: [`lexer.word_match()`](). It is much easier and more efficient to +-- write word matches like: -- --- The first 3 lines of code simply define often used convenience variables. The --- 5th and last lines define and return the lexer object Scintilla uses; they --- are very important and must be part of every lexer. The sixth line defines --- something called a "token", an essential building block of lexers. You will --- learn about tokens shortly. The rest of the code defines a set of grammar --- rules and token styles. You will learn about those later. Note, however, the --- `M.` prefix in front of `_rules` and `_tokenstyles`: not only do these tables --- belong to their respective lexers, but any non-local variables need the `M.` --- prefix too so-as not to affect Lua's global environment. All in all, this is --- a minimal, working lexer that you can build on. --- --- ### Tokens --- --- Take a moment to think about your programming language's structure. What kind --- of key elements does it have? In the template shown earlier, one predefined --- element all languages have is whitespace. Your language probably also has --- elements like comments, strings, and keywords. Lexers refer to these elements --- as "tokens". Tokens are the fundamental "building blocks" of lexers. Lexers --- break down source code into tokens for coloring, which results in the syntax --- highlighting familiar to you. It is up to you how specific your lexer is when --- it comes to tokens. Perhaps only distinguishing between keywords and --- identifiers is necessary, or maybe recognizing constants and built-in --- functions, methods, or libraries is desirable. The Lua lexer, for example, --- defines 11 tokens: whitespace, comments, strings, numbers, keywords, built-in --- functions, constants, built-in libraries, identifiers, labels, and operators. --- Even though constants, built-in functions, and built-in libraries are subsets --- of identifiers, Lua programmers find it helpful for the lexer to distinguish --- between them all. It is perfectly acceptable to just recognize keywords and --- identifiers. --- --- In a lexer, tokens consist of a token name and an LPeg pattern that matches a --- sequence of characters recognized as an instance of that token. Create tokens --- using the [`lexer.token()`]() function. Let us examine the "whitespace" token --- defined in the template shown earlier: +-- local keyword = token(lexer.KEYWORD, lexer.word_match{ +-- 'keyword_1', 'keyword_2', ..., 'keyword_n' +-- }) -- --- local ws = token(l.WHITESPACE, l.space^1) +-- local case_insensitive_keyword = token(lexer.KEYWORD, lexer.word_match({ +-- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n' +-- }, true)) -- --- At first glance, the first argument does not appear to be a string name and --- the second argument does not appear to be an LPeg pattern. Perhaps you --- expected something like: +-- local hyphened_keyword = token(lexer.KEYWORD, lexer.word_match{ +-- 'keyword-1', 'keyword-2', ..., 'keyword-n' +-- }) -- --- local ws = token('whitespace', S('\t\v\f\n\r ')^1) +-- For short keyword lists, you can use a single string of words. For example: -- --- The `lexer` (`l`) module actually provides a convenient list of common token --- names and common LPeg patterns for you to use. Token names include --- [`lexer.DEFAULT`](), [`lexer.WHITESPACE`](), [`lexer.COMMENT`](), --- [`lexer.STRING`](), [`lexer.NUMBER`](), [`lexer.KEYWORD`](), --- [`lexer.IDENTIFIER`](), [`lexer.OPERATOR`](), [`lexer.ERROR`](), --- [`lexer.PREPROCESSOR`](), [`lexer.CONSTANT`](), [`lexer.VARIABLE`](), --- [`lexer.FUNCTION`](), [`lexer.CLASS`](), [`lexer.TYPE`](), [`lexer.LABEL`](), --- [`lexer.REGEX`](), and [`lexer.EMBEDDED`](). Patterns include --- [`lexer.any`](), [`lexer.ascii`](), [`lexer.extend`](), [`lexer.alpha`](), --- [`lexer.digit`](), [`lexer.alnum`](), [`lexer.lower`](), [`lexer.upper`](), --- [`lexer.xdigit`](), [`lexer.cntrl`](), [`lexer.graph`](), [`lexer.print`](), --- [`lexer.punct`](), [`lexer.space`](), [`lexer.newline`](), --- [`lexer.nonnewline`](), [`lexer.nonnewline_esc`](), [`lexer.dec_num`](), --- [`lexer.hex_num`](), [`lexer.oct_num`](), [`lexer.integer`](), --- [`lexer.float`](), and [`lexer.word`](). You may use your own token names if --- none of the above fit your language, but an advantage to using predefined --- token names is that your lexer's tokens will inherit the universal syntax --- highlighting color theme used by your text editor. --- --- #### Example Tokens --- --- So, how might you define other tokens like comments, strings, and keywords? --- Here are some examples. +-- local keyword = token(lexer.KEYWORD, lexer.word_match('key_1 key_2 ... key_n')) -- -- **Comments** -- -- Line-style comments with a prefix character(s) are easy to express with LPeg: -- --- local shell_comment = token(l.COMMENT, '#' * l.nonnewline^0) --- local c_line_comment = token(l.COMMENT, '//' * l.nonnewline_esc^0) +-- local shell_comment = token(lexer.COMMENT, lexer.to_eol('#')) +-- local c_line_comment = token(lexer.COMMENT, lexer.to_eol('//', true)) -- --- The comments above start with a '#' or "//" and go to the end of the line. --- The second comment recognizes the next line also as a comment if the current --- line ends with a '\' escape character. +-- The comments above start with a '#' or "//" and go to the end of the line. The second comment +-- recognizes the next line also as a comment if the current line ends with a '\' escape character. -- --- C-style "block" comments with a start and end delimiter are also easy to --- express: +-- C-style "block" comments with a start and end delimiter are also easy to express: -- --- local c_comment = token(l.COMMENT, '/*' * (l.any - '*/')^0 * P('*/')^-1) +-- local c_comment = token(lexer.COMMENT, lexer.range('/*', '*/')) -- --- This comment starts with a "/\*" sequence and contains anything up to and --- including an ending "\*/" sequence. The ending "\*/" is optional so the lexer --- can recognize unfinished comments as comments and highlight them properly. +-- This comment starts with a "/\*" sequence and contains anything up to and including an ending +-- "\*/" sequence. The ending "\*/" is optional so the lexer can recognize unfinished comments +-- as comments and highlight them properly. -- -- **Strings** -- --- It is tempting to think that a string is not much different from the block --- comment shown above in that both have start and end delimiters: +-- Most programming languages allow escape sequences in strings such that a sequence like +-- "\\"" in a double-quoted string indicates that the '"' is not the end of the +-- string. [`lexer.range()`]() handles escapes inherently. -- --- local dq_str = '"' * (l.any - '"')^0 * P('"')^-1 --- local sq_str = "'" * (l.any - "'")^0 * P("'")^-1 --- local simple_string = token(l.STRING, dq_str + sq_str) +-- local dq_str = lexer.range('"') +-- local sq_str = lexer.range("'") +-- local string = token(lexer.STRING, dq_str + sq_str) -- --- However, most programming languages allow escape sequences in strings such --- that a sequence like "\\"" in a double-quoted string indicates that the --- '"' is not the end of the string. The above token incorrectly matches --- such a string. Instead, use the [`lexer.delimited_range()`]() convenience --- function. +-- In this case, the lexer treats '\' as an escape character in a string sequence. -- --- local dq_str = l.delimited_range('"') --- local sq_str = l.delimited_range("'") --- local string = token(l.STRING, dq_str + sq_str) +-- **Numbers** -- --- In this case, the lexer treats '\' as an escape character in a string --- sequence. +-- Most programming languages have the same format for integer and float tokens, so it might +-- be as simple as using a predefined LPeg pattern: -- --- **Keywords** +-- local number = token(lexer.NUMBER, lexer.number) -- --- Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered --- choices, use another convenience function: [`lexer.word_match()`](). It is --- much easier and more efficient to write word matches like: +-- However, some languages allow postfix characters on integers. -- --- local keyword = token(l.KEYWORD, l.word_match{ --- 'keyword_1', 'keyword_2', ..., 'keyword_n' --- }) +-- local integer = P('-')^-1 * (lexer.dec_num * S('lL')^-1) +-- local number = token(lexer.NUMBER, lexer.float + lexer.hex_num + integer) -- --- local case_insensitive_keyword = token(l.KEYWORD, l.word_match({ --- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n' --- }, nil, true)) +-- Your language may need other tweaks, but it is up to you how fine-grained you want your +-- highlighting to be. After all, you are not writing a compiler or interpreter! -- --- local hyphened_keyword = token(l.KEYWORD, l.word_match({ --- 'keyword-1', 'keyword-2', ..., 'keyword-n' --- }, '-')) +-- #### Rules -- --- By default, characters considered to be in keywords are in the set of --- alphanumeric characters and underscores. The last token demonstrates how to --- allow '-' (hyphen) characters to be in keywords as well. +-- Programming languages have grammars, which specify valid token structure. For example, +-- comments usually cannot appear within a string. Grammars consist of rules, which are simply +-- combinations of tokens. Recall from the lexer template the [`lexer.add_rule()`]() call, +-- which adds a rule to the lexer's grammar: -- --- **Numbers** +-- lex:add_rule('whitespace', ws) -- --- Most programming languages have the same format for integer and float tokens, --- so it might be as simple as using a couple of predefined LPeg patterns: +-- Each rule has an associated name, but rule names are completely arbitrary and serve only to +-- identify and distinguish between different rules. Rule order is important: if text does not +-- match the first rule added to the grammar, the lexer tries to match the second rule added, and +-- so on. Right now this lexer simply matches whitespace tokens under a rule named "whitespace". -- --- local number = token(l.NUMBER, l.float + l.integer) +-- To illustrate the importance of rule order, here is an example of a simplified Lua lexer: -- --- However, some languages allow postfix characters on integers. +-- lex:add_rule('whitespace', token(lexer.WHITESPACE, ...)) +-- lex:add_rule('keyword', token(lexer.KEYWORD, ...)) +-- lex:add_rule('identifier', token(lexer.IDENTIFIER, ...)) +-- lex:add_rule('string', token(lexer.STRING, ...)) +-- lex:add_rule('comment', token(lexer.COMMENT, ...)) +-- lex:add_rule('number', token(lexer.NUMBER, ...)) +-- lex:add_rule('label', token(lexer.LABEL, ...)) +-- lex:add_rule('operator', token(lexer.OPERATOR, ...)) -- --- local integer = P('-')^-1 * (l.dec_num * S('lL')^-1) --- local number = token(l.NUMBER, l.float + l.hex_num + integer) +-- Note how identifiers come after keywords. In Lua, as with most programming languages, +-- the characters allowed in keywords and identifiers are in the same set (alphanumerics +-- plus underscores). If the lexer added the "identifier" rule before the "keyword" rule, +-- all keywords would match identifiers and thus incorrectly highlight as identifiers instead +-- of keywords. The same idea applies to function, constant, etc. tokens that you may want to +-- distinguish between: their rules should come before identifiers. -- --- Your language may need other tweaks, but it is up to you how fine-grained you --- want your highlighting to be. After all, you are not writing a compiler or --- interpreter! +-- So what about text that does not match any rules? For example in Lua, the '!' character is +-- meaningless outside a string or comment. Normally the lexer skips over such text. If instead +-- you want to highlight these "syntax errors", add an additional end rule: -- --- ### Rules +-- lex:add_rule('whitespace', ws) +-- ... +-- lex:add_rule('error', token(lexer.ERROR, lexer.any)) -- --- Programming languages have grammars, which specify valid token structure. For --- example, comments usually cannot appear within a string. Grammars consist of --- rules, which are simply combinations of tokens. Recall from the lexer --- template the `_rules` table, which defines all the rules used by the lexer --- grammar: +-- This identifies and highlights any character not matched by an existing rule as a `lexer.ERROR` +-- token. -- --- M._rules = { --- {'whitespace', ws}, --- } +-- Even though the rules defined in the examples above contain a single token, rules may +-- consist of multiple tokens. For example, a rule for an HTML tag could consist of a tag token +-- followed by an arbitrary number of attribute tokens, allowing the lexer to highlight all +-- tokens separately. That rule might look something like this: -- --- Each entry in a lexer's `_rules` table consists of a rule name and its --- associated pattern. Rule names are completely arbitrary and serve only to --- identify and distinguish between different rules. Rule order is important: if --- text does not match the first rule, the lexer tries the second rule, and so --- on. This simple grammar says to match whitespace tokens under a rule named --- "whitespace". +-- lex:add_rule('tag', tag_start * (ws * attributes)^0 * tag_end^-1) -- --- To illustrate the importance of rule order, here is an example of a --- simplified Lua grammar: +-- Note however that lexers with complex rules like these are more prone to lose track of their +-- state, especially if they span multiple lines. -- --- M._rules = { --- {'whitespace', ws}, --- {'keyword', keyword}, --- {'identifier', identifier}, --- {'string', string}, --- {'comment', comment}, --- {'number', number}, --- {'label', label}, --- {'operator', operator}, --- } +-- #### Summary -- --- Note how identifiers come after keywords. In Lua, as with most programming --- languages, the characters allowed in keywords and identifiers are in the same --- set (alphanumerics plus underscores). If the lexer specified the "identifier" --- rule before the "keyword" rule, all keywords would match identifiers and thus --- incorrectly highlight as identifiers instead of keywords. The same idea --- applies to function, constant, etc. tokens that you may want to distinguish --- between: their rules should come before identifiers. +-- Lexers primarily consist of tokens and grammar rules. At your disposal are a number of +-- convenience patterns and functions for rapidly creating a lexer. If you choose to use +-- predefined token names for your tokens, you do not have to define how the lexer highlights +-- them. The tokens will inherit the default syntax highlighting color theme your editor uses. -- --- So what about text that does not match any rules? For example in Lua, the '!' --- character is meaningless outside a string or comment. Normally the lexer --- skips over such text. If instead you want to highlight these "syntax errors", --- add an additional end rule: +-- ### Advanced Techniques -- --- M._rules = { --- {'whitespace', ws}, --- {'error', token(l.ERROR, l.any)}, --- } --- --- This identifies and highlights any character not matched by an existing --- rule as an `lexer.ERROR` token. --- --- Even though the rules defined in the examples above contain a single token, --- rules may consist of multiple tokens. For example, a rule for an HTML tag --- could consist of a tag token followed by an arbitrary number of attribute --- tokens, allowing the lexer to highlight all tokens separately. The rule might --- look something like this: --- --- {'tag', tag_start * (ws * attributes)^0 * tag_end^-1} --- --- Note however that lexers with complex rules like these are more prone to lose --- track of their state. --- --- ### Summary --- --- Lexers primarily consist of tokens and grammar rules. At your disposal are a --- number of convenience patterns and functions for rapidly creating a lexer. If --- you choose to use predefined token names for your tokens, you do not have to --- define how the lexer highlights them. The tokens will inherit the default --- syntax highlighting color theme your editor uses. --- --- ## Advanced Techniques --- --- ### Styles and Styling --- --- The most basic form of syntax highlighting is assigning different colors to --- different tokens. Instead of highlighting with just colors, Scintilla allows --- for more rich highlighting, or "styling", with different fonts, font sizes, --- font attributes, and foreground and background colors, just to name a few. --- The unit of this rich highlighting is called a "style". Styles are simply --- strings of comma-separated property settings. By default, lexers associate --- predefined token names like `lexer.WHITESPACE`, `lexer.COMMENT`, --- `lexer.STRING`, etc. with particular styles as part of a universal color --- theme. These predefined styles include [`lexer.STYLE_CLASS`](), --- [`lexer.STYLE_COMMENT`](), [`lexer.STYLE_CONSTANT`](), --- [`lexer.STYLE_ERROR`](), [`lexer.STYLE_EMBEDDED`](), --- [`lexer.STYLE_FUNCTION`](), [`lexer.STYLE_IDENTIFIER`](), --- [`lexer.STYLE_KEYWORD`](), [`lexer.STYLE_LABEL`](), [`lexer.STYLE_NUMBER`](), --- [`lexer.STYLE_OPERATOR`](), [`lexer.STYLE_PREPROCESSOR`](), --- [`lexer.STYLE_REGEX`](), [`lexer.STYLE_STRING`](), [`lexer.STYLE_TYPE`](), --- [`lexer.STYLE_VARIABLE`](), and [`lexer.STYLE_WHITESPACE`](). Like with --- predefined token names and LPeg patterns, you may define your own styles. At --- their core, styles are just strings, so you may create new ones and/or modify --- existing ones. Each style consists of the following comma-separated settings: --- --- Setting | Description --- ---------------|------------ --- font:_name_ | The name of the font the style uses. --- size:_int_ | The size of the font the style uses. --- [not]bold | Whether or not the font face is bold. --- weight:_int_ | The weight or boldness of a font, between 1 and 999. --- [not]italics | Whether or not the font face is italic. --- [not]underlined| Whether or not the font face is underlined. --- fore:_color_ | The foreground color of the font face. --- back:_color_ | The background color of the font face. --- [not]eolfilled | Does the background color extend to the end of the line? --- case:_char_ | The case of the font ('u': upper, 'l': lower, 'm': normal). --- [not]visible | Whether or not the text is visible. --- [not]changeable| Whether the text is changeable or read-only. --- --- Specify font colors in either "#RRGGBB" format, "0xBBGGRR" format, or the --- decimal equivalent of the latter. As with token names, LPeg patterns, and --- styles, there is a set of predefined color names, but they vary depending on --- the current color theme in use. Therefore, it is generally not a good idea to --- manually define colors within styles in your lexer since they might not fit --- into a user's chosen color theme. Try to refrain from even using predefined --- colors in a style because that color may be theme-specific. Instead, the best --- practice is to either use predefined styles or derive new color-agnostic --- styles from predefined ones. For example, Lua "longstring" tokens use the --- existing `lexer.STYLE_STRING` style instead of defining a new one. --- --- #### Example Styles --- --- Defining styles is pretty straightforward. An empty style that inherits the --- default theme settings is simply an empty string: --- --- local style_nothing = '' --- --- A similar style but with a bold font face looks like this: +-- #### Styles and Styling -- --- local style_bold = 'bold' +-- The most basic form of syntax highlighting is assigning different colors to different +-- tokens. Instead of highlighting with just colors, Scintilla allows for more rich highlighting, +-- or "styling", with different fonts, font sizes, font attributes, and foreground and background +-- colors, just to name a few. The unit of this rich highlighting is called a "style". Styles +-- are simply Lua tables of properties. By default, lexers associate predefined token names like +-- `lexer.WHITESPACE`, `lexer.COMMENT`, `lexer.STRING`, etc. with particular styles as part +-- of a universal color theme. These predefined styles are contained in [`lexer.styles`](), +-- and you may define your own styles. See that table's documentation for more information. As +-- with token names, LPeg patterns, and styles, there is a set of predefined color names, +-- but they vary depending on the current color theme in use. Therefore, it is generally not +-- a good idea to manually define colors within styles in your lexer since they might not fit +-- into a user's chosen color theme. Try to refrain from even using predefined colors in a +-- style because that color may be theme-specific. Instead, the best practice is to either use +-- predefined styles or derive new color-agnostic styles from predefined ones. For example, Lua +-- "longstring" tokens use the existing `lexer.styles.string` style instead of defining a new one. -- --- If you want the same style, but also with an italic font face, define the new --- style in terms of the old one: +-- ##### Example Styles -- --- local style_bold_italic = style_bold..',italics' +-- Defining styles is pretty straightforward. An empty style that inherits the default theme +-- settings is simply an empty table: -- --- This allows you to derive new styles from predefined ones without having to --- rewrite them. This operation leaves the old style unchanged. Thus if you --- had a "static variable" token whose style you wanted to base off of --- `lexer.STYLE_VARIABLE`, it would probably look like: +-- local style_nothing = {} -- --- local style_static_var = l.STYLE_VARIABLE..',italics' +-- A similar style but with a bold font face looks like this: -- --- The color theme files in the *lexers/themes/* folder give more examples of --- style definitions. +-- local style_bold = {bold = true} -- --- ### Token Styles +-- You can derive new styles from predefined ones without having to rewrite them. This operation +-- leaves the old style unchanged. For example, if you had a "static variable" token whose +-- style you wanted to base off of `lexer.styles.variable`, it would probably look like: -- --- Lexers use the `_tokenstyles` table to assign tokens to particular styles. --- Recall the token definition and `_tokenstyles` table from the lexer template: +-- local style_static_var = lexer.styles.variable .. {italics = true} -- --- local ws = token(l.WHITESPACE, l.space^1) +-- The color theme files in the *lexers/themes/* folder give more examples of style definitions. -- --- ... +-- #### Token Styles -- --- M._tokenstyles = { +-- Lexers use the [`lexer.add_style()`]() function to assign styles to particular tokens. Recall +-- the token definition and from the lexer template: -- --- } +-- local ws = token(lexer.WHITESPACE, lexer.space^1) +-- lex:add_rule('whitespace', ws) -- --- Why is a style not assigned to the `lexer.WHITESPACE` token? As mentioned --- earlier, lexers automatically associate tokens that use predefined token --- names with a particular style. Only tokens with custom token names need --- manual style associations. As an example, consider a custom whitespace token: +-- Why is a style not assigned to the `lexer.WHITESPACE` token? As mentioned earlier, lexers +-- automatically associate tokens that use predefined token names with a particular style. Only +-- tokens with custom token names need manual style associations. As an example, consider a +-- custom whitespace token: -- --- local ws = token('custom_whitespace', l.space^1) +-- local ws = token('custom_whitespace', lexer.space^1) -- -- Assigning a style to this token looks like: -- --- M._tokenstyles = { --- custom_whitespace = l.STYLE_WHITESPACE --- } +-- lex:add_style('custom_whitespace', lexer.styles.whitespace) -- --- Do not confuse token names with rule names. They are completely different --- entities. In the example above, the lexer assigns the "custom_whitespace" --- token the existing style for `WHITESPACE` tokens. If instead you want to --- color the background of whitespace a shade of grey, it might look like: +-- Do not confuse token names with rule names. They are completely different entities. In the +-- example above, the lexer associates the "custom_whitespace" token with the existing style +-- for `lexer.WHITESPACE` tokens. If instead you prefer to color the background of whitespace +-- a shade of grey, it might look like: -- --- local custom_style = l.STYLE_WHITESPACE..',back:$(color.grey)' --- M._tokenstyles = { --- custom_whitespace = custom_style --- } +-- lex:add_style('custom_whitespace', lexer.styles.whitespace .. {back = lexer.colors.grey}) -- --- Notice that the lexer peforms Scintilla/SciTE-style "$()" property expansion. --- You may also use "%()". Remember to refrain from assigning specific colors in --- styles, but in this case, all user color themes probably define the --- "color.grey" property. +-- Remember to refrain from assigning specific colors in styles, but in this case, all user +-- color themes probably define `colors.grey`. -- --- ### Line Lexers +-- #### Line Lexers -- --- By default, lexers match the arbitrary chunks of text passed to them by --- Scintilla. These chunks may be a full document, only the visible part of a --- document, or even just portions of lines. Some lexers need to match whole --- lines. For example, a lexer for the output of a file "diff" needs to know if --- the line started with a '+' or '-' and then style the entire line --- accordingly. To indicate that your lexer matches by line, use the --- `_LEXBYLINE` field: +-- By default, lexers match the arbitrary chunks of text passed to them by Scintilla. These +-- chunks may be a full document, only the visible part of a document, or even just portions +-- of lines. Some lexers need to match whole lines. For example, a lexer for the output of a +-- file "diff" needs to know if the line started with a '+' or '-' and then style the entire +-- line accordingly. To indicate that your lexer matches by line, create the lexer with an +-- extra parameter: -- --- M._LEXBYLINE = true +-- local lex = lexer.new('?', {lex_by_line = true}) -- --- Now the input text for the lexer is a single line at a time. Keep in mind --- that line lexers do not have the ability to look ahead at subsequent lines. +-- Now the input text for the lexer is a single line at a time. Keep in mind that line lexers +-- do not have the ability to look ahead at subsequent lines. -- --- ### Embedded Lexers +-- #### Embedded Lexers -- --- Lexers embed within one another very easily, requiring minimal effort. In the --- following sections, the lexer being embedded is called the "child" lexer and --- the lexer a child is being embedded in is called the "parent". For example, --- consider an HTML lexer and a CSS lexer. Either lexer stands alone for styling --- their respective HTML and CSS files. However, CSS can be embedded inside --- HTML. In this specific case, the CSS lexer is the "child" lexer with the HTML --- lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This --- sounds a lot like the case with CSS, but there is a subtle difference: PHP --- _embeds itself_ into HTML while CSS is _embedded in_ HTML. This fundamental --- difference results in two types of embedded lexers: a parent lexer that --- embeds other child lexers in it (like HTML embedding CSS), and a child lexer --- that embeds itself within a parent lexer (like PHP embedding itself in HTML). +-- Lexers embed within one another very easily, requiring minimal effort. In the following +-- sections, the lexer being embedded is called the "child" lexer and the lexer a child is +-- being embedded in is called the "parent". For example, consider an HTML lexer and a CSS +-- lexer. Either lexer stands alone for styling their respective HTML and CSS files. However, CSS +-- can be embedded inside HTML. In this specific case, the CSS lexer is the "child" lexer with +-- the HTML lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This sounds +-- a lot like the case with CSS, but there is a subtle difference: PHP _embeds itself into_ +-- HTML while CSS is _embedded in_ HTML. This fundamental difference results in two types of +-- embedded lexers: a parent lexer that embeds other child lexers in it (like HTML embedding CSS), +-- and a child lexer that embeds itself into a parent lexer (like PHP embedding itself in HTML). -- --- #### Parent Lexer +-- ##### Parent Lexer -- --- Before embedding a child lexer into a parent lexer, the parent lexer needs to --- load the child lexer. This is done with the [`lexer.load()`]() function. For --- example, loading the CSS lexer within the HTML lexer looks like: +-- Before embedding a child lexer into a parent lexer, the parent lexer needs to load the child +-- lexer. This is done with the [`lexer.load()`]() function. For example, loading the CSS lexer +-- within the HTML lexer looks like: -- --- local css = l.load('css') +-- local css = lexer.load('css') -- --- The next part of the embedding process is telling the parent lexer when to --- switch over to the child lexer and when to switch back. The lexer refers to --- these indications as the "start rule" and "end rule", respectively, and are --- just LPeg patterns. Continuing with the HTML/CSS example, the transition from --- HTML to CSS is when the lexer encounters a "style" tag with a "type" --- attribute whose value is "text/css": +-- The next part of the embedding process is telling the parent lexer when to switch over +-- to the child lexer and when to switch back. The lexer refers to these indications as the +-- "start rule" and "end rule", respectively, and are just LPeg patterns. Continuing with the +-- HTML/CSS example, the transition from HTML to CSS is when the lexer encounters a "style" +-- tag with a "type" attribute whose value is "text/css": -- -- local css_tag = P('<style') * P(function(input, index) --- if input:find('^[^>]+type="text/css"', index) then --- return index --- end +-- if input:find('^[^>]+type="text/css"', index) then return index end -- end) -- --- This pattern looks for the beginning of a "style" tag and searches its --- attribute list for the text "`type="text/css"`". (In this simplified example, --- the Lua pattern does not consider whitespace between the '=' nor does it --- consider that using single quotes is valid.) If there is a match, the --- functional pattern returns a value instead of `nil`. In this case, the value --- returned does not matter because we ultimately want to style the "style" tag --- as an HTML tag, so the actual start rule looks like this: +-- This pattern looks for the beginning of a "style" tag and searches its attribute list for +-- the text "`type="text/css"`". (In this simplified example, the Lua pattern does not consider +-- whitespace between the '=' nor does it consider that using single quotes is valid.) If there +-- is a match, the functional pattern returns a value instead of `nil`. In this case, the value +-- returned does not matter because we ultimately want to style the "style" tag as an HTML tag, +-- so the actual start rule looks like this: -- -- local css_start_rule = #css_tag * tag -- --- Now that the parent knows when to switch to the child, it needs to know when --- to switch back. In the case of HTML/CSS, the switch back occurs when the --- lexer encounters an ending "style" tag, though the lexer should still style --- the tag as an HTML tag: +-- Now that the parent knows when to switch to the child, it needs to know when to switch +-- back. In the case of HTML/CSS, the switch back occurs when the lexer encounters an ending +-- "style" tag, though the lexer should still style the tag as an HTML tag: -- -- local css_end_rule = #P('</style>') * tag -- --- Once the parent loads the child lexer and defines the child's start and end --- rules, it embeds the child with the [`lexer.embed_lexer()`]() function: +-- Once the parent loads the child lexer and defines the child's start and end rules, it embeds +-- the child with the [`lexer.embed()`]() function: -- --- l.embed_lexer(M, css, css_start_rule, css_end_rule) +-- lex:embed(css, css_start_rule, css_end_rule) -- --- The first parameter is the parent lexer object to embed the child in, which --- in this case is `M`. The other three parameters are the child lexer object --- loaded earlier followed by its start and end rules. +-- ##### Child Lexer -- --- #### Child Lexer +-- The process for instructing a child lexer to embed itself into a parent is very similar to +-- embedding a child into a parent: first, load the parent lexer into the child lexer with the +-- [`lexer.load()`]() function and then create start and end rules for the child lexer. However, +-- in this case, call [`lexer.embed()`]() with switched arguments. For example, in the PHP lexer: -- --- The process for instructing a child lexer to embed itself into a parent is --- very similar to embedding a child into a parent: first, load the parent lexer --- into the child lexer with the [`lexer.load()`]() function and then create --- start and end rules for the child lexer. However, in this case, swap the --- lexer object arguments to [`lexer.embed_lexer()`](). For example, in the PHP --- lexer: --- --- local html = l.load('html') +-- local html = lexer.load('html') -- local php_start_rule = token('php_tag', '<?php ') -- local php_end_rule = token('php_tag', '?>') --- l.embed_lexer(html, M, php_start_rule, php_end_rule) +-- lex:add_style('php_tag', lexer.styles.embedded) +-- html:embed(lex, php_start_rule, php_end_rule) -- --- ### Lexers with Complex State +-- #### Lexers with Complex State -- --- A vast majority of lexers are not stateful and can operate on any chunk of --- text in a document. However, there may be rare cases where a lexer does need --- to keep track of some sort of persistent state. Rather than using `lpeg.P` --- function patterns that set state variables, it is recommended to make use of --- Scintilla's built-in, per-line state integers via [`lexer.line_state`](). It --- was designed to accommodate up to 32 bit flags for tracking state. --- [`lexer.line_from_position()`]() will return the line for any position given --- to an `lpeg.P` function pattern. (Any positions derived from that position --- argument will also work.) +-- A vast majority of lexers are not stateful and can operate on any chunk of text in a +-- document. However, there may be rare cases where a lexer does need to keep track of some +-- sort of persistent state. Rather than using `lpeg.P` function patterns that set state +-- variables, it is recommended to make use of Scintilla's built-in, per-line state integers via +-- [`lexer.line_state`](). It was designed to accommodate up to 32 bit flags for tracking state. +-- [`lexer.line_from_position()`]() will return the line for any position given to an `lpeg.P` +-- function pattern. (Any positions derived from that position argument will also work.) -- -- Writing stateful lexers is beyond the scope of this document. -- --- ## Code Folding +-- ### Code Folding -- --- When reading source code, it is occasionally helpful to temporarily hide --- blocks of code like functions, classes, comments, etc. This is the concept of --- "folding". In the Textadept and SciTE editors for example, little indicators --- in the editor margins appear next to code that can be folded at places called --- "fold points". When the user clicks an indicator, the editor hides the code --- associated with the indicator until the user clicks the indicator again. The +-- When reading source code, it is occasionally helpful to temporarily hide blocks of code like +-- functions, classes, comments, etc. This is the concept of "folding". In the Textadept and +-- SciTE editors for example, little indicators in the editor margins appear next to code that +-- can be folded at places called "fold points". When the user clicks an indicator, the editor +-- hides the code associated with the indicator until the user clicks the indicator again. The -- lexer specifies these fold points and what code exactly to fold. -- --- The fold points for most languages occur on keywords or character sequences. --- Examples of fold keywords are "if" and "end" in Lua and examples of fold --- character sequences are '{', '}', "/\*", and "\*/" in C for code block and --- comment delimiters, respectively. However, these fold points cannot occur --- just anywhere. For example, lexers should not recognize fold keywords that --- appear within strings or comments. The lexer's `_foldsymbols` table allows --- you to conveniently define fold points with such granularity. For example, --- consider C: +-- The fold points for most languages occur on keywords or character sequences. Examples of +-- fold keywords are "if" and "end" in Lua and examples of fold character sequences are '{', +-- '}', "/\*", and "\*/" in C for code block and comment delimiters, respectively. However, +-- these fold points cannot occur just anywhere. For example, lexers should not recognize fold +-- keywords that appear within strings or comments. The [`lexer.add_fold_point()`]() function +-- allows you to conveniently define fold points with such granularity. For example, consider C: -- --- M._foldsymbols = { --- [l.OPERATOR] = {['{'] = 1, ['}'] = -1}, --- [l.COMMENT] = {['/*'] = 1, ['*/'] = -1}, --- _patterns = {'[{}]', '/%*', '%*/'} --- } +-- lex:add_fold_point(lexer.OPERATOR, '{', '}') +-- lex:add_fold_point(lexer.COMMENT, '/*', '*/') -- --- The first assignment states that any '{' or '}' that the lexer recognized as --- an `lexer.OPERATOR` token is a fold point. The integer `1` indicates the --- match is a beginning fold point and `-1` indicates the match is an ending --- fold point. Likewise, the second assignment states that any "/\*" or "\*/" --- that the lexer recognizes as part of a `lexer.COMMENT` token is a fold point. --- The lexer does not consider any occurences of these characters outside their --- defined tokens (such as in a string) as fold points. Finally, every --- `_foldsymbols` table must have a `_patterns` field that contains a list of --- [Lua patterns][] that match fold points. If the lexer encounters text that --- matches one of those patterns, the lexer looks up the matched text in its --- token's table in order to determine whether or not the text is a fold point. --- In the example above, the first Lua pattern matches any '{' or '}' --- characters. When the lexer comes across one of those characters, it checks if --- the match is an `lexer.OPERATOR` token. If so, the lexer identifies the match --- as a fold point. The same idea applies for the other patterns. (The '%' is in --- the other patterns because '\*' is a special character in Lua patterns that --- needs escaping.) How do you specify fold keywords? Here is an example for --- Lua: +-- The first assignment states that any '{' or '}' that the lexer recognized as an `lexer.OPERATOR` +-- token is a fold point. Likewise, the second assignment states that any "/\*" or "\*/" that +-- the lexer recognizes as part of a `lexer.COMMENT` token is a fold point. The lexer does +-- not consider any occurrences of these characters outside their defined tokens (such as in +-- a string) as fold points. How do you specify fold keywords? Here is an example for Lua: -- --- M._foldsymbols = { --- [l.KEYWORD] = { --- ['if'] = 1, ['do'] = 1, ['function'] = 1, --- ['end'] = -1, ['repeat'] = 1, ['until'] = -1 --- }, --- _patterns = {'%l+'} --- } --- --- Any time the lexer encounters a lower case word, if that word is a --- `lexer.KEYWORD` token and in the associated list of fold points, the lexer --- identifies the word as a fold point. +-- lex:add_fold_point(lexer.KEYWORD, 'if', 'end') +-- lex:add_fold_point(lexer.KEYWORD, 'do', 'end') +-- lex:add_fold_point(lexer.KEYWORD, 'function', 'end') +-- lex:add_fold_point(lexer.KEYWORD, 'repeat', 'until') -- -- If your lexer has case-insensitive keywords as fold points, simply add a --- `_case_insensitive = true` option to the `_foldsymbols` table and specify --- keywords in lower case. +-- `case_insensitive_fold_points = true` option to [`lexer.new()`](), and specify keywords in +-- lower case. -- --- If your lexer needs to do some additional processing to determine if a match --- is a fold point, assign a function that returns an integer. Returning `1` or --- `-1` indicates the match is a fold point. Returning `0` indicates it is not. --- For example: +-- If your lexer needs to do some additional processing in order to determine if a token is +-- a fold point, pass a function that returns an integer to `lex:add_fold_point()`. Returning +-- `1` indicates the token is a beginning fold point and returning `-1` indicates the token is +-- an ending fold point. Returning `0` indicates the token is not a fold point. For example: -- --- local function fold_strange_token(text, pos, line, s, match) +-- local function fold_strange_token(text, pos, line, s, symbol) -- if ... then -- return 1 -- beginning fold point -- elseif ... then @@ -605,107 +467,205 @@ local M = {} -- return 0 -- end -- +-- lex:add_fold_point('strange_token', '|', fold_strange_token) +-- +-- Any time the lexer encounters a '|' that is a "strange_token", it calls the `fold_strange_token` +-- function to determine if '|' is a fold point. The lexer calls these functions with the +-- following arguments: the text to identify fold points in, the beginning position of the +-- current line in the text to fold, the current line's text, the position in the current line +-- the fold point text starts at, and the fold point text itself. +-- +-- #### Fold by Indentation +-- +-- Some languages have significant whitespace and/or no delimiters that indicate fold points. If +-- your lexer falls into this category and you would like to mark fold points based on changes +-- in indentation, create the lexer with a `fold_by_indentation = true` option: +-- +-- local lex = lexer.new('?', {fold_by_indentation = true}) +-- +-- ### Using Lexers +-- +-- **Textadept** +-- +-- Put your lexer in your *~/.textadept/lexers/* directory so you do not overwrite it when +-- upgrading Textadept. Also, lexers in this directory override default lexers. Thus, Textadept +-- loads a user *lua* lexer instead of the default *lua* lexer. This is convenient for tweaking +-- a default lexer to your liking. Then add a [file type](#textadept.file_types) for your lexer +-- if necessary. +-- +-- **SciTE** +-- +-- Create a *.properties* file for your lexer and `import` it in either your *SciTEUser.properties* +-- or *SciTEGlobal.properties*. The contents of the *.properties* file should contain: +-- +-- file.patterns.[lexer_name]=[file_patterns] +-- lexer.$(file.patterns.[lexer_name])=[lexer_name] +-- +-- where `[lexer_name]` is the name of your lexer (minus the *.lua* extension) and +-- `[file_patterns]` is a set of file extensions to use your lexer for. +-- +-- Please note that Lua lexers ignore any styling information in *.properties* files. Your +-- theme file in the *lexers/themes/* directory contains styling information. +-- +-- ### Migrating Legacy Lexers +-- +-- Legacy lexers are of the form: +-- +-- local l = require('lexer') +-- local token, word_match = l.token, l.word_match +-- local P, R, S = lpeg.P, lpeg.R, lpeg.S +-- +-- local M = {_NAME = '?'} +-- +-- [... token and pattern definitions ...] +-- +-- M._rules = { +-- {'rule', pattern}, +-- [...] +-- } +-- +-- M._tokenstyles = { +-- 'token' = 'style', +-- [...] +-- } +-- -- M._foldsymbols = { --- ['strange_token'] = {['|'] = fold_strange_token}, --- _patterns = {'|'} +-- _patterns = {...}, +-- ['token'] = {['start'] = 1, ['end'] = -1}, +-- [...] -- } -- --- Any time the lexer encounters a '|' that is a "strange_token", it calls the --- `fold_strange_token` function to determine if '|' is a fold point. The lexer --- calls these functions with the following arguments: the text to identify fold --- points in, the beginning position of the current line in the text to fold, --- the current line's text, the position in the current line the matched text --- starts at, and the matched text itself. +-- return M -- --- [Lua patterns]: http://www.lua.org/manual/5.2/manual.html#6.4.1 +-- While Scintillua will handle such legacy lexers just fine without any changes, it is +-- recommended that you migrate yours. The migration process is fairly straightforward: +-- +-- 1. Replace all instances of `l` with `lexer`, as it's better practice and results in less +-- confusion. +-- 2. Replace `local M = {_NAME = '?'}` with `local lex = lexer.new('?')`, where `?` is the +-- name of your legacy lexer. At the end of the lexer, change `return M` to `return lex`. +-- 3. Instead of defining rules towards the end of your lexer, define your rules as you define +-- your tokens and patterns using [`lex:add_rule()`](#lexer.add_rule). +-- 4. Similarly, any custom token names should have their styles immediately defined using +-- [`lex:add_style()`](#lexer.add_style). +-- 5. Optionally convert any table arguments passed to [`lexer.word_match()`]() to a +-- space-separated string of words. +-- 6. Replace any calls to `lexer.embed(M, child, ...)` and `lexer.embed(parent, M, ...)` with +-- [`lex:embed`](#lexer.embed)`(child, ...)` and `parent:embed(lex, ...)`, respectively. +-- 7. Define fold points with simple calls to [`lex:add_fold_point()`](#lexer.add_fold_point). No +-- need to mess with Lua patterns anymore. +-- 8. Any legacy lexer options such as `M._FOLDBYINDENTATION`, `M._LEXBYLINE`, `M._lexer`, +-- etc. should be added as table options to [`lexer.new()`](). +-- 9. Any external lexer rule fetching and/or modifications via `lexer._RULES` should be changed +-- to use [`lexer.get_rule()`]() and [`lexer.modify_rule()`](). +-- +-- As an example, consider the following sample legacy lexer: -- --- ### Fold by Indentation +-- local l = require('lexer') +-- local token, word_match = l.token, l.word_match +-- local P, R, S = lpeg.P, lpeg.R, lpeg.S -- --- Some languages have significant whitespace and/or no delimiters that indicate --- fold points. If your lexer falls into this category and you would like to --- mark fold points based on changes in indentation, use the --- `_FOLDBYINDENTATION` field: +-- local M = {_NAME = 'legacy'} -- --- M._FOLDBYINDENTATION = true +-- local ws = token(l.WHITESPACE, l.space^1) +-- local comment = token(l.COMMENT, '#' * l.nonnewline^0) +-- local string = token(l.STRING, l.delimited_range('"')) +-- local number = token(l.NUMBER, l.float + l.integer) +-- local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'}) +-- local custom = token('custom', P('quux')) +-- local identifier = token(l.IDENTIFIER, l.word) +-- local operator = token(l.OPERATOR, S('+-*/%^=<>,.()[]{}')) -- --- ## Using Lexers +-- M._rules = { +-- {'whitespace', ws}, +-- {'keyword', keyword}, +-- {'custom', custom}, +-- {'identifier', identifier}, +-- {'string', string}, +-- {'comment', comment}, +-- {'number', number}, +-- {'operator', operator} +-- } -- --- ### Textadept +-- M._tokenstyles = { +-- 'custom' = l.STYLE_KEYWORD .. ',bold' +-- } -- --- Put your lexer in your *~/.textadept/lexers/* directory so you do not --- overwrite it when upgrading Textadept. Also, lexers in this directory --- override default lexers. Thus, Textadept loads a user *lua* lexer instead of --- the default *lua* lexer. This is convenient for tweaking a default lexer to --- your liking. Then add a [file type][] for your lexer if necessary. +-- M._foldsymbols = { +-- _patterns = {'[{}]'}, +-- [l.OPERATOR] = {['{'] = 1, ['}'] = -1} +-- } -- --- [file type]: _M.textadept.file_types.html +-- return M -- --- ### SciTE +-- Following the migration steps would yield: -- --- Create a *.properties* file for your lexer and `import` it in either your --- *SciTEUser.properties* or *SciTEGlobal.properties*. The contents of the --- *.properties* file should contain: +-- local lexer = require('lexer') +-- local token, word_match = lexer.token, lexer.word_match +-- local P, S = lpeg.P, lpeg.S -- --- file.patterns.[lexer_name]=[file_patterns] --- lexer.$(file.patterns.[lexer_name])=[lexer_name] +-- local lex = lexer.new('legacy') -- --- where `[lexer_name]` is the name of your lexer (minus the *.lua* extension) --- and `[file_patterns]` is a set of file extensions to use your lexer for. +-- lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1)) +-- lex:add_rule('keyword', token(lexer.KEYWORD, word_match('foo bar baz'))) +-- lex:add_rule('custom', token('custom', 'quux')) +-- lex:add_style('custom', lexer.styles.keyword .. {bold = true}) +-- lex:add_rule('identifier', token(lexer.IDENTIFIER, lexer.word)) +-- lex:add_rule('string', token(lexer.STRING, lexer.range('"'))) +-- lex:add_rule('comment', token(lexer.COMMENT, lexer.to_eol('#'))) +-- lex:add_rule('number', token(lexer.NUMBER, lexer.number)) +-- lex:add_rule('operator', token(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}'))) -- --- Please note that Lua lexers ignore any styling information in *.properties* --- files. Your theme file in the *lexers/themes/* directory contains styling --- information. +-- lex:add_fold_point(lexer.OPERATOR, '{', '}') -- --- ## Considerations +-- return lex -- --- ### Performance +-- ### Considerations -- --- There might be some slight overhead when initializing a lexer, but loading a --- file from disk into Scintilla is usually more expensive. On modern computer --- systems, I see no difference in speed between LPeg lexers and Scintilla's C++ --- ones. Optimize lexers for speed by re-arranging rules in the `_rules` table --- so that the most common rules match first. Do keep in mind that order matters --- for similar rules. +-- #### Performance -- --- ### Limitations +-- There might be some slight overhead when initializing a lexer, but loading a file from disk +-- into Scintilla is usually more expensive. On modern computer systems, I see no difference in +-- speed between Lua lexers and Scintilla's C++ ones. Optimize lexers for speed by re-arranging +-- `lexer.add_rule()` calls so that the most common rules match first. Do keep in mind that +-- order matters for similar rules. -- --- Embedded preprocessor languages like PHP cannot completely embed in their --- parent languages in that the parent's tokens do not support start and end --- rules. This mostly goes unnoticed, but code like +-- In some cases, folding may be far more expensive than lexing, particularly in lexers with a +-- lot of potential fold points. If your lexer is exhibiting signs of slowness, try disabling +-- folding in your text editor first. If that speeds things up, you can try reducing the number +-- of fold points you added, overriding `lexer.fold()` with your own implementation, or simply +-- eliminating folding support from your lexer. -- --- <div id="<?php echo $id; ?>"> +-- #### Limitations -- --- or +-- Embedded preprocessor languages like PHP cannot completely embed in their parent languages +-- in that the parent's tokens do not support start and end rules. This mostly goes unnoticed, +-- but code like -- --- <div <?php if ($odd) { echo 'class="odd"'; } ?>> +-- <div id="<?php echo $id; ?>"> -- -- will not style correctly. -- --- ### Troubleshooting +-- #### Troubleshooting -- --- Errors in lexers can be tricky to debug. Lexers print Lua errors to --- `io.stderr` and `_G.print()` statements to `io.stdout`. Running your editor --- from a terminal is the easiest way to see errors as they occur. +-- Errors in lexers can be tricky to debug. Lexers print Lua errors to `io.stderr` and `_G.print()` +-- statements to `io.stdout`. Running your editor from a terminal is the easiest way to see +-- errors as they occur. -- --- ### Risks +-- #### Risks -- --- Poorly written lexers have the ability to crash Scintilla (and thus its --- containing application), so unsaved data might be lost. However, I have only --- observed these crashes in early lexer development, when syntax errors or --- pattern errors are present. Once the lexer actually starts styling text --- (either correctly or incorrectly, it does not matter), I have not observed +-- Poorly written lexers have the ability to crash Scintilla (and thus its containing application), +-- so unsaved data might be lost. However, I have only observed these crashes in early lexer +-- development, when syntax errors or pattern errors are present. Once the lexer actually starts +-- styling text (either correctly or incorrectly, it does not matter), I have not observed -- any crashes. -- --- ### Acknowledgements +-- #### Acknowledgements -- --- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list --- that inspired me, and thanks to Roberto Ierusalimschy for LPeg. +-- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list that provided inspiration, +-- and thanks to Roberto Ierusalimschy for LPeg. -- -- [lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html --- @field LEXERPATH (string) --- The path used to search for a lexer to load. --- Identical in format to Lua's `package.path` string. --- The default value is `package.path`. -- @field DEFAULT (string) -- The token name for default tokens. -- @field WHITESPACE (string) @@ -740,58 +700,6 @@ local M = {} -- The token name for label tokens. -- @field REGEX (string) -- The token name for regex tokens. --- @field STYLE_CLASS (string) --- The style typically used for class definitions. --- @field STYLE_COMMENT (string) --- The style typically used for code comments. --- @field STYLE_CONSTANT (string) --- The style typically used for constants. --- @field STYLE_ERROR (string) --- The style typically used for erroneous syntax. --- @field STYLE_FUNCTION (string) --- The style typically used for function definitions. --- @field STYLE_KEYWORD (string) --- The style typically used for language keywords. --- @field STYLE_LABEL (string) --- The style typically used for labels. --- @field STYLE_NUMBER (string) --- The style typically used for numbers. --- @field STYLE_OPERATOR (string) --- The style typically used for operators. --- @field STYLE_REGEX (string) --- The style typically used for regular expression strings. --- @field STYLE_STRING (string) --- The style typically used for strings. --- @field STYLE_PREPROCESSOR (string) --- The style typically used for preprocessor statements. --- @field STYLE_TYPE (string) --- The style typically used for static types. --- @field STYLE_VARIABLE (string) --- The style typically used for variables. --- @field STYLE_WHITESPACE (string) --- The style typically used for whitespace. --- @field STYLE_EMBEDDED (string) --- The style typically used for embedded code. --- @field STYLE_IDENTIFIER (string) --- The style typically used for identifier words. --- @field STYLE_DEFAULT (string) --- The style all styles are based off of. --- @field STYLE_LINENUMBER (string) --- The style used for all margins except fold margins. --- @field STYLE_BRACELIGHT (string) --- The style used for highlighted brace characters. --- @field STYLE_BRACEBAD (string) --- The style used for unmatched brace characters. --- @field STYLE_CONTROLCHAR (string) --- The style used for control characters. --- Color attributes are ignored. --- @field STYLE_INDENTGUIDE (string) --- The style used for indentation guides. --- @field STYLE_CALLTIP (string) --- The style used by call tips if [`buffer.call_tip_use_style`]() is set. --- Only the font name, size, and color attributes are used. --- @field STYLE_FOLDDISPLAYTEXT (string) --- The style used for fold display text. -- @field any (pattern) -- A pattern that matches any single character. -- @field ascii (pattern) @@ -803,8 +711,7 @@ local M = {} -- @field digit (pattern) -- A pattern that matches any digit ('0'-'9'). -- @field alnum (pattern) --- A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', --- '0'-'9'). +-- A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', '0'-'9'). -- @field lower (pattern) -- A pattern that matches any lower case character ('a'-'z'). -- @field upper (pattern) @@ -818,18 +725,14 @@ local M = {} -- @field print (pattern) -- A pattern that matches any printable character (' ' to '~'). -- @field punct (pattern) --- A pattern that matches any punctuation character ('!' to '/', ':' to '@', --- '[' to ''', '{' to '~'). +-- A pattern that matches any punctuation character ('!' to '/', ':' to '@', '[' to ''', +-- '{' to '~'). -- @field space (pattern) --- A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', --- '\r', space). +-- A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', '\r', space). -- @field newline (pattern) --- A pattern that matches any set of end of line characters. +-- A pattern that matches a sequence of end of line characters. -- @field nonnewline (pattern) -- A pattern that matches any single, non-newline character. --- @field nonnewline_esc (pattern) --- A pattern that matches any single, non-newline character or any set of end --- of line characters escaped with '\'. -- @field dec_num (pattern) -- A pattern that matches a decimal number. -- @field hex_num (pattern) @@ -840,9 +743,12 @@ local M = {} -- A pattern that matches either a decimal, hexadecimal, or octal number. -- @field float (pattern) -- A pattern that matches a floating point number. +-- @field number (pattern) +-- A pattern that matches a typical number, either a floating point, decimal, hexadecimal, +-- or octal number. -- @field word (pattern) --- A pattern that matches a typical word. Words begin with a letter or --- underscore and consist of alphanumeric and underscore characters. +-- A pattern that matches a typical word. Words begin with a letter or underscore and consist +-- of alphanumeric and underscore characters. -- @field FOLD_BASE (number) -- The initial (root) fold level. -- @field FOLD_BLANK (number) @@ -850,9 +756,8 @@ local M = {} -- @field FOLD_HEADER (number) -- Flag indicating the line is fold point. -- @field fold_level (table, Read-only) --- Table of fold level bit-masks for line numbers starting from zero. --- Fold level masks are composed of an integer level combined with any of the --- following bits: +-- Table of fold level bit-masks for line numbers starting from 1. +-- Fold level masks are composed of an integer level combined with any of the following bits: -- -- * `lexer.FOLD_BASE` -- The initial fold level. @@ -861,87 +766,328 @@ local M = {} -- * `lexer.FOLD_HEADER` -- The line is a header, or fold point. -- @field indent_amount (table, Read-only) --- Table of indentation amounts in character columns, for line numbers --- starting from zero. +-- Table of indentation amounts in character columns, for line numbers starting from 1. -- @field line_state (table) --- Table of integer line states for line numbers starting from zero. +-- Table of integer line states for line numbers starting from 1. -- Line states can be used by lexers for keeping track of persistent states. -- @field property (table) -- Map of key-value string pairs. -- @field property_expanded (table, Read-only) --- Map of key-value string pairs with `$()` and `%()` variable replacement --- performed in values. +-- Map of key-value string pairs with `$()` and `%()` variable replacement performed in values. -- @field property_int (table, Read-only) --- Map of key-value pairs with values interpreted as numbers, or `0` if not --- found. +-- Map of key-value pairs with values interpreted as numbers, or `0` if not found. -- @field style_at (table, Read-only) -- Table of style names at positions in the buffer starting from 1. +-- @field folding (boolean) +-- Whether or not folding is enabled for the lexers that support it. +-- This option is disabled by default. +-- This is an alias for `lexer.property['fold'] = '1|0'`. +-- @field fold_on_zero_sum_lines (boolean) +-- Whether or not to mark as a fold point lines that contain both an ending and starting fold +-- point. For example, `} else {` would be marked as a fold point. +-- This option is disabled by default. This is an alias for +-- `lexer.property['fold.on.zero.sum.lines'] = '1|0'`. +-- @field fold_compact (boolean) +-- Whether or not blank lines after an ending fold point are included in that +-- fold. +-- This option is disabled by default. +-- This is an alias for `lexer.property['fold.compact'] = '1|0'`. +-- @field fold_by_indentation (boolean) +-- Whether or not to fold based on indentation level if a lexer does not have +-- a folder. +-- Some lexers automatically enable this option. It is disabled by default. +-- This is an alias for `lexer.property['fold.by.indentation'] = '1|0'`. +-- @field fold_line_groups (boolean) +-- Whether or not to fold multiple, consecutive line groups (such as line comments and import +-- statements) and only show the top line. +-- This option is disabled by default. +-- This is an alias for `lexer.property['fold.line.groups'] = '1|0'`. module('lexer')]=] +if not require then + -- Substitute for Lua's require() function, which does not require the package module to + -- be loaded. + -- Note: all modules must be in the global namespace, which is the case in LexerLPeg's default + -- Lua State. + function require(name) return name == 'lexer' and M or _G[name] end +end + +local print = function(...) + local args = table.pack(...) + local msg = {} + for i = 1, args.n do + msg[#msg + 1] = tostring(args[i]) + end + vis:info(table.concat(msg, ' ')) +end + lpeg = require('lpeg') local lpeg_P, lpeg_R, lpeg_S, lpeg_V = lpeg.P, lpeg.R, lpeg.S, lpeg.V local lpeg_Ct, lpeg_Cc, lpeg_Cp = lpeg.Ct, lpeg.Cc, lpeg.Cp -local lpeg_Cmt, lpeg_C, lpeg_Carg = lpeg.Cmt, lpeg.C, lpeg.Carg +local lpeg_Cmt, lpeg_C = lpeg.Cmt, lpeg.C local lpeg_match = lpeg.match -M.LEXERPATH = package.path +-- Searches for the given *name* in the given *path*. +-- This is a safe implementation of Lua 5.2's `package.searchpath()` function that does not +-- require the package module to be loaded. +local function searchpath(name, path) + local tried = {} + for part in path:gmatch('[^;]+') do + local filename = part:gsub('%?', name) + local ok, errmsg = loadfile(filename) + if ok or not errmsg:find('cannot open') then return filename end + tried[#tried + 1] = string.format("no file '%s'", filename) + end + return nil, table.concat(tried, '\n') +end --- Table of loaded lexers. -M.lexers = {} +--- +-- Map of color name strings to color values in `0xBBGGRR` or `"#RRGGBB"` format. +-- Note: for applications running within a terminal emulator, only 16 color values are recognized, +-- regardless of how many colors a user's terminal actually supports. (A terminal emulator's +-- settings determines how to actually display these recognized color values, which may end up +-- being mapped to a completely different color set.) In order to use the light variant of a +-- color, some terminals require a style's `bold` attribute must be set along with that normal +-- color. Recognized color values are black (0x000000), red (0x000080), green (0x008000), yellow +-- (0x008080), blue (0x800000), magenta (0x800080), cyan (0x808000), white (0xC0C0C0), light black +-- (0x404040), light red (0x0000FF), light green (0x00FF00), light yellow (0x00FFFF), light blue +-- (0xFF0000), light magenta (0xFF00FF), light cyan (0xFFFF00), and light white (0xFFFFFF). +-- @name colors +-- @class table +M.colors = setmetatable({}, { + __index = function(_, name) + local color = M.property['color.' .. name] + return tonumber(color) or color + end, __newindex = function(_, name, color) M.property['color.' .. name] = color end +}) --- Keep track of the last parent lexer loaded. This lexer's rules are used for --- proxy lexers (those that load parent and child lexers to embed) that do not --- declare a parent lexer. -local parent_lexer +-- A style object that distills into a property string that can be read by the LPeg lexer. +local style_obj = {} +style_obj.__index = style_obj -if not package.searchpath then - -- Searches for the given *name* in the given *path*. - -- This is an implementation of Lua 5.2's `package.searchpath()` function for - -- Lua 5.1. - function package.searchpath(name, path) - local tried = {} - for part in path:gmatch('[^;]+') do - local filename = part:gsub('%?', name) - local f = io.open(filename, 'r') - if f then f:close() return filename end - tried[#tried + 1] = ("no file '%s'"):format(filename) +-- Create a style object from a style name, property table, or legacy style string. +function style_obj.new(name_or_props) + local prop_string = tostring(name_or_props) + if type(name_or_props) == 'string' and name_or_props:find('^[%w_]+$') then + prop_string = string.format('$(style.%s)', name_or_props) + elseif type(name_or_props) == 'table' then + local settings = {} + for k, v in pairs(name_or_props) do + settings[#settings + 1] = type(v) ~= 'boolean' and string.format('%s:%s', k, v) or + string.format('%s%s', v and '' or 'not', k) end - return nil, table.concat(tried, '\n') + prop_string = table.concat(settings, ',') end + return setmetatable({prop_string = prop_string}, style_obj) end --- Adds a rule to a lexer's current ordered list of rules. +-- Returns a new style based on this one with the properties defined in the given table or +-- legacy style string. +function style_obj.__concat(self, props) + if type(props) == 'table' then props = tostring(style_obj.new(props)) end + return setmetatable({prop_string = string.format('%s,%s', self.prop_string, props)}, style_obj) +end + +-- Returns this style object as property string for use with the LPeg lexer. +function style_obj.__tostring(self) return self.prop_string end + +--- +-- Map of style names to style definition tables. +-- +-- Style names consist of the following default names as well as the token names defined by lexers. +-- +-- * `default`: The default style all others are based on. +-- * `line_number`: The line number margin style. +-- * `control_char`: The style of control character blocks. +-- * `indent_guide`: The style of indentation guides. +-- * `call_tip`: The style of call tip text. Only the `font`, `size`, `fore`, and `back` style +-- definition fields are supported. +-- * `fold_display_text`: The style of text displayed next to folded lines. +-- * `class`, `comment`, `constant`, `embedded`, `error`, `function`, `identifier`, `keyword`, +-- `label`, `number`, `operator`, `preprocessor`, `regex`, `string`, `type`, `variable`, +-- `whitespace`: Some token names used by lexers. Some lexers may define more token names, +-- so this list is not exhaustive. +-- * *`lang`*`_whitespace`: A special style for whitespace tokens in lexer name *lang*. It +-- inherits from `whitespace`, and is used in place of it for all lexers. +-- +-- Style definition tables may contain the following fields: +-- +-- * `font`: String font name. +-- * `size`: Integer font size. +-- * `bold`: Whether or not the font face is bold. The default value is `false`. +-- * `weight`: Integer weight or boldness of a font, between 1 and 999. +-- * `italics`: Whether or not the font face is italic. The default value is `false`. +-- * `underlined`: Whether or not the font face is underlined. The default value is `false`. +-- * `fore`: Font face foreground color in `0xBBGGRR` or `"#RRGGBB"` format. +-- * `back`: Font face background color in `0xBBGGRR` or `"#RRGGBB"` format. +-- * `eolfilled`: Whether or not the background color extends to the end of the line. The +-- default value is `false`. +-- * `case`: Font case: `'u'` for upper, `'l'` for lower, and `'m'` for normal, mixed case. The +-- default value is `'m'`. +-- * `visible`: Whether or not the text is visible. The default value is `true`. +-- * `changeable`: Whether the text is changeable instead of read-only. The default value is +-- `true`. +-- @class table +-- @name styles +M.styles = setmetatable({}, { + __index = function(_, name) return style_obj.new(name) end, __newindex = function(_, name, style) + if getmetatable(style) ~= style_obj then style = style_obj.new(style) end + M.property['style.' .. name] = tostring(style) + end +}) + +-- Default styles. +local default = { + 'nothing', 'whitespace', 'comment', 'string', 'number', 'keyword', 'identifier', 'operator', + 'error', 'preprocessor', 'constant', 'variable', 'function', 'class', 'type', 'label', 'regex', + 'embedded' +} +for _, name in ipairs(default) do + M[name:upper()] = name + M['STYLE_' .. name:upper()] = style_obj.new(name) -- backward compatibility +end +-- Predefined styles. +local predefined = { + 'default', 'line_number', 'brace_light', 'brace_bad', 'control_char', 'indent_guide', 'call_tip', + 'fold_display_text' +} +for _, name in ipairs(predefined) do + M[name:upper()] = name + M['STYLE_' .. name:upper()] = style_obj.new(name) -- backward compatibility +end + +--- +-- Adds pattern *rule* identified by string *id* to the ordered list of rules for lexer *lexer*. -- @param lexer The lexer to add the given rule to. --- @param name The name associated with this rule. It is used for other lexers --- to access this particular rule from the lexer's `_RULES` table. It does not --- have to be the same as the name passed to `token`. +-- @param id The id associated with this rule. It does not have to be the same as the name +-- passed to `token()`. -- @param rule The LPeg pattern of the rule. -local function add_rule(lexer, id, rule) +-- @see modify_rule +-- @name add_rule +function M.add_rule(lexer, id, rule) + if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent if not lexer._RULES then lexer._RULES = {} - -- Contains an ordered list (by numerical index) of rule names. This is used - -- in conjunction with lexer._RULES for building _TOKENRULE. + -- Contains an ordered list (by numerical index) of rule names. This is used in conjunction + -- with lexer._RULES for building _TOKENRULE. lexer._RULEORDER = {} end lexer._RULES[id] = rule lexer._RULEORDER[#lexer._RULEORDER + 1] = id + lexer:build_grammar() end --- Adds a new Scintilla style to Scintilla. --- @param lexer The lexer to add the given style to. --- @param token_name The name of the token associated with this style. --- @param style A Scintilla style created from `style()`. --- @see style -local function add_style(lexer, token_name, style) +--- +-- Replaces in lexer *lexer* the existing rule identified by string *id* with pattern *rule*. +-- @param lexer The lexer to modify. +-- @param id The id associated with this rule. +-- @param rule The LPeg pattern of the rule. +-- @name modify_rule +function M.modify_rule(lexer, id, rule) + if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent + lexer._RULES[id] = rule + lexer:build_grammar() +end + +--- +-- Returns the rule identified by string *id*. +-- @param lexer The lexer to fetch a rule from. +-- @param id The id of the rule to fetch. +-- @return pattern +-- @name get_rule +function M.get_rule(lexer, id) + if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent + return lexer._RULES[id] +end + +--- +-- Associates string *token_name* in lexer *lexer* with style table *style*. +-- *style* may have the following fields: +-- +-- * `font`: String font name. +-- * `size`: Integer font size. +-- * `bold`: Whether or not the font face is bold. The default value is `false`. +-- * `weight`: Integer weight or boldness of a font, between 1 and 999. +-- * `italics`: Whether or not the font face is italic. The default value is `false`. +-- * `underlined`: Whether or not the font face is underlined. The default value is `false`. +-- * `fore`: Font face foreground color in `0xBBGGRR` or `"#RRGGBB"` format. +-- * `back`: Font face background color in `0xBBGGRR` or `"#RRGGBB"` format. +-- * `eolfilled`: Whether or not the background color extends to the end of the line. The +-- default value is `false`. +-- * `case`: Font case, `'u'` for upper, `'l'` for lower, and `'m'` for normal, mixed case. The +-- default value is `'m'`. +-- * `visible`: Whether or not the text is visible. The default value is `true`. +-- * `changeable`: Whether the text is changeable instead of read-only. The default value is +-- `true`. +-- +-- Field values may also contain "$(property.name)" expansions for properties defined in Scintilla, +-- theme files, etc. +-- @param lexer The lexer to add a style to. +-- @param token_name The name of the token to associated with the style. +-- @param style A style string for Scintilla. +-- @usage lex:add_style('longstring', lexer.styles.string) +-- @usage lex:add_style('deprecated_func', lexer.styles['function'] .. {italics = true} +-- @usage lex:add_style('visible_ws', lexer.styles.whitespace .. {back = lexer.colors.grey} +-- @name add_style +function M.add_style(lexer, token_name, style) local num_styles = lexer._numstyles - if num_styles == 32 then num_styles = num_styles + 8 end -- skip predefined - if num_styles >= 255 then print('Too many styles defined (255 MAX)') end + if num_styles == 33 then num_styles = num_styles + 8 end -- skip predefined + if num_styles >= 256 then print('Too many styles defined (256 MAX)') end lexer._TOKENSTYLES[token_name], lexer._numstyles = num_styles, num_styles + 1 - lexer._EXTRASTYLES[token_name] = style + if type(style) == 'table' and not getmetatable(style) then style = style_obj.new(style) end + lexer._EXTRASTYLES[token_name] = tostring(style) + -- If the lexer is a proxy or a child that embedded itself, copy this style to the parent lexer. + if lexer._lexer then lexer._lexer:add_style(token_name, style) end +end + +--- +-- Adds to lexer *lexer* a fold point whose beginning and end tokens are string *token_name* +-- tokens with string content *start_symbol* and *end_symbol*, respectively. +-- In the event that *start_symbol* may or may not be a fold point depending on context, and that +-- additional processing is required, *end_symbol* may be a function that ultimately returns +-- `1` (indicating a beginning fold point), `-1` (indicating an ending fold point), or `0` +-- (indicating no fold point). That function is passed the following arguments: +-- +-- * `text`: The text being processed for fold points. +-- * `pos`: The position in *text* of the beginning of the line currently being processed. +-- * `line`: The text of the line currently being processed. +-- * `s`: The position of *start_symbol* in *line*. +-- * `symbol`: *start_symbol* itself. +-- @param lexer The lexer to add a fold point to. +-- @param token_name The token name of text that indicates a fold point. +-- @param start_symbol The text that indicates the beginning of a fold point. +-- @param end_symbol Either the text that indicates the end of a fold point, or a function that +-- returns whether or not *start_symbol* is a beginning fold point (1), an ending fold point +-- (-1), or not a fold point at all (0). +-- @usage lex:add_fold_point(lexer.OPERATOR, '{', '}') +-- @usage lex:add_fold_point(lexer.KEYWORD, 'if', 'end') +-- @usage lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('#')) +-- @usage lex:add_fold_point('custom', function(text, pos, line, s, symbol) ... end) +-- @name add_fold_point +function M.add_fold_point(lexer, token_name, start_symbol, end_symbol) + if not lexer._FOLDPOINTS then lexer._FOLDPOINTS = {_SYMBOLS = {}} end + local symbols = lexer._FOLDPOINTS._SYMBOLS + if not lexer._FOLDPOINTS[token_name] then lexer._FOLDPOINTS[token_name] = {} end + if lexer._CASEINSENSITIVEFOLDPOINTS then + start_symbol = start_symbol:lower() + if type(end_symbol) == 'string' then end_symbol = end_symbol:lower() end + end + if type(end_symbol) == 'string' then + if not symbols[end_symbol] then symbols[#symbols + 1], symbols[end_symbol] = end_symbol, true end + lexer._FOLDPOINTS[token_name][start_symbol] = 1 + lexer._FOLDPOINTS[token_name][end_symbol] = -1 + else + lexer._FOLDPOINTS[token_name][start_symbol] = end_symbol -- function or int + end + if not symbols[start_symbol] then + symbols[#symbols + 1], symbols[start_symbol] = start_symbol, true + end + -- If the lexer is a proxy or a child that embedded itself, copy this fold point to the + -- parent lexer. + if lexer._lexer then lexer._lexer:add_fold_point(token_name, start_symbol, end_symbol) end end -- (Re)constructs `lexer._TOKENRULE`. --- @param parent The parent lexer. local function join_tokens(lexer) local patterns, order = lexer._RULES, lexer._RULEORDER local token_rule = patterns[order[1]] @@ -950,218 +1096,118 @@ local function join_tokens(lexer) return lexer._TOKENRULE end --- Adds a given lexer and any of its embedded lexers to a given grammar. --- @param grammar The grammar to add the lexer to. --- @param lexer The lexer to add. -local function add_lexer(grammar, lexer, token_rule) - local token_rule = join_tokens(lexer) - local lexer_name = lexer._NAME - for i = 1, #lexer._CHILDREN do - local child = lexer._CHILDREN[i] - if child._CHILDREN then add_lexer(grammar, child) end - local child_name = child._NAME - local rules = child._EMBEDDEDRULES[lexer_name] - local rules_token_rule = grammar['__'..child_name] or rules.token_rule - grammar[child_name] = (-rules.end_rule * rules_token_rule)^0 * - rules.end_rule^-1 * lpeg_V(lexer_name) - local embedded_child = '_'..child_name - grammar[embedded_child] = rules.start_rule * (-rules.end_rule * - rules_token_rule)^0 * rules.end_rule^-1 - token_rule = lpeg_V(embedded_child) + token_rule - end - grammar['__'..lexer_name] = token_rule -- can contain embedded lexer rules - grammar[lexer_name] = token_rule^0 -end +-- Metatable for Scintillua grammars. +-- These grammars are just tables ultimately passed to `lpeg.P()`. +local grammar_mt = { + __index = { + -- Adds lexer *lexer* and any of its embedded lexers to this grammar. + -- @param lexer The lexer to add. + add_lexer = function(self, lexer) + local lexer_name = lexer._PARENTNAME or lexer._NAME + local token_rule = lexer:join_tokens() + for _, child in ipairs(lexer._CHILDREN) do + if child._CHILDREN then self:add_lexer(child) end + local rules = child._EMBEDDEDRULES[lexer_name] + local rules_token_rule = self['__' .. child._NAME] or rules.token_rule + self[child._NAME] = (-rules.end_rule * rules_token_rule)^0 * rules.end_rule^-1 * + lpeg_V(lexer_name) + local embedded_child = '_' .. child._NAME + self[embedded_child] = rules.start_rule * (-rules.end_rule * rules_token_rule)^0 * + rules.end_rule^-1 + token_rule = lpeg_V(embedded_child) + token_rule + end + self['__' .. lexer_name] = token_rule -- can contain embedded lexer rules + self[lexer_name] = token_rule^0 + end + } +} -- (Re)constructs `lexer._GRAMMAR`. --- @param lexer The parent lexer. --- @param initial_rule The name of the rule to start lexing with. The default --- value is `lexer._NAME`. Multilang lexers use this to start with a child --- rule if necessary. +-- @param initial_rule The name of the rule to start lexing with. The default value is +-- `lexer._NAME`. Multilang lexers use this to start with a child rule if necessary. local function build_grammar(lexer, initial_rule) - local children = lexer._CHILDREN - if children then - local lexer_name = lexer._NAME - if not initial_rule then initial_rule = lexer_name end - local grammar = {initial_rule} - add_lexer(grammar, lexer) + if not lexer._RULES then return end + if lexer._CHILDREN then + if not initial_rule then initial_rule = lexer._NAME end + local grammar = setmetatable({initial_rule}, grammar_mt) + grammar:add_lexer(lexer) lexer._INITIALRULE = initial_rule lexer._GRAMMAR = lpeg_Ct(lpeg_P(grammar)) else - local function tmout(_, _, t1, redrawtime_max, flag) - if not redrawtime_max or os.clock() - t1 < redrawtime_max then return true end - if flag then flag.timedout = true end - end - local tokens = join_tokens(lexer) - -- every 500 tokens (approx. a screenful), check whether we have exceeded the timeout - lexer._GRAMMAR = lpeg_Ct((tokens * tokens^-500 * lpeg_Cmt(lpeg_Carg(1) * lpeg_Carg(2) * lpeg_Carg(3), tmout))^0) - end -end - -local string_upper = string.upper --- Default styles. -local default = { - 'nothing', 'whitespace', 'comment', 'string', 'number', 'keyword', - 'identifier', 'operator', 'error', 'preprocessor', 'constant', 'variable', - 'function', 'class', 'type', 'label', 'regex', 'embedded' -} -for i = 1, #default do - local name, upper_name = default[i], string_upper(default[i]) - M[upper_name] = name - if not M['STYLE_'..upper_name] then - M['STYLE_'..upper_name] = '' - end -end --- Predefined styles. -local predefined = { - 'default', 'linenumber', 'bracelight', 'bracebad', 'controlchar', - 'indentguide', 'calltip', 'folddisplaytext' -} -for i = 1, #predefined do - local name, upper_name = predefined[i], string_upper(predefined[i]) - M[upper_name] = name - if not M['STYLE_'..upper_name] then - M['STYLE_'..upper_name] = '' + lexer._GRAMMAR = lpeg_Ct(lexer:join_tokens()^0) end end --- --- Initializes or loads and returns the lexer of string name *name*. --- Scintilla calls this function in order to load a lexer. Parent lexers also --- call this function in order to load child lexers and vice-versa. The user --- calls this function in order to load a lexer when using Scintillua as a Lua --- library. --- @param name The name of the lexing language. --- @param alt_name The alternate name of the lexing language. This is useful for --- embedding the same child lexer with multiple sets of start and end tokens. --- @param cache Flag indicating whether or not to load lexers from the cache. --- This should only be `true` when initially loading a lexer (e.g. not from --- within another lexer for embedding purposes). --- The default value is `false`. --- @return lexer object --- @name load -function M.load(name, alt_name, cache) - if cache and M.lexers[alt_name or name] then return M.lexers[alt_name or name] end - parent_lexer = nil -- reset - - -- When using Scintillua as a stand-alone module, the `property` and - -- `property_int` tables do not exist (they are not useful). Create them to - -- prevent errors from occurring. - if not M.property then - M.property, M.property_int = {}, setmetatable({}, { - __index = function(t, k) return tonumber(M.property[k]) or 0 end, - __newindex = function() error('read-only property') end - }) - end - - -- Load the language lexer with its rules, styles, etc. - M.WHITESPACE = (alt_name or name)..'_whitespace' - local lexer_file, error = package.searchpath('lexers/'..name, M.LEXERPATH) - local ok, lexer = pcall(dofile, lexer_file or '') - if not ok then - return nil - end - if alt_name then lexer._NAME = alt_name end - - -- Create the initial maps for token names to style numbers and styles. - local token_styles = {} - for i = 1, #default do token_styles[default[i]] = i - 1 end - for i = 1, #predefined do token_styles[predefined[i]] = i + 31 end - lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default - lexer._EXTRASTYLES = {} - - -- If the lexer is a proxy (loads parent and child lexers to embed) and does - -- not declare a parent, try and find one and use its rules. - if not lexer._rules and not lexer._lexer then lexer._lexer = parent_lexer end - - -- If the lexer is a proxy or a child that embedded itself, add its rules and - -- styles to the parent lexer. Then set the parent to be the main lexer. - if lexer._lexer then - local l, _r, _s = lexer._lexer, lexer._rules, lexer._tokenstyles - if not l._tokenstyles then l._tokenstyles = {} end - if _r then - for i = 1, #_r do - -- Prevent rule id clashes. - l._rules[#l._rules + 1] = {lexer._NAME..'_'.._r[i][1], _r[i][2]} - end - end - if _s then - for token, style in pairs(_s) do l._tokenstyles[token] = style end - end - lexer = l - end - - -- Add the lexer's styles and build its grammar. - if lexer._rules then - if lexer._tokenstyles then - for token, style in pairs(lexer._tokenstyles) do - add_style(lexer, token, style) +-- Embeds child lexer *child* in parent lexer *lexer* using patterns *start_rule* and *end_rule*, +-- which signal the beginning and end of the embedded lexer, respectively. +-- @param lexer The parent lexer. +-- @param child The child lexer. +-- @param start_rule The pattern that signals the beginning of the embedded lexer. +-- @param end_rule The pattern that signals the end of the embedded lexer. +-- @usage html:embed(css, css_start_rule, css_end_rule) +-- @usage html:embed(lex, php_start_rule, php_end_rule) -- from php lexer +-- @name embed +function M.embed(lexer, child, start_rule, end_rule) + if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent + -- Add child rules. + if not child._EMBEDDEDRULES then child._EMBEDDEDRULES = {} end + if not child._RULES then error('Cannot embed lexer with no rules') end + child._EMBEDDEDRULES[lexer._NAME] = { + start_rule = start_rule, token_rule = child:join_tokens(), end_rule = end_rule + } + if not lexer._CHILDREN then lexer._CHILDREN = {} end + local children = lexer._CHILDREN + children[#children + 1] = child + -- Add child styles. + for token, style in pairs(child._EXTRASTYLES) do lexer:add_style(token, style) end + -- Add child fold symbols. + if child._FOLDPOINTS then + for token_name, symbols in pairs(child._FOLDPOINTS) do + if token_name ~= '_SYMBOLS' then + for symbol, v in pairs(symbols) do lexer:add_fold_point(token_name, symbol, v) end end end - for i = 1, #lexer._rules do - add_rule(lexer, lexer._rules[i][1], lexer._rules[i][2]) - end - build_grammar(lexer) - end - -- Add the lexer's unique whitespace style. - add_style(lexer, lexer._NAME..'_whitespace', M.STYLE_WHITESPACE) - - -- Process the lexer's fold symbols. - if lexer._foldsymbols and lexer._foldsymbols._patterns then - local patterns = lexer._foldsymbols._patterns - for i = 1, #patterns do patterns[i] = '()('..patterns[i]..')' end end - - lexer.lex, lexer.fold = M.lex, M.fold - M.lexers[alt_name or name] = lexer - return lexer + lexer:build_grammar() + child._lexer = lexer -- use parent's tokens if child is embedding itself end --- --- Lexes a chunk of text *text* (that has an initial style number of --- *init_style*) with lexer *lexer*. --- If *lexer* has a `_LEXBYLINE` flag set, the text is lexed one line at a time. --- Otherwise the text is lexed as a whole. --- @param lexer The lexer object to lex with. +-- Lexes a chunk of text *text* (that has an initial style number of *init_style*) using lexer +-- *lexer*, returning a table of token names and positions. +-- @param lexer The lexer to lex text with. -- @param text The text in the buffer to lex. --- @param init_style The current style. Multiple-language lexers use this to --- determine which language to start lexing in. --- @param redrawtime_max Stop lexing after that many seconds and set the second return value (timedout) to true. --- @param init Start lexing from this offset in *text* (default is 1). +-- @param init_style The current style. Multiple-language lexers use this to determine which +-- language to start lexing in. -- @return table of token names and positions. --- @return whether the lexing timed out. -- @name lex -function M.lex(lexer, text, init_style, redrawtime_max, init) +function M.lex(lexer, text, init_style) if not lexer._GRAMMAR then return {M.DEFAULT, #text + 1} end if not lexer._LEXBYLINE then - -- For multilang lexers, build a new grammar whose initial_rule is the - -- current language. + -- For multilang lexers, build a new grammar whose initial_rule is the current language. if lexer._CHILDREN then for style, style_num in pairs(lexer._TOKENSTYLES) do if style_num == init_style then - local lexer_name = style:match('^(.+)_whitespace') or lexer._NAME - if lexer._INITIALRULE ~= lexer_name then - build_grammar(lexer, lexer_name) - end + local lexer_name = style:match('^(.+)_whitespace') or lexer._PARENTNAME or lexer._NAME + if lexer._INITIALRULE ~= lexer_name then lexer:build_grammar(lexer_name) end break end end end - local flag = {} - return lpeg_match(lexer._GRAMMAR, text, init, os.clock(), redrawtime_max, flag), flag.timedout + return lpeg_match(lexer._GRAMMAR, text) else - local tokens = {} local function append(tokens, line_tokens, offset) for i = 1, #line_tokens, 2 do tokens[#tokens + 1] = line_tokens[i] tokens[#tokens + 1] = line_tokens[i + 1] + offset end end + local tokens = {} local offset = 0 local grammar = lexer._GRAMMAR - local flag = {} for line in text:gmatch('[^\r\n]*\r?\n?') do - local line_tokens = lpeg_match(grammar, line, init, os.clock(), redrawtime_max, flag) + local line_tokens = lpeg_match(grammar, line) if line_tokens then append(tokens, line_tokens, offset) end offset = offset + #line -- Use the default style to the end of the line if none was specified. @@ -1169,75 +1215,90 @@ function M.lex(lexer, text, init_style, redrawtime_max, init) tokens[#tokens + 1], tokens[#tokens + 2] = 'default', offset + 1 end end - return tokens, flag.timedout + return tokens end end --- --- Determines fold points in a chunk of text *text* with lexer *lexer*. --- *text* starts at position *start_pos* on line number *start_line* with a --- beginning fold level of *start_level* in the buffer. If *lexer* has a `_fold` --- function or a `_foldsymbols` table, that field is used to perform folding. --- Otherwise, if *lexer* has a `_FOLDBYINDENTATION` field set, or if a --- `fold.by.indentation` property is set, folding by indentation is done. --- @param lexer The lexer object to fold with. +-- Determines fold points in a chunk of text *text* using lexer *lexer*, returning a table of +-- fold levels associated with line numbers. +-- *text* starts at position *start_pos* on line number *start_line* with a beginning fold +-- level of *start_level* in the buffer. +-- @param lexer The lexer to fold text with. -- @param text The text in the buffer to fold. --- @param start_pos The position in the buffer *text* starts at, starting at --- zero. --- @param start_line The line number *text* starts on. +-- @param start_pos The position in the buffer *text* starts at, counting from 1. +-- @param start_line The line number *text* starts on, counting from 1. -- @param start_level The fold level *text* starts on. --- @return table of fold levels. +-- @return table of fold levels associated with line numbers. -- @name fold function M.fold(lexer, text, start_pos, start_line, start_level) local folds = {} if text == '' then return folds end local fold = M.property_int['fold'] > 0 local FOLD_BASE = M.FOLD_BASE - local FOLD_HEADER, FOLD_BLANK = M.FOLD_HEADER, M.FOLD_BLANK - if fold and lexer._fold then - return lexer._fold(text, start_pos, start_line, start_level) - elseif fold and lexer._foldsymbols then + local FOLD_HEADER, FOLD_BLANK = M.FOLD_HEADER, M.FOLD_BLANK + if fold and lexer._FOLDPOINTS then local lines = {} - for p, l in (text..'\n'):gmatch('()(.-)\r?\n') do - lines[#lines + 1] = {p, l} - end + for p, l in (text .. '\n'):gmatch('()(.-)\r?\n') do lines[#lines + 1] = {p, l} end local fold_zero_sum_lines = M.property_int['fold.on.zero.sum.lines'] > 0 - local fold_symbols = lexer._foldsymbols - local fold_symbols_patterns = fold_symbols._patterns - local fold_symbols_case_insensitive = fold_symbols._case_insensitive + local fold_compact = M.property_int['fold.compact'] > 0 + local fold_points = lexer._FOLDPOINTS + local fold_point_symbols = fold_points._SYMBOLS local style_at, fold_level = M.style_at, M.fold_level local line_num, prev_level = start_line, start_level local current_level = prev_level - for i = 1, #lines do - local pos, line = lines[i][1], lines[i][2] + for _, captures in ipairs(lines) do + local pos, line = captures[1], captures[2] if line ~= '' then - if fold_symbols_case_insensitive then line = line:lower() end + if lexer._CASEINSENSITIVEFOLDPOINTS then line = line:lower() end + local ranges = {} + local function is_valid_range(s, e) + if not s or not e then return false end + for i = 1, #ranges - 1, 2 do + local range_s, range_e = ranges[i], ranges[i + 1] + if s >= range_s and s <= range_e or e >= range_s and e <= range_e then + return false + end + end + ranges[#ranges + 1] = s + ranges[#ranges + 1] = e + return true + end local level_decreased = false - for j = 1, #fold_symbols_patterns do - for s, match in line:gmatch(fold_symbols_patterns[j]) do - local symbols = fold_symbols[style_at[start_pos + pos + s - 1]] - local l = symbols and symbols[match] - if type(l) == 'function' then l = l(text, pos, line, s, match) end - if type(l) == 'number' then - current_level = current_level + l - if l < 0 and current_level < prev_level then - -- Potential zero-sum line. If the level were to go back up on - -- the same line, the line may be marked as a fold header. - level_decreased = true + for _, symbol in ipairs(fold_point_symbols) do + local word = not symbol:find('[^%w_]') + local s, e = line:find(symbol, 1, true) + while is_valid_range(s, e) do + -- if not word or line:find('^%f[%w_]' .. symbol .. '%f[^%w_]', s) then + local word_before = s > 1 and line:find('^[%w_]', s - 1) + local word_after = line:find('^[%w_]', e + 1) + if not word or not (word_before or word_after) then + local symbols = fold_points[style_at[start_pos + pos - 1 + s - 1]] + local level = symbols and symbols[symbol] + if type(level) == 'function' then + level = level(text, pos, line, s, symbol) + end + if type(level) == 'number' then + current_level = current_level + level + if level < 0 and current_level < prev_level then + -- Potential zero-sum line. If the level were to go back up on the same line, + -- the line may be marked as a fold header. + level_decreased = true + end end end + s, e = line:find(symbol, s + 1, true) end end folds[line_num] = prev_level if current_level > prev_level then folds[line_num] = prev_level + FOLD_HEADER - elseif level_decreased and current_level == prev_level and - fold_zero_sum_lines then + elseif level_decreased and current_level == prev_level and fold_zero_sum_lines then if line_num > start_line then folds[line_num] = prev_level - 1 + FOLD_HEADER else -- Typing within a zero-sum line. - local level = fold_level[line_num - 1] - 1 + local level = fold_level[line_num] - 1 if level > FOLD_HEADER then level = level - FOLD_HEADER end if level > FOLD_BLANK then level = level - FOLD_BLANK end folds[line_num] = level + FOLD_HEADER @@ -1247,33 +1308,29 @@ function M.fold(lexer, text, start_pos, start_line, start_level) if current_level < FOLD_BASE then current_level = FOLD_BASE end prev_level = current_level else - folds[line_num] = prev_level + FOLD_BLANK + folds[line_num] = prev_level + (fold_compact and FOLD_BLANK or 0) end line_num = line_num + 1 end - elseif fold and (lexer._FOLDBYINDENTATION or - M.property_int['fold.by.indentation'] > 0) then + elseif fold and (lexer._FOLDBYINDENTATION or M.property_int['fold.by.indentation'] > 0) then -- Indentation based folding. -- Calculate indentation per line. local indentation = {} - for indent, line in (text..'\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do + for indent, line in (text .. '\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do indentation[#indentation + 1] = line ~= '' and #indent end - -- Find the first non-blank line before start_line. If the current line is - -- indented, make that previous line a header and update the levels of any - -- blank lines inbetween. If the current line is blank, match the level of - -- the previous non-blank line. + -- Find the first non-blank line before start_line. If the current line is indented, make + -- that previous line a header and update the levels of any blank lines inbetween. If the + -- current line is blank, match the level of the previous non-blank line. local current_level = start_level - for i = start_line - 1, 0, -1 do + for i = start_line, 1, -1 do local level = M.fold_level[i] if level >= FOLD_HEADER then level = level - FOLD_HEADER end if level < FOLD_BLANK then local indent = M.indent_amount[i] if indentation[1] and indentation[1] > indent then folds[i] = FOLD_BASE + indent + FOLD_HEADER - for j = i + 1, start_line - 1 do - folds[j] = start_level + FOLD_BLANK - end + for j = i + 1, start_line - 1 do folds[j] = start_level + FOLD_BLANK end elseif not indentation[1] then current_level = FOLD_BASE + indent end @@ -1309,91 +1366,295 @@ function M.fold(lexer, text, start_pos, start_line, start_level) return folds end +--- +-- Creates a returns a new lexer with the given name. +-- @param name The lexer's name. +-- @param opts Table of lexer options. Options currently supported: +-- * `lex_by_line`: Whether or not the lexer only processes whole lines of text (instead of +-- arbitrary chunks of text) at a time. Line lexers cannot look ahead to subsequent lines. +-- The default value is `false`. +-- * `fold_by_indentation`: Whether or not the lexer does not define any fold points and that +-- fold points should be calculated based on changes in line indentation. The default value +-- is `false`. +-- * `case_insensitive_fold_points`: Whether or not fold points added via +-- `lexer.add_fold_point()` ignore case. The default value is `false`. +-- * `inherit`: Lexer to inherit from. The default value is `nil`. +-- @usage lexer.new('rhtml', {inherit = lexer.load('html')}) +-- @name new +function M.new(name, opts) + local lexer = { + _NAME = assert(name, 'lexer name expected'), _LEXBYLINE = opts and opts['lex_by_line'], + _FOLDBYINDENTATION = opts and opts['fold_by_indentation'], + _CASEINSENSITIVEFOLDPOINTS = opts and opts['case_insensitive_fold_points'], + _lexer = opts and opts['inherit'] + } + + -- Create the initial maps for token names to style numbers and styles. + local token_styles = {} + for i = 1, #default do token_styles[default[i]] = i end + for i = 1, #predefined do token_styles[predefined[i]] = i + 32 end + lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default + 1 + lexer._EXTRASTYLES = {} + + return setmetatable(lexer, { + __index = { + add_rule = M.add_rule, modify_rule = M.modify_rule, get_rule = M.get_rule, + add_style = M.add_style, add_fold_point = M.add_fold_point, join_tokens = join_tokens, + build_grammar = build_grammar, embed = M.embed, lex = M.lex, fold = M.fold + } + }) +end + +-- Legacy support for older lexers. +-- Processes the `lex._rules`, `lex._tokenstyles`, and `lex._foldsymbols` tables. Since legacy +-- lexers may be processed up to twice, ensure their default styles and rules are not processed +-- more than once. +local function process_legacy_lexer(lexer) + local function warn(msg) --[[io.stderr:write(msg, "\n")]]end + if not lexer._LEGACY then + lexer._LEGACY = true + warn("lexers as tables are deprecated; use 'lexer.new()'") + local token_styles = {} + for i = 1, #default do token_styles[default[i]] = i end + for i = 1, #predefined do token_styles[predefined[i]] = i + 32 end + lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default + 1 + lexer._EXTRASTYLES = {} + setmetatable(lexer, getmetatable(M.new(''))) + if lexer._rules then + warn("lexer '_rules' table is deprecated; use 'add_rule()'") + for _, rule in ipairs(lexer._rules) do lexer:add_rule(rule[1], rule[2]) end + end + end + if lexer._tokenstyles then + warn("lexer '_tokenstyles' table is deprecated; use 'add_style()'") + for token, style in pairs(lexer._tokenstyles) do + -- If this legacy lexer is being processed a second time, only add styles added since + -- the first processing. + if not lexer._TOKENSTYLES[token] then lexer:add_style(token, style) end + end + end + if lexer._foldsymbols then + warn("lexer '_foldsymbols' table is deprecated; use 'add_fold_point()'") + for token_name, symbols in pairs(lexer._foldsymbols) do + if type(symbols) == 'table' and token_name ~= '_patterns' then + for symbol, v in pairs(symbols) do lexer:add_fold_point(token_name, symbol, v) end + end + end + if lexer._foldsymbols._case_insensitive then lexer._CASEINSENSITIVEFOLDPOINTS = true end + elseif lexer._fold then + lexer.fold = function(self, ...) return lexer._fold(...) end + end +end + +local lexers = {} -- cache of loaded lexers +--- +-- Initializes or loads and returns the lexer of string name *name*. +-- Scintilla calls this function in order to load a lexer. Parent lexers also call this function +-- in order to load child lexers and vice-versa. The user calls this function in order to load +-- a lexer when using Scintillua as a Lua library. +-- @param name The name of the lexing language. +-- @param alt_name The alternate name of the lexing language. This is useful for embedding the +-- same child lexer with multiple sets of start and end tokens. +-- @param cache Flag indicating whether or not to load lexers from the cache. This should only +-- be `true` when initially loading a lexer (e.g. not from within another lexer for embedding +-- purposes). The default value is `false`. +-- @return lexer object +-- @name load +function M.load(name, alt_name, cache) + if cache and lexers[alt_name or name] then return lexers[alt_name or name] end + + -- When using Scintillua as a stand-alone module, the `property`, `property_int`, and + -- `property_expanded` tables do not exist (they are not useful). Create them in order prevent + -- errors from occurring. + if not M.property then + M.property = setmetatable({['lexer.lpeg.home'] = package.path:gsub('/%?%.lua', '')}, { + __index = function() return '' end, + __newindex = function(t, k, v) rawset(t, k, tostring(v)) end + }) + M.property_int = setmetatable({}, { + __index = function(t, k) return tonumber(M.property[k]) or 0 end, + __newindex = function() error('read-only property') end + }) + M.property_expanded = setmetatable({}, { + __index = function(t, key) + return M.property[key]:gsub('[$%%](%b())', function(key) return t[key:sub(2, -2)] end) + end, __newindex = function() error('read-only property') end + }) + end + + -- Load the language lexer with its rules, styles, etc. + -- However, replace the default `WHITESPACE` style name with a unique whitespace style name + -- (and then automatically add it afterwards), since embedded lexing relies on these unique + -- whitespace style names. Note that loading embedded lexers changes `WHITESPACE` again, + -- so when adding it later, do not reference the potentially incorrect value. + M.WHITESPACE = (alt_name or name) .. '_whitespace' + local path = M.property['lexer.lpeg.home']:gsub(';', '/?.lua;') .. '/?.lua' + local lexer = dofile(assert(searchpath('lexers/'..name, path))) + assert(lexer, string.format("'%s.lua' did not return a lexer", name)) + if alt_name then lexer._NAME = alt_name end + if not getmetatable(lexer) or lexer._LEGACY then + -- A legacy lexer may need to be processed a second time in order to pick up any `_tokenstyles` + -- or `_foldsymbols` added after `lexer.embed_lexer()`. + process_legacy_lexer(lexer) + if lexer._lexer and lexer._lexer._LEGACY then + process_legacy_lexer(lexer._lexer) -- mainly for `_foldsymbols` edits + end + end + lexer:add_style((alt_name or name) .. '_whitespace', M.styles.whitespace) + + -- If the lexer is a proxy or a child that embedded itself, set the parent to be the main + -- lexer. Keep a reference to the old parent name since embedded child rules reference and + -- use that name. + if lexer._lexer then + lexer = lexer._lexer + lexer._PARENTNAME, lexer._NAME = lexer._NAME, alt_name or name + end + + if cache then lexers[alt_name or name] = lexer end + return lexer +end + -- The following are utility functions lexers will have access to. -- Common patterns. M.any = lpeg_P(1) -M.ascii = lpeg_R('\000\127') -M.extend = lpeg_R('\000\255') M.alpha = lpeg_R('AZ', 'az') M.digit = lpeg_R('09') M.alnum = lpeg_R('AZ', 'az', '09') M.lower = lpeg_R('az') M.upper = lpeg_R('AZ') M.xdigit = lpeg_R('09', 'AF', 'af') -M.cntrl = lpeg_R('\000\031') M.graph = lpeg_R('!~') -M.print = lpeg_R(' ~') M.punct = lpeg_R('!/', ':@', '[\'', '{~') M.space = lpeg_S('\t\v\f\n\r ') -M.newline = lpeg_S('\r\n\f')^1 +M.newline = lpeg_P('\r')^-1 * '\n' M.nonnewline = 1 - M.newline -M.nonnewline_esc = 1 - (M.newline + '\\') + '\\' * M.any M.dec_num = M.digit^1 M.hex_num = '0' * lpeg_S('xX') * M.xdigit^1 M.oct_num = '0' * lpeg_R('07')^1 M.integer = lpeg_S('+-')^-1 * (M.hex_num + M.oct_num + M.dec_num) M.float = lpeg_S('+-')^-1 * - ((M.digit^0 * '.' * M.digit^1 + M.digit^1 * '.' * M.digit^0) * - (lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)^-1 + - (M.digit^1 * lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)) + ((M.digit^0 * '.' * M.digit^1 + M.digit^1 * '.' * M.digit^0 * -lpeg_P('.')) * + (lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)^-1 + + (M.digit^1 * lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)) +M.number = M.float + M.integer M.word = (M.alpha + '_') * (M.alnum + '_')^0 +-- Deprecated. +M.nonnewline_esc = 1 - (M.newline + '\\') + '\\' * M.any +M.ascii = lpeg_R('\000\127') +M.extend = lpeg_R('\000\255') +M.cntrl = lpeg_R('\000\031') +M.print = lpeg_R(' ~') + --- --- Creates and returns a token pattern with token name *name* and pattern --- *patt*. --- If *name* is not a predefined token name, its style must be defined in the --- lexer's `_tokenstyles` table. --- @param name The name of token. If this name is not a predefined token name, --- then a style needs to be assiciated with it in the lexer's `_tokenstyles` --- table. +-- Creates and returns a token pattern with token name *name* and pattern *patt*. +-- If *name* is not a predefined token name, its style must be defined via `lexer.add_style()`. +-- @param name The name of token. If this name is not a predefined token name, then a style +-- needs to be assiciated with it via `lexer.add_style()`. -- @param patt The LPeg pattern associated with the token. -- @return pattern --- @usage local ws = token(l.WHITESPACE, l.space^1) --- @usage local annotation = token('annotation', '@' * l.word) +-- @usage local ws = token(lexer.WHITESPACE, lexer.space^1) +-- @usage local annotation = token('annotation', '@' * lexer.word) -- @name token function M.token(name, patt) return lpeg_Cc(name) * patt * lpeg_Cp() end --- --- Creates and returns a pattern that matches a range of text bounded by --- *chars* characters. --- This is a convenience function for matching more complicated delimited ranges --- like strings with escape characters and balanced parentheses. *single_line* --- indicates whether or not the range must be on a single line, *no_escape* --- indicates whether or not to ignore '\' as an escape character, and *balanced* --- indicates whether or not to handle balanced ranges like parentheses and --- requires *chars* to be composed of two characters. +-- Creates and returns a pattern that matches from string or pattern *prefix* until the end of +-- the line. +-- *escape* indicates whether the end of the line can be escaped with a '\' character. +-- @param prefix String or pattern prefix to start matching at. +-- @param escape Optional flag indicating whether or not newlines can be escaped by a '\' +-- character. The default value is `false`. +-- @return pattern +-- @usage local line_comment = lexer.to_eol('//') +-- @usage local line_comment = lexer.to_eol(S('#;')) +-- @name to_eol +function M.to_eol(prefix, escape) + return prefix * (not escape and M.nonnewline or M.nonnewline_esc)^0 +end + +--- +-- Creates and returns a pattern that matches a range of text bounded by strings or patterns *s* +-- and *e*. +-- This is a convenience function for matching more complicated ranges like strings with escape +-- characters, balanced parentheses, and block comments (nested or not). *e* is optional and +-- defaults to *s*. *single_line* indicates whether or not the range must be on a single line; +-- *escapes* indicates whether or not to allow '\' as an escape character; and *balanced* +-- indicates whether or not to handle balanced ranges like parentheses, and requires *s* and *e* +-- to be different. +-- @param s String or pattern start of a range. +-- @param e Optional string or pattern end of a range. The default value is *s*. +-- @param single_line Optional flag indicating whether or not the range must be on a single +-- line. The default value is `false`. +-- @param escapes Optional flag indicating whether or not the range end may be escaped by a '\' +-- character. The default value is `false` unless *s* and *e* are identical, single-character +-- strings. In that case, the default value is `true`. +-- @param balanced Optional flag indicating whether or not to match a balanced range, like the +-- "%b" Lua pattern. This flag only applies if *s* and *e* are different. +-- @return pattern +-- @usage local dq_str_escapes = lexer.range('"') +-- @usage local dq_str_noescapes = lexer.range('"', false, false) +-- @usage local unbalanced_parens = lexer.range('(', ')') +-- @usage local balanced_parens = lexer.range('(', ')', false, false, true) +-- @name range +function M.range(s, e, single_line, escapes, balanced) + if type(e) ~= 'string' and type(e) ~= 'userdata' then + e, single_line, escapes, balanced = s, e, single_line, escapes + end + local any = M.any - e + if single_line then any = any - '\n' end + if balanced then any = any - s end + if escapes == nil then + -- Only allow escapes by default for ranges with identical, single-character string delimiters. + escapes = type(s) == 'string' and #s == 1 and s == e + end + if escapes then any = any - '\\' + '\\' * M.any end + if balanced and s ~= e then + return lpeg_P{s * (any + lpeg_V(1))^0 * lpeg_P(e)^-1} + else + return s * any^0 * lpeg_P(e)^-1 + end +end + +-- Deprecated function. Use `lexer.range()` instead. +-- Creates and returns a pattern that matches a range of text bounded by *chars* characters. +-- This is a convenience function for matching more complicated delimited ranges like strings +-- with escape characters and balanced parentheses. *single_line* indicates whether or not the +-- range must be on a single line, *no_escape* indicates whether or not to ignore '\' as an +-- escape character, and *balanced* indicates whether or not to handle balanced ranges like +-- parentheses and requires *chars* to be composed of two characters. -- @param chars The character(s) that bound the matched range. --- @param single_line Optional flag indicating whether or not the range must be --- on a single line. --- @param no_escape Optional flag indicating whether or not the range end --- character may be escaped by a '\\' character. --- @param balanced Optional flag indicating whether or not to match a balanced --- range, like the "%b" Lua pattern. This flag only applies if *chars* --- consists of two different characters (e.g. "()"). +-- @param single_line Optional flag indicating whether or not the range must be on a single line. +-- @param no_escape Optional flag indicating whether or not the range end character may be +-- escaped by a '\\' character. +-- @param balanced Optional flag indicating whether or not to match a balanced range, like the +-- "%b" Lua pattern. This flag only applies if *chars* consists of two different characters +-- (e.g. "()"). -- @return pattern --- @usage local dq_str_escapes = l.delimited_range('"') --- @usage local dq_str_noescapes = l.delimited_range('"', false, true) --- @usage local unbalanced_parens = l.delimited_range('()') --- @usage local balanced_parens = l.delimited_range('()', false, false, true) --- @see nested_pair +-- @usage local dq_str_escapes = lexer.delimited_range('"') +-- @usage local dq_str_noescapes = lexer.delimited_range('"', false, true) +-- @usage local unbalanced_parens = lexer.delimited_range('()') +-- @usage local balanced_parens = lexer.delimited_range('()', false, false, true) +-- @see range -- @name delimited_range function M.delimited_range(chars, single_line, no_escape, balanced) + print("lexer.delimited_range() is deprecated, use lexer.range()") local s = chars:sub(1, 1) local e = #chars == 2 and chars:sub(2, 2) or s local range local b = balanced and s or '' local n = single_line and '\n' or '' if no_escape then - local invalid = lpeg_S(e..n..b) + local invalid = lpeg_S(e .. n .. b) range = M.any - invalid else - local invalid = lpeg_S(e..n..b) + '\\' + local invalid = lpeg_S(e .. n .. b) + '\\' range = M.any - invalid + '\\' * M.any end if balanced and s ~= e then @@ -1404,12 +1665,10 @@ function M.delimited_range(chars, single_line, no_escape, balanced) end --- --- Creates and returns a pattern that matches pattern *patt* only at the --- beginning of a line. +-- Creates and returns a pattern that matches pattern *patt* only at the beginning of a line. -- @param patt The LPeg pattern to match on the beginning of a line. -- @return pattern --- @usage local preproc = token(l.PREPROCESSOR, l.starts_line('#') * --- l.nonnewline^0) +-- @usage local preproc = token(lexer.PREPROCESSOR, lexer.starts_line(lexer.to_eol('#'))) -- @name starts_line function M.starts_line(patt) return lpeg_Cmt(lpeg_C(patt), function(input, index, match, ...) @@ -1421,15 +1680,14 @@ function M.starts_line(patt) end --- --- Creates and returns a pattern that verifies that string set *s* contains the --- first non-whitespace character behind the current match position. +-- Creates and returns a pattern that verifies the first non-whitespace character behind the +-- current match position is in string set *s*. -- @param s String character set like one passed to `lpeg.S()`. -- @return pattern --- @usage local regex = l.last_char_includes('+-*!%^&|=,([{') * --- l.delimited_range('/') +-- @usage local regex = lexer.last_char_includes('+-*!%^&|=,([{') * lexer.range('/') -- @name last_char_includes function M.last_char_includes(s) - s = '['..s:gsub('[-%%%[]', '%%%1')..']' + s = string.format('[%s]', s:gsub('[-%%%[]', '%%%1')) return lpeg_P(function(input, index) if index == 1 then return index end local i = index @@ -1438,109 +1696,77 @@ function M.last_char_includes(s) end) end ---- --- Returns a pattern that matches a balanced range of text that starts with --- string *start_chars* and ends with string *end_chars*. --- With single-character delimiters, this function is identical to --- `delimited_range(start_chars..end_chars, false, true, true)`. +-- Deprecated function. Use `lexer.range()` instead. +-- Returns a pattern that matches a balanced range of text that starts with string *start_chars* +-- and ends with string *end_chars*. +-- With single-character delimiters, this function is identical to `delimited_range(start_chars .. +-- end_chars, false, true, true)`. -- @param start_chars The string starting a nested sequence. -- @param end_chars The string ending a nested sequence. -- @return pattern --- @usage local nested_comment = l.nested_pair('/*', '*/') --- @see delimited_range +-- @usage local nested_comment = lexer.nested_pair('/*', '*/') +-- @see range -- @name nested_pair function M.nested_pair(start_chars, end_chars) + print("lexer.nested_pair() is deprecated, use lexer.range()") local s, e = start_chars, lpeg_P(end_chars)^-1 return lpeg_P{s * (M.any - s - end_chars + lpeg_V(1))^0 * e} end --- --- Creates and returns a pattern that matches any single word in list *words*. --- Words consist of alphanumeric and underscore characters, as well as the --- characters in string set *word_chars*. *case_insensitive* indicates whether --- or not to ignore case when matching words. --- This is a convenience function for simplifying a set of ordered choice word --- patterns. --- @param words A table of words. --- @param word_chars Optional string of additional characters considered to be --- part of a word. By default, word characters are alphanumerics and --- underscores ("%w_" in Lua). This parameter may be `nil` or the empty string --- in order to indicate no additional word characters. --- @param case_insensitive Optional boolean flag indicating whether or not the --- word match is case-insensitive. The default is `false`. +-- Creates and returns a pattern that matches any single word in list or string *words*. +-- *case_insensitive* indicates whether or not to ignore case when matching words. +-- This is a convenience function for simplifying a set of ordered choice word patterns. +-- @param word_list A list of words or a string list of words separated by spaces. +-- @param case_insensitive Optional boolean flag indicating whether or not the word match is +-- case-insensitive. The default value is `false`. +-- @param word_chars Unused legacy parameter. -- @return pattern --- @usage local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'}) --- @usage local keyword = token(l.KEYWORD, word_match({'foo-bar', 'foo-baz', --- 'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar'}, '-', true)) +-- @usage local keyword = token(lexer.KEYWORD, word_match{'foo', 'bar', 'baz'}) +-- @usage local keyword = token(lexer.KEYWORD, word_match({'foo-bar', 'foo-baz', 'bar-foo', +-- 'bar-baz', 'baz-foo', 'baz-bar'}, true)) +-- @usage local keyword = token(lexer.KEYWORD, word_match('foo bar baz')) -- @name word_match -function M.word_match(words, word_chars, case_insensitive) - local word_list = {} - for i = 1, #words do - word_list[case_insensitive and words[i]:lower() or words[i]] = true +function M.word_match(word_list, case_insensitive, word_chars) + if type(case_insensitive) == 'string' or type(word_chars) == 'boolean' then + -- Legacy `word_match(word_list, word_chars, case_insensitive)` form. + word_chars, case_insensitive = case_insensitive, word_chars + elseif type(word_list) == 'string' then + local words = word_list -- space-separated list of words + word_list = {} + for word in words:gsub('%-%-[^\n]+', ''):gmatch('%S+') do word_list[#word_list + 1] = word end + end + if not word_chars then word_chars = '' end + for _, word in ipairs(word_list) do + word_list[case_insensitive and word:lower() or word] = true + for char in word:gmatch('[^%w_%s]') do + if not word_chars:find(char, 1, true) then word_chars = word_chars .. char end + end end local chars = M.alnum + '_' - if word_chars then chars = chars + lpeg_S(word_chars) end + if word_chars ~= '' then chars = chars + lpeg_S(word_chars) end return lpeg_Cmt(chars^1, function(input, index, word) if case_insensitive then word = word:lower() end return word_list[word] and index or nil end) end ---- --- Embeds child lexer *child* in parent lexer *parent* using patterns --- *start_rule* and *end_rule*, which signal the beginning and end of the --- embedded lexer, respectively. +-- Deprecated legacy function. Use `parent:embed()` instead. +-- Embeds child lexer *child* in parent lexer *parent* using patterns *start_rule* and *end_rule*, +-- which signal the beginning and end of the embedded lexer, respectively. -- @param parent The parent lexer. -- @param child The child lexer. --- @param start_rule The pattern that signals the beginning of the embedded --- lexer. +-- @param start_rule The pattern that signals the beginning of the embedded lexer. -- @param end_rule The pattern that signals the end of the embedded lexer. --- @usage l.embed_lexer(M, css, css_start_rule, css_end_rule) --- @usage l.embed_lexer(html, M, php_start_rule, php_end_rule) --- @usage l.embed_lexer(html, ruby, ruby_start_rule, ruby_end_rule) +-- @usage lexer.embed_lexer(M, css, css_start_rule, css_end_rule) +-- @usage lexer.embed_lexer(html, M, php_start_rule, php_end_rule) +-- @usage lexer.embed_lexer(html, ruby, ruby_start_rule, ruby_end_rule) +-- @see embed -- @name embed_lexer function M.embed_lexer(parent, child, start_rule, end_rule) - -- Add child rules. - if not child._EMBEDDEDRULES then child._EMBEDDEDRULES = {} end - if not child._RULES then -- creating a child lexer to be embedded - if not child._rules then error('Cannot embed language with no rules') end - for i = 1, #child._rules do - add_rule(child, child._rules[i][1], child._rules[i][2]) - end - end - child._EMBEDDEDRULES[parent._NAME] = { - ['start_rule'] = start_rule, - token_rule = join_tokens(child), - ['end_rule'] = end_rule - } - if not parent._CHILDREN then parent._CHILDREN = {} end - local children = parent._CHILDREN - children[#children + 1] = child - -- Add child styles. - if not parent._tokenstyles then parent._tokenstyles = {} end - local tokenstyles = parent._tokenstyles - tokenstyles[child._NAME..'_whitespace'] = M.STYLE_WHITESPACE - if child._tokenstyles then - for token, style in pairs(child._tokenstyles) do - tokenstyles[token] = style - end - end - -- Add child fold symbols. - if not parent._foldsymbols then parent._foldsymbols = {} end - if child._foldsymbols then - for token, symbols in pairs(child._foldsymbols) do - if not parent._foldsymbols[token] then parent._foldsymbols[token] = {} end - for k, v in pairs(symbols) do - if type(k) == 'number' then - parent._foldsymbols[token][#parent._foldsymbols[token] + 1] = v - elseif not parent._foldsymbols[token][k] then - parent._foldsymbols[token][k] = v - end - end - end - end - child._lexer = parent -- use parent's tokens if child is embedding itself - parent_lexer = parent -- use parent's tokens if the calling lexer is a proxy + if not getmetatable(parent) then process_legacy_lexer(parent) end + if not getmetatable(child) then process_legacy_lexer(child) end + parent:embed(child, start_rule, end_rule) end -- Determines if the previous line is a comment. @@ -1584,16 +1810,17 @@ local function next_line_is_comment(prefix, text, pos, line, s) end --- --- Returns a fold function (to be used within the lexer's `_foldsymbols` table) --- that folds consecutive line comments that start with string *prefix*. --- @param prefix The prefix string defining a line comment. --- @usage [l.COMMENT] = {['--'] = l.fold_line_comments('--')} --- @usage [l.COMMENT] = {['//'] = l.fold_line_comments('//')} --- @name fold_line_comments -function M.fold_line_comments(prefix) +-- Returns for `lexer.add_fold_point()` the parameters needed to fold consecutive lines that +-- start with string *prefix*. +-- @param prefix The prefix string (e.g. a line comment). +-- @usage lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('--')) +-- @usage lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('//')) +-- @usage lex:add_fold_point(lexer.KEYWORD, lexer.fold_consecutive_lines('import')) +-- @name fold_consecutive_lines +function M.fold_consecutive_lines(prefix) local property_int = M.property_int - return function(text, pos, line, s) - if property_int['fold.line.comments'] == 0 then return 0 end + return prefix, function(text, pos, line, s) + if property_int['fold.line.groups'] == 0 then return 0 end if s > 1 and line:match('^%s*()') < s then return 0 end local prev_line_comment = prev_line_is_comment(prefix, text, pos, line, s) local next_line_comment = next_line_is_comment(prefix, text, pos, line, s) @@ -1603,73 +1830,26 @@ function M.fold_line_comments(prefix) end end -M.property_expanded = setmetatable({}, { - -- Returns the string property value associated with string property *key*, - -- replacing any "$()" and "%()" expressions with the values of their keys. - __index = function(t, key) - return M.property[key]:gsub('[$%%]%b()', function(key) - return t[key:sub(3, -2)] - end) - end, - __newindex = function() error('read-only property') end -}) +-- Deprecated legacy function. Use `lexer.fold_consecutive_lines()` instead. +-- Returns a fold function (to be passed to `lexer.add_fold_point()`) that folds consecutive +-- line comments that start with string *prefix*. +-- @param prefix The prefix string defining a line comment. +-- @usage lex:add_fold_point(lexer.COMMENT, '--', lexer.fold_line_comments('--')) +-- @usage lex:add_fold_point(lexer.COMMENT, '//', lexer.fold_line_comments('//')) +-- @name fold_line_comments +function M.fold_line_comments(prefix) + print('lexer.fold_line_comments() is deprecated, use lexer.fold_consecutive_lines()') + return select(2, M.fold_consecutive_lines(prefix)) +end --[[ The functions and fields below were defined in C. --- --- Returns the line number of the line that contains position *pos*, which +-- Returns the line number (starting from 1) of the line that contains position *pos*, which -- starts from 1. -- @param pos The position to get the line number of. -- @return number local function line_from_position(pos) end - ---- --- Individual fields for a lexer instance. --- @field _NAME The string name of the lexer. --- @field _rules An ordered list of rules for a lexer grammar. --- Each rule is a table containing an arbitrary rule name and the LPeg pattern --- associated with the rule. The order of rules is important, as rules are --- matched sequentially. --- Child lexers should not use this table to access and/or modify their --- parent's rules and vice-versa. Use the `_RULES` table instead. --- @field _tokenstyles A map of non-predefined token names to styles. --- Remember to use token names, not rule names. It is recommended to use --- predefined styles or color-agnostic styles derived from predefined styles --- to ensure compatibility with user color themes. --- @field _foldsymbols A table of recognized fold points for the lexer. --- Keys are token names with table values defining fold points. Those table --- values have string keys of keywords or characters that indicate a fold --- point whose values are integers. A value of `1` indicates a beginning fold --- point and a value of `-1` indicates an ending fold point. Values can also --- be functions that return `1`, `-1`, or `0` (indicating no fold point) for --- keys which need additional processing. --- There is also a required `_patterns` key whose value is a table containing --- Lua pattern strings that match all fold points (the string keys contained --- in token name table values). When the lexer encounters text that matches --- one of those patterns, the matched text is looked up in its token's table --- to determine whether or not it is a fold point. --- There is also an optional `_case_insensitive` option that indicates whether --- or not fold point keys are case-insensitive. If `true`, fold point keys --- should be in lower case. --- @field _fold If this function exists in the lexer, it is called for folding --- the document instead of using `_foldsymbols` or indentation. --- @field _lexer The parent lexer object whose rules should be used. This field --- is only necessary to disambiguate a proxy lexer that loaded parent and --- child lexers for embedding and ended up having multiple parents loaded. --- @field _RULES A map of rule name keys with their associated LPeg pattern --- values for the lexer. --- This is constructed from the lexer's `_rules` table and accessible to other --- lexers for embedded lexer applications like modifying parent or child --- rules. --- @field _LEXBYLINE Indicates the lexer can only process one whole line of text --- (instead of an arbitrary chunk of text) at a time. --- The default value is `false`. Line lexers cannot look ahead to subsequent --- lines. --- @field _FOLDBYINDENTATION Declares the lexer does not define fold points and --- that fold points should be calculated based on changes in indentation. --- @class table --- @name lexer -local lexer ]] return M |
