[QScintilla] Search Whole Word matches Only on regular expression symbol

Baz Walter bazwal at ftml.net
Sun Oct 13 04:22:02 BST 2013


On 12/10/13 23:29, Baz Walter wrote:
> On 12/10/13 11:52, Phil Thompson wrote:
>> So I need to call SCI_SETWORDCHARS when a lexer is set using the value
>> returned by the lexer's wordCharacters() method.
>>
>> Is this likely to cause any unforeseen problems?
>
> As usual with Scintilla, the main source of potential problems is
> single-byte vs multi-byte encodings. For latin-1, any byte in the range
> 0-255 can be set as a word character. But for utf-8, only the ascii
> range is relevant - all unicode characters above 127 are always treated
> as word characters, regardless of what has been set using SCI_SETWORDCHARS.
>
> However, Scintilla's default set of word characters (i.e. those set via
> SCI_SETCHARSDEFAULT) includes the standard alphanumerics and underscore,
> *plus* all the characters in the range 128-255 (regardless of the
> code-page setting).
>
> So, assuming the current lexer wordCharacters functions only ever return
> ascii, there is some potential for changes in behaviour if QScintilla is
> being used in *latin-1* mode (utf-8 mode should be unaffected).
>
> The only other potential issue I can think of at the moment, is that
> setting the word characters automatically resets the whitespace and
> punctuation characters to their default values.
>

One area that I didn't consider was auto-completion. I concocted my own 
implementation of this a long time ago, and so I haven't used 
QScintilla's version of it much.

After having a look at the source, I'm wondering whether things may be 
more complicated than I thought.

It seems the lexer wordCharacters method *must* return ascii, because 
auto-completion only ever looks at *single bytes*. Things could break in 
utf-8 mode if wordCharacters included some random non-ascii bytes and a 
multi-byte character was encountered. (For example, if the lead byte of 
a multi-byte sequence was included in the word characters, but not its 
continuation bytes, it might result in an attempt to insert text at an 
invalid position).

On top of that, auto-completion also uses Scintilla's search apis to 
find the start of words (which in turn depends on Scintilla's definition 
of word characters). What happens if the lexer's definition of word 
characters conflicts with Scintilla's? Possibly there are some 
edge-cases where this might matter, but I confess I'm not sure.

-- 
Regards
Baz Walter


More information about the QScintilla mailing list