[QScintilla] Search Whole Word matches Only on regular expression symbol

Mon Oct 14 14:12:34 BST 2013

On Sun, 13 Oct 2013 04:22:02 +0100, Baz Walter <bazwal at ftml.net> wrote:
> On 12/10/13 23:29, Baz Walter wrote:
>> On 12/10/13 11:52, Phil Thompson wrote:
>>> So I need to call SCI_SETWORDCHARS when a lexer is set using the value
>>> returned by the lexer's wordCharacters() method.
>>>
>>> Is this likely to cause any unforeseen problems?
>>
>> As usual with Scintilla, the main source of potential problems is
>> single-byte vs multi-byte encodings. For latin-1, any byte in the range
>> 0-255 can be set as a word character. But for utf-8, only the ascii
>> range is relevant - all unicode characters above 127 are always treated
>> as word characters, regardless of what has been set using
>> SCI_SETWORDCHARS.
>>
>> However, Scintilla's default set of word characters (i.e. those set via
>> SCI_SETCHARSDEFAULT) includes the standard alphanumerics and
underscore,
>> *plus* all the characters in the range 128-255 (regardless of the
>> code-page setting).
>>
>> So, assuming the current lexer wordCharacters functions only ever
return
>> ascii, there is some potential for changes in behaviour if QScintilla
is
>> being used in *latin-1* mode (utf-8 mode should be unaffected).
>>
>> The only other potential issue I can think of at the moment, is that
>> setting the word characters automatically resets the whitespace and
>> punctuation characters to their default values.
>>
> 
> One area that I didn't consider was auto-completion. I concocted my own 
> implementation of this a long time ago, and so I haven't used 
> QScintilla's version of it much.
> 
> After having a look at the source, I'm wondering whether things may be 
> more complicated than I thought.
> 
> It seems the lexer wordCharacters method *must* return ascii, because 
> auto-completion only ever looks at *single bytes*. Things could break in

> utf-8 mode if wordCharacters included some random non-ascii bytes and a 
> multi-byte character was encountered. (For example, if the lead byte of 
> a multi-byte sequence was included in the word characters, but not its 
> continuation bytes, it might result in an attempt to insert text at an 
> invalid position).
> 
> On top of that, auto-completion also uses Scintilla's search apis to 
> find the start of words (which in turn depends on Scintilla's definition

> of word characters). What happens if the lexer's definition of word 
> characters conflicts with Scintilla's? Possibly there are some 
> edge-cases where this might matter, but I confess I'm not sure.

...in other words a can of worms. I won't change it then. Maybe QScintilla
should be a fork of Scintilla rather than a port.

Thanks for looking into this,
Phil