[PyQt] Problem with regex in my code

Maurizio Berti maurizio.berti at gmail.com
Sun Sep 2 17:33:57 BST 2018


Well, that's exactly what I meant when asking for a bit of context.
As somebody not knowing anything about arabic languages, to me the last
character in "قَالَ" is "لَ", as "لَ" is treated as a complete character
when moving around or selecting with the cursor; my answer worked because
nobody told me about that.
The fatha, as I understand, is a diacritic glyph, but in most western
languages those glyphs are always part of another letter, so "à" is always
a single character in unicode (even if "`" exists as a single character
alone).
For example, u'à' is u'\xe0' for python 2 and for python 3 it is
b'\xc3\xa0', which is a single character len as 'a' without any accent,
while u'لَ' is u'\u0644\u064e' and b'\xd9\x84\xd9\x8e' respectively. You
can easily do a len() to see that. Unfortunally, character selection and
cursor position treat 'لَ' as a single one just like 'à', hence the
misunderstanding. I assume that's a similar behavior with how japanese
kanji characters are treated.

So, the problem here is that, since those glyphs are treated as single
separated characters for "strings", you'll need a bit more complex and
stronger regex - and that's where you'll need to better understand how
regexes work, since they were created mostly for western languages, and
their usage with extended languages (and then unicode) requires a deeper
knowledge.

When I tried your example I tested the regex alone before, and the results
were confirmed, so I assumed it was ok.
You probably need to add a further step to better understand what's not
working: the first is to be completely sure that the expression you use
actually matches what you want. You can just print to console the results
(with character positions and so on) or anything you want, _then_ test the
result to the QTextDocument and see, by iterating the results carefully,
how the QTextCursor is behaving. If you still don't understand what's going
on then, that's where you send here that code with the forementioned
context. We're eager to help, but we can't do everything :-)
And there's the possibility that, while doing so, you'll find what is
wrong; it's called "Rubber duck debugging", and it's when you have to
explain almost line-by-line, your code and your case, possibly to somebody
who knows absolutely anything about it. I solved tons of problems by
starting to ask a question and writing example code, and finding the
solution myself even before finishing the question because I was forced to
think about the problem and explain it in a "different" and thorough way.

Anyway, I wasn't saying you should never ask those things, but that you
probably should ask somewhere else, some forum/website/list where you would
probably find more people who is actually able to give you an answer, and
for that stackoverflow might be much more helpful. About that, have a look
at this:
https://stackoverflow.com/questions/11323596/regular-expression-for-arabic-language
Knowing the unicode blocks could be helpful to better track word boundaries
in expressions.


I understand that documentation about these topics is not common or easy to
find as in other languages, but I'm also pretty sure you're not the first
having these kind of issues, so, the trick is to be patient and ask in/look
for the right places :-)
If you need to test your expressions, you can also use regex101.com, which
features a very good interface and provides deep insights to regular
expressions as they are inserted.

So, I'd suggest you try to simplify your code, make step-by-step tests
against your regular expressions and, if everything looks ok but you still
have problems with the QTextCursor, then submit us an example with all
possible test cases _already included_ in the code, and I'd be happy to
help if I can.

Good luck!
Maurizio

2018-09-02 10:12 GMT+02:00 Maziar Parsijani <maziar.parsijani at gmail.com>:

> Hi Maurizio and David
> You both are correct ,but if you attract on the code :When you copy the
> last character in "قَالَ " we call it fatha " ــَ " and put it completely
> in the lineedit.text() then it will find it correctly without (r'\b{}\b').
> The problem here is with that last one character which is sometimes
> different for example if I change the string to this :
>
> m = "قَالَتْ قَالَ فَقَالُوا۟ قَالُوا۟ قَالَ قَالُوٓا۟ قَالَتَا ٱ
> لْقَالِينَ  قَال  قَالِ "
> then we can not use (r'\b{}\b').
> Dear Maurizio I will take your advice But the problem here maybe related
> to PYQT.QTGUI.Qtextcursor and Qtextedit.textcursor
> if  you say that it is not related then I will not ask anything else .and
> as you know the Arabic  help texts in python are not so helpful.
> Thanks anyway but let me know about that I don't try to post other
> problems like this.
>
>
>
> On Sun, Sep 2, 2018 at 12:33 AM Maurizio Berti <maurizio.berti at gmail.com>
> wrote:
>
>> Simple regexes like that one will match all sequences of characters
>> found; if you want to find those character as a single "word", you will
>> need to use word boundaries.
>>
>> Change the pattern init to this and it works (at least, according to the
>> string you gave):
>>
>> pattern = QtCore.QRegExp(r'\b{}\b'.format(self.lineEdit.text()))
>>
>> Have a look at this, too: https://stackoverflow.
>> com/questions/40731058/regex-match-arabic-keyword
>>
>> As a side note, forgive my bluntness, but you might want to read some
>> more documentation about regex and right-to-left languages practices.
>> The latest questions you asked were not python nor qt related at all,
>> and, as you might understand, most of the people on this mailing list don't
>> even know how arabic language works; a bit of context might be useful, so
>> that even people not experienced with those language might help too.
>>
>> Also, try to simplify examples by limiting code to what is really related
>> to your issue: those geometry, font and widget settings won't help anyone
>> reading your code. The same example could have been written in less than
>> half lines, and would have been much more readable: easier to read is
>> easier to understand and easier to help.
>> Finally, avoid mixing the way you import modules. You should import from
>> the main module _OR_ from the submodules. While it doesn't change much in
>> terms of computation, it decreases the possibility of bugs and improves
>> readibility, which is better for everybody reading your code, including you
>> :-)
>>
>> Maurizio
>>
>>
>>
>> 2018-09-01 20:33 GMT+02:00 Maziar Parsijani <maziar.parsijani at gmail.com>:
>>
>>> I want to select 2 "قَالَ" in m = "قَالَتْ قَالَ فَقَالُوا۟ قَالُوا۟
>>> قَالَ قَالُوٓا۟ قَالَتَا ٱلْقَالِينَ  " but with the below code it
>>> selects all words which contains " "قَالَ"
>>> Now what is the problem here?
>>> I have to put spaces before and after pattern1 for such thing but it
>>> doesn't work for me.
>>>
>>> pattern1 = " {0} ".format(self.lineEdit.text())
>>>
>>>
>>> from PyQt5 import QtCore, QtGui, QtWidgets
>>> from PyQt5.QtWidgets import QApplication, QTextEdit
>>> from PyQt5.QtGui import QTextDocument, QTextDocumentFragment
>>> from PyQt5 import QtCore, QtGui, QtWidgets
>>> import sys
>>> from PyQt5.QtWidgets import QDialog, QApplication
>>> class AppWindow(QDialog):
>>>     def __init__(self):
>>>         super().__init__()
>>>         self.setObjectName("Dialog")
>>>         self.resize(800, 600)
>>>         self.lineEdit = QtWidgets.QLineEdit(self)
>>>         self.lineEdit.setGeometry(QtCore.QRect(70, 70, 211, 21))
>>>         self.lineEdit.setObjectName("lineEdit")
>>>         self.pushButton = QtWidgets.QPushButton(self)
>>>         self.pushButton.setGeometry(QtCore.QRect(130, 110, 83, 28))
>>>         self.pushButton.setObjectName("pushButton")
>>>         self.SearchResults = QtWidgets.QTextEdit(self)
>>>         self.SearchResults.setGeometry(QtCore.QRect(130, 140, 500, 400))
>>>         font = QtGui.QFont()
>>>         font.setFamily("Amiri")
>>>         font.setPointSize(12)
>>>         self.SearchResults.setFont(font)
>>>         self.SearchResults.setToolTipDuration(0)
>>>         self.SearchResults.setReadOnly(True)
>>>         self.SearchResults.setAutoFormatting(QtWidgets.QTextEdit.AutoAll)
>>>         self.SearchResults.setObjectName("SearchResults")
>>>
>>>         self.retranslateUi(self)
>>>         QtCore.QMetaObject.connectSlotsByName(self)
>>>     def find1(self):
>>>             m = "  قَالَتْ  قَالَ فَقَالُوا۟ قَالُوا۟ قَالَ  قَالُوٓا۟ قَالَتَا  ٱلْقَالِينَ    "
>>>             self.SearchResults.append('{0} '.format(m))
>>>
>>>
>>>             cursor = self.SearchResults.textCursor()
>>>             format = QtGui.QTextCharFormat()
>>>             format.setForeground(QtGui.QBrush(QtGui.QColor("red")))
>>>
>>>             pattern1 = "{0}".format(self.lineEdit.text())
>>>             regex = QtCore.QRegExp(pattern1)
>>>             pos = 0
>>>             index = regex.indexIn(self.SearchResults.toPlainText(), pos)
>>>             tedad = 0
>>>             while (index != -1):
>>>                 cursor.setPosition(index)
>>>                 cursor.movePosition(QtGui.QTextCursor.WordLeft, QtGui.QTextCursor.KeepAnchor)
>>>                 cursor.mergeCharFormat(format)
>>>                 pos = index + regex.matchedLength()
>>>                 index = regex.indexIn(self.SearchResults.toPlainText(), pos)
>>>                 if regex.isValid():
>>>                     tedad += 1
>>>             nmayesh = ("{}".format(tedad))
>>>             self.SearchResults.append("{}".format(tedad))
>>>
>>>     def retranslateUi(self, Dialog):
>>>         _translate = QtCore.QCoreApplication.translate
>>>         self.setWindowTitle(_translate("Dialog", "Dialog"))
>>>         self.pushButton.setText(_translate("Dialog", "PushButton"))
>>>         self.pushButton.clicked.connect(self.find1)
>>>
>>> app = QApplication(sys.argv)
>>> w = AppWindow()
>>> w.show()
>>> sys.exit(app.exec_())
>>>
>>>
>>> _______________________________________________
>>> PyQt mailing list    PyQt at riverbankcomputing.com
>>> https://www.riverbankcomputing.com/mailman/listinfo/pyqt
>>>
>>
>>
>>
>> --
>> È difficile avere una convinzione precisa quando si parla delle ragioni
>> del cuore. - "Sostiene Pereira", Antonio Tabucchi
>> http://www.jidesk.net
>>
>


-- 
È difficile avere una convinzione precisa quando si parla delle ragioni del
cuore. - "Sostiene Pereira", Antonio Tabucchi
http://www.jidesk.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20180902/c0819ebb/attachment-0001.html>


More information about the PyQt mailing list