[PyQt] Search method for Arabic text

Zachary Scheuren angryjaga at gmail.com
Wed Aug 29 06:37:56 BST 2018


Yes, that's what I was afraid of. You lose the index when you strip the
marks so you need to keep track of them somehow. One way is to split on
spaces and track it that way. Something like this with PyQt5. You'll need
to fill in your pieces, but the main idea is here:

# setup your QApplication and QTextEdit, etc.
from pyarabic import araby
from PyQt5.QtGui import QTextCharFormat
from PyQt5.QtGui import QTextCursor

textedit = your QTextEdit
text = the source text
text2 = araby.strip_tashkeel(text)
search_string = 'السماء'
matches = []

# to be more accurate about the match you can use re.finditer() instead of
just "if search_string in" to get the start index,
# but then you should track that as well as the list index so you can be
more accurate later. this is just a quick proof of concept so I didn't go
all the way
for i, c in enumerate(text2.split()):
    if search_string in c:
        matches.append(i)

# Now use the match indices to highlight the original text
cursor = textedit.textCursor()
highlight_color = QColor()  # whatever color you want
fmt = QTextCharFormat()
fmt.setBackground(highlight_color)

for match_index in matches:
    start = len(' '.join(text.split()[0:match_index])) + 1
    end = len(' '.join(text.split()[0:match_index + 1]))
    cursor.setPosition(start, QTextCursor.MoveAnchor)
    cursor.setPosition(end, QTextCursor.KeepAnchor)
    cursor.setCharFormat(fmt)

This works, but it can highlight some parts that aren't really part of the
match. To avoid that you could check the index offset within each matched
item in the list to make sure you get the right start index. It's all
possible. Just takes some more code. I hope this helps get you closer.



On Tue, Aug 28, 2018 at 12:57 AM, Maziar Parsijani <
maziar.parsijani at gmail.com> wrote:

> Hi Zachary Scheuren
> Thanks a lot for your answer .The reason I have Email PYQT is that I use
> its widgets and I forgot to say that I use Qtextedit.And for more detail I
> can say that I found that pyarabic library before and as you said I can
> remove marks with its strip_tashkeel(text) function ,But if you take a
> look at below example I think you will find what I want to do.And I can
> refer you to http://tanzil.net here if you search for "السماء" it will
> find and highlight "ٱلسَّمَآءِ" I want to know if its possible in
> python?I use sqlite database and I could find the results like this but I
> can not highlight them.
> Example :
> search for :" السماء "
> but I want to show Quranuthmani and find :" ٱلسَّمَآءِ" in it and
> highlight them.
> I can find them with no problem cause of using sqlite database table with
> different Quran text But the problem is highlighting them cause of using
> regex
> and " السماء " ," ٱلسَّمَآءِ" are not the same so I can not highlight
> them.
> Please accept my apologizes for asking my question before your permission
> .
>
> On Mon, Aug 27, 2018 at 10:14 AM Zachary Scheuren <angryjaga at gmail.com>
> wrote:
>
>> This isn't really a PyQt question. You can do all that in basic Python,
>> but it can help if you have something like the pyarabic library. With that
>> you can strip out the vocalization before comparing strings. You also need
>> to consider all the possible Alefs like in str1 you have Alef with Wasla,
>> but str2 only has Alef. pyarabic can also help there with araby.ALEFAT
>> which is a list of all possible Alefs with marks. You need to manually
>> check that because Alef with Wasla has no Unicode decomposition and the
>> wasla isn't encoded as a separate mark. There have been Unicode proposals
>> for that, but nothing has happened so far. Anyway, I did a quick little
>> test with your strings...
>>
>> import re
>> from pyarabic import araby
>> str3_nomarks = araby.separate(str3)[0]  # strips all diacritics
>> for c in araby.ALEFAT:  # replace any Alef with a mark by base Alef
>>     str3_nomarks = str3_nomarks.replace(c, araby.ALEF)
>>
>> re.findall(str2, str3_nomarks)
>>
>> Something like that will get you matches, but if you need to track the
>> position in a string you'll have to do some more work since dropping the
>> diacritics will throw off the index.
>>
>>
>>
>> On Wed, Aug 22, 2018 at 12:43 AM, Maziar Parsijani <
>> maziar.parsijani at gmail.com> wrote:
>>
>>> Hi
>>> I have some Arabic strings in mt database now I want to if I search like
>>> this :
>>>
>>>   str1 = "ٱلْمُفْلِحُونَ"
>>>   str2 = "المفلحون"
>>> as you can see str1 is the same as str2 but in Arabic text str1 has more
>>> alphabetical characters.
>>> Is there anyway to search str2 but I could find both of them in a string
>>> like :
>>>  str3 = " المفلحون ٱلْمُفْلِحُونَ ٱلنَّاسُ المفلحون ٱلْمُفْلِحُونَ
>>> المفلحون ٱلنَّاسُ المفلحون ٱلنَّاسُ "
>>>
>>> _______________________________________________
>>> PyQt mailing list    PyQt at riverbankcomputing.com
>>> https://www.riverbankcomputing.com/mailman/listinfo/pyqt
>>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20180828/9e5a501d/attachment-0001.html>


More information about the PyQt mailing list