Prince Prince - 1 month ago 5
Python Question

How to trim Devanagri text from a given string by the help of unicodes

I want to write a python code that can fetch out Devanagari text from the given string, but I don't know how to use Unicode for the same.

My input will be in this form

Translate 'अंक'
36 अ [V]
36 ं [n]
57 ं (क [N]
36 क [kV]
---
(hi)'VNk(en)


I want text written in Devanagari only not that numbers or English alphabets.

My output should be in this form

अंक अ ं ं(क क

I Have tried this code

import codecs

file = codecs.open("C:/Users/prince/Desktop/hindi.txt",mode = "r", encoding = "utf-8")
file_dic = codecs.open("C:/Users/prince/Desktop/dic.txt",mode = "w", encoding = "utf-8")
for i in range (0, 330):
u = file.read()
if (u[i] >= 0900) && (u[i]<= 097F):
file_dic.write(u)
file_dic.write(' ')

Answer

A regular expression will keep your Devanagari text together. Your example would print spaces between every character. Below also adds the Devanagari Extended range in Unicode as well:

#!python3
#coding:utf8

import re

text = '''\
Translate 'अंक'  
36  अ       [V]  
36  ं       [n]  
57  ं  (क [N]  
36  क [kV]
---  
(hi)'VNk(en)
'''

print(' '.join(re.findall(r'[\u0900-\u097f\ua8e0-\ua8ff]+',text)))

Output:

अंक अ ं ं क क

Writing to the files in your example:

import re

with open("C:/Users/prince/Desktop/hindi.txt",mode = "r", encoding = "utf-8") as file:
    text = file.read()
with open("C:/Users/prince/Desktop/dic.txt",mode = "w", encoding = "utf-8") as file_dic:
    file_dic.write(' '.join(re.findall(r'[\u0900-\u097f\ua8e0-\ua8ff]+',text)))