Skip to main content

Command Palette

Search for a command to run...

Removal of Non ASCII characters using Python

Published
2 min read
Removal of Non ASCII characters using Python
A

13+ Years of experienced as Full Stack Developer. Also worked as architect for building solutions and product to help for automation. Solution-oriented and hands-on technical utility player. Having experience of more than 4 years of experience in E commerce and finance in each domain. Experience in having driving business automation, marketing using technology. Strong follower of open source technology. Used PHP, Python, AWS and Angular as technology stack to build product

Hello Devs,

I am going to explain about how to remove non ascii characters from input text or content. Let first get to know what non-ascii characters are.

What are non ascii characters ?

You might have faced an issue while copy pasting text from document ( docx ) to HTML input element or any editor. Sometimes the format of symbols is not supported in particular. input area. Example, double quote is used in docx file and code editor or input element is different see below 👇🏻

“Example Text”. - in docx file 
"Example Text" - in editor or HTML input element

When you are trying to docx file text format into HTML then it is treated as non ascii characters or junk characters. Generally It can save into the database but sometime while doing some encoding or signature calculating you will face an issue because this will throw an error due to an unsupported string. One of the real scenarios I faced while calculating AWS signature before passing to API gateway and same matching with calculated signature by AWS is match and it throws an error because AWS signature calculation mechanism removes those characters and calculates signature but in your code you might not be doing then very straight it will not match.

How to solve this issue then ?

Below is Python script to remove those non ascii characters or junk characters.

Prerequisite :

  • Python any version ( recommended 3.x )
  • Regular expression operations library(re) - pip install re
import re
ini_string = "'technews One lone dude awaits iPad 2 at Apple\x89Ûªs SXSW store"
res1 = " ".join(re.split("[^A-Za-z0-9]+", ini_string)) 
print(res1)

if re.match("[^\t\r\n\x20-\x7E]+", ini_string):
    print("found")

result = ini_string.encode().decode('ascii', 'replace').replace(u'\ufffd', '`')
result2 = ini_string.encode().decode("utf-8").replace(u"\x89Ûª", "`").encode("utf-8")
print(result2)

References :

  • https://gist.github.com/aviboy2006/ca1e50f1cb1a32f7544f2f0af1fb928d

More from this blog

I

InternetKatta | AWS | Programming | Learning | PHP | Angular

87 posts

Write & Share What We learn | Learning can't measure because it is learning