Removal of Non ASCII characters using Python
I am going to explain about how to remove non ascii characters from input text or content. Let first get to know what non-ascii characters are.
What are non ascii characters ?
You might have faced an issue while copy pasting text from document ( docx ) to HTML input element or any editor. Sometimes the format of symbols is not supported in particular. input area. Example, double quote is used in docx file and code editor or input element is different see below 👇🏻
“Example Text”. - in docx file "Example Text" - in editor or HTML input element
When you are trying to docx file text format into HTML then it is treated as non ascii characters or junk characters. Generally It can save into the database but sometime while doing some encoding or signature calculating you will face an issue because this will throw an error due to an unsupported string. One of the real scenarios I faced while calculating AWS signature before passing to API gateway and same matching with calculated signature by AWS is match and it throws an error because AWS signature calculation mechanism removes those characters and calculates signature but in your code you might not be doing then very straight it will not match.
How to solve this issue then ?
Below is Python script to remove those non ascii characters or junk characters.
- Python any version ( recommended 3.x )
- Regular expression operations library(re) -
pip install re
import re ini_string = "'technews One lone dude awaits iPad 2 at Apple\x89Ûªs SXSW store" res1 = " ".join(re.split("[^A-Za-z0-9]+", ini_string)) print(res1) if re.match("[^\t\r\n\x20-\x7E]+", ini_string): print("found") result = ini_string.encode().decode('ascii', 'replace').replace(u'\ufffd', '`') result2 = ini_string.encode().decode("utf-8").replace(u"\x89Ûª", "`").encode("utf-8") print(result2)