How to safely remove Byte Order Marks (BOM) from files created on windows

The problem

If you read enough files created on a windows box you’ll eventually run into this. ""\xEF\xBB\xBFyour expected string" instead of "your expected string". I recently ran into again while parsing some csv files exported in windows. In order to parse the string you’ll need to remove what is called a Byte Order Mark.

Solution

Just search for "\xEF\xBB\xBF" and remove it.

file = File.read("filename.txt").sub("\xEF\xBB\xBF", '')

That will usually work but if the string encoding is ASCII-8Bit it will throw the error Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) If you force it to be UTF-8 first you shouldn’t have any issues.

file = File.read("filename.txt").force_encoding('utf-8').encode.sub("\xEF\xBB\xBF", '')

October 15, 2017
rubyutf-8ASCII-8Bit


[