A few months ago I worked on a process that imports Facebook Leads into a legacy system. Facebook sends its advertising data as UTF-16 encoded CSV. The tool also had to support the CSV files occasionally being ended by hand, which reverted the encoding to something a bit more standard. Thankfully, there was a small library out there that helped. So, in case you ever find yourself in need of guessing if a file is UTF-16 and don't want to roll your own, here you go:
<dependency>
<groupId>org.codehaus.guessencoding</groupId>
<artifactId>guessencoding</artifactId>
<version>1.4</version>
<type>jar</type>
</dependency>
File in = new File(inputFile);
if (!in.exists()) {
throw new IllegalArgumentException("Input file not found");
}
Charset cs = CharsetToolkit.guessEncoding(in, 4096, StandardCharsets.UTF_8);
System.out.println("Reading " + inputFile + " as " + cs.name());
Reader r = new InputStreamReader(new FileInputStream(in),cs.name());
Very helpful Dan, thanks for posting this. It looks like guessencoding is unmaintained and has some CVE dependencies, so nowadays I’d recommend forking the CharsetToolkit or picking a different library. I used icu4j, had to wrestle it a bit, here’s an example:
https://gist.github.com/raymyers/b5aedf87c1b35ccf02fc52b5e85aeb8d