Automatic file encoding detection in Java

A few months ago I worked on a process that imports Facebook Leads into a legacy system. Facebook sends its advertising data as UTF-16 encoded CSV. The tool also had to support the CSV files occasionally being ended by hand, which reverted the encoding to something a bit more standard. Thankfully, there was a small library out there that helped. So, in case you ever find yourself in need of guessing if a file is UTF-16 and don't want to roll your own, here you go:

<dependency>
    <groupId>org.codehaus.guessencoding</groupId>
    <artifactId>guessencoding</artifactId>
    <version>1.4</version>
    <type>jar</type>
</dependency>
File in = new File(inputFile);
if (!in.exists()) {
   throw new IllegalArgumentException("Input file not found");
}
Charset cs = CharsetToolkit.guessEncoding(in, 4096, StandardCharsets.UTF_8);
System.out.println("Reading " + inputFile + " as " + cs.name());
Reader r = new InputStreamReader(new FileInputStream(in),cs.name());

1 thought on “Automatic file encoding detection in Java”

Leave a Reply