Crusaderpyro Crusaderpyro - 1 year ago 61
Java Question

When do I need to specify the encoding while writing the file to the disk?

I have a sample method which copies one file to another using InputStream and OutputStream. In this case, the source file is encoded in 'UTF-8'. Even if I don't specify the encoding while writing to the disk, the destination file has the correct encoding. But, if I have to write a java.lang.String to a file, I need to specify the encoding. Why is that ?

public static void copyFile() {

String sourceFilePath = "C://my_encoded.txt";

InputStream inStream = null;
OutputStream outStream = null;

String targetFilePath = "C://my_target.txt";
File sourcefile =new File(sourceFilePath);
outStream = new FileOutputStream(targetFilePath);
inStream = new FileInputStream(sourcefile);
byte[] buffer = new byte[1024];

int length;
//copy the file content in bytes
while ((length = > 0){
outStream.write(buffer, 0, length);
System.out.println("File "+targetFilePath+" is copied successful!");
}catch(IOException e){

My guess is that since the source file has thee correct encoding and since we read and write one byte at a time, it works fine. And java.lang.String is 'UTF-16' by default and if we write it to the file, it reads one byte at a time instead of 2 bytes and hence garbage values. Is that correct or am I completely wrong in my understanding ?

Answer Source

You are copying the file byte per byte, so you don't need to care about character encoding.

As a rule of thumb:

Use the various InputStream and OutputStream implementations for byte-wise processing (like file copy). There are some convenience methods to handle text directly like PrintStream.println(). Be careful because most of them use the default platform specific encoding.

Use the various Reader and Writer implemenations for reading and writing text.

If you need to convert between byte-wise and text processing use InputStreamReader and OutputStreamWriter with explicit file encoding.

Do not rely on the default encoding. The default character encoding is platform specific (e.g. Windows-ANSI aka Cp1252 for Windows, usually UTF-8 on Linux).

Example: If you need to read a UTF-8 text file:

BufferedReader reader = 
  new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"));

Avoid using a FileReader because a FileReader uses always the default encoding.

A special case: If you need random access to a file you should use RandomAccessFile. With it you can read and write data blocks at arbitrary positions. You can read and write raw byte blocks or you can use convenience methods to read and write text. But you should read the documentation carefully. E.g. the methods readUTF() and writeUTF() use a modified UTF-8 encoding.

InputStream, OutputStream, Reader, Writer and RandomAccessFile form the basic IO functionality, enough for most use cases. For advanced IO (e.g. memory mapped files, ...) have a look at package java.nio.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download