Maciej Ziarko Maciej Ziarko - 3 months ago 8
Java Question

Reading website's contents into string

Currently I'm working on a class that can be used to read the contents of the website specified by the url. I'm just beginning my adventures with

java.io
and
java.net
so I need to consult my design.

Usage:

TextURL url = new TextURL(urlString);
String contents = url.read();


My code:

package pl.maciejziarko.util;

import java.io.*;
import java.net.*;

public final class TextURL
{
private static final int BUFFER_SIZE = 1024 * 10;
private static final int ZERO = 0;
private final byte[] dataBuffer = new byte[BUFFER_SIZE];
private final URL urlObject;

public TextURL(String urlString) throws MalformedURLException
{
this.urlObject = new URL(urlString);
}

public String read()
{
final StringBuilder sb = new StringBuilder();

try
{
final BufferedInputStream in =
new BufferedInputStream(urlObject.openStream());

int bytesRead = ZERO;

while ((bytesRead = in.read(dataBuffer, ZERO, BUFFER_SIZE)) >= ZERO)
{
sb.append(new String(dataBuffer, ZERO, bytesRead));
}
}
catch (UnknownHostException e)
{
return null;
}
catch (IOException e)
{
return null;
}

return sb.toString();
}

//Usage:
public static void main(String[] args)
{
try
{
TextURL url = new TextURL("http://www.flickr.com/explore/interesting/7days/");
String contents = url.read();

if (contents != null)
System.out.println(contents);
else
System.out.println("ERROR!");
}
catch (MalformedURLException e)
{
System.out.println("Check you the url!");
}
}
}


My question is:
Is it a good way to achieve what I want? Are there any better solutions?

I particularly didn't like
sb.append(new String(dataBuffer, ZERO, bytesRead));
but I wasn't able to express it in a different way. Is it good to create a new String every iteration? I suppose no.

Any other weak points?

Thanks in advance!

Answer

Consider using URLConnection instead. Furthermore you might want to leverage IOUtils from Apache Commons IO to make the string reading easier too. For example:

URL url = new URL("http://www.example.com/");
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();  // ** WRONG: should use "con.getContentType()" instead but it returns something like "text/html; charset=UTF-8" so this value must be parsed to extract the actual encoding
encoding = encoding == null ? "UTF-8" : encoding;
String body = IOUtils.toString(in, encoding);
System.out.println(body);

If you don't want to use IOUtils I'd probably rewrite that line above something like:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[8192];
int len = 0;
while ((len = in.read(buf)) != -1) {
    baos.write(buf, 0, len);
}
String body = new String(baos.toByteArray(), encoding);