Peter Ream Peter Ream - 6 days ago 5
Java Question

java hex data in string

I have read a PDF file using PDFBOX in JAVA and have converted the data to text and have saved in a string. I have found that a lot of the text data is surrounded by X'C2A0'. For instance:

X'436C756233AC2A04469616D6F6E64C2A0' Club:__Diamond__


__ is X'C2A0'

I want to search for "Club:__, then parse between the 2 __ for "Diamond". I have tried something like:

String TAG = "\\xC2A0"; // Tag in PDF

int pos = text.indexOf(TAG, positionInText);


but I never get any hits. How do I specify TAG?

EDIT:

Maybe some clarification is needed. I used PDFBOX as such:

public void toText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;

file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0

parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);

// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());

text = pdfStripper.getText(pdDoc);


text is a field defined as String. This text String is what I amd trying to parse.

Answer

It's not completely clear from your question if the string you are searching is hex-encoded itself or is a normal character string that in the file contains 2-byte sequences with the character values 0xc2 0xa0.

Assuming the latter case, in the file the sequence 0xc2a0 is the UTF-8 encoding for the Unicode code-point 0xA0, which is the non-breaking space that corresponds to the   entity in HTML.

If the file contains these two-byte sequences, then when read into your Java string (assuming you used the UTF-8 encoding to interpret the byte stream), then each of these sequences will become a single 0xA0 in your string.

You should be able to write a regular expression to find data delimited by pairs of these.

Comments