NIlesh Lanke NIlesh Lanke - 5 days ago 4
C# Question

Display Unicode characters in converting Html to Pdf

I am using itextsharp dll to convert HTML to PDF.

The HTML has some Unicode characters like α, β... when I try to convert HTML to PDF, Unicode characters are not shown in PDF.

My function:

Document doc = new Document(PageSize.LETTER);

using (FileStream fs = new FileStream(Path.Combine("Test.pdf"), FileMode.Create, FileAccess.Write, FileShare.Read))
{
PdfWriter.GetInstance(doc, fs);

doc.Open();
doc.NewPage();

string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
"ARIALUNI.TTF");

BaseFont bf = BaseFont.CreateFont(arialuniTff, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

Font fontNormal = new Font(bf, 12, Font.NORMAL);

List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()),
new StyleSheet());
Paragraph p = new Paragraph {Font = fontNormal};

foreach (var element in list)
{
p.Add(element);
doc.Add(p);
}

doc.Close();
}

Answer

When dealing with Unicode characters and iTextSharp there's a couple of things you need to take care of. The first one you did already and that's getting a font that supports your characters. The second thing is that you want to actually register the font with iTextSharp so that its aware of it.

//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);

Now that we have a font we need to create a StyleSheet object that tells iTextSharp when and how to use it.

//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");

The one non-HTML part that you also need to do is set a special encoding parameter. This encoding is specific to iTextSharp and in your case you want it to be Identity-H. If you don't set this then it default to Cp1252 (WINANSI).

//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);

Lastly, we need to pass our stylesheet to the ParseToList method:

//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);

Putting that all together, from open to close you'd have:

doc.Open();

//Sample HTML
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(@"<p>This is a test: <strong>α,β</strong></p>");

//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);

//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");
//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);

//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);

//Loop through each element, don't bother wrapping in P tags
foreach (var element in list) {
    doc.Add(element);
}

doc.Close();

EDIT

In your comment you show HTML that specifies an override font. iTextSharp does not spider the system for fonts and its HTML parser doesn't use font fallback techniques. Any fonts specified in HTML/CSS must be manually registered.

string lucidaTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "l_10646.ttf");
iTextSharp.text.FontFactory.Register(lucidaTff);
Comments