TheSolider TheSolider - 1 month ago 7
Perl Question

"Too late for "-C" option" error With Perl and Shell scripts

I have a jar application that has several functions, one of which is to convert from HTML to XML. When I try to run a simple command such as:

java -jar lt4el-cmd.jar send -l en "l2:https://en.wikipedia.org/wiki/Personal_computer"


I get the following errors:

ERROR [Thread-1]: html2base/html2base-wrapper.sh: Too late for "-C" option at html2base/html2xml.pl line 1.
/tmp/lpc.30872.html: failed
cat: /tmp/lpc.30872.xml: No such file or directory
(LpcControl.java:229)
ERROR [Thread-1]: ana2ont/ana2ont.sh ${lang}: -:1: parser error : Document is empty
-:1: parser error : Start tag expected, '<' not found
Tokenization/tagging failed
^
-:1: parser error : Document is empty
unable to parse -
-:1: parser error : Document is empty
unable to parse -
(LpcControl.java:229)
ERROR [Thread-1]: Error in conversion: Error running conversion script (ana2ont/ana2ont.sh ${lang}): 6 (AppInterface.java:159)


This is the
html2base-wrapper.sh
script which seems to be where the first error occurs.

#!/bin/bash

if [ "$1" == "check" ]; then
. common.sh
check_binary perl || exit 1
check_perl_module HTML::TreeBuilder || exit 1
check_perl_module XML::LibXML || exit 1
check_binary tidy || exit 1
check_binary xmllint || exit 1
check_binary xsltproc || exit 1
exit
fi

cat >"$TMPDIR/lpc.$$.html"
html2base/html2base.sh -d html2base/LT4ELBase.dtd -x html2base/LT4ELBase.xslt -t "$TMPDIR/lpc.$$.html" >&2
cat "$TMPDIR/lpc.$$.xml";
rm -f "$TMPDIR"/lpc.$$.{ht,x}ml


And the
html2base.sh
script:

#!/bin/bash
#
# Sample script for automated HTML -> XML conversion
#
# Miroslav Spousta <spousta@ufal.mff.cuni.cz>
# $Id: html2base.sh 462 2008-03-17 08:37:14Z qiq $

basedir=`dirname $0`;

# constants
HTML2XML_BIN=${basedir}/html2xml.pl
ICONV_BIN=iconv
TIDY_BIN=tidy
XMLLINT_BIN=xmllint
XSLTPROC_BIN=xsltproc
DTDPARSE_BIN=dtdparse
TMPDIR=/tmp

# default values
VERBOSE=0
ENCODING=
TIDY=0
VALIDATE=0
DTD=${basedir}/LT4ELBase.dtd
XSLT=${basedir}/LT4ELBase.xslt

usage()
{
echo "usage: html2base.sh [options] file(s)"
echo "XML -> HTML conversion script."
echo
echo " -e, --encoding=charset Convert input files from encoding to UTF-8 (none)"
echo " -d, --dtd=file DTD to be used for conversion and validation ($DTD)"
echo " -x, --xslt=file XSLT to be applied after conversion ($XSLT)"
echo " -t, --tidy Run HTMLTidy on input HTML files"
echo " -a, --validate Validate output XML files"
echo " -v, --verbose Be verbose"
echo " -h, --help Print this usage"
exit 1;
}

OPTIONS=`getopt -o e:d:x:tahv -l encoding:,dtd:,xlst,tidy,validate,verbose,help -n 'convert.sh' -- "$@"`
if [ $? != 0 ]; then
usage;
fi
eval set -- "$OPTIONS"
while true ; do
case "$1" in
-e | --encoding) ENCODING=$2; shift 2 ;;
-d | --dtd) DTD=$2; shift 2 ;;
-x | --xslt) XSLT=$2; shift 2 ;;
-t | --tidy) TIDY=1; shift 1;;
-a | --validate) VALIDATE=1; shift 1;;
-v | --verbose) VERBOSE=1; shift 1 ;;
-h | --help) usage; shift 1 ;;
--) shift ; break ;;
*) echo "Internal error!" ; echo $1; exit 1 ;;
esac
done

if [ $# -eq 0 ]; then
usage;
fi

DTD_XML=`echo "$DTD"|sed -e 's/\.dtd/.xml/'`
if [ "$VERBOSE" -eq 1 ]; then
VERBOSE=--verbose
else
VERBOSE=
fi

# create $DTD_XML if necessary
if [ ! -f "$DTD_XML" ]; then
if ! $DTDPARSE_BIN $DTD -o $DTD_XML 2>/dev/null; then
echo "cannot run dtdparse, cannot create $DTD_XML";
exit 1;
fi;
fi

# process file by file

total=0
nok=0
while [ -n "$1" ]; do
file=$1;
if [ -n "$VERBOSE" ]; then
echo "Processing $file..."
fi
f="$file";
result=0;
if [ -n "$ENCODING" ]; then
$ICONV_BIN -f "$ENCODING" -t utf-8 "$f" -o "$file.xtmp"
result=$?
error="encoding error"
f=$file.xtmp
fi
if [ "$result" -eq 0 ]; then
if [ "$TIDY" = '1' ]; then
$TIDY_BIN --force-output 1 -q -utf8 >"$file.xtmp2" "$f" 2>/dev/null
f=$file.xtmp2
fi
out=`echo $file|sed -e 's/\.x\?html\?$/.xml/'`
if [ "$out" = "$file" ]; then
out="$out.xml"
fi
$HTML2XML_BIN --simplify-ws $VERBOSE $DTD_XML -o "$out" "$f"
result=$?
error="failed"
fi
if [ "$result" -eq 0 ]; then
$XSLTPROC_BIN --path `dirname $DTD` $XSLT "$out" |$XMLLINT_BIN --noblanks --format -o "$out.tmp1" -
result=$?
error="failed"
mv "$out.tmp1" "$out"
if [ "$result" -eq 0 -a "$VALIDATE" = '1' ]; then
tmp=`dirname $file`/$DTD
delete=0
if [ ! -f $tmp ]; then
cp $DTD $tmp
delete=1
fi
$XMLLINT_BIN --path `dirname $DTD` --valid --noout "$out"
result=$?
error="validation error"
if [ "$delete" -eq 1 ]; then
rm -f $tmp
fi
fi
fi
if [ "$result" -eq 0 ]; then
if [ -n "$VERBOSE" ]; then
echo "OK"
fi
else
echo "$file: $error "
nok=`expr $nok + 1`
fi
total=`expr $total + 1`
rm -f $file.xtmp $file.xtmp2
shift;
done
if [ -n "$VERBOSE" ]; then
echo
echo "Total: $total, failed: $nok"
fi


And the beginning part of the html2xml.pl file:

#!/usr/bin/perl -W -C

# Simple HTML to XML (subset of XHTML) conversion tool. Should always produce a
# valid XML file according to the output DTD file specified.
#
# Miroslav Spousta <spousta@ufal.mff.cuni.cz>
# $Id: html2xml.pl 461 2008-03-09 09:49:42Z qiq $

use HTML::TreeBuilder;
use HTML::Element;
use HTML::Entities;
use XML::LibXML;
use Getopt::Long;
use Data::Dumper;
use strict;


I can't seem to figure where the problem is. And what exactly does
ERROR [Thread-1]
mean?
Thanks

Answer Source

The error was from the the html2xml.pl script as other users rightly mentioned. I'm running ubuntu 16.04.2 system which comes with a default perl 5.22 version. And as this post mentions, using the -C option (as from perl 5.10.1) on the #! line requires you to also specify it on the command line at execution time, which I wasn't sure how to do because I was running a jar file. I installed perlbrew, instead, which I used to get an earlier version of perl and modified my perl script to:

#!/usr/bin/path/to/perlbrew/perl -W -C

# Simple HTML to XML (subset of XHTML) conversion tool. Should always produce a
# valid XML file according to the output DTD file specified.
#
# Miroslav Spousta <spousta@ufal.mff.cuni.cz>
# $Id: html2xml.pl 461 2008-03-09 09:49:42Z qiq $

This might also come in handy in setting up shell scripts when using perlbrew.

Thanks for the efforts in contribution.