abbood abbood - 7 months ago 9
Java Question

why do certain zip files have unknown file content

background



I stumbled across this problem here

analysis



according to the java docs for ZipEntry, sometimes requesting the size of a zipfile entry simply returns -1

However, running the command

$ unzip -l b17c024e-89f1-42f7-a546-91d46610cedb.epub
Archive: b17c024e-89f1-42f7-a546-91d46610cedb.epub
Length Date Time Name
-------- ---- ---- ----
20 01-27-12 11:17 mimetype
2378 04-20-12 10:12 OEBPS/hayat-ghayr.html
6436 02-06-12 11:06 OEBPS/content.opf
112579 01-27-12 11:25 OEBPS/images/978-614-425-313-7-hayat-ghayr-cover.png
182575 01-27-12 11:25 OEBPS/images/978-614-425-313-7-hayat_fmt.png
7757 01-27-12 11:21 OEBPS/template.css
5643 01-27-12 11:18 OEBPS/hayat-ghayr-2.html
20144 01-27-12 11:17 OEBPS/hayat-ghayr-1.html
65543 01-27-12 11:17 OEBPS/hayat-ghayr-3.html
59434 01-27-12 11:17 OEBPS/hayat-ghayr-4.html
66768 01-27-12 11:17 OEBPS/hayat-ghayr-5.html
49117 01-27-12 11:17 OEBPS/hayat-ghayr-6.html
65346 01-27-12 11:17 OEBPS/hayat-ghayr-7.html
74196 01-27-12 11:17 OEBPS/hayat-ghayr-8.html
73998 01-27-12 11:17 OEBPS/hayat-ghayr-9.html
61031 01-27-12 11:17 OEBPS/hayat-ghayr-10.html
68297 01-27-12 11:17 OEBPS/hayat-ghayr-11.html
72084 01-27-12 11:17 OEBPS/hayat-ghayr-12.html
2386 01-27-12 11:17 OEBPS/hayat-ghayr-13.html
61132 01-27-12 11:17 OEBPS/hayat-ghayr-14.html
46320 01-27-12 11:17 OEBPS/hayat-ghayr-15.html
32673 01-27-12 11:17 OEBPS/hayat-ghayr-16.html
88584 01-27-12 11:17 OEBPS/hayat-ghayr-17.html
56474 01-27-12 11:17 OEBPS/hayat-ghayr-18.html
52840 01-27-12 11:17 OEBPS/hayat-ghayr-19.html
80022 01-27-12 11:17 OEBPS/hayat-ghayr-20.html
50781 01-27-12 11:17 OEBPS/hayat-ghayr-21.html
2765 01-27-12 11:17 OEBPS/hayat-ghayr-22.html
265 01-27-12 11:17 META-INF/container.xml
54942 01-27-12 11:17 OEBPS/images/277.png
5549 01-27-12 11:17 OEBPS/toc.ncx
1072 03-23-12 13:28 iTunesMetadata.plist
-------- -------
1529151 32 files


shows that there is a content length for all the chapters..
but also, if we unzip the same file and rezip it again with stronger compression.. the zipFile java command returns the proper content size

question



is this the zip library's fault or the original compression fault? how can we know?

follow up question



see how to access a zipEntry from a streamed zipfile

Answer

ZIP stores meta data inside the archive in a few different places ("local file header", "central directory" and sometimes a "data descriptor"). Only the "local file header" is in front of the file's content - the "central directory" is at the very end of the archive. Only the "central directory" holds the full truth, it is perfectly valid to not specify any size in the "local file header".

When using ZipArchiveInputStream you obtain the ZipEntry as soon as the "local file header" has been read (because the underlying stream may not be seekable), so the size information may be missing. ZipFile uses RandomAccessFile under the covers and can read the "central directory" - as does unzip and friends - so they know more than ZipArchiveInputStream.