This was an experiment to discover the most common file structure used by eBooks, for those times when you want to tweak the code directly, rather than rely on eBook creation software. It focused on ePub eBooks for Android devices.
Some basic technical information:
An ePub book is basically a zipped container of files, with the content, images, styles and meta data. So you can unzip the ePub file using a free tool like 7zip, and inspect the contents.
An ePub file ends with the file extension .epub, whether it’s EPUB2 or EPUB3. (Although if it has digital rights management it may end in .acsm).
ePub2 is based on XHTML1.
ePub3 is based on HTML5.
The official standard is maintained by http://idpf.org/
There’s a good one page summary at http://en.wikipedia.org/wiki/EPUB
The main differences between ePub2 and ePub3 were:
ePub3 is based on HTML5 instead of XHTML 1.1
It supports CSS 2.1 plus some elements of CSS3
Supports OpenType and WOFF fonts
Supports HTML5 audio and video
Supports media overlays
Additional features such as Math XML
ncx file deprecated in favour of nav item
spine itemrefs contain a property
required metadata for decterms:modified
The sample eBooks were sourced from various providers, including purchased eBooks, demo guides from popular readers, and official test documents.
This is still in progress…
Content Breakdown
UB Reader Demo
Root files – OEBPS, META-INF, mimetype
OEBPS contains tox.ncx, content.opf
META-INF contains encryption.xml, container.xml
Calibre Demo
Root files – text, styles, META-INF, images, toc.ncx, mimetype, content.opf
META-INF contains container.xml, calibre_bookmarks.txt
Gutenberg’s Magic Catalog:
Root files – content, META-INF, toc.ncx, mimetype, metadata.opf
META-INF contains container.xml
Gutenberg download of Alice in Wonderland:
Root files – content, META-INF, mimetype
META-INF contains container.xml
Content contains toc.ncx, content.opf, wrap.html, css files, HTML chapter pages, images
The Little Prince
ePub2
Generated from Adobe InDesign for GoogleEbooks
Sigil indicated some HTML pages were not well-formed.
XHTML content pages
OPF XML – http://www.idpf.org/2007/opf
Root files – OEBPS, META-INF, mimetype
META-INF contains container.xml
OEBPS contains volume.opf, _toc_ncx_.ncx, _page_map_.xml
Antonio’s Tale
ePUb3
Generated from Infogrid.
XHTML content pages
ePub XML http://www.idpf.org/2007/ops
Meta tags specify the width, layout, orientation and paging.
Root files – OPS, META-INF, mimetype
META-INF contains container.xml
OPS contains package.opf, TOC.xhtml, XHTML chapter pages, css/fonts/jss folders
Thomas Cole – Sample ePub3 demo
ePub3
Generated manually.
XHTML content pages
ePub XML http://www.idpf.org/2007/ops
Content pages use CSS to specify fixed width or reflowable.
Calibre test
ePub2
Generated by Calibre
HTML content pages
OPF XML – http://www.idpf.org/2007/opf