Ebooks Made From PDF

I'm asked this often because the truth is, making an eBook from a PDF is quite expensive. As it's time-consuming and labor-intensive.  People think that's nuts--that they can see that option, right there in Adobe Acrobat or PDF 995 or whatever PDF reader software that they're using, that says "export file to Word" or "export to HTML" or from GoogleDocs, "export to ePUB" and they think, "well, hell, how hard can it be?" 

The answer is--very.

The actual steps for us to make an ebook from a PDF are:

  1.  We scan the PDF, using AbbyyFineReader commercial scanning software.
  2. We OCR the PDF, also using Abbyy.
  3. When we're done with that step, we have a raw scan output Word file.  This Word file has the same page layout and page breaks as the original PDF.  That's important and you'll see why in a moment. 
  4. That Word file is full of red-marked/lettered text.  That text indicates where Abbyy suggests that there's a scanning error.
  5. We go through, fixing all of those, checking them against the original pdf.
  6.  When we've completed that step, we export that edited Word file to PDF format.  
  7. We then run a COMPARE program, that compares the original PDF ("PDF1") with this new, from-the-Word-file-PDF ("PDF2").
  8. We check and correct every single comparison "twig" that says that there's something different between PDF1 and PDF2.
  9. Then we take the revised Word file, and export another PDF.
  10. We export another PDF (PDF 3), and,
  11. Yup, we run a comparison, now, between PDF3 and PDF1.
  12. If we get more compare discrepancies, we lather-rinse-repeat, correcting those discrepancies in the Word file, from the original PDF.
  13. And we continue this process, over and over, until there are no discrepancy reports between PDFx and PDF1.  
  14. At that point--we are finally ready to start the eBook formatting process, which means we start by cleaning the "new" Word file, exporting it to HTML and starting at the same place that we would have been, if we'd had a Word file to start with.

And that is why, especially for very long, complexly-laid out PDFs, formatting into an eBook is so expensive.  The automatic "export to Word" functions, either from Acrobat, other PDF software or those online websites are all full of utter nonsense.  What comes out looks okay on the surface, but it's broken underneath--where eBooks live.

Heck, don't believe me--export your PDF into Word and then upload that at the KDP, and preview the resulting eBook.  Horrified? Yup, that's how that goes.  Trust me, we don't do this for fun!  If there were a faster, cheaper way to do this right--making ebooks from PDFs--I can assure you, we'd be the very first people to use it!

We Scan your PDF

 

So, how do we convert your PDF? 95% of the time, after we try a few things, we end up running OCR software (Optical Character Recognition) on it. Yes, just as if it were a print book. Believe it or not, this is faster and less expensive for you than if we use Adobe’s tools to “export to Word” or “export to HTML.” (If you've seen these tools, or those online "convert your file" websites, before you invest your time or money, please see our article here in the FAQ on that topic: There Is No Magic Way to Convert a PDF to eBook Form. We try this, of course, on every book, to see if we can save the client money. But usually, OCR is the best way, and produces the cleanest Word output. 

Then we run comparison software which checks every single character in the output against every single character in your original PDF. Our accuracy is 99.95%, guaranteed. No conversion is ever 100%. This is one of the reasons that every client gets a review copy, to check. If we make errors in the conversion, we fix them at no charge to you.

Our output format from the OCR is Word. We export that scanned Word File to HTML.   Then we export that file to HTML--and from that point, the process is the same as listed in the "From Word" section. Once the PDF has been OCR’ed and put into Word, we take that Word file and convert it to HTML. From that point forward, the process is identical to the process described for Word or other word-processed files as described here in the FAQ. 

Do Those Auto-Magic Conversion Websites Work?

Yes, we know that you’ve probably seen Internet websites and ads on Google saying “Convert your PDF to Word Now!” or other such promises.  Mostly, this is snake-oil.  We have been doing this for four years, and we have never found an “easy" way to convert PDF's to HTML to make an eBook.  We’ve never even found a good way to make a clean Word file from PDF, at least, not with "push-button magic." 

We get a lot of Word files in here for quoting that were obviously made from PDF's using these snake-oil conversion sites or worse, software you have to buy.  Please don't fall for this.  When we look at the “underneath” in these files, it's harder to clean and more time-consuming than just making the book from the PDF file you had.  Unless you have a lot of expert knowledge of Word, please don’t try to use these converters and then send us the resulting Word file.  When you think of the results of these "auto-magic" conversions, what you see as a result is not what you get; you see a tiny piece of the surface, like an iceberg.  The submerged part, however--the part that actually makes your eBook--is a disaster.  You can read a bit about this in our article, Why PDF's Really Don't Work At Kindle

What we do to convert PDF to Word/HTML is very time-consuming and takes a lot of knowledge, to do correctly.  We don't say this to fool you.  We say this to save you money and time.  The files made by these “EZ convert" programs are quite simply no good, and are worthless for conversion.  You're better off simply giving us the PDF, and allowing us to work from that, for accuracy of conversion. 

 

 

 

What's Wrong With My PDF→Word Conversion?  It Looks Perfect!

When people look at the results of automated "PDF to Word®" conversion sites, or software, different people see different things.  To an author, who only has a PDF copy of a book from her backlist, it looks like manna from heaven—a Word® file that looks perfect!  To an ebook professional, however, it’s like the movie Lake Placid—a serene, gorgeous surface, beneath which danger lurks.   

What Happens?  Why Doesn't PDF Work at the KDP?

You’ve probably heard people talk about how they tried to upload a PDF at the KDP®, or tried to use a program like Adobe Acrobat® to "make" a Word® file from their PDF, only to have achieved wholly unexpected and dismal results.  This happens a lot, particularly to people who don’t have expertise in Word®.  When you use a program like Acrobat®, or one of those online conversion web sites, the file that you get back will often look exactly like you think it should.  And you’ll think it’s great, and be thrilled.  But, underneath, where it counts—where Word's invisible codes tell text what it is and how to display—lurks an unholy mess waiting to bite you when you try to actually use that file, rather than just looking at it.  (Cue the soundtrack from Jaws...bum-bum-bum-bump-bump-bump-bump-bum....)

Let's Look At An Example:

Let’s look at one real-life example, to kick off the discussion.  This prospective client came to us, having exported his “Word®” file from PDF, and then uploading the file to Amazon®.  As he ended up coming to us, you can already predict (plot spoiler ahead!), that the results weren’t good.     

When a display or layout program like Acrobat® tries to export a Word® file, it tries to “tell” Word® what it thinks it is seeing.  Because a PDF is not a word-processed file, it’s using a completely different set of codes, and different types of codes, to achieve the layout that you see when you view it.  This is because Acrobat® is a layout program, not a word processor.  Acrobat® and other layout programs only care about how the end product looks; word processors care about what the elements (words, sentences, paragraphs) in a document are.  Do you remember the old parable about three blind men and an elephant?  Well, the Acrobat® conversion to Word® format is a bit like that; Acrobat® tells Word® based upon what it thinks it sees; what it interprets as your intent—not what Word® actually needs to “hear.”  Let’s look at how Acrobat® “sees” a page of text, to the naked eye:

The page as it looked, when it came out of the scan. Fine, right? Hmmm...look again.What are those squiggly green things?
The page as it looked, when it came out of the scan. Fine, right? Hmmm...look again.What are those squiggly green things?
When the pilcrows are displayed, now you can see what's going on.
When the pilcrows are displayed, now you can see what's going on.
The results--not what you expected, is it?
The results--not what you expected, is it?

Figure 1 is one of the pages, in Word, that was the result of an “automatic” export from Adobe Acrobat® to MS Word. Click to enlarge.

This small section looks fine, right?  But those of you with eagle-eyes may have noticed that something isn’t quite right—why is the first word in each line underlined with the dreaded squiggly-green line?  Why does Word® think that’s a grammar error?  To see why that’s happening, let’s look at this exact same page with “reveal codes” turned on (what you see if you click the pilcrow icon ¶ on your Word® 2007-2010 Ribbon, or in the main toolbar for older editions):

Now you can see what’s really going on.  When Acrobat® exported that file into Word®, it “thought” that every line was its own paragraph.  That’s right—if you tried to upload this file at the KDP, every single line you see there would come out, in Kindle, as its own paragraph, not words inside a much larger paragraph.  That’s what Word® is trying to tell you, with those squiggly green lines—it’s trying to say, “Hey, you didn’t capitalize the first letter of this new sentence.”  Word® thinks that those first words on each line are actually the first words in a new sentence.

Why does it think that?  Because immediately before those words, Word® obeys a pilcrow command (at the end of each line, over there in the right-hand margin).   That pilcrow instructs Word, “I am marking the end of a paragraph.”  Word® knows that the very next word is the first word of a new paragraph, so it must be the first word of a new sentence, and therefore, should be capitalized.  That’s what those little pilcrows, and the little squiggly green lines are telling you:  Here There Be Dragons!

But:  Won't It look Fine, Anyway?  Without Those Cruddy Pilcrows?

When this file was exported to Kindle by the prospective client, what he saw, to his horror, was this (I’m simulating the actual output, starting with the first line of the “paragraph” near the bottom of the section shown that starts with, “Some of the nuns…”):

Obviously—this was not what he’d had in mind. This was prose, not poetry or some type of experimental Haiku. He’d expected his Kindle book would look like Figure 1…but what he got was far, far different, making the book unreadable and thus, unsaleable. Why did this happen?

The way a word processor works is actually pretty simple. Every single element in a word-processed file, whether it's a paragraph, or an italicized word or phrase, or smallcaps, has invisible tags surrounding it that identifies it to the program and tells it how to display. More importantly, those codes (tags) tell the program what it is. (A word, a paragraph, etc.) An example of how this looks in code (HTML), which is what actually runs word processors, and is used to make eBooks, is this:

<p class=”indent”><i>This is a paragraph in italics, in HTML</i>, which is the “language” used to create Kindle books.</p>

What this looks like, on a Kindle device:

This is a paragraph in italics, in HTML, which is the “language” used to create Kindle books.

A word or phrase in italics, for example, is surrounded by tags like this to start italicization: <i>. The program is told to stop italicizing the words by a closing tag, which looks like this: </i>. This is true whether it’s Word, Wordperfect, Open Office, Libre Office…well, you get the drift.

In the above example, you see me tell the program that the paragraph starts with the word “This,” after the opening paragraph tag, and ends with the period after the word “books.” The italics styling starts with the word “This,” and stops after the word, “HTML.” In most word-processors, most of this happens invisibly to you, and can only be revealed using either Word’s Styles menu, or by working in the actual code, as most ebook conversion companies do. This is the “black box” effect; magic happens behind the screen that makes stuff “just happen.”

Exhibit 1 and the result shows just one very simplified explanation of how things go badly wrong when exporting PDF files to Word. I used it because it’s the easiest to demonstrate. Far larger, and harder to find and fix, land mines await the unwary.

Other Horror Stories:

Much text formatting, like italics, can go horribly wrong. One such case is a client that came to us because no matter what she did, when she uploaded her “Word” file (made from her PDF) to the KDP, none of her italics showed up. It turned out that Acrobat® told Word® that the italics were in a special italic font that isn’t available on Kindle—so of course, the italics never showed up. Sometimes, Acrobat® tells Word® that a symbol exists, but uses a special symbol font to create it—and again, that symbol’s font may not be on your computer, and it’s certainly not on Kindle devices.

It’s important to remember: PDF is all about layout, and how text looks; word-processors and eBooks are all about what elements are (words, sentences, paragraphs, pages, sections), and then how they are displayed. In eBooks, the structure (what something is) takes precedence over how it looks.

All real paragraphs must have that pilcrow code at the end; that instructs Word® that the paragraph is where it should be, and that the next paragraph starts immediately. But again, most of the chaos caused with “auto-magic” convert-PDF-to-Word® programs is not visible to the eye in Word; the problems only surface after the document is converted into code. Even I, after five years of making ebooks, can sometimes not see the problems that are hidden deep in the code of a “faux” Word® file until I export the file into code, and then find the hidden Dragons waiting for me.

Can I "DIY" my PDF Conversion to eBook?

If you can, it’s best to leave conversion from PDF to Word® or eBook to experts. Yes, I know that sounds self-serving, as I own an ebook-making firm, but it’s true. If you have a lot of expertise in Word® (or another word processor); if you have a true command of Word’s Styles, macros, etc., you can absolutely do all the clean-up yourself, but whether you do it yourself, or pay someone else to do it, all that “cruft” that is put inside a PDF-exported/created Word® file must be cleaned up before you can make a successful, clean, beautiful-looking ebook.

The “paragraph” problem can be cleaned up with time and some effort, even by those without a lot of expertise in Word. You can go through and delete all those unwanted paragraph codes, but you have to do it one line at a time. Don’t do what one of our clients did: she thought it would be “faster and easier” to use search and replace. She chose “all” on the search and replace menu—and ended up with a book that was one giant paragraph long!

***
Remember: you can see full-size examples of today's images and examples by clicking any of the images. You'll want to see them larger size in order to view them clearly. This is "stuff" worth reviewing, and worth knowing about before you decide to take on PDF→Word→Kindle conversion for yourself. As I said above: it can be done by a determined beginner, but do know and understand what you're getting into, upfront, and don't be easily discouraged. Good Luck!