What's Wrong With My PDF→Word Conversion?  It Looks Perfect!

When people look at the results of automated "PDF to Word®" conversion sites, or software, different people see different things.  To an author, who only has a PDF copy of a book from her backlist, it looks like manna from heaven—a Word® file that looks perfect!  To an ebook professional, however, it’s like the movie Lake Placid—a serene, gorgeous surface, beneath which danger lurks.   

What Happens?  Why Doesn't PDF Work at the KDP?

You’ve probably heard people talk about how they tried to upload a PDF at the KDP®, or tried to use a program like Adobe Acrobat® to "make" a Word® file from their PDF, only to have achieved wholly unexpected and dismal results.  This happens a lot, particularly to people who don’t have expertise in Word®.  When you use a program like Acrobat®, or one of those online conversion web sites, the file that you get back will often look exactly like you think it should.  And you’ll think it’s great, and be thrilled.  But, underneath, where it counts—where Word's invisible codes tell text what it is and how to display—lurks an unholy mess waiting to bite you when you try to actually use that file, rather than just looking at it.  (Cue the soundtrack from Jaws...bum-bum-bum-bump-bump-bump-bump-bum....)

Let's Look At An Example:

Let’s look at one real-life example, to kick off the discussion.  This prospective client came to us, having exported his “Word®” file from PDF, and then uploading the file to Amazon®.  As he ended up coming to us, you can already predict (plot spoiler ahead!), that the results weren’t good.     

When a display or layout program like Acrobat® tries to export a Word® file, it tries to “tell” Word® what it thinks it is seeing.  Because a PDF is not a word-processed file, it’s using a completely different set of codes, and different types of codes, to achieve the layout that you see when you view it.  This is because Acrobat® is a layout program, not a word processor.  Acrobat® and other layout programs only care about how the end product looks; word processors care about what the elements (words, sentences, paragraphs) in a document are.  Do you remember the old parable about three blind men and an elephant?  Well, the Acrobat® conversion to Word® format is a bit like that; Acrobat® tells Word® based upon what it thinks it sees; what it interprets as your intent—not what Word® actually needs to “hear.”  Let’s look at how Acrobat® “sees” a page of text, to the naked eye:

The page as it looked, when it came out of the scan. Fine, right? Hmmm...look again.What are those squiggly green things?
The page as it looked, when it came out of the scan. Fine, right? Hmmm...look again.What are those squiggly green things?
When the pilcrows are displayed, now you can see what's going on.
When the pilcrows are displayed, now you can see what's going on.
The results--not what you expected, is it?
The results--not what you expected, is it?

Figure 1 is one of the pages, in Word, that was the result of an “automatic” export from Adobe Acrobat® to MS Word. Click to enlarge.

This small section looks fine, right?  But those of you with eagle-eyes may have noticed that something isn’t quite right—why is the first word in each line underlined with the dreaded squiggly-green line?  Why does Word® think that’s a grammar error?  To see why that’s happening, let’s look at this exact same page with “reveal codes” turned on (what you see if you click the pilcrow icon ¶ on your Word® 2007-2010 Ribbon, or in the main toolbar for older editions):

Now you can see what’s really going on.  When Acrobat® exported that file into Word®, it “thought” that every line was its own paragraph.  That’s right—if you tried to upload this file at the KDP, every single line you see there would come out, in Kindle, as its own paragraph, not words inside a much larger paragraph.  That’s what Word® is trying to tell you, with those squiggly green lines—it’s trying to say, “Hey, you didn’t capitalize the first letter of this new sentence.”  Word® thinks that those first words on each line are actually the first words in a new sentence.

Why does it think that?  Because immediately before those words, Word® obeys a pilcrow command (at the end of each line, over there in the right-hand margin).   That pilcrow instructs Word, “I am marking the end of a paragraph.”  Word® knows that the very next word is the first word of a new paragraph, so it must be the first word of a new sentence, and therefore, should be capitalized.  That’s what those little pilcrows, and the little squiggly green lines are telling you:  Here There Be Dragons!

But:  Won't It look Fine, Anyway?  Without Those Cruddy Pilcrows?

When this file was exported to Kindle by the prospective client, what he saw, to his horror, was this (I’m simulating the actual output, starting with the first line of the “paragraph” near the bottom of the section shown that starts with, “Some of the nuns…”):

Obviously—this was not what he’d had in mind. This was prose, not poetry or some type of experimental Haiku. He’d expected his Kindle book would look like Figure 1…but what he got was far, far different, making the book unreadable and thus, unsaleable. Why did this happen?

The way a word processor works is actually pretty simple. Every single element in a word-processed file, whether it's a paragraph, or an italicized word or phrase, or smallcaps, has invisible tags surrounding it that identifies it to the program and tells it how to display. More importantly, those codes (tags) tell the program what it is. (A word, a paragraph, etc.) An example of how this looks in code (HTML), which is what actually runs word processors, and is used to make eBooks, is this:

<p class=”indent”><i>This is a paragraph in italics, in HTML</i>, which is the “language” used to create Kindle books.</p>

What this looks like, on a Kindle device:

This is a paragraph in italics, in HTML, which is the “language” used to create Kindle books.

A word or phrase in italics, for example, is surrounded by tags like this to start italicization: <i>. The program is told to stop italicizing the words by a closing tag, which looks like this: </i>. This is true whether it’s Word, Wordperfect, Open Office, Libre Office…well, you get the drift.

In the above example, you see me tell the program that the paragraph starts with the word “This,” after the opening paragraph tag, and ends with the period after the word “books.” The italics styling starts with the word “This,” and stops after the word, “HTML.” In most word-processors, most of this happens invisibly to you, and can only be revealed using either Word’s Styles menu, or by working in the actual code, as most ebook conversion companies do. This is the “black box” effect; magic happens behind the screen that makes stuff “just happen.”

Exhibit 1 and the result shows just one very simplified explanation of how things go badly wrong when exporting PDF files to Word. I used it because it’s the easiest to demonstrate. Far larger, and harder to find and fix, land mines await the unwary.

Other Horror Stories:

Much text formatting, like italics, can go horribly wrong. One such case is a client that came to us because no matter what she did, when she uploaded her “Word” file (made from her PDF) to the KDP, none of her italics showed up. It turned out that Acrobat® told Word® that the italics were in a special italic font that isn’t available on Kindle—so of course, the italics never showed up. Sometimes, Acrobat® tells Word® that a symbol exists, but uses a special symbol font to create it—and again, that symbol’s font may not be on your computer, and it’s certainly not on Kindle devices.

It’s important to remember: PDF is all about layout, and how text looks; word-processors and eBooks are all about what elements are (words, sentences, paragraphs, pages, sections), and then how they are displayed. In eBooks, the structure (what something is) takes precedence over how it looks.

All real paragraphs must have that pilcrow code at the end; that instructs Word® that the paragraph is where it should be, and that the next paragraph starts immediately. But again, most of the chaos caused with “auto-magic” convert-PDF-to-Word® programs is not visible to the eye in Word; the problems only surface after the document is converted into code. Even I, after five years of making ebooks, can sometimes not see the problems that are hidden deep in the code of a “faux” Word® file until I export the file into code, and then find the hidden Dragons waiting for me.

Can I "DIY" my PDF Conversion to eBook?

If you can, it’s best to leave conversion from PDF to Word® or eBook to experts. Yes, I know that sounds self-serving, as I own an ebook-making firm, but it’s true. If you have a lot of expertise in Word® (or another word processor); if you have a true command of Word’s Styles, macros, etc., you can absolutely do all the clean-up yourself, but whether you do it yourself, or pay someone else to do it, all that “cruft” that is put inside a PDF-exported/created Word® file must be cleaned up before you can make a successful, clean, beautiful-looking ebook.

The “paragraph” problem can be cleaned up with time and some effort, even by those without a lot of expertise in Word. You can go through and delete all those unwanted paragraph codes, but you have to do it one line at a time. Don’t do what one of our clients did: she thought it would be “faster and easier” to use search and replace. She chose “all” on the search and replace menu—and ended up with a book that was one giant paragraph long!

Remember: you can see full-size examples of today's images and examples by clicking any of the images. You'll want to see them larger size in order to view them clearly. This is "stuff" worth reviewing, and worth knowing about before you decide to take on PDF→Word→Kindle conversion for yourself. As I said above: it can be done by a determined beginner, but do know and understand what you're getting into, upfront, and don't be easily discouraged. Good Luck!