Freitag, 24. Mai 2013

Converting Microsoft DOC or DOCX files into PDF using Java without contortions

I will give you a heads up: There is no simple, well-performing solution using pure Java. To get an intuition for why this is the case, just try to open a DOC-formated file with a non-Microsoft text editor, usually Apache Open Office or Libre Office. If your file contains more than a few standard formated lines, you are likely to experience layout displacements. The same is true for the DOC-format's XML-based successor, the DOCX format.

Unfortunately, converting a file to PDF conforms to opening the DOC-file and printing it out into another file. Consequently, the resulting PDF file will contain the same layout displacements as the software you originally used to open the DOC-file. Of course, this does not only apply to Open Office: You would face the same difficulties (or probably even worse) if you read a DOC(X) file using any Java library offering such functionality.

Therefore, a fully functioning DOC(X) to PDF conversion will always require you to use Microsoft Word. Unfortunately, Microsoft Word does not offer command line switches for direct printing or PDF-conversion.

Recently, I was faced with this problem what lead me to implement the small workaround which I will introduce in the reminder of this blog entry. To begin with, you need a working installation of Microsoft Word 2007 or higher on your machine. If you are using Microsoft Word 2007, make sure that the PDF plugin is installed. Later versions of MS Word are already bundled with this plugin. Secondly, you need to make sure that you have the Windows Scripting Host installed on your computer. This is basically the case for any Windows operating system. The Windows Scripting Host allows us to run Visual Basic scripts as this one:

' See http://msdn2.microsoft.com/en-us/library/bb238158.aspx
Const wdFormatPDF = 17  ' PDF format. 
Const wdFormatXPS = 18  ' XPS format. 

Const WdDoNotSaveChanges = 0

Dim arguments
Set arguments = WScript.Arguments

' Make sure that there are one or two arguments
Function CheckUserArguments()
  If arguments.Unnamed.Count < 1 Or arguments.Unnamed.Count > 2 Then
    WScript.Echo "Use:"
    WScript.Echo "<script> input.doc"
    WScript.Echo "<script> input.doc output.pdf"
    WScript.Quit 1
  End If
End Function


' Transforms a doc to a pdf
Function DocToPdf( docInputFile, pdfOutputFile )

  Dim fileSystemObject
  Dim wordApplication
  Dim wordDocument
  Dim wordDocuments
  Dim baseFolder

  Set fileSystemObject = CreateObject("Scripting.FileSystemObject")
  Set wordApplication = CreateObject("Word.Application")
  Set wordDocuments = wordApplication.Documents

  docInputFile = fileSystemObject.GetAbsolutePathName(docInputFile)
  baseFolder = fileSystemObject.GetParentFolderName(docInputFile)

  If Len(pdfOutputFile) = 0 Then
    pdfOutputFile = fileSystemObject.GetBaseName(docInputFile) + ".pdf"
  End If

  If Len(fileSystemObject.GetParentFolderName(pdfOutputFile)) = 0 Then
    pdfOutputFile = baseFolder + "\" + pdfOutputFile
  End If

  ' Disable any potential macros of the word document.
  wordApplication.WordBasic.DisableAutoMacros

  Set wordDocument = wordDocuments.Open(docInputFile)

  ' See http://msdn2.microsoft.com/en-us/library/bb221597.aspx 
  wordDocument.SaveAs pdfOutputFile, wdFormatPDF

  wordDocument.Close WdDoNotSaveChanges
  wordApplication.Quit WdDoNotSaveChanges
  
  Set wordApplication = Nothing
  Set fileSystemObject = Nothing

End Function

' Execute script
Call CheckUserArguments()
If arguments.Unnamed.Count = 2 Then
 Call DocToPdf( arguments.Unnamed.Item(0), arguments.Unnamed.Item(1) )
Else
 Call DocToPdf( arguments.Unnamed.Item(0), "" )
End If

Set arguments = Nothing

Copy this script and save it on your machine. Name the file something like doc2pdf.vbs. I will at this point not go into the details of Visual Basic scripting since this blog is addressed to Java developers. In a nutshell, this scripts checks for the existence of two command line arguments. The first of these arguments represents the DOC(X) file to be converted. The second parameter is optional and represents the output file. If no such parameter can be found, the script will simply append .pdf to the DOC(X) file and save this output in the same directory. The conversion is achieved by calling Microsoft Word silently. There exist more advanced implementations of this functionality on the net.

You will now be able to call this script from a MS Windows console (cmd) by typing:

C:\example\doc2pdf.vbs C:\example\myfile.docx

After executing this script, you will find C:\example\myfile.docx.pdf on your machine. Make sure that this conversion works in order to confirm that your system is configured correctly.

But there is more bad news. You will not be able to call this script from Java directly. Attempting to run the script via Runtime.exec will result in an java.io.IOException. The reason for this exception can be found in its description:
Cannot run program "C:\example\doc2pdf.vbs": CreateProcess error=193, %1 is not a valid Win32 application
Apparently, Java cannot access the Microsoft Script Host and does therefore not recognize our script as a valid application. This requires us to apply another workaround: We will write a small bash script that executes the Visual Basic script for us. This script will look something like this:

@Echo off
pushd %~dp0
cscript C:\example\doc2pdf.vbs %1 %2

Save this file as doc2pdf.bat. Again, I will spare you the details of this short bash script but it generally will only execute the Visual Basic script and will additionally pass its first two command line arguments to it. (If there are any.) Try this script by typing

C:\example\doc2pdf C:\example\myfile.docx

into your command line and to see if your script is set up correctly. The advantage of this bash script over the Visual Basic implementation is that it can be called by Java:

try {
    String docToPdf = "C:\\example\\doc2pdf.bat";
    File docPath = new File(getClass().getResource("/mydocument.docx").getFile());
    File pdfPath = new File(docPath.getAbsolutePath() + ".pdf");
    String command = String.format("%s %s %s", docToPdf, docPath, pdfPath);
    Process process = Runtime.getRuntime().exec(command);
    // The next line is optional and will force the current Java 
    //thread to block until the script has finished its execution.
    process.waitFor();
} catch (IOException e) {
    e.printStackTrace();
} catch (InterruptedException e) {
    e.printStackTrace();
}

By calling Process.waitFor you can block your execution thread until the bash script has finished its execution and the PDF file was produced. Additionally, you will receive a status code as a return value which informs you whether the bash script has terminated correctly. The PDF file can be accessed by the variable pdfPath in the above script.

It remains disappointing  that this solution will most likely only run on Windows systems. However, you might get it going on Linux via Wine and winetricks. (Winetricks allows to install Visual Basic for the Windows Scripting Host by the parameter option wsh56vb.) Any feedback on such further experiments are appreciated.

Kommentare:

  1. I was desperately looking for such a solution since some time. All the other options I found on the internet were generating non-formatted PDF, totally different from my docx file.
    Thank you so much. You cannot imagine how happy I felt when finding this article. GG ! Nice blog ;)

    AntwortenLöschen
    Antworten
    1. You are welcome. Have a look at documents4j which implements the above solution.

      Löschen
    2. Is there a way to go from PDF to DOCX with a PDF whose original format is DOCX? I tried to swap the VBA code but doesn't work...

      Löschen
    3. This is perfectly possible. Simply specify the input file to be a PDF and require a DOCX output. I added a code example to the issue you opened on GitHub.

      Löschen