I will give you a heads up: There is no simple, well-performing solution using pure Java. To get an intuition for why this is the case, just try to open a DOC-formated file with a non-Microsoft text editor, usually Apache Open Office or Libre Office. If your file contains more than a few standard formated lines, you are likely to experience layout displacements. The same is true for the DOC-format's XML-based successor, the DOCX format.
Unfortunately, converting a file to PDF conforms to opening the DOC-file and printing it out into another file. Consequently, the resulting PDF file will contain the same layout displacements as the software you originally used to open the DOC-file. Of course, this does not only apply to Open Office: You would face the same difficulties (or probably even worse) if you read a DOC(X) file using any Java library offering such functionality.
Therefore, a fully functioning DOC(X) to PDF conversion will always require you to use Microsoft Word. Unfortunately, Microsoft Word does not offer command line switches for direct printing or PDF-conversion.
Recently, I was faced with this problem what lead me to implement the small workaround which I will introduce in the reminder of this blog entry. To begin with, you need a working installation of Microsoft Word 2007 or higher on your machine. If you are using Microsoft Word 2007, make sure that the PDF plugin is installed. Later versions of MS Word are already bundled with this plugin. Secondly, you need to make sure that you have the Windows Scripting Host installed on your computer. This is basically the case for any Windows operating system. The Windows Scripting Host allows us to run Visual Basic scripts as this one:
Copy this script and save it on your machine. Name the file something like doc2pdf.vbs. I will at this point not go into the details of Visual Basic scripting since this blog is addressed to Java developers. In a nutshell, this scripts checks for the existence of two command line arguments. The first of these arguments represents the DOC(X) file to be converted. The second parameter is optional and represents the output file. If no such parameter can be found, the script will simply append .pdf to the DOC(X) file and save this output in the same directory. The conversion is achieved by calling Microsoft Word silently. There exist more advanced implementations of this functionality on the net.
You will now be able to call this script from a MS Windows console (cmd) by typing:
After executing this script, you will find C:\example\myfile.docx.pdf on your machine. Make sure that this conversion works in order to confirm that your system is configured correctly.
But there is more bad news. You will not be able to call this script from Java directly. Attempting to run the script via Runtime.exec will result in an java.io.IOException. The reason for this exception can be found in its description:
Save this file as doc2pdf.bat. Again, I will spare you the details of this short bash script but it generally will only execute the Visual Basic script and will additionally pass its first two command line arguments to it. (If there are any.) Try this script by typing
into your command line and to see if your script is set up correctly. The advantage of this bash script over the Visual Basic implementation is that it can be called by Java:
By calling Process.waitFor you can block your execution thread until the bash script has finished its execution and the PDF file was produced. Additionally, you will receive a status code as a return value which informs you whether the bash script has terminated correctly. The PDF file can be accessed by the variable pdfPath in the above script.
It remains disappointing that this solution will most likely only run on Windows systems. However, you might get it going on Linux via Wine and winetricks. (Winetricks allows to install Visual Basic for the Windows Scripting Host by the parameter option wsh56vb.) Any feedback on such further experiments are appreciated.
Unfortunately, converting a file to PDF conforms to opening the DOC-file and printing it out into another file. Consequently, the resulting PDF file will contain the same layout displacements as the software you originally used to open the DOC-file. Of course, this does not only apply to Open Office: You would face the same difficulties (or probably even worse) if you read a DOC(X) file using any Java library offering such functionality.
Therefore, a fully functioning DOC(X) to PDF conversion will always require you to use Microsoft Word. Unfortunately, Microsoft Word does not offer command line switches for direct printing or PDF-conversion.
Recently, I was faced with this problem what lead me to implement the small workaround which I will introduce in the reminder of this blog entry. To begin with, you need a working installation of Microsoft Word 2007 or higher on your machine. If you are using Microsoft Word 2007, make sure that the PDF plugin is installed. Later versions of MS Word are already bundled with this plugin. Secondly, you need to make sure that you have the Windows Scripting Host installed on your computer. This is basically the case for any Windows operating system. The Windows Scripting Host allows us to run Visual Basic scripts as this one:
' See http://msdn2.microsoft.com/en-us/library/bb238158.aspx
Const wdFormatPDF = 17 ' PDF format.
Const wdFormatXPS = 18 ' XPS format.
Const WdDoNotSaveChanges = 0
Dim arguments
Set arguments = WScript.Arguments
' Make sure that there are one or two arguments
Function CheckUserArguments()
If arguments.Unnamed.Count < 1 Or arguments.Unnamed.Count > 2 Then
WScript.Echo "Use:"
WScript.Echo "<script> input.doc"
WScript.Echo "<script> input.doc output.pdf"
WScript.Quit 1
End If
End Function
' Transforms a doc to a pdf
Function DocToPdf( docInputFile, pdfOutputFile )
Dim fileSystemObject
Dim wordApplication
Dim wordDocument
Dim wordDocuments
Dim baseFolder
Set fileSystemObject = CreateObject("Scripting.FileSystemObject")
Set wordApplication = CreateObject("Word.Application")
Set wordDocuments = wordApplication.Documents
docInputFile = fileSystemObject.GetAbsolutePathName(docInputFile)
baseFolder = fileSystemObject.GetParentFolderName(docInputFile)
If Len(pdfOutputFile) = 0 Then
pdfOutputFile = fileSystemObject.GetBaseName(docInputFile) + ".pdf"
End If
If Len(fileSystemObject.GetParentFolderName(pdfOutputFile)) = 0 Then
pdfOutputFile = baseFolder + "\" + pdfOutputFile
End If
' Disable any potential macros of the word document.
wordApplication.WordBasic.DisableAutoMacros
Set wordDocument = wordDocuments.Open(docInputFile)
' See http://msdn2.microsoft.com/en-us/library/bb221597.aspx
wordDocument.SaveAs pdfOutputFile, wdFormatPDF
wordDocument.Close WdDoNotSaveChanges
wordApplication.Quit WdDoNotSaveChanges
Set wordApplication = Nothing
Set fileSystemObject = Nothing
End Function
' Execute script
Call CheckUserArguments()
If arguments.Unnamed.Count = 2 Then
Call DocToPdf( arguments.Unnamed.Item(0), arguments.Unnamed.Item(1) )
Else
Call DocToPdf( arguments.Unnamed.Item(0), "" )
End If
Set arguments = Nothing
Copy this script and save it on your machine. Name the file something like doc2pdf.vbs. I will at this point not go into the details of Visual Basic scripting since this blog is addressed to Java developers. In a nutshell, this scripts checks for the existence of two command line arguments. The first of these arguments represents the DOC(X) file to be converted. The second parameter is optional and represents the output file. If no such parameter can be found, the script will simply append .pdf to the DOC(X) file and save this output in the same directory. The conversion is achieved by calling Microsoft Word silently. There exist more advanced implementations of this functionality on the net.
You will now be able to call this script from a MS Windows console (cmd) by typing:
C:\example\doc2pdf.vbs C:\example\myfile.docx
After executing this script, you will find C:\example\myfile.docx.pdf on your machine. Make sure that this conversion works in order to confirm that your system is configured correctly.
But there is more bad news. You will not be able to call this script from Java directly. Attempting to run the script via Runtime.exec will result in an java.io.IOException. The reason for this exception can be found in its description:
Cannot run program "C:\example\doc2pdf.vbs": CreateProcess error=193, %1 is not a valid Win32 applicationApparently, Java cannot access the Microsoft Script Host and does therefore not recognize our script as a valid application. This requires us to apply another workaround: We will write a small bash script that executes the Visual Basic script for us. This script will look something like this:
@Echo off pushd %~dp0 cscript C:\example\doc2pdf.vbs %1 %2
Save this file as doc2pdf.bat. Again, I will spare you the details of this short bash script but it generally will only execute the Visual Basic script and will additionally pass its first two command line arguments to it. (If there are any.) Try this script by typing
C:\example\doc2pdf C:\example\myfile.docx
into your command line and to see if your script is set up correctly. The advantage of this bash script over the Visual Basic implementation is that it can be called by Java:
try {
String docToPdf = "C:\\example\\doc2pdf.bat";
File docPath = new File(getClass().getResource("/mydocument.docx").getFile());
File pdfPath = new File(docPath.getAbsolutePath() + ".pdf");
String command = String.format("%s %s %s", docToPdf, docPath, pdfPath);
Process process = Runtime.getRuntime().exec(command);
// The next line is optional and will force the current Java
//thread to block until the script has finished its execution.
process.waitFor();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
By calling Process.waitFor you can block your execution thread until the bash script has finished its execution and the PDF file was produced. Additionally, you will receive a status code as a return value which informs you whether the bash script has terminated correctly. The PDF file can be accessed by the variable pdfPath in the above script.
It remains disappointing that this solution will most likely only run on Windows systems. However, you might get it going on Linux via Wine and winetricks. (Winetricks allows to install Visual Basic for the Windows Scripting Host by the parameter option wsh56vb.) Any feedback on such further experiments are appreciated.
I was desperately looking for such a solution since some time. All the other options I found on the internet were generating non-formatted PDF, totally different from my docx file.
AntwortenLöschenThank you so much. You cannot imagine how happy I felt when finding this article. GG ! Nice blog ;)
You are welcome. Have a look at documents4j which implements the above solution.
LöschenIs there a way to go from PDF to DOCX with a PDF whose original format is DOCX? I tried to swap the VBA code but doesn't work...
LöschenThis is perfectly possible. Simply specify the input file to be a PDF and require a DOCX output. I added a code example to the issue you opened on GitHub.
LöschenI am very happy for such good solution for converting the docx file to pdf. I am serching for this code on internet from last 2 months, I got many codes on internate but code can't work properly. Thank you so much for providing such good code.
AntwortenLöschen