I recently came across this exact same problem, so I dug into PyPDF2 to see what’s going on, and how to resolve it.
Note: I am assuming that
filename is a well-formed file path string. Assume the same for all of my code
The Short Answer
PdfFileMerger() class instead of the
PdfFileWriter() class. I’ve tried to provide the following to as closely resemble your content as I could:
from PyPDF2 import PdfFileMerger, PdfFileReader [...] merger = PdfFileMerger() for filename in filenames: merger.append(PdfFileReader(file(filename, 'rb'))) merger.write("document-output.pdf")
The Long Answer
The way you’re using
PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the
PdfFileWriter, you are adding references to the page in the open
PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn’t do any garbage collection / automatic file closing despite re-using the file handle. They remain open until
PdfFileWriter no longer needs access to them, which is at
output.write(outputStream) in your code.
To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the
PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at
PdfFileMerger wasn’t close enough, and that it only created copies in certain conditions.
My initial attempts looked like the following, and were resulting in the same IO Problems:
merger = PdfFileMerger() for filename in filenames: merger.append(filename) merger.write(output_file_path)
Looking at the PyPDF2 source code, we see that
fileobj to be passed, and then uses the
merge() function, passing in it’s last page as the new files position.
merge() does the following with
fileobj (before opening it with
if type(fileobj) in (str, unicode): fileobj = file(fileobj, 'rb') my_file = True elif type(fileobj) == file: fileobj.seek(0) filecontent = fileobj.read() fileobj = StringIO(filecontent) my_file = True elif type(fileobj) == PdfFileReader: orig_tell = fileobj.stream.tell() fileobj.stream.seek(0) filecontent = StringIO(fileobj.stream.read()) fileobj.stream.seek(orig_tell) fileobj = filecontent my_file = True
We can see that the
append() option does accept a string, and when doing so, assumes it’s a file path and creates a file object at that location. The end result is the exact same thing we’re trying to avoid. A
PdfFileReader() object holding open a file until the file is eventually written!
However, if we either make a file object of the file path string or a
PdfFileReader(see Edit 2) object of the path string before it gets passed into
append(), it will automatically create a copy for us as a
StringIO object, allowing Python to close the file.
I would recommend the simpler
merger.append(file(filename, 'rb')), as others have reported that a
PdfFileReader object may stay open in memory, even after calling
Hope this helped!
EDIT: I assumed you were using
PyPDF. If you aren’t, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.
If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than
PdfFileMerger won’t be available to you. In that situation you can re-use the code from PyPDF2’s
merge function (provided above) to create a copy of the file as a
StringIO object, and use that in your code in place of the file object.
EDIT 2: Previous recommendation of using
merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks @Agostino).