Get the exact Stringposition in PDF

As plinth and David van Driessche already pointed out in their answers, text extration from PDF file is non-trivial. Fortunately the classes in the parser package of iText do most of the heavy lifting for you. You have already found at least one class from that package,PdfTextExtractor,but this class essentially is a convenience utility for using the parser functionality of iText if you’re only interested in the plain text of the page. In your case you have to look at the classes in that package more intensely.

A starting point to get information on the topic of text extraction with iText is section 15.3 Parsing PDFs of iText in Action — 2nd Edition, especially the methodextractTextof the sample ParsingHelloWorld.java:

public void extractText(String src, String dest) throws IOException
{
    PrintWriter out = new PrintWriter(new FileOutputStream(dest));
    PdfReader reader = new PdfReader(src);
    RenderListener listener = new MyTextRenderListener(out);
    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
    PdfDictionary pageDic = reader.getPageN(1);
    PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
    processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic);
    out.flush();
    out.close();
}

which makes use of the RenderListenerimplementation MyTextRenderListener.java:

public class MyTextRenderListener implements RenderListener
{
    [...]

    /**
     * @see RenderListener#renderText(TextRenderInfo)
     */
    public void renderText(TextRenderInfo renderInfo) {
        out.print("<");
        out.print(renderInfo.getText());
        out.print(">");
    }
}

While thisRenderListenerimplementation merely outputs the text, the TextRenderInfo object it inspects offers way more information:

public LineSegment getBaseline();    // the baseline for the text (i.e. the line that the text 'sits' on)
public LineSegment getAscentLine();  // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)
public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)
public float getRise()             ; // the rise which  represents how far above the nominal baseline the text should be rendered

public String getText();             // the text to render
public int getTextRenderMode();      // the text render mode
public DocumentFont getFont();       // the font
public float getSingleSpaceWidth();  // the width, in user space units, of a single space character in the current font

public List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation

Thus, if yourRenderListenerin addition to inspecting the text withgetText()also considersgetBaseline()or evengetAscentLine()andgetDescentLine().you have all the coordinates you will likely need.

PS: There is a wrapper class for the code inParsingHelloWorld.extractText(), PdfReaderContentParser, which allows you to simply write the following given aPdfReader reader, anint page,and aRenderListener renderListener:

PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(page, renderListener);

Leave a Comment

tech