public class PDFTextStripper extends LegacyPDFStreamEngine
Modifier and Type | Class and Description |
---|---|
private static class |
PDFTextStripper.LineItem
internal marker class.
|
private static class |
PDFTextStripper.PositionWrapper
wrapper of TextPosition that adds flags to track status as linestart and paragraph start positions.
|
private static class |
PDFTextStripper.WordWithTextPositions
Internal class that maps strings to lists of
TextPosition arrays. |
Modifier and Type | Field and Description |
---|---|
private boolean |
addMoreFormatting |
private java.lang.String |
articleEnd |
private java.lang.String |
articleStart |
private float |
averageCharTolerance |
private java.util.List<PDRectangle> |
beadRectangles |
private java.util.Map<java.lang.String,java.util.TreeMap<java.lang.Float,java.util.TreeSet<java.lang.Float>>> |
characterListMapping |
protected java.util.ArrayList<java.util.List<TextPosition>> |
charactersByArticle
The charactersByArticle is used to extract text by article divisions.
|
private int |
currentPageNo |
private static float |
defaultDropThreshold |
private static float |
defaultIndentThreshold |
protected PDDocument |
document |
private float |
dropThreshold |
private static float |
END_OF_LAST_TEXT_X_RESET_VALUE |
private PDOutlineItem |
endBookmark |
private int |
endBookmarkPageNumber |
private int |
endPage |
private static float |
EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE |
private float |
indentThreshold |
private boolean |
inParagraph
True if we started a paragraph but haven't ended it yet.
|
private static float |
LAST_WORD_SPACING_RESET_VALUE |
protected java.lang.String |
LINE_SEPARATOR
The platform's line separator.
|
private java.lang.String |
lineSeparator |
private static java.lang.String[] |
LIST_ITEM_EXPRESSIONS
a list of regular expressions that match commonly used list item formats, i.e.
|
private java.util.List<java.util.regex.Pattern> |
listOfPatterns |
private static org.apache.commons.logging.Log |
LOG |
private static float |
MAX_HEIGHT_FOR_LINE_RESET_VALUE |
private static float |
MAX_Y_FOR_LINE_RESET_VALUE |
private static float |
MIN_Y_TOP_FOR_LINE_RESET_VALUE |
private static java.util.Map<java.lang.Character,java.lang.Character> |
MIRRORING_CHAR_MAP |
protected java.io.Writer |
output |
private java.lang.String |
pageEnd |
private java.lang.String |
pageStart |
private java.lang.String |
paragraphEnd |
private java.lang.String |
paragraphStart |
private boolean |
shouldSeparateByBeads |
private boolean |
sortByPosition |
private float |
spacingTolerance |
private PDOutlineItem |
startBookmark |
private int |
startBookmarkPageNumber |
private int |
startPage |
private boolean |
suppressDuplicateOverlappingText |
private static boolean |
useCustomQuickSort |
private java.lang.String |
wordSeparator |
Constructor and Description |
---|
PDFTextStripper()
Instantiate a new PDFTextStripper object.
|
Modifier and Type | Method and Description |
---|---|
private PDFTextStripper.WordWithTextPositions |
createWord(java.lang.String word,
java.util.List<TextPosition> wordPositions)
Used within
normalize(List) to create a single PDFTextStripper.WordWithTextPositions entry. |
protected void |
endArticle()
End an article.
|
protected void |
endDocument(PDDocument document)
This method is available for subclasses of this class.
|
protected void |
endPage(PDPage page)
End a page.
|
private void |
fillBeadRectangles(PDPage page) |
boolean |
getAddMoreFormatting()
This will tell if the text stripper should add some more text formatting.
|
java.lang.String |
getArticleEnd()
Returns the string which will be used at the end of an article.
|
java.lang.String |
getArticleStart()
Returns the string which will be used at the beginning of an article.
|
float |
getAverageCharTolerance()
Get the current character width-based tolerance value that is being used to estimate where spaces in text should
be added.
|
protected java.util.List<java.util.List<TextPosition>> |
getCharactersByArticle()
Character strings are grouped by articles.
|
protected int |
getCurrentPageNo()
Get the current page number that is being processed.
|
float |
getDropThreshold()
the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line
start is considered to be a paragraph start.
|
PDOutlineItem |
getEndBookmark()
Get the bookmark where text extraction should end, inclusive.
|
int |
getEndPage()
This will get the last page that will be extracted.
|
float |
getIndentThreshold()
returns the multiple of whitespace character widths for the current text which the current line start can be
indented from the previous line start beyond which the current line start is considered to be a paragraph start.
|
java.lang.String |
getLineSeparator()
This will get the line separator.
|
protected java.util.List<java.util.regex.Pattern> |
getListItemPatterns()
returns a list of regular expression Patterns representing different common list item formats.
|
protected java.io.Writer |
getOutput()
The output stream that is being written to.
|
java.lang.String |
getPageEnd()
Returns the string which will be used at the end of a page.
|
java.lang.String |
getPageStart()
Returns the string which will be used at the beginning of a page.
|
java.lang.String |
getParagraphEnd()
Returns the string which will be used at the end of a paragraph.
|
java.lang.String |
getParagraphStart()
Returns the string which will be used at the beginning of a paragraph.
|
boolean |
getSeparateByBeads()
This will tell if the text stripper should separate by beads.
|
boolean |
getSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream.
|
float |
getSpacingTolerance()
Get the current space width-based tolerance value that is being used to estimate where spaces in text should be
added.
|
PDOutlineItem |
getStartBookmark()
Get the bookmark where text extraction should start, inclusive.
|
int |
getStartPage()
This is the page that the text extraction will start on.
|
boolean |
getSuppressDuplicateOverlappingText() |
java.lang.String |
getText(PDDocument doc)
This will return the text of a document.
|
java.lang.String |
getWordSeparator()
This will get the word separator.
|
private java.lang.String |
handleDirection(java.lang.String word)
Handles the LTR and RTL direction of the given words.
|
private PDFTextStripper.PositionWrapper |
handleLineSeparation(PDFTextStripper.PositionWrapper current,
PDFTextStripper.PositionWrapper lastPosition,
PDFTextStripper.PositionWrapper lastLineStartPosition,
float maxHeightForLine)
handles the line separator for a new line given the specified current and previous TextPositions.
|
private void |
isParagraphSeparation(PDFTextStripper.PositionWrapper position,
PDFTextStripper.PositionWrapper lastPosition,
PDFTextStripper.PositionWrapper lastLineStartPosition,
float maxHeightForLine)
tests the relationship between the last text position, the current text position and the last text position that
followed a line separator to decide if the gap represents a paragraph separation.
|
private java.util.regex.Pattern |
matchListItemPattern(PDFTextStripper.PositionWrapper pw)
returns the list item Pattern object that matches the text at the specified PositionWrapper or null if the text
does not match such a pattern.
|
protected static java.util.regex.Pattern |
matchPattern(java.lang.String string,
java.util.List<java.util.regex.Pattern> patterns)
iterates over the specified list of Patterns until it finds one that matches the specified string.
|
private float |
multiplyFloat(float value1,
float value2) |
private java.util.List<PDFTextStripper.WordWithTextPositions> |
normalize(java.util.List<PDFTextStripper.LineItem> line)
Normalize the given list of TextPositions.
|
private java.lang.StringBuilder |
normalizeAdd(java.util.List<PDFTextStripper.WordWithTextPositions> normalized,
java.lang.StringBuilder lineBuilder,
java.util.List<TextPosition> wordPositions,
PDFTextStripper.LineItem item)
Used within
normalize(List) to handle a TextPosition . |
private java.lang.String |
normalizeWord(java.lang.String word)
Normalize certain Unicode characters.
|
private boolean |
overlap(float y1,
float height1,
float y2,
float height2) |
private static void |
parseBidiFile(java.io.InputStream inputStream)
This method parses the bidi file provided as inputstream.
|
void |
processPage(PDPage page)
This will process the contents of a page.
|
protected void |
processPages(PDPageTree pages)
This will process all of the pages and the text that is in them.
|
protected void |
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page.
|
private void |
resetEngine() |
void |
setAddMoreFormatting(boolean newAddMoreFormatting)
There will some additional text formatting be added if addMoreFormatting is set to true.
|
void |
setArticleEnd(java.lang.String articleEndValue)
Sets the string which will be used at the end of an article.
|
void |
setArticleStart(java.lang.String articleStartValue)
Sets the string which will be used at the beginning of an article.
|
void |
setAverageCharTolerance(float averageCharToleranceValue)
Set the character width-based tolerance value that is used to estimate where spaces in text should be added.
|
void |
setDropThreshold(float dropThresholdValue)
sets the minimum whitespace, as a multiple of the max height of the current characters beyond which the current
line start is considered to be a paragraph start.
|
void |
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop.
|
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class.
|
void |
setIndentThreshold(float indentThresholdValue)
sets the multiple of whitespace character widths for the current text which the current line start can be
indented from the previous line start beyond which the current line start is considered to be a paragraph start.
|
void |
setLineSeparator(java.lang.String separator)
Set the desired line separator for output text.
|
protected void |
setListItemPatterns(java.util.List<java.util.regex.Pattern> patterns)
use to supply a different set of regular expression patterns for matching list item starts.
|
void |
setPageEnd(java.lang.String pageEndValue)
Sets the string which will be used at the end of a page.
|
void |
setPageStart(java.lang.String pageStartValue)
Sets the string which will be used at the beginning of a page.
|
void |
setParagraphEnd(java.lang.String s)
Sets the string which will be used at the end of a paragraph.
|
void |
setParagraphStart(java.lang.String s)
Sets the string which will be used at the beginning of a paragraph.
|
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads.
|
void |
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen.
|
void |
setSpacingTolerance(float spacingToleranceValue)
Set the space width-based tolerance value that is used to estimate where spaces in text should be added.
|
void |
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive.
|
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class.
|
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other.
|
void |
setWordSeparator(java.lang.String separator)
Set the desired word separator for output text.
|
protected void |
startArticle()
Start a new article, which is typically defined as a column on a single page (also referred to as a bead).
|
protected void |
startArticle(boolean isLTR)
Start a new article, which is typically defined as a column on a single page (also referred to as a bead).
|
protected void |
startDocument(PDDocument document)
This method is available for subclasses of this class.
|
protected void |
startPage(PDPage page)
Start a new page.
|
private boolean |
within(float first,
float second,
float variance)
This will determine of two floating point numbers are within a specified variance.
|
protected void |
writeCharacters(TextPosition text)
Write the string in TextPosition to the output stream.
|
private void |
writeLine(java.util.List<PDFTextStripper.WordWithTextPositions> line)
Write a list of string containing a whole line of a document.
|
protected void |
writeLineSeparator()
Write the line separator value to the output stream.
|
protected void |
writePage()
This will print the text of the processed page to "output".
|
protected void |
writePageEnd()
Write something (if defined) at the end of a page.
|
protected void |
writePageStart()
Write something (if defined) at the start of a page.
|
protected void |
writeParagraphEnd()
Write something (if defined) at the end of a paragraph.
|
protected void |
writeParagraphSeparator()
writes the paragraph separator string to the output.
|
protected void |
writeParagraphStart()
Write something (if defined) at the start of a paragraph.
|
protected void |
writeString(java.lang.String text)
Write a Java string to the output stream.
|
protected void |
writeString(java.lang.String text,
java.util.List<TextPosition> textPositions)
Write a Java string to the output stream.
|
void |
writeText(PDDocument doc,
java.io.Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer.
|
protected void |
writeWordSeparator()
Write the word separator value to the output stream.
|
showGlyph
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
private static float defaultIndentThreshold
private static float defaultDropThreshold
private static final boolean useCustomQuickSort
private static final org.apache.commons.logging.Log LOG
protected final java.lang.String LINE_SEPARATOR
private java.lang.String lineSeparator
private java.lang.String wordSeparator
private java.lang.String paragraphStart
private java.lang.String paragraphEnd
private java.lang.String pageStart
private java.lang.String pageEnd
private java.lang.String articleStart
private java.lang.String articleEnd
private int currentPageNo
private int startPage
private int endPage
private PDOutlineItem startBookmark
private int startBookmarkPageNumber
private int endBookmarkPageNumber
private PDOutlineItem endBookmark
private boolean suppressDuplicateOverlappingText
private boolean shouldSeparateByBeads
private boolean sortByPosition
private boolean addMoreFormatting
private float indentThreshold
private float dropThreshold
private float spacingTolerance
private float averageCharTolerance
private java.util.List<PDRectangle> beadRectangles
protected java.util.ArrayList<java.util.List<TextPosition>> charactersByArticle
private java.util.Map<java.lang.String,java.util.TreeMap<java.lang.Float,java.util.TreeSet<java.lang.Float>>> characterListMapping
protected PDDocument document
protected java.io.Writer output
private boolean inParagraph
private static final float END_OF_LAST_TEXT_X_RESET_VALUE
private static final float MAX_Y_FOR_LINE_RESET_VALUE
private static final float EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
private static final float MAX_HEIGHT_FOR_LINE_RESET_VALUE
private static final float MIN_Y_TOP_FOR_LINE_RESET_VALUE
private static final float LAST_WORD_SPACING_RESET_VALUE
private static final java.lang.String[] LIST_ITEM_EXPRESSIONS
private java.util.List<java.util.regex.Pattern> listOfPatterns
private static java.util.Map<java.lang.Character,java.lang.Character> MIRRORING_CHAR_MAP
public PDFTextStripper() throws java.io.IOException
java.io.IOException
- If there is an error loading the properties.public java.lang.String getText(PDDocument doc) throws java.io.IOException
doc
- The document to get the text from.java.io.IOException
- if the doc state is invalid or it is encrypted.private void resetEngine()
public void writeText(PDDocument doc, java.io.Writer outputStream) throws java.io.IOException
doc
- The document to get the data from.outputStream
- The location to put the text.java.io.IOException
- If the doc is in an invalid state.protected void processPages(PDPageTree pages) throws java.io.IOException
pages
- The pages object in the document.java.io.IOException
- If there is an error parsing the text.protected void startDocument(PDDocument document) throws java.io.IOException
document
- The PDF document that is being processed.java.io.IOException
- If an IO error occurs.protected void endDocument(PDDocument document) throws java.io.IOException
document
- The PDF document that is being processed.java.io.IOException
- If an IO error occurs.public void processPage(PDPage page) throws java.io.IOException
processPage
in class LegacyPDFStreamEngine
page
- The page to process.java.io.IOException
- If there is an error processing the page.private void fillBeadRectangles(PDPage page)
protected void startArticle() throws java.io.IOException
java.io.IOException
- If there is any error writing to the stream.protected void startArticle(boolean isLTR) throws java.io.IOException
isLTR
- true if primary direction of text is left to right.java.io.IOException
- If there is any error writing to the stream.protected void endArticle() throws java.io.IOException
java.io.IOException
- If there is any error writing to the stream.protected void startPage(PDPage page) throws java.io.IOException
page
- The page we are about to process.java.io.IOException
- If there is any error writing to the stream.protected void endPage(PDPage page) throws java.io.IOException
page
- The page we are about to process.java.io.IOException
- If there is any error writing to the stream.protected void writePage() throws java.io.IOException
java.io.IOException
- If there is an error writing the text.private boolean overlap(float y1, float height1, float y2, float height2)
protected void writeLineSeparator() throws java.io.IOException
java.io.IOException
- If there is a problem writing out the line separator to the document.protected void writeWordSeparator() throws java.io.IOException
java.io.IOException
- If there is a problem writing out the word separator to the document.protected void writeCharacters(TextPosition text) throws java.io.IOException
text
- The text to write to the stream.java.io.IOException
- If there is an error when writing the text.protected void writeString(java.lang.String text, java.util.List<TextPosition> textPositions) throws java.io.IOException
textPositions
and just calls writeString(String)
.text
- The text to write to the stream.textPositions
- The TextPositions belonging to the text.java.io.IOException
- If there is an error when writing the text.protected void writeString(java.lang.String text) throws java.io.IOException
text
- The text to write to the stream.java.io.IOException
- If there is an error when writing the text.private boolean within(float first, float second, float variance)
first
- The first number to compare to.second
- The second number to compare to.variance
- The allowed variance.protected void processTextPosition(TextPosition text)
processTextPosition
in class LegacyPDFStreamEngine
text
- The text to process.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue
- New value of 1-based startPage property.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue
- New value of 1-based endPage property.public void setLineSeparator(java.lang.String separator)
separator
- The desired line separator string.public java.lang.String getLineSeparator()
public java.lang.String getWordSeparator()
public void setWordSeparator(java.lang.String separator)
separator
- The desired page separator string.public boolean getSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected java.io.Writer getOutput()
protected java.util.List<java.util.List<TextPosition>> getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue
- The suppressDuplicateOverlappingText to set.public boolean getSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads
- The new grouping of beads.public PDOutlineItem getEndBookmark()
public void setEndBookmark(PDOutlineItem aEndBookmark)
aEndBookmark
- The ending bookmark.public PDOutlineItem getStartBookmark()
public void setStartBookmark(PDOutlineItem aStartBookmark)
aStartBookmark
- The starting bookmark.public boolean getAddMoreFormatting()
public void setAddMoreFormatting(boolean newAddMoreFormatting)
newAddMoreFormatting
- Tell PDFBox to add some more text formattingpublic boolean getSortByPosition()
public void setSortByPosition(boolean newSortByPosition)
newSortByPosition
- Tell PDFBox to sort the text positions.public float getSpacingTolerance()
public void setSpacingTolerance(float spacingToleranceValue)
spacingToleranceValue
- tolerance / scaling factor to usepublic float getAverageCharTolerance()
public void setAverageCharTolerance(float averageCharToleranceValue)
averageCharToleranceValue
- average tolerance / scaling factor to usepublic float getIndentThreshold()
public void setIndentThreshold(float indentThresholdValue)
indentThresholdValue
- the number of whitespace character widths to use when detecting paragraph indents.public float getDropThreshold()
public void setDropThreshold(float dropThresholdValue)
dropThresholdValue
- the character height multiple for max allowed whitespace between lines in the same
paragraph.public java.lang.String getParagraphStart()
public void setParagraphStart(java.lang.String s)
s
- the paragraph start stringpublic java.lang.String getParagraphEnd()
public void setParagraphEnd(java.lang.String s)
s
- the paragraph end stringpublic java.lang.String getPageStart()
public void setPageStart(java.lang.String pageStartValue)
pageStartValue
- the page start stringpublic java.lang.String getPageEnd()
public void setPageEnd(java.lang.String pageEndValue)
pageEndValue
- the page end stringpublic java.lang.String getArticleStart()
public void setArticleStart(java.lang.String articleStartValue)
articleStartValue
- the article start stringpublic java.lang.String getArticleEnd()
public void setArticleEnd(java.lang.String articleEndValue)
articleEndValue
- the article end stringprivate PDFTextStripper.PositionWrapper handleLineSeparation(PDFTextStripper.PositionWrapper current, PDFTextStripper.PositionWrapper lastPosition, PDFTextStripper.PositionWrapper lastLineStartPosition, float maxHeightForLine) throws java.io.IOException
current
- the current text positionlastPosition
- the previous text positionlastLineStartPosition
- the last text position that followed a line separator.maxHeightForLine
- max height for positions since lastLineStartPositionjava.io.IOException
- if something went wrongprivate void isParagraphSeparation(PDFTextStripper.PositionWrapper position, PDFTextStripper.PositionWrapper lastPosition, PDFTextStripper.PositionWrapper lastLineStartPosition, float maxHeightForLine)
This base implementation tests to see if the lastLineStartPosition is null OR if the current vertical position has dropped below the last text vertical position by at least 2.5 times the current text height OR if the current horizontal position is indented by at least 2 times the current width of a space character.
This also attempts to identify text that is indented under a hanging indent.
This method sets the isParagraphStart and isHangingIndent flags on the current position object.
position
- the current text position. This may have its isParagraphStart or isHangingIndent flags set upon
return.lastPosition
- the previous text position (should not be null).lastLineStartPosition
- the last text position that followed a line separator, or null.maxHeightForLine
- max height for text positions since lasLineStartPosition.private float multiplyFloat(float value1, float value2)
protected void writeParagraphSeparator() throws java.io.IOException
java.io.IOException
- if something went wrongprotected void writeParagraphStart() throws java.io.IOException
java.io.IOException
- if something went wrongprotected void writeParagraphEnd() throws java.io.IOException
java.io.IOException
- if something went wrongprotected void writePageStart() throws java.io.IOException
java.io.IOException
- if something went wrongprotected void writePageEnd() throws java.io.IOException
java.io.IOException
- if something went wrongprivate java.util.regex.Pattern matchListItemPattern(PDFTextStripper.PositionWrapper pw)
getListItemPatterns()
method. To add to the list, simply override that method (if sub-classing) or explicitly supply your own list
using setListItemPatterns(List)
.pw
- positionprotected void setListItemPatterns(java.util.List<java.util.regex.Pattern> patterns)
patterns
- list of patternsprotected java.util.List<java.util.regex.Pattern> getListItemPatterns()
This method returns a list of such regular expression Patterns.
protected static java.util.regex.Pattern matchPattern(java.lang.String string, java.util.List<java.util.regex.Pattern> patterns)
Order of the supplied list of patterns is important as most common patterns should come first. Patterns should be strict in general, and all will be used with case sensitivity on.
string
- the string to be searchedpatterns
- list of patternsprivate void writeLine(java.util.List<PDFTextStripper.WordWithTextPositions> line) throws java.io.IOException
line
- a list with the words of the given linejava.io.IOException
- if something went wrongprivate java.util.List<PDFTextStripper.WordWithTextPositions> normalize(java.util.List<PDFTextStripper.LineItem> line)
line
- list of TextPositionsprivate java.lang.String handleDirection(java.lang.String word)
word
- The word that shall be processedprivate static void parseBidiFile(java.io.InputStream inputStream) throws java.io.IOException
inputStream
- - The bidi file as inputstreamjava.io.IOException
- if any line could not be read by the LineNumberReaderprivate PDFTextStripper.WordWithTextPositions createWord(java.lang.String word, java.util.List<TextPosition> wordPositions)
normalize(List)
to create a single PDFTextStripper.WordWithTextPositions
entry.private java.lang.String normalizeWord(java.lang.String word)
word
- Word to normalizeprivate java.lang.StringBuilder normalizeAdd(java.util.List<PDFTextStripper.WordWithTextPositions> normalized, java.lang.StringBuilder lineBuilder, java.util.List<TextPosition> wordPositions, PDFTextStripper.LineItem item)
normalize(List)
to handle a TextPosition
.