python 3.x - After using pdftotext: find page of string from txt -


i coding in python , managed use pdftotext in order extract text pdf.

that particular text file split in list of strings. using regular expression able find specific words interested in. reason why divide text list want measure distance between 2 specific words , distance mean number of words in between 2 words.

however after finding position of words able refer initial pdf. in detail, interested in page , maybe line (if pdf supports kind of structure) these words located.

one idea have process each page of pdf, when find these words know on page was. has big disadvantage page breaks not natural. meaning, lose ability find words if unfortunately separated page break.

do have idea how in more sophisticated manner?

you'll need more sophisticated library 1 you're using. datalogics pdf java toolkit has several classes can extract text pdf file. 1 use depends on want text after extraction. readingordertextextractor create list of lists allow extract text , examine content of paragraphs, sentences within paragraphs, , words within sentence. you'll not able tell distance between words whether in same sentence or paragraph. 1 you've found word object, can find both it's location on page, allowing highlighting, , page number it's on.


Comments