text-extraction – Make Me Engineer

C# Extract text from PDF using PdfSharp

June 10, 2023 by Tarik

Took Sergio’s answer and made some extension methods. I also changed the accumulation of strings into an iterator. public static class PdfSharpExtensions { public static IEnumerable<string> ExtractText(this PdfPage page) { var content = ContentReader.ReadContent(page); var text = content.ExtractText(); return text; } public static IEnumerable<string> ExtractText(this CObject cObject) { if (cObject is COperator) { var cOperator … Read more

Get last whole number in a string

June 1, 2023 by Tarik

you could do: $text = “1 out of 23”; if(preg_match_all(‘/\d+/’, $text, $numbers)) $lastnum = end($numbers[0]); Note that $numbers[0] contains array of strings that matched full pattern, and $numbers[1] contains array of strings enclosed by tags.

How to extract just plain text from .doc & .docx files? [closed]

November 3, 2022 by Tarik

If you want the pure plain text(my requirement) then all you need is unzip -p some.docx word/document.xml | sed -e ‘s/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g’ Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

How to extract text from a PDF? [closed]

October 6, 2022 by Tarik

I was given a 400 page pdf file with a table of data that I had to import – luckily no images. Ghostscript worked for me: gswin64c -sDEVICE=txtwrite -o output.txt input.pdf The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, … Read more

regular expression to extract text from HTML

September 3, 2022 by Tarik

Remove javascript and CSS: <(script|style).*?</\1> Remove tags <.*?>

How to extract text from MS office documents in C#

July 26, 2022 by Tarik

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you … Read more

How to extract string following a pattern with grep, regex or perl [duplicate]

July 24, 2022 by Tarik

Since you need to match content without including it in the result (must match name=” but it’s not part of the desired result) some form of zero-width matching or group capturing is required. This can be done easily with the following tools: Perl With Perl you could use the n option to loop line by … Read more

Getting URL parameter in java and extract a specific text from that URL

July 8, 2022 by Tarik

I think the one of the easiest ways out would be to parse the string returned by URL.getQuery() as public static Map<String, String> getQueryMap(String query) { String[] params = query.split(“&”); Map<String, String> map = new HashMap<String, String>(); for (String param : params) { String name = param.split(“=”)[0]; String value = param.split(“=”)[1]; map.put(name, value); } return … Read more

Extracting text from a PDF file using PDFMiner in python?

May 17, 2022 by Tarik

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016) from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec=”utf-8″ laparams = LAParams() device = TextConverter(rsrcmgr, … Read more

How to extract a substring using regex

May 10, 2022 by Tarik

Assuming you want the part between single quotes, use this regular expression with a Matcher: “‘(.*?)'” Example: String mydata = “some string with ‘the data i want’ inside”; Pattern pattern = Pattern.compile(“‘(.*?)'”); Matcher matcher = pattern.matcher(mydata); if (matcher.find()) { System.out.println(matcher.group(1)); } Result: the data i want