As a developer , I was tasked to extract specific data from a PDF. Upon analysing it further, certain patterns were found based on keywords in the document. Since I was using Python language for the task I found 2 tools quite useful which are PyMuPDF and PDFMiner. These tools can then be used to extract the text from a page on which regular expression can be applied to further extract relevant data. Next, we are going to take a deeper look into these tools, specifically focusing on the pros and cons of each. PyMuPDF Docs , PIP package Pros Simple and understandable API Extensive tools to work with text, images, and graphics Available as a PIP package (pip install PyMuPDF) Better support for a range of symbols comparer to PyPDF2 Cons Parsed text is not in sequence Dependency on other package-Fitz Text sequence information lost during extraction PDFMiner Docs , PIP package ...
A collection of my learning's and insights towards the technologies I am interested in.