Skip to main content

Posts

Finding difference between 2 files in Python

In this post, we will take a look at how to compare two files using Python.   I was tasked to compare 2 files and then list the differences between them using Python. Initially, I started with filecmp module, but even with the function parameter ‘ shallow’ set to false, the Boolean result was not enough. Sure, it can act as an indicator to take some action, but it will not list the differences.   I was looking for something more visual, something like color coding and not like the git diff output, which is not very user-friendly. But, another Python internal module, difflib helped me to get the job done.   Inside Difflib, HtmlDiff is what I was looking for. The differences were highlighted with 3 different colors and also the line numbers were indicated in a table to locate the differences. The results are quite self-explanatory and it is easier to explain the differences to other people. Code for generating the above difference table: Note: File1...
Recent posts

Cracking the coding interview: Stacks and Queues(3.2)

In this post, we are going to go over my notes while solving a coding question Today I am sharing my thoughts on the Cracking the coding interview question for chapter 3 Stacks and Queues, Q3.2 Below are my personal notes and working while solving the problem. I might be correct, wrong, or totally wrong, but I wanted to have a working copy of my train of thoughts while solving the problem. Notes section ************************************** what operations will the stack have? push,pop,getTop and isEmpty. Here all the operations will have a time complexity of O(1). while performing push operation we can maintain the min value. While performing the pop operation, how do we find the next min value? - When we are pushing the element, we can push the item and the min value, up to that item in the stack.  In that way, we will have a track of min value up to a particular item when we start popping. For the pop operation, we can also have the min function work in O(1). push === 1. check ...

Adding existing Anaconda environment to Jupyter notebook

In this post we are going to take a look at adding Anaconda environment to Jupyter notebook. Recently, I was working on a CSV file and wanted to work with Pandas package for tabular data manipulation using Python. The problem was even if I install Pandas package, I would have to install other Data Science package as needed. But, the Anaconda environment was already setup on my laptop, which I want to reuse.   Today, we will look into how to reuse the Anaconda environment within the Jupyter Notebook.   There are 4 basic steps to be followed for adding the environment: 1. Create a conda environment Go to Conda command prompt(Run in Admin mode) Run the following command: conda create –-name newenv O/P:   What if there is an existing conda environment? Go to Conda command prompt(No need for Admin mode) Run the following command: conda env list O/P: Since there was only one environment, only one entry was displayed. ‘*’ indicates the cur...

PyMuPDF vs PDFMiner

 As a developer , I was tasked to extract specific data from a PDF. Upon analysing it further, certain patterns were found based on keywords in the document. Since I was using Python language for the task I found 2 tools quite useful which are PyMuPDF and PDFMiner. These tools can then be used to extract the text from a page on which regular expression can be applied to further extract relevant data.     Next, we are going to take a deeper look into these tools, specifically focusing on the pros and cons of each.     PyMuPDF   Docs , PIP package Pros Simple and understandable API Extensive tools to work with text, images, and graphics Available as a PIP package (pip install PyMuPDF) Better support for a range of symbols comparer to PyPDF2   Cons Parsed text is not in sequence Dependency on other package-Fitz Text sequence information lost during extraction     PDFMiner   Docs ,  PIP package ...

Google Colab to Databricks

  Google Colab to Databricks Hello Everyone, Recently I started using a new tool for Data Analytics (Databricks Community Edition) and this post shares my experience in moving from Google Colab to Databricks. What is Databricks Community Edition notebook? It is a powerful platform for collaboration among data analysts, data scientists, and data engineers. You can think of it as a cloud-based Data Analytics Platform which gives you a chance to tap into Spark and other open-source tools.   First impressions on moving from Google Colab to Databricks At first, it seems similar to Colab, the same notebook environment. But when you start working with it, you’ll notice the differences. The first major difference I noticed was the filesystem. Also, many other notebook features like Spark, SQL, and SQLAnalytics can be accessed for learning at no cost. Another major difference is that the databricks has two filesystems, one local and another on AWS. If you want to upload any f...

Scikit learn design principles

In this post we are going to take a look at the design principles of the very popular library which is Scikit Learn. If you are into machine learning and deep learning then you might be familiar with the scikit learn library. But those who are beginners, they might have a small hint of how things work around here but through this post will help you to get a general idea about this opensource library. Following topics will be covered in this post: 1.What is Scikit learn 2.Some details about scikit learn 3.list and describe the design principles 1.What is scikit learn??  Scikit learn is an open source library written in python which supports many machine learning algorithms like Classification,Regression,Rlustering and many other algorithms.It was designed to work in harmony with other libraries like NumPy and SciPy. 2.Some more details about scikit learn The first public release of Scikit was on February 1,2010 and was designed extensively by developers at the French I...

Managing data in Numpy- Part 2

This post is the second part for the data management in numpy. We are going to continue with different set of functions and its syntax, code snippets and description in this post.