Efficient way to process large text/log files using awk with Python

If you are trying to read or process a large (>5GB) text file you may get a memory error or any performance issue. So I came up with an idea to integrate the C/C++ tools with python to do this in a better and efficient way.

We have some tools to process the text files. In this blog, I am going to use the awk, you can download awk for windows. After the installation, you need to add the installed path directory to the Environment Variables to access the awk through the command prompt.

In python we have os and subprocess modules to interact with the shell, whenever we use these methods then the respective shell of the operating system will open, the command gets executed, and close the shell after execution.

os.system(): We use this method to run the shell commands using Python. It is very useful when we need to save the output in a file. With this method, we can not assign the output value of the respective command to any variable.

Syntax: os.system(command)
command: shell command to execute.

subprocess.check_output(): This method returns the output of the shell command to a python variable. We need to decode the output to utf-8 because the default returned output type is a byte string.

Syntax: subprocess.check_output(command)

I have a 6GB text file I need to process and perform some operations on it. Now I am going to explain some examples of when and how we can use awk commands with our python code.

Example 1: Count the number of lines in the text file.

Method 1: Usual way of counting number of lines in a file with python

file = "test.txt"

with open(file, errors='ignore') as f:
    for i, l in enumerate(f):

Method 2: Using awk with python

import subprocess
file = "test.txt"

cmd = 'awk "END{print NR}" test.txt'
no_of_lines = subprocess.check_output(cmd).decode("utf-8")

In the above example, Method 1 took 30sec and Method 2 took 20sec. It is just to count the number of lines

Example 2: Fetch lines between line numbers and save it in a file

Method 1:

file = "test.txt"
from itertools import islice
output = ''
start_line,end_line = 1,100000

with open(file, errors='ignore') as f:
    lines = islice(f, start_line, end_line)
    for line in lines:
        output += line
outfile = open("output.txt",'w')

Method 2:

import os
cmd = f'awk "NR>={start_line} && NR<={end_line}'+' { print }"'+f" output.txt"
os.system(cmd+"> sample.txt")

In Example 2, Method 1 took 1 min 20 sec, but Method 2 took just 2 sec to fetch and store the first 1,00,000 lines in a file. Now you can see a huge difference in execution time. If you use more complex operations with large text/log files in a usual way, then it will take hours or get a memory error. In that situation, you can save time and do not get any memory errors by integrating awk with python.

Example 3: Pattern match

For pattern matching in Python, we have to read and store the whole file in memory or we have to check the pattern line by line. It is not a good idea to store a 10GB file in memory or check the pattern line by line. So in this situation, we can use grep for pattern matching with python. We also have a sed tool we can use that also. The grep and sed works in the same way as awk, but the syntax is different.

Syntax: grep pattern filename

import os

pattern = '[a-zA-Z]\+\.\?[a-zA-Z]\+@[a-z]\+\.[a-z]*'
cmd = f'grep "{pattern}" test.txt -o'
os.system(cmd+"> emails.txt")

In the above example, we have the pattern to find emails present in a text file and storing them in a new file. grep is one of the best tools for pattern matching, so please check the documentation to know more about grep (windows grep documentation).

Run the ‘- -help’ command in the command prompt or terminal to explore more features of tools.

Syntax: ToolName – -help

awk --help

I tried to explain it in a simple way, if you find this helpful please do share it with your friends. If you have any doubts feel free to drop a comment.

Leave A Comment

Your email address will not be published. Required fields are marked *