alvas alvas - 1 year ago 39
Perl Question

Lowercasing script in Python vs Perl

In Perl, to lowercase a textfile, I could do the following


#!/usr/bin/env perl

use warnings;
use strict;

binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");

while(<STDIN>) {
print lc($_);

And on the command line:
perl lowercase.perl < infile.txt > lowered.txt

, I could do with

#!/usr/bin/env python
import io
import sys

with[1], 'r', 'utf8') as fin:
with[2], 'r', 'utf8') as fout:

And on the command line:
python infile.txt lowered.txt

Is the Perl
different from the Python

Does it stream the input and lowercase it as it outputs? Or does it read the whole file like the Python's

Instead of reading in a whole file, is there a way to stream the input into Python and output the lowered case byte by byte or char by char?

Is there a way to control the command-line syntax such that it follows the Perl STDIN and STDOUT? E.g.
python < infile.txt > lowered.txt


There seem to be two interleaved issues here and I address that first. To make both Perl and Python use either invokation with a very similar behavior see the second part of the post.

Short: They differ in how they do I/O but both work line-by-line, and Python code is easily changed to allow the same command-line invokation as Perl code. Also, both can be written so to allow input either from file or from standard input stream.

(1)   Both of your solutions are "streaming," in the sense that they both process input line-by-line. Perl code reads from STDIN while Python code gets data from a file, but they both get a line at a time. In that sense they are comparable in efficiency for large files.

A standard way to both read and write files line-by-line in Python is

with open('infile', 'r') as fin, open('outfile', 'w') as fout:

See, for example, these SO posts on processing a very large file and read-and-write files. The way your read the file seems idiomatic for line-by-line processing, see for example SO posts on reading large-file line-by-line, on idiomatic line-by-line reading and another one on line-by-line reading.

Change the first open here to your to directly take the first argument from the command line as the file name, and add modes as needed.

(2)   The command line with both input and output redirection that you show is a shell feature

./program < input > output

The program is fed lines through the standard input stream (file descriptor 0). They are provided from the file input by the shell via its < redirection. From gnu bash manual (see 3.6.1), where "word" stands for our "input"

Redirection of input causes the file whose name results from the expansion of word to be opened for reading on file descriptor n, or the standard input (file descriptor 0) if n is not specified.

Any program can be written to do that, ie. act as a filter.  For Python you can use

import sys   
for line in sys.stdin:
    print line.lower()

See for example a post on writing filters. Now you can invoke it as < input in a shell.

The code prints to standard output, which can then be redirected by shell using >. Then you get the same invokation as for the Perl script.

I take it that the standard output redirection > is clear in both cases.

Finally, you can bring both to a nearly identical behavior, and allowing either invokation, in this way.

In Perl, there is the following idiom

while (my $line = <>) {
    # process $line

The diamond operator <> either takes line by line from all files submitted on the command line (found in @ARGV), or it gets its lines from STDIN (if data is somehow piped into the script). From I/O Operators in perlop

The null filehandle <> is special: it can be used to emulate the behavior of sed and awk, and any other Unix filter program that takes a list of filenames, doing the same to each line of input from all of them. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the @ARGV array is checked, and if it is empty, $ARGV[0] is set to "-" , which when opened gives you standard input. The @ARGV array is then processed as a list of filenames.

In Python you get practically the same behavior by

import fileinput
for line in fileinput.input():
    # process line

This also goes through lines of files named in sys.argv, defaulting to sys.stdin if list is empty. From fileinput documentation

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty. If a filename is '-', it is also replaced by sys.stdin. To specify an alternative list of filenames, pass it as the first argument to input(). A single file name is also allowed.

In both cases, if there are command-line arguments other than file names more need be done.

With this you can use both Perl and Python scripts in either way

lowercase < input > output
lowercase input   > output

Or, for that matter, as cat input | lowercase > output.

All methods here read input and write output line-by-line. This is further optimized (buffered) by the interpreter, the system, and shell's redirections. It is possible to change that so to read and/or write in smaller chunks but that would be extremely inefficient and noticeably slow down programs.