File operations and command line

We will learn how read and write files from python and how to write a python program and run it from the command line.

Redaing files

Python - just like most of the languages - handles files through file objects.

The open(filename[, mode]) function opens a file and returns a file handle object (or raise an error). The mode can be 'r' (read), 'w' (write), 'r+' (both), or in case of binary files: 'rb', 'wb', 'r+b'.

In [1]:
f = open('E0.csv') # open for reading, returns file object
print f
<open file 'E0.csv', mode 'r' at 0x0000000008C3AC00>

The file object is not so useful on its own. This file contains the English Premier League statistics from the season of 2015/16. The f.read() reads the whole file to a string. We don't print the whole:

In [2]:
f = open('E0.csv')
content = f.read()
print content[:100]
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR

Read only one first line!

In [3]:
f = open('E0.csv')
first_line = f.readline()
print first_line
second_line = f.readline()
print second_line
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR,B365H,B365D,B365A,BWH,BWD,BWA,IWH,IWD,IWA,LBH,LBD,LBA,PSH,PSD,PSA,WHH,WHD,WHA,VCH,VCD,VCA,Bb1X2,BbMxH,BbAvH,BbMxD,BbAvD,BbMxA,BbAvA,BbOU,BbMx>2.5,BbAv>2.5,BbMx<2.5,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA

E0,13/08/16,Burnley,Swansea,0,1,A,0,0,D,J Moss,10,17,3,9,10,14,7,4,3,2,0,0,2.4,3.3,3.25,2.45,3.1,2.95,2.5,3.3,2.65,2.45,3.25,3.1,2.47,3.32,3.19,2.5,3.2,2.9,2.5,3.2,3.25,55,2.55,2.43,3.35,3.21,3.3,3.1,40,2.4,2.3,1.68,1.61,32,-0.25,2.13,2.06,1.86,1.81,2.79,3.16,2.89

The file object is iterable, row-wise:

Mind the newline character at the end of each line.

In [4]:
f = open('E0.csv')
L = []
for line in f:
    L.append(line)
print L[:10]
['Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR,B365H,B365D,B365A,BWH,BWD,BWA,IWH,IWD,IWA,LBH,LBD,LBA,PSH,PSD,PSA,WHH,WHD,WHA,VCH,VCD,VCA,Bb1X2,BbMxH,BbAvH,BbMxD,BbAvD,BbMxA,BbAvA,BbOU,BbMx>2.5,BbAv>2.5,BbMx<2.5,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA\n', 'E0,13/08/16,Burnley,Swansea,0,1,A,0,0,D,J Moss,10,17,3,9,10,14,7,4,3,2,0,0,2.4,3.3,3.25,2.45,3.1,2.95,2.5,3.3,2.65,2.45,3.25,3.1,2.47,3.32,3.19,2.5,3.2,2.9,2.5,3.2,3.25,55,2.55,2.43,3.35,3.21,3.3,3.1,40,2.4,2.3,1.68,1.61,32,-0.25,2.13,2.06,1.86,1.81,2.79,3.16,2.89\n', 'E0,13/08/16,Crystal Palace,West Brom,0,1,A,0,0,D,C Pawson,14,13,4,3,12,15,3,6,2,2,0,0,2,3.3,4.5,2,3.2,3.9,2.1,3.3,3.3,2,3.25,4.33,2.06,3.29,4.32,2.05,3.1,4,2,3.3,4.4,56,2.1,2.01,3.4,3.23,4.5,4.16,38,2.68,2.5,1.6,1.52,33,-0.5,2.07,2,1.9,1.85,2.25,3.15,3.86\n', 'E0,13/08/16,Everton,Tottenham,1,1,D,1,0,H,M Atkinson,12,13,6,4,10,14,5,6,0,0,0,0,3.2,3.4,2.4,2.95,3.2,2.4,2.65,3.3,2.5,3.1,3.4,2.4,3.25,3.43,2.37,3.1,3.1,2.4,3.25,3.4,2.38,55,3.3,3.12,3.45,3.32,2.5,2.36,41,2.12,2.05,1.87,1.77,32,0.25,1.91,1.85,2.09,2,3.64,3.54,2.16\n', 'E0,13/08/16,Hull,Leicester,2,1,H,1,0,H,M Dean,14,18,5,5,8,17,5,3,2,2,0,0,4.5,3.6,1.91,4.33,3.4,1.9,3.3,3.3,2.1,4.5,3.5,1.91,4.43,3.55,1.95,4.2,3.25,1.95,4.4,3.5,1.95,55,4.5,4.17,3.6,3.43,2.33,1.95,40,2.3,2.19,1.74,1.67,31,0.25,2.35,2.26,2.03,1.67,4.68,3.5,1.92\n', 'E0,13/08/16,Man City,Sunderland,2,1,H,1,0,H,R Madley,16,7,4,3,11,14,9,6,1,2,0,0,1.25,6.5,15,1.22,6,11.5,1.25,5.5,10.3,1.25,6.5,13,1.27,6.48,13.15,1.25,5.5,13,1.25,6.5,15,56,1.3,1.25,6.8,6.11,15,12.55,39,1.56,1.53,2.67,2.48,34,-1.5,1.81,1.73,2.2,2.14,1.25,6.5,14.5\n', 'E0,13/08/16,Middlesbrough,Stoke,1,1,D,1,0,H,K Friend,12,12,2,1,18,14,9,6,3,5,0,0,2.38,3.2,3.4,2.25,3.1,3.25,2.3,3.3,2.9,2.3,3.2,3.4,2.33,3.24,3.53,2.4,3.1,3.1,2.38,3.2,3.4,56,2.4,2.31,3.3,3.16,3.65,3.38,38,2.61,2.46,1.57,1.53,32,-0.25,1.99,1.93,1.97,1.92,2.2,3.38,3.7\n', 'E0,13/08/16,Southampton,Watford,1,1,D,0,1,A,R East,24,5,6,1,8,12,6,2,1,2,0,1,1.8,3.75,5,1.8,3.4,4.5,1.8,3.5,4.2,1.8,3.6,5,1.88,3.68,4.64,1.83,3.4,4.5,1.83,3.6,5,56,1.88,1.82,3.8,3.56,5,4.62,42,2.13,2.06,1.83,1.75,33,-0.75,2.16,2.07,1.89,1.8,1.8,3.83,4.91\n', 'E0,14/08/16,Arsenal,Liverpool,3,4,A,1,1,D,M Oliver,9,16,5,7,13,17,5,4,3,3,0,0,2.4,3.5,3.1,2.35,3.3,2.9,2.3,3.3,2.9,2.38,3.4,3.1,2.41,3.53,3.1,2.5,3.1,3,2.4,3.5,3.1,55,2.5,2.36,3.55,3.42,3.2,3.04,42,1.98,1.81,2.09,1.99,31,-0.5,2.41,2.31,1.81,1.64,2.8,3.44,2.68\n', 'E0,14/08/16,Bournemouth,Man United,1,3,A,0,1,A,A Marriner,9,11,3,7,7,10,4,2,0,1,0,0,4.75,3.6,1.85,4.6,3.5,1.75,4.5,3.5,1.75,4.8,3.6,1.8,4.7,3.62,1.88,4.5,3.4,1.85,4.75,3.6,1.87,55,5,4.5,3.75,3.51,1.95,1.86,42,2.11,2.05,1.87,1.76,33,0.75,1.8,1.76,2.17,2.11,5.4,3.65,1.78\n']

The list L now contains the rows of the file. You can split the lines into cells with .split(",") but that's for later.

Writing a file

Let's say that you care only about the results of the team 'Liverpool'. Write a file named 'Liverpool.csv' containing anything.

For writing we open with open(filename, 'w'). If you write file, don't forget to close it!

In [5]:
f = open('Liverpool.csv', 'w')
f.write('YNWA')
f.close()

Note: for reading a text file we use open('E0.csv', 'r') but reading is the default, so you can skip the mode r.

Let's read the results row-by-row and choose the rows containing the word 'Liverpool', save those line in 'Liverpool.csv'. The header of the file will be the same!

In [6]:
f = open('E0.csv')
L = [f.readline()]
for line in f:
    if 'Liverpool' in line:
        L.append(line)
with open('Liverpool.csv', 'w') as f:
    for l in L:
        f.write(l)
f.close()

Note that the write method does not write into a new line automatically, you have to insert the newline characters manually. In the example the lines already contained newline characters.

The with open(filename, 'rb') as f is the same as f = open(filename, 'rb') but the file will be closed at the end of the block. In this way you can make sure to close the file and don't leave it open.

The former one is prefered due to safety reasons (against data corruption)!

csv and json in python

The former file was a comma separated values file with the .csv extenstion. In those files the records follow each other line-by-line, inside a record the cells are separeted by comma (,) Python can handle this format with the csv module.

In [7]:
import csv
L=[]
with open('E0.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        L.append(row)
print L[0]
print L[19]
['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'VCH', 'VCD', 'VCA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD', 'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5', 'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA', 'PSCH', 'PSCD', 'PSCA']
['E0', '21/08/16', 'Sunderland', 'Middlesbrough', '1', '2', 'A', '0', '2', 'A', 'M Atkinson', '18', '8', '5', '3', '11', '14', '8', '1', '1', '1', '0', '0', '2.55', '3.2', '3.1', '2.5', '3.1', '3.1', '2.3', '3.2', '3', '2.5', '3.2', '3.1', '2.61', '3.22', '3.06', '2.6', '3.2', '2.75', '2.63', '3.13', '3.1', '56', '2.64', '2.52', '3.25', '3.16', '3.25', '3.02', '39', '2.5', '2.38', '1.64', '1.57', '33', '-0.25', '2.21', '2.14', '1.8', '1.75', '2.79', '3.1', '2.94']

The difference is that the cells are handled, too. You can determine which delimiter to use (here ,). The quotechar determines how to read strings with special characters or commas. This is useful for example when you want to write the decimal number 2,25 in a .csv file. The csv module handles all these.

Open the .csv files as binary (rb/wb), the csv module won't work properly otherwise.

In general opening a text file as binary is not a big problem, but opening a binary file as text may cause problems. More about the EOL characters. This is due to historical reasons from different operation systems.

Reading csv format as dict

If you look the data closely, you can see that a dictionary would be even better. It's better to refer the cells by name and not index.

This format uses the first line (header) as dictionary keys.

In [8]:
import csv
L=[]
with open('E0.csv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        L.append(row)
print L[0]
print L[1]
{'BbAHh': '-0.25', 'HY': '3', 'BbMx>2.5': '2.4', 'HTHG': '0', 'HR': '0', 'HS': '10', 'VCA': '3.25', 'BbMxD': '3.35', 'AwayTeam': 'Swansea', 'BbAvD': '3.21', 'PSD': '3.32', 'BbAvA': '3.1', 'HC': '7', 'HF': '10', 'Bb1X2': '55', 'BbAvH': '2.43', 'WHD': '3.2', 'Referee': 'J Moss', 'WHH': '2.5', 'WHA': '2.9', 'IWA': '2.65', 'AST': '9', 'BbMxH': '2.55', 'HTAG': '0', 'PSCH': '2.79', 'BbAv>2.5': '2.3', 'IWH': '2.5', 'LBA': '3.1', 'BWA': '2.95', 'BWD': '3.1', 'LBD': '3.25', 'HST': '3', 'PSA': '3.19', 'Date': '13/08/16', 'LBH': '2.45', 'BbMxAHA': '1.86', 'BbAvAHA': '1.81', 'BbAvAHH': '2.06', 'IWD': '3.3', 'AC': '4', 'FTR': 'A', 'VCD': '3.2', 'AF': '14', 'VCH': '2.5', 'FTHG': '0', 'BWH': '2.45', 'AS': '17', 'AR': '0', 'AY': '2', 'Div': 'E0', 'PSH': '2.47', 'B365H': '2.4', 'HomeTeam': 'Burnley', 'B365D': '3.3', 'B365A': '3.25', 'BbMx<2.5': '1.68', 'BbMxAHH': '2.13', 'BbAv<2.5': '1.61', 'HTR': 'D', 'BbAH': '32', 'BbOU': '40', 'FTAG': '1', 'PSCA': '2.89', 'PSCD': '3.16', 'BbMxA': '3.3'}
{'BbAHh': '-0.5', 'HY': '2', 'BbMx>2.5': '2.68', 'HTHG': '0', 'HR': '0', 'HS': '14', 'VCA': '4.4', 'BbMxD': '3.4', 'AwayTeam': 'West Brom', 'BbAvD': '3.23', 'PSD': '3.29', 'BbAvA': '4.16', 'HC': '3', 'HF': '12', 'Bb1X2': '56', 'BbAvH': '2.01', 'WHD': '3.1', 'Referee': 'C Pawson', 'WHH': '2.05', 'WHA': '4', 'IWA': '3.3', 'AST': '3', 'BbMxH': '2.1', 'HTAG': '0', 'PSCH': '2.25', 'BbAv>2.5': '2.5', 'IWH': '2.1', 'LBA': '4.33', 'BWA': '3.9', 'BWD': '3.2', 'LBD': '3.25', 'HST': '4', 'PSA': '4.32', 'Date': '13/08/16', 'LBH': '2', 'BbMxAHA': '1.9', 'BbAvAHA': '1.85', 'BbAvAHH': '2', 'IWD': '3.3', 'AC': '6', 'FTR': 'A', 'VCD': '3.3', 'AF': '15', 'VCH': '2', 'FTHG': '0', 'BWH': '2', 'AS': '13', 'AR': '0', 'AY': '2', 'Div': 'E0', 'PSH': '2.06', 'B365H': '2', 'HomeTeam': 'Crystal Palace', 'B365D': '3.3', 'B365A': '4.5', 'BbMx<2.5': '1.6', 'BbMxAHH': '2.07', 'BbAv<2.5': '1.52', 'HTR': 'D', 'BbAH': '33', 'BbOU': '38', 'FTAG': '1', 'PSCA': '3.86', 'PSCD': '3.15', 'BbMxA': '4.5'}

Store the data of Liverpool matches. We will write the 'Date', 'HomeTeam', 'AwayTeam', 'FTHG'(Full Time Home Goals), 'FTAG' (Full Time Away Goals), 'FTR' (Full Time Result) values! We will use csv.DictWriter to write the data, the writer.writeheader() writes the header first, then writer.writerows() writes the actual data.

The fieldnames parameter tells which fields (columns) to use. The extrasaction='ignore' ignores the other fields.

In [9]:
import csv
L=[]
with open('E0.csv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile)
    for x in reader:
        if x['HomeTeam'] == 'Liverpool' or x['AwayTeam'] == 'Liverpool':
            L.append(x)
csvfile.close()
with open('Liverpool.csv', 'wb') as output:
    fields = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR']
    writer = csv.DictWriter(output, fieldnames=fields, extrasaction='ignore')
    writer.writeheader()
    writer.writerows(L)
output.close()

The json format

JavaScript Object Notation

This format can store most of the types and also combine them: numbers, strings, lists, dicts, list of lists, list of dicts etc.

The lists are marked with a comma separated list in brackets [ ], the dict contains the usual key:value pairs in curly brackets { }.

{
    "Liverpool" : {
        "Players": [
            "Steven Gerrard",
            "Bill Shankly"
        ],
        "Results" : [
            {
                "HomeTeam":"Liverpool",
                "AwayTeam":"Tottenham",
                "HTG":1,
                "ATG":1
            },
            {
                "HomeTeam":"West Ham",
                "AwayTeam":"Liverpool",
                "HTG":2,
                "ATG":0
            }
        ],
        "Points":1,
        "Goals Scored":1,
        "Goals Condceded":3
    }
}

Python cvan hanle this ormat with the json module. After reading the file, the data is a dictionary, containing all sorts of objects.

The u'Steven Gerrard' means a unicode string (encoding).

In [10]:
import json
with open('Liverpool.json') as data_file:    
    data = json.load(data_file)

print data
print data['Liverpool']['Players']
{u'Liverpool': {u'Players': [u'Steven Gerrard', u'Bill Shankly'], u'Goals Scored': 1, u'Points': 1, u'Goals Condceded': 3, u'Results': [{u'HTG': 1, u'AwayTeam': u'Tottenham', u'HomeTeam': u'Liverpool', u'ATG': 1}, {u'HTG': 2, u'AwayTeam': u'Liverpool', u'HomeTeam': u'West Ham', u'ATG': 0}]}}
[u'Steven Gerrard', u'Bill Shankly']

Now write a json file! To look better you can use the sort_keys, indent and separators parameters. The json.dumps(obj) returns a string which encodes an object (obj) in a json format. We write that into a file and that's all.

In [11]:
import json
with open('Liverpool.json') as data_file:    
    data = json.load(data_file)
data_file.close()
with open('Liverpool_matches.json', 'wb') as f:
    f.write(json.dumps(data['Liverpool']['Results'], 
            sort_keys=True, indent=4, separators=(',', ': ')))

There are sevaral ways to handle json format:

  • json.dumps(obj): encodes obj to a JSON formatted string
  • json.dump(JSON_formatted_string, file): writes into a file
  • json.load(file): reads the content of file to a python object (it can be a complex python data)
  • json.loads(JSON_formatted_string): converts a JSON formatted string into a python object

More details in https://docs.python.org/2/library/json.html.

Command line arguments

We will run python codes as standalone programs!

The sys module

Write a python code and save with the .py extension. Your OS can recongise it as a python program or you can run with an interpreter.

You can communicate with your program via input or with command line arguments. Your very first code writes its arguments. The first one is the name of the code file. The others are optional. The list sys.argv stores these parameters (list of strings). You have to import sys first.

Save the followings as cli.py and run from command line.

import sys

print 'Number of arguments:', len(sys.argv)
print 'List of arguments:', str(sys.argv)
In [12]:
! python cli.py arg1 arg2
Number of arguments: 3
List of arguments: ['cli.py', 'arg1', 'arg2']

The ! tells the notebook to run in command line, not as a python code.

You can use the values in sys.argv and we call them positional parameters since you can refer to them by their place in the list sys.argv.

Exercise: calculate the power of a number. Write a python program which have two command line arguments: base and exponent.

If the numbers are integers then calculate as integers, otherwise calculate with floats.

Save the followings as power.py.

import sys

def is_intstring(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

a = []

for i in range(1,3):
    if is_intstring(sys.argv[i]):
        a.append(int(sys.argv[i]))
    else:
        a.append(float(sys.argv[i]))

print a[0] ** a[1]

This is how to run it.

In [13]:
!python power.py 4.2 3

!python power.py 2 100
74.088
1267650600228229401496703205376

argparse

If you want to use more complex command line arguments, then you can use the argparse module.

The following is an example of a flag used to mark two options: on or off. For example if verbosity is on, then you want a lots of info, or if it's off then you want less info printed on your screen.

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--verbosity", help="increase output verbosity")
args = parser.parse_args()
print type(args.verbosity)
if args.verbosity:
    print "verbosity turned on"
else:
    print "verbosity turned off"
In [14]:
!python parser.py --verbosity 1
<type 'str'>
verbosity turned on
In [15]:
!python parser.py
<type 'NoneType'>
verbosity turned off

This works for integer values 0 and 1 like, but a nicer solution is the bool type. In this way you only have to write -v or --verbose or nothing.

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-v", "--verbose", help="increase output verbosity", action="store_true")
args = parser.parse_args()
print type(args.verbose)
if args.verbose:
    print "verbosity turned on"
else:
    print "verbosity turned off"
In [16]:
!python parser2.py
!python parser2.py --verbose
!python parser2.py -v
<type 'bool'>
verbosity turned off
<type 'bool'>
verbosity turned on
<type 'bool'>
verbosity turned on

You can even print a nice help menu with exmplanatory text (help="increase output verbosity").

In [17]:
!python parser2.py --help
usage: parser2.py [-h] [-v]

optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose  increase output verbosity

Back to the .csv file. Let's say you want to know how many matches are there where a given team scored more than a given number of goals. For example how many matches did Liverpool score more than one goal?

The name of the team is the -t or --team argument, by default its 'Liverpool' but you can set to other teams as well.

The minimum goal number is by default 0, but you can reset it with -g or --goal.

The action='store' option stores them into the args object. You can access them by args.team and args.goals.

import argparse
import csv

parser = argparse.ArgumentParser()
parser.add_argument("-t", "--team", help="The team we are looking for", action="store", type=str, default='Liverpool')
parser.add_argument("-g", "--goals", help="Number of minimum goals scored", action="store", type=float, default=0)
args = parser.parse_args()

m = 0
team = args.team
goals = args.goals
with open('E0.csv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile)
    for x in reader:
        if x['HomeTeam'] == team and float(x['FTHG']) >= goals:
            m += 1
        elif x['AwayTeam'] == team and float(x['FTAG']) >= goals:
            m += 1
print m
In [18]:
!python goals.py -h
usage: goals.py [-h] [-t TEAM] [-g GOALS]

optional arguments:
  -h, --help            show this help message and exit
  -t TEAM, --team TEAM  The team we are looking for
  -g GOALS, --goals GOALS
                        Number of minimum goals scored
In [19]:
!python goals.py -g 1
33