Fast and easy data loading

Let's start with a CSV file and pandas. The pandas library offers the most accessible and complete functionality to load tabular data from a file (or a URL). By default, it will store data in a specialized pandas data structure, index each row, separate variables by custom delimiters, infer the right data type for each column, convert data (if necessary), as well as parse dates, missing values, and erroneous values.

We will start by importing the pandas package and reading our Iris dataset:

In: import pandas as pd
    iris_filename = 'datasets-uci-iris.csv'
    iris = pd.read_csv(iris_filename, sep=',', decimal='.', header=None,
                       names= ['sepal_length', 'sepal_width', 
                               'petal_length', 'petal_width',
                               'target'])

You can specify the name of the file, the character used as a separator (sep), the character used for the decimal placeholder (decimal), whether there is a header (header), and the variable names (using names and a list). The settings of the sep=',' and decimal='.' parameters have default values, and they are redundant in function. For European-style CSV, it is important to point out both since, in many European countries, the separator character and the decimal placeholder are different from the default ones.

If the dataset is not available online, you can follow these steps to download it from the internet:

In: import urllib
    url = "http://aima.cs.berkeley.edu/data/iris.csv"
    set1 = urllib.request.Request(url)
    iris_p = urllib.request.urlopen(set1)
    iris_other = pd.read_csv(iris_p, sep=',', decimal='.',
    header=None, names= ['sepal_length', 'sepal_width',
                         'petal_length', 'petal_width', 
                         'target'])
    iris_other.head()

The resulting object, named iris, is a pandas DataFrame. It's more than a simple Python list or dictionary, and in the sections that follow, we will explore some of its features. To get an idea of its content, you can print the first (or the last) row(s) by using the following commands:

In: iris.head()

The head of the DataFrame will be printed in the output:

In: iris.tail()

The function, if called without arguments, will print five lines. If you want to get back a different number of rows, just call the function using the number of rows you want to see as an argument, as follows:

In: iris.head(2)

The preceding command will print only the first two lines. Now, to get the names of the columns, you can simply use the following method:

In: iris.columns
Out: Index(['sepal_length', 'sepal_width', 
            'petal_length', 'petal_width',  
            'target'], dtype='object')

The resulting object is a very interesting one. It looks like a list, but it is actually a pandas index. As suggested by the object's name, it indexes the columns' names. To extract the target column, for example, you can simply do the following:

In: y = iris['target']
    y

Out: 0     Iris-setosa
     1     Iris-setosa
     2     Iris-setosa
     3     Iris-setosa
     ...
     149    Iris-virginica
     Name: target, dtype: object

The type of the object y is a pandas series. Right now, think of it as a one-dimensional array with axis labels, as we will investigate it in depth later on. Now, we just understood that a pandas Index class acts like a dictionary index of the table's columns. Note that you can also get a list of columns referring to them by their indexes, as follows:

In: X = iris[['sepal_length', 'sepal_width']]
    X

Out: [150 rows x 2 columns]

Here are the four head rows of the X dataset:

And here are the four tail ones:

In this case, the result is a pandas DataFrame. Why such a difference in results when using the same function? In the first case, we asked for a column. Therefore, the output was a 1D vector (that is, a pandas series). In the second example, we asked for multiple columns and we obtained a matrix-like result (and we know that matrices are mapped as pandas DataFrames). A novice reader can simply spot the difference by looking at the heading of the output; if the columns are labeled, then you are dealing with a pandas DataFrame. On the other hand, if the result is a vector and it presents no heading, then that is a pandas series.

So far, we have learned some common steps from the data science process; after you load the dataset, you usually separate the features and target labels.

In a classification problem, target labels are the ordinal numbers or textual strings that indicate the class associated with every set of features.

Then, the following steps require you to get an idea of how large the problem is, and therefore, you need to know the size of the dataset. Typically, for each observation, we count a line, and for each feature, a column.

To obtain the dimensions of the dataset, just use the attribute shape on either a pandas DataFrame or series, as shown in the following example:

In: print (X.shape)

Out: (150, 2)

In:  print (y.shape)

Out: (150,)

The resulting object is a tuple that contains the size of the matrix/array in each dimension. Also, note that a pandas series follow the same format (that is, a tuple with only one element).