Dealing with problematic data

Now, you should be more confident with the basics of the process and be ready to face datasets that are more problematic, since it is very common to have messy data in reality. Consequently, let's see what happens if the CSV file contains a header and some missing values and dates. For example, to make our example realistic, let's imagine the situation of a travel agency:

According to the temperature of three popular destinations, they record whether the user picks the first, second, or third destination:

Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination
20140910,80,32,40,1
20140911,100,50,36,2
20140912,102,55,46,1
20140912,60,20,35,3
20140914,60,,32,3
20140914,,57,42,2

In this case, all the numbers are integers and the header is in the file. In our first attempt to load this dataset, we can provide the following command:

In: import pandas as pd

In: fake_dataset = pd.read_csv('a_loading_example_1.csv', sep=',')
    fake_dataset

The top rows of the fake_dataset are printed:

Pandas automatically gave the columns their actual name after picking them from the first data row. We first detect a problem: all of the data, even the dates, have been parsed as integers (or, in other cases, as strings). If the format of the dates is not very strange, you can try the auto-detection routines that specify the column that contains the date data. In the following example, it works well when using the following arguments:

In: fake_dataset = pd.read_csv('a_loading_example_1.csv', 
                               parse_dates=[0])
    fake_dataset

Here is the fake_dataset whose date column is now correctly interpreted by the read_csv:

Now, in order to get rid of the missing values that are indicated by NaN, replace them with a more meaningful number (let's say, 50 Fahrenheit). We can execute our command in the following way:

In: fake_dataset.fillna(50)

At this point you will notice that there no more missing variables:

After that, all of the missing data has disappeared and it has been replaced by the constant 50.0. Treating missing data can also require different approaches. As an alternative to the previous command, values can be replaced by a negative constant value to mark the fact that they are different from others (and leave the guess for the learning algorithm):

In: fake_dataset.fillna(-1)

Note that this method only fills missing values in the view of the data (that is, it doesn't modify the original DataFrame). In order to actually change them, use the inplace=True argument command.

NaN values can also be replaced by the column's mean or median value as a way to minimize the guessing error:

In: fake_dataset.fillna(fake_dataset.mean(axis=0))

The .mean method calculates the mean of the specified axis.

Please note that axis= 0 implies a calculation of means that spans the rows; the consequently obtained means are derived from column-wise computations. Instead, axis=1 spans columns and, therefore, row-wise results are obtained. This works in the same way for all other methods that require the axis parameter, both in pandas and NumPy.

The .median method is analogous to .mean, but it computes the median value, which is useful if the mean is not a very good representation of the central value in the data, given a too skewed distribution (for instance, when there are many extreme values in your feature).

Another possible problem when handling real-world datasets is when loading a dataset containing errors or bad lines. In this case, the default behavior of the read_csv method is to stop and raise an exception. A possible workaround, which is feasible when erroneous examples are not the majority, is to ignore the lines causing exceptions. In many cases, such a choice has the sole implication of training the machine learning algorithm without the erroneous observations. As an example, let's say that you have a badly formatted dataset and you want to load just all the good lines and ignore the badly formatted ones.

This is now your a_loading_example_2.csv file:

Val1,Val2,Val3
0,0,0
1,1,1
2,2,2,2
3,3,3

And here is what you can do with the error_bad_lines option:

In: bad_dataset = pd.read_csv('a_loading_example_2.csv',
                              error_bad_lines=False) 
    bad_dataset

Out: Skipping line 4: expected 3 fields, saw 4

The resulting output has the fourth line skipped because it has four values instead of three: