Accessing other data formats

So far, we have worked on CSV files only. The pandas package offers similar functionality (and functions) in order to load MS Excel, HDFS, SQL, JSON, HTML, and Stata datasets. Since most of these formats are not used routinely in data science, the understanding of how one can load and handle each of them is mostly left to you, who can refer to the documentation available on the pandas website (http://pandas.pydata.org/pandas-docs/version/0.16/io.html). Here, we will only demonstrate the essentials on how to effectively use your disk space to store and retrieve information for machine learning algorithms in a fast and efficient way. In such a case, you can leverage an SQLite database (https://www.sqlite.org/index.html) in order to access specific subsets of information and convert them into a pandas DataFrame. If you don't need to make particular selections or filterings on the data, but your only problem is that reading data from a CSV file is time-consuming and requires a lot of effort every time (for instance, setting the right variables types and names), you can speed up saving and loading your data by using the HDF5 data structure (https://support.hdfgroup.org/HDF5/whatishdf5.html).

In our first example, we are going to use SQLite and the SQL language to store away some data and retrieve a filtered version of it. SQLite has quite a few advantages over other databases: it is self-contained (all your data will be stored into a single file), serverless (Python will provide the interface to store, manipulate, and access the data), and fast. After importing the sqlite3 package (which is part of the Python stack, so there's no need to install it anyway), you define two queries: one to drop any previous data table of the same name, and one to create a new table that's capable of keeping the date, city, temperature, and destination data (and you use integer, float, and varchar types, which correspond to int, float, and str).

After opening the database (which at this point is created, if not present on disk), you execute the two queries and then commit the changes (by committing, you actually start the execution of all the previous database commands in a single batch: https://www.sqlite.org/atomiccommit.html):

In: import sqlite3
    drop_query = "DROP TABLE IF EXISTS temp_data;"
    create_query = "CREATE TABLE temp_data \
                    (date INTEGER, city VARCHAR(80), \
                    temperature REAL, destination INTEGER);"
    connection = sqlite3.connect("example.db")
    connection.execute(drop_query)
    connection.execute(create_query)
    connection.commit()

At this point, the database has been created on disk with all of its data tables.

In the previous example, you created a database on disk. You can also create it in-memory by changing the connection output to ':memory:', as shown in this snippet of code: connection = sqlite3.connect(':memory:') you can use ':memory:' to create an in-memory database.

In order to insert the data into the database table, the best approach is to create a list of tuples of values containing the rows of data you need to store. Then, an insert query will take care of recording each data row. Please note that this time we are using the executemany method for multiple commands (each row is inserted separately into the table) instead of the previous command, execute:

In: data = [(20140910, "Rome",   80.0, 0),
            (20140910, "Berlin", 50.0, 0),
            (20140910, "Wien",   32.0, 1),
            (20140911, "Paris",  65.0, 0)]
    insert_query = "INSERT INTO temp_data VALUES(?, ?, ?, ?)"
    connection.executemany(insert_query, data)
    connection.commit()

At this point, we simply decide, by a selection query, what data, based on specific criteria, we need to get in-memory, and we retrieve it by using the read_sql_query command:

In: selection_query = "SELECT date, city, temperature, destination \
                       FROM temp_data WHERE Date=20140910"
    retrieved = pd.read_sql_query(selection_query, connection)

Now, all the data you need, in pandas DataFrame format, is contained in the retrieved variable. All you need to do is to close the connection with the database:

In: connection.close()

In the following example, we will instead face the situation of a large CSV file that requires a long amount of time for both loading and parsing its column variables. In such a case, we will use a data format, HDF5, which is suitable for storing and retrieving DataFrames in a fast fashion.

HDF5 is a file format that was originally developed by the National Center for Supercomputing Applications (NCSA) to store and access large amounts of scientific data, based on the requirements of NASA in the 1990s in order to have a portable file format for the data produced by the Earth Observing System and other space observation systems. HDF5 is arranged as a hierarchical data storage that allows saving multidimensional arrays of a homogeneous type or group which are containers of arrays and other groups. As a filesystem, it perfectly fits the DataFrame structure, and by means of automatic data compressions, such a filesystem can make data loading much faster than simply reading a CSV file, in case of large files.

The pandas package allows you to use the HDF5 format to store series and DataFrame data structures. You may find it invaluable for storing binary data as well, such a preprocessed images or video files. When you need to access a large number of files from disk, you may experience some latency in getting the data in-memory because the files are scattered in the filesystem. Storing all the files into a single HDF5 file will simply solve the problem. You can read how to use the h5py package, a Python package providing an interface for storing and retrieving data in NumPy array form, at https://www.h5py.org/ and especially at http://docs.h5py.org/en/stable/, its main documentation website. You also can install h5py by issuing the conda install h5py or pip install h5py commands.

We will start by initializing the HDF5 file, example.h5, using the HDFStore command, which allows for a low-level manipulation of the data file. After instantiating the file, you can start using it as if it were a Python dictionary. In the following code snippet, you store the Iris dataset under the dictionary key iris. After that, you simply close the HDF5 file:

In: storage = pd.HDFStore('example.h5')
    storage['iris'] = iris
    storage.close()

When you need to retrieve the data stored in the HDF5 file, you can reopen the file using the HDFStore command. First, you check the available keys (as you would do in a dictionary):

In: storage = pd.HDFStore('example.h5')
    storage.keys()

Out: ['/iris']

Then, you allocate the desired values by recalling them through the corresponding key:

In: fast_iris_upload = storage['iris']
    type(fast_iris_upload)

Out: pandas.core.frame.DataFrame

The data is promptly loaded, and the previous DataFrame is now available for further processing under the variable fast_iris_upload.