- Python Data Science Essentials
- Alberto Boschetti Luca Massaron
- 1094字
- 2021-08-13 15:19:39
Accessing other data formats
So far, we have worked on CSV files only. The pandas package offers similar functionality (and functions) in order to load MS Excel, HDFS, SQL, JSON, HTML, and Stata datasets. Since most of these formats are not used routinely in data science, the understanding of how one can load and handle each of them is mostly left to you, who can refer to the documentation available on the pandas website (http://pandas.pydata.org/pandas-docs/version/0.16/io.html). Here, we will only demonstrate the essentials on how to effectively use your disk space to store and retrieve information for machine learning algorithms in a fast and efficient way. In such a case, you can leverage an SQLite database (https://www.sqlite.org/index.html) in order to access specific subsets of information and convert them into a pandas DataFrame. If you don't need to make particular selections or filterings on the data, but your only problem is that reading data from a CSV file is time-consuming and requires a lot of effort every time (for instance, setting the right variables types and names), you can speed up saving and loading your data by using the HDF5 data structure (https://support.hdfgroup.org/HDF5/whatishdf5.html).
In our first example, we are going to use SQLite and the SQL language to store away some data and retrieve a filtered version of it. SQLite has quite a few advantages over other databases: it is self-contained (all your data will be stored into a single file), serverless (Python will provide the interface to store, manipulate, and access the data), and fast. After importing the sqlite3 package (which is part of the Python stack, so there's no need to install it anyway), you define two queries: one to drop any previous data table of the same name, and one to create a new table that's capable of keeping the date, city, temperature, and destination data (and you use integer, float, and varchar types, which correspond to int, float, and str).
After opening the database (which at this point is created, if not present on disk), you execute the two queries and then commit the changes (by committing, you actually start the execution of all the previous database commands in a single batch: https://www.sqlite.org/atomiccommit.html):
In: import sqlite3
drop_query = "DROP TABLE IF EXISTS temp_data;"
create_query = "CREATE TABLE temp_data \
(date INTEGER, city VARCHAR(80), \
temperature REAL, destination INTEGER);"
connection = sqlite3.connect("example.db")
connection.execute(drop_query)
connection.execute(create_query)
connection.commit()
At this point, the database has been created on disk with all of its data tables.
In order to insert the data into the database table, the best approach is to create a list of tuples of values containing the rows of data you need to store. Then, an insert query will take care of recording each data row. Please note that this time we are using the executemany method for multiple commands (each row is inserted separately into the table) instead of the previous command, execute:
In: data = [(20140910, "Rome", 80.0, 0),
(20140910, "Berlin", 50.0, 0),
(20140910, "Wien", 32.0, 1),
(20140911, "Paris", 65.0, 0)]
insert_query = "INSERT INTO temp_data VALUES(?, ?, ?, ?)"
connection.executemany(insert_query, data)
connection.commit()
At this point, we simply decide, by a selection query, what data, based on specific criteria, we need to get in-memory, and we retrieve it by using the read_sql_query command:
In: selection_query = "SELECT date, city, temperature, destination \
FROM temp_data WHERE Date=20140910"
retrieved = pd.read_sql_query(selection_query, connection)
Now, all the data you need, in pandas DataFrame format, is contained in the retrieved variable. All you need to do is to close the connection with the database:
In: connection.close()
In the following example, we will instead face the situation of a large CSV file that requires a long amount of time for both loading and parsing its column variables. In such a case, we will use a data format, HDF5, which is suitable for storing and retrieving DataFrames in a fast fashion.
HDF5 is a file format that was originally developed by the National Center for Supercomputing Applications (NCSA) to store and access large amounts of scientific data, based on the requirements of NASA in the 1990s in order to have a portable file format for the data produced by the Earth Observing System and other space observation systems. HDF5 is arranged as a hierarchical data storage that allows saving multidimensional arrays of a homogeneous type or group which are containers of arrays and other groups. As a filesystem, it perfectly fits the DataFrame structure, and by means of automatic data compressions, such a filesystem can make data loading much faster than simply reading a CSV file, in case of large files.
We will start by initializing the HDF5 file, example.h5, using the HDFStore command, which allows for a low-level manipulation of the data file. After instantiating the file, you can start using it as if it were a Python dictionary. In the following code snippet, you store the Iris dataset under the dictionary key iris. After that, you simply close the HDF5 file:
In: storage = pd.HDFStore('example.h5')
storage['iris'] = iris
storage.close()
When you need to retrieve the data stored in the HDF5 file, you can reopen the file using the HDFStore command. First, you check the available keys (as you would do in a dictionary):
In: storage = pd.HDFStore('example.h5')
storage.keys()
Out: ['/iris']
Then, you allocate the desired values by recalling them through the corresponding key:
In: fast_iris_upload = storage['iris']
type(fast_iris_upload)
Out: pandas.core.frame.DataFrame
The data is promptly loaded, and the previous DataFrame is now available for further processing under the variable fast_iris_upload.