Shape of a Column in Pandas Data Frame
Categories: Tech | Pubby Cash Received:
Pandas is a must-have Python module for data manipulation and analysis. It can read a csv file into Python in a format of pandas data frame. Let's talk about the shape of a data-frame column, which may confuse a lot of beginners. Let's say we have a csv table named "data.csv" like this:
Year | Counts |
2008 | 100 |
2009 | 200 |
2010 | 300 |
2011 | 400 |
2012 | 500 |
import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('Year', inplace=True)
y = df['Counts']
print(y, type(y), y.shape)
You will get a variable y with two columns - Year as index and the Counts. Not one column as one may think! The type of y belongs to a class of pands.core.series.Series and the shape of the y is (5, ). Note this is zero dimension. This is what confuses people most. How can an object of two columns has a shape of zero dimension? What should we do if we only want one column - 'Counts'?
Let's execute the following codes to see the results:
y = df['Counts'].tolist()
print(y, type(y))
Now we got a list with only 'Counts' values. The data type is list. Since it is a list, the shape method cannot be used.
Then, let's execute the following codes to see the results:
y = df['Counts'].to_numpy()
print(y, type(y),y.shape)
It appears we got a list with only 'Counts' values. But this list is not a true list, as the shape method can be applied. It is a numpy array with a shape of (5, ), meaning zero dimension. Note that although we did not import numpy here, pandas is built on numpy and thus possess numpy's array features.
This zero dimension can cause some problems if you want to apply some machine learning algorithms to the data using the sci-kit learn module, because it accepts numpy array with at least one dimension. The question becomes: how to change a zero dimension array to a one-dimension array? A reshape(-1,1) method can be neat here. Execute the following code and see the results:
y = df['Counts'].to_numpy().reshape(-1,1)
print(y,type(y),y.shape)
You will see this time, y has a shape of (5,1), although you cannot really visually tell the difference compared to the previous y. But it is now a one-dimensional numpy array that can be executed with machine learning algorithms!
Although you don't have to go through the processes like these to plot the results with matplotlib, you have more flexibility and more control over your x and y axis, which is a good thing. Here is an example I did with my data:
