A good three years after the first main version, version 2.0 of pandas has now been released. The release stabilizes functions, some of which were already included in pandas 1.5, including the extension arrays, the connection to Apache Arrow and copy-on-write.
In the course of the version jump, the team cleaned up the Python library for processing and analyzing data and removed or adjusted all components marked as deprecated.
Own data types for the library
Already pandas 1.0 introduced Extension Arrays (EAs), which allow to define own data types that deviate from NumPy data types. The team has implemented EAs within pandas that allow missing values in any data type like integer or boolean.
Since earlier versions implicitly required NumPy data types in many places, the EA connection did not work completely. The pandas team successively improved it over the 1.x series. In pandas 2.0, most calls honor EA’s own methods instead of falling back to NumPy functions. This also improves performance:
# pandas 1.5.3:
In[3]: ser = pd.Series(list(range(1, 1_000_000))
+ [pd.NA], dtype="Int64")
In[4]: %timeit ser.drop_duplicates()
22.7 ms ± 272 µs per loop \
(mean ± std. dev. of 7 runs, 10 loops each)
# pandas 2.0:
In[3]: ser = pd.Series(list(range(1, 1_000_000))
+ [pd.NA], dtype="Int64")
In[4]: %timeit ser.drop_duplicates()
7.54 ms ± 24 µs per loop \
(mean ± std. dev. of 7 runs, 100 loops each)
Among other things, takes place within GroupBy
-No conversion operations too float
no longer take place, but the EA conventions apply. In addition to improved performance, this procedure prevents loss of precision for larger numbers.
pandas 2.0 has a new parameter for almost all I/O functions to automatically convert to IO data types. If the parameter dtype_backend
on "numpy_nullable"
is set, the function inputs DataFrame
returns, which consists only of nullable data types.
You can also create an index that contains extension arrays. pandas 1.4.0 introduced the procedure for the first time, and meanwhile all operations are efficiently implemented:
- Index operations use EA functions.
- There is an efficient engine that does the triaging of data
loc
andiloc
allows. - pandas copies the values in one
MultiIndex
internal no more. This improves performance and allows the correct data types to be used consistently.
The pandas team is constantly working on the extension array interface, and each new release brings improvements.
Interaction with Apache Arrow
The Apache Arrow library defines a programming language-independent format for in-memory data processing and represents a significant improvement compared to NumPy and the object
-data type.
pandas 1.5.0 introduced one based on arrow arrays for the first time ExtensionArray
a. pandas 2.0 increases the minimum version of the Python interface PyArrow and significantly improves the connection to Arrow. pandas 1.5.3 has numerous PerformanceWarning
s were issued indicating that a NumPy implementation was used instead of a PyArrow implementation. Most of that warning is now obsolete. Also, the team improved the PyArrow integration in the parts of the library that don’t use the EA interface due to the lack of specialized implementation. With version 2.0, pandas uses the corresponding PyArrow compute interface in most cases.
The string connection for PyArrow EAs essentially corresponds to the implementation in older extension arrays, which are defined by the data type string
and the option string_storage="pyarrow"
has been activated.
A PyArrow data type can be passed in pandas either f"{dtype}[pyarrow]"
also int64[pyarrow]
for integer data types, or with
import pandas as pd
import pyarrow as pa
dtype = pd.ArrowDtype(pa.int64)
establish. These data types are allowed everywhere.
The new parameter for I/O functions dtype_backend
also used to generate DataFrame
s with PyArrow arrays. For a function to return a PyArrow DataFrame, the parameter must have the value "pyarrow"
have.
In addition, some I/O functions have a PyArrow-specific engine that is significantly more performant because it generates PyArrow arrays natively.
Another benefit of PyArrow DataFrames is improved interoperability with other Arrow libraries. These can either be native PyArrow libraries or other DataFrame libraries like cuDF or Polars. When converting between the libraries, there is no need to copy the data. Marc Garcia, who is part of the pandas-core team, wrote a blog post that explains this part in more detail.
Less accurate dates
So far, pandas has rendered all timestamps with nanosecond precision. The following corresponds asm8
the datetime
-Representation:
# pandas 1.5.3
In[1]: pd.Timestamp("2019-12-31").asm8
Out[1]: 2019-12-31T00:00:00.000000000
# pandas 2.0:
In[1]: pd.Timestamp("2019-12-31").asm8
Out[1]: 2019-12-31T00:00:00
Due to the high precision, it was impossible to represent dates before September 21, 1677 or after April 11, 2264, since the associated values would have exceeded the 64-bit integer limit with nanosecond precision. All other date values returned an error. This was a hindrance, especially for investigations over millennia or millions of years.
On May 10th and 11th the Minds Mastering Machines will take place in Karlsruhe. The specialist conference, which has been organized by iX, heise Developer and dpunkt.verlag since 2018, is primarily aimed at data scientists, data engineers and developers who implement machine learning projects in reality.
The two-day program offers a good 30 lectures on language models, cybersecurity, resilience, federated learning, model optimization, MLOps with Argo workflows and the EU AI Act, among other things.
pandas 2.0 fixes the problem with three new levels of precision:
- seconds
- milliseconds
- microseconds
This allows the library to use all years between -2.9e11
and 2.9e11
and create a timestamp for the year 1000, for example:
In[5]: pd.Timestamp(“1000-10-11″, unit=”s”) Out[5]: Timestamp(‘1000-10-11 00:00:00’)
The parameter controls the different levels of precision unit
.
A large part of pandas is designed for timestamps with nanosecond precision. The team therefore had to laboriously modify the methods. Since the different degrees of precision are still new, it is possible that individual areas of the API do not yet work as desired.