A good three years after the first main version, version 2.0 of pandas has now been released. The release stabilizes functions, some of which were already included in pandas 1.5, including the extension arrays, the connection to Apache Arrow and copy-on-write.
In the course of the version jump, the team cleaned up the Python library for processing and analyzing data and removed or adjusted all components marked as deprecated.
Own data types for the library
Already pandas 1.0 introduced Extension Arrays (EAs), which allow to define own data types that deviate from NumPy data types. The team has implemented EAs within pandas that allow missing values in any data type like integer or boolean.
Since earlier versions implicitly required NumPy data types in many places, the EA connection did not work completely. The pandas team successively improved it over the 1.x series. In pandas 2.0, most calls honor EA’s own methods instead of falling back to NumPy functions. This also improves performance:
# pandas 1.5.3: In: ser = pd.Series(list(range(1, 1_000_000)) + [pd.NA], dtype="Int64") In: %timeit ser.drop_duplicates() 22.7 ms ± 272 µs per loop \ (mean ± std. dev. of 7 runs, 10 loops each) # pandas 2.0: In: ser = pd.Series(list(range(1, 1_000_000)) + [pd.NA], dtype="Int64") In: %timeit ser.drop_duplicates() 7.54 ms ± 24 µs per loop \ (mean ± std. dev. of 7 runs, 100 loops each)
Among other things, takes place within
GroupBy-No conversion operations too
float no longer take place, but the EA conventions apply. In addition to improved performance, this procedure prevents loss of precision for larger numbers.
pandas 2.0 has a new parameter for almost all I/O functions to automatically convert to IO data types. If the parameter
"numpy_nullable" is set, the function inputs
DataFrame returns, which consists only of nullable data types.
You can also create an index that contains extension arrays. pandas 1.4.0 introduced the procedure for the first time, and meanwhile all operations are efficiently implemented:
- Index operations use EA functions.
- There is an efficient engine that does the triaging of data
- pandas copies the values in one
MultiIndexinternal no more. This improves performance and allows the correct data types to be used consistently.
The pandas team is constantly working on the extension array interface, and each new release brings improvements.
Interaction with Apache Arrow
The Apache Arrow library defines a programming language-independent format for in-memory data processing and represents a significant improvement compared to NumPy and the
pandas 1.5.0 introduced one based on arrow arrays for the first time
ExtensionArray a. pandas 2.0 increases the minimum version of the Python interface PyArrow and significantly improves the connection to Arrow. pandas 1.5.3 has numerous
PerformanceWarnings were issued indicating that a NumPy implementation was used instead of a PyArrow implementation. Most of that warning is now obsolete. Also, the team improved the PyArrow integration in the parts of the library that don’t use the EA interface due to the lack of specialized implementation. With version 2.0, pandas uses the corresponding PyArrow compute interface in most cases.
The string connection for PyArrow EAs essentially corresponds to the implementation in older extension arrays, which are defined by the data type
string and the option
string_storage="pyarrow" has been activated.
A PyArrow data type can be passed in pandas either
int64[pyarrow] for integer data types, or with
import pandas as pd import pyarrow as pa dtype = pd.ArrowDtype(pa.int64)
establish. These data types are allowed everywhere.
The new parameter for I/O functions
dtype_backend also used to generate
DataFrames with PyArrow arrays. For a function to return a PyArrow DataFrame, the parameter must have the value
In addition, some I/O functions have a PyArrow-specific engine that is significantly more performant because it generates PyArrow arrays natively.
Another benefit of PyArrow DataFrames is improved interoperability with other Arrow libraries. These can either be native PyArrow libraries or other DataFrame libraries like cuDF or Polars. When converting between the libraries, there is no need to copy the data. Marc Garcia, who is part of the pandas-core team, wrote a blog post that explains this part in more detail.
Less accurate dates
So far, pandas has rendered all timestamps with nanosecond precision. The following corresponds
# pandas 1.5.3 In: pd.Timestamp("2019-12-31").asm8 Out: 2019-12-31T00:00:00.000000000 # pandas 2.0: In: pd.Timestamp("2019-12-31").asm8 Out: 2019-12-31T00:00:00
Due to the high precision, it was impossible to represent dates before September 21, 1677 or after April 11, 2264, since the associated values would have exceeded the 64-bit integer limit with nanosecond precision. All other date values returned an error. This was a hindrance, especially for investigations over millennia or millions of years.
On May 10th and 11th the Minds Mastering Machines will take place in Karlsruhe. The specialist conference, which has been organized by iX, heise Developer and dpunkt.verlag since 2018, is primarily aimed at data scientists, data engineers and developers who implement machine learning projects in reality.
The two-day program offers a good 30 lectures on language models, cybersecurity, resilience, federated learning, model optimization, MLOps with Argo workflows and the EU AI Act, among other things.
pandas 2.0 fixes the problem with three new levels of precision:
This allows the library to use all years between
2.9e11 and create a timestamp for the year 1000, for example:
In: pd.Timestamp(“1000-10-11″, unit=”s”) Out: Timestamp(‘1000-10-11 00:00:00’)
The parameter controls the different levels of precision
A large part of pandas is designed for timestamps with nanosecond precision. The team therefore had to laboriously modify the methods. Since the different degrees of precision are still new, it is possible that individual areas of the API do not yet work as desired.