Python library pandas 2.0 optimizes memory management

A good three years after the first main version, version 2.0 of pandas has now been released. The release stabilizes functions, some of which were already included in pandas 1.5, including the extension arrays, the connection to Apache Arrow and copy-on-write.

In the course of the version jump, the team cleaned up the Python library for processing and analyzing data and removed or adjusted all components marked as deprecated.

Already pandas 1.0 introduced Extension Arrays (EAs), which allow to define own data types that deviate from NumPy data types. The team has implemented EAs within pandas that allow missing values ​​in any data type like integer or boolean.

Since earlier versions implicitly required NumPy data types in many places, the EA connection did not work completely. The pandas team successively improved it over the 1.x series. In pandas 2.0, most calls honor EA’s own methods instead of falling back to NumPy functions. This also improves performance:

# pandas 1.5.3:

In[3]: ser = pd.Series(list(range(1, 1_000_000)) 
                       + [pd.NA], dtype="Int64")
In[4]: %timeit ser.drop_duplicates()

22.7 ms ± 272 µs per loop \
  (mean ± std. dev. of 7 runs, 10 loops each)


# pandas 2.0:

In[3]: ser = pd.Series(list(range(1, 1_000_000)) 
                       + [pd.NA], dtype="Int64")

In[4]: %timeit ser.drop_duplicates()

7.54 ms ± 24 µs per loop \
  (mean ± std. dev. of 7 runs, 100 loops each)

Among other things, takes place within GroupBy-No conversion operations too float no longer take place, but the EA conventions apply. In addition to improved performance, this procedure prevents loss of precision for larger numbers.

pandas 2.0 has a new parameter for almost all I/O functions to automatically convert to IO data types. If the parameter dtype_backend on "numpy_nullable" is set, the function inputs DataFrame returns, which consists only of nullable data types.

You can also create an index that contains extension arrays. pandas 1.4.0 introduced the procedure for the first time, and meanwhile all operations are efficiently implemented:

  • Index operations use EA functions.
  • There is an efficient engine that does the triaging of data loc and iloc allows.
  • pandas copies the values ​​in one MultiIndex internal no more. This improves performance and allows the correct data types to be used consistently.

The pandas team is constantly working on the extension array interface, and each new release brings improvements.

The Apache Arrow library defines a programming language-independent format for in-memory data processing and represents a significant improvement compared to NumPy and the object-data type.

pandas 1.5.0 introduced one based on arrow arrays for the first time ExtensionArray a. pandas 2.0 increases the minimum version of the Python interface PyArrow and significantly improves the connection to Arrow. pandas 1.5.3 has numerous PerformanceWarnings were issued indicating that a NumPy implementation was used instead of a PyArrow implementation. Most of that warning is now obsolete. Also, the team improved the PyArrow integration in the parts of the library that don’t use the EA interface due to the lack of specialized implementation. With version 2.0, pandas uses the corresponding PyArrow compute interface in most cases.

The string connection for PyArrow EAs essentially corresponds to the implementation in older extension arrays, which are defined by the data type string and the option string_storage="pyarrow" has been activated.

A PyArrow data type can be passed in pandas either f"{dtype}[pyarrow]"also int64[pyarrow] for integer data types, or with

import pandas as pd
import pyarrow as pa

dtype = pd.ArrowDtype(pa.int64)

establish. These data types are allowed everywhere.

The new parameter for I/O functions dtype_backend also used to generate DataFrames with PyArrow arrays. For a function to return a PyArrow DataFrame, the parameter must have the value "pyarrow" have.

In addition, some I/O functions have a PyArrow-specific engine that is significantly more performant because it generates PyArrow arrays natively.

Another benefit of PyArrow DataFrames is improved interoperability with other Arrow libraries. These can either be native PyArrow libraries or other DataFrame libraries like cuDF or Polars. When converting between the libraries, there is no need to copy the data. Marc Garcia, who is part of the pandas-core team, wrote a blog post that explains this part in more detail.

So far, pandas has rendered all timestamps with nanosecond precision. The following corresponds asm8 the datetime-Representation:

# pandas 1.5.3
 
In[1]: pd.Timestamp("2019-12-31").asm8
Out[1]: 2019-12-31T00:00:00.000000000

# pandas 2.0:

In[1]: pd.Timestamp("2019-12-31").asm8
Out[1]: 2019-12-31T00:00:00

Due to the high precision, it was impossible to represent dates before September 21, 1677 or after April 11, 2264, since the associated values ​​would have exceeded the 64-bit integer limit with nanosecond precision. All other date values ​​returned an error. This was a hindrance, especially for investigations over millennia or millions of years.




On May 10th and 11th the Minds Mastering Machines will take place in Karlsruhe. The specialist conference, which has been organized by iX, heise Developer and dpunkt.verlag since 2018, is primarily aimed at data scientists, data engineers and developers who implement machine learning projects in reality.

The two-day program offers a good 30 lectures on language models, cybersecurity, resilience, federated learning, model optimization, MLOps with Argo workflows and the EU AI Act, among other things.

pandas 2.0 fixes the problem with three new levels of precision:

  • seconds
  • milliseconds
  • microseconds

This allows the library to use all years between -2.9e11 and 2.9e11 and create a timestamp for the year 1000, for example:

In[5]: pd.Timestamp(“1000-10-11″, unit=”s”) Out[5]: Timestamp(‘1000-10-11 00:00:00’)

The parameter controls the different levels of precision unit.

A large part of pandas is designed for timestamps with nanosecond precision. The team therefore had to laboriously modify the methods. Since the different degrees of precision are still new, it is possible that individual areas of the API do not yet work as desired.

To home page

Related Posts

Hot News

Trending

usefull links

robis robis robis