Is Python Really a Bottleneck?

Is Python Really a Bottleneck?

Published
December 21, 2020
Tags
Data EngineeringPythonArchitecture & DesignOpen-Source
Photo by
Photo by Josh Hild from Pexels

Is Python Really a Bottleneck?

Full disclosure — I’m currently working as a Python Engineer, and therefore you can consider me biased. But I want to unravel some criticism about Python and reflect on whether the speed concerns are valid regarding day-to-day work using Python for data engineering, data science, and analytics.

Is Python too slow?

From my perspective, such questions should be asked based on a specific context or use case. Is Python slow in crunching numbers compared to compiled languages such as C? Yes, it is. This fact has been known for years, and that’s the reason why Python libraries for which speed plays an important role, such as numpy, leverage C under the hood.

But is Python that much slower than other (more difficult to learn and use) languages for all use cases? If you look at the performance benchmarks of many Python libraries optimized to solve a specific problem, they do decently well compared to compiled languages. For instance, look at the FastAPI performance benchmark — obviously, Go as a compiled language is much faster than Python. Still, FastAPI beats some of Go’s libraries for building REST APIs:

image

Web Framework Benchmarks — image by the author

💡

Side note: the above list does not include C++ and Java web frameworks, which had even higher performance.

Similarly, when comparing Dask (written in Python) against Spark (written in Scala) for data-intensive neuroimaging pipelines [2], the authors concluded:

“Overall, our results show no substantial performance difference between the engines.”

The question we should be asking ourselves is what speed do we really need. If you run an ETL job triggered only once per day, you may not care whether it takes 20 seconds or 200 seconds. You may then prefer to make the code easy to understand, package, and maintain, especially given that the compute resources are becoming increasingly affordable as compared to the costly engineering time.

Code speed vs. practicality

From a pragmatic standpoint, there are many different questions that we need to answer when choosing a programming language for day-to-day work.

Can you reliably solve multiple business problems with this language?

If all you care about is speed, then don’t use Python, period. There are much faster alternatives for all sorts of use cases. Python's main benefits lie in its readability, ease of use, and a wide range of problems that can be solved with it. Python can be used as a glue that ties together a myriad of different systems, services, and use cases.

Can you find enough employees that know this language?

Since Python is so easy to learn and use, the number of Python users is constantly growing. Business users, who were previously crunching numbers in Excel, can now quickly learn to code in Pandas and thus learn to be self-sufficient without constantly relying on IT resources. At the same time, this eliminates the burden of the IT and analytics departments. It also improves the time to value.

These days, it’s easier to find data engineers who know Python and can maintain a Spark data processing application in this language, rather than those who could do the same in Java or Scala. Many organizations are gradually switching to Python for many use cases simply because of the higher chances of finding employees who “speak” that language.

In contrast, I know companies who desperately need Java or C# developers to maintain their existing applications, but those languages are difficult (take years to master) and seem unattractive for new programmers who can potentially earn more in jobs that leverage much easier languages such as Go or Python.

Synergies between experts from different domains

If your company leverages Python, there are high chances that the same language can be used by business users, data analysts, data scientists, data engineers, backend and web developers, DevOps engineers, and even system administrators. This leads to synergies in projects where people from different domains can work together and leverage the same tools.

What are the true bottlenecks in data processing?

Based on my own work, I usually experienced bottlenecks not in the language itself but rather in the external resources. To be more concrete, let’s look at several examples.

Writes to relational databases

When processing data in the ETL-fashion, we need to load this data in the end to some centralized place. While we could leverage multithreading in Python to write data to some relational database faster (by using more threads), the chances are that the increase in the number of parallel writes could max out the CPU capacity of that database.

In fact, this happened to me once when I was using multithreading to speed up the writes to an RDS Aurora database on AWS. I then noticed that the CPU utilization for the writer node went up so high that I had to deliberately make my code slower by using fewer threads to ensure that I wouldn’t break the database instance.

This means that Python has mechanisms to parallelize and speed up many operations, but your relational database (limited by the number of CPU cores) has its limits that will unlikely be solved just by using a faster programming language.

Making calls to external APIs

Working with external REST APIs from which you may want to extract data for your data analytics needs is another example where the language itself doesn’t seem to be a bottleneck. While we could speed up the data extraction by leveraging parallelism, this could be in vain because many external APIs limit the number of requests we can make within a specific time period. Thus, you may often find yourself making your script deliberately slower to ensure that you don’t exceed the API’s request limits: time.sleep(10).

Working with Big Data

From my experience working with vast datasets, you can’t load really “big data” into your laptop’s memory regardless of which language you use. For such use cases, you will likely need to leverage distributed processing frameworks such as Dask, Spark, Ray, etc. There is a limit to how much data you can process when using a single server instance or your laptop.

If you want to offload the actual data processing to a cluster of compute nodes, possibly even making use of GPU instances that can further speed up compute, Python happens to have a large ecosystem of frameworks that make this task easy:

  • Do you want to speed up compute for data science by leveraging GPUs? Use Pytorch, Tensorflow, Ray, or Rapids (even with SQL — BlazingSQL)
  • Do you want to speed up your Python code to process Big Data? Use Spark (or Databricks), Dask, or Prefect (that abstracts away Dask under the hood)
  • Do you want to speed up your data processing for analytics? Use fast specialized in-memory columnar databases that ensure high-speed processing just by using SQL queries.

And if you need to orchestrate and monitor data processing that occurs on a cluster of compute nodes, there are several workflow management platforms, written in Python, that will speed up the development and improve maintenance of your data pipelines, such as Apache Airflow, Prefect, or Dagster. If you want to learn more about those, have a look at my previous articles.

As a side note, I can imagine that some people complaining about Python don’t leverage it to its full capacity or may not be using proper data structures for the problem at hand.

To summarize, if you need to process large amounts of data quickly, you will likely need more compute resources rather than a faster programming language, and there are Python libraries that make it easy to distribute work across hundreds of nodes.

Conclusion

In this article, we discussed whether Python is a real bottleneck in the current data processing landscape. While Python is slower than many compiled languages, it’s easy to use and extremely diverse. We noticed that, for many, the practicality of the language beats the speed considerations.

Lastly, we discussed that, at least in data engineering, the language itself might not be the bottleneck, but rather the limits of external systems and the sheer amount of data that prohibits its processing on a single machine regardless of the chosen programming language.

If this article was helpful, follow me to see my next posts. Thank you for reading!

References:

[1] TechEmpower: Web Framework Benchmarks. Source

[2] “A performance comparison of Dask and Apache Spark for data-intensive neuroimaging pipelines” — Mathieu Dugré, Valérie Hayot-Sasson, Tristan Glatard. Link to arxiv