DuckDB combines the simplicity and ease of use of SQLite with the analytical performance of specialized columnar databases. Learn more with Python examples.
Image by Author
DuckDB is a free, open-source, embedded database management system designed for data analytics and online analytical processing. This means several things:
In summary, DuckDB provides an easy-to-use, embedded analytic database for applications that need fast and simple data analysis capabilities. It fills a niche for analytical processing where a full database server would be overkill.
There are many reasons companies are now building products on top of DuckDB. The database is designed for fast analytical queries which means it’s optimized for aggregations, joins, and complex queries on large datasets – the types of queries often used in analytics and reporting. Moreover:
In short, DuckDB combines the simplicity and ease of use of SQLite with the analytical performance of specialized columnar databases. All of these factors – performance, simplicity, features, and open source license – contribute to DuckDB’s growing popularity among developers and data analysts.
Let’s test out a few features of DuckDB using the Python API.
You can instal DuckDB using Pypi:
For other programming language, head to the DuckDB’s installation guide.
In this example, we will be using Data Science Salaries 2023 CSV dataset from Kaggle and try to test DuckDB’s various functionalities.
You can load a CSV file just like pandas into a relation. DuckDB provides a relational API that allows users to link query operations together. The queries are lazily evaluated, which enables DuckDB to optimize their execution.
We have loaded the data science salary dataset and displayed the alias.
To display the column names we will use .columns
similar to pandas.
You can apply multiple functions to the relation to get specific results. In our case, we have filtered out “work_year”, displayed only three columns, and ordered and limited them to display the bottom five job titles based on the salaries.
Learn more about Relational API by following the guide.
You can also use Relational API to join two datasets. In our case, we are joining the same dataset by changing the alias name on a “job_title”.
There are direct methods too. You just have to write SQL query to perform analysis on the dataset. Instead of the table name, you will write the location and name of the CSV file.
By default, DuckDB operates on an in-memory database. This means that any tables created are stored in memory and not persisted to disk. However, by using the .connect()
method, a connection can be made to a persistent database file on disk. Any data written to that database connection will then be saved to the disk file and reloaded when reconnecting to the same file.
We can also create the new table using a data science salary CSV file.
After performing all the tasks, you must close the connection to the database.
Why do I like DuckDB? It is fast and simple to learn and manage. I believe simplicity is the main reason DuckDB has become widely used in the data science community. DuckDB provides an intuitive SQL interface that is easy for data analysts and scientists to pick up. Installation is straightforward, and the database files are light and manageable. All of these make DuckDB a joy to use.
Check out my previous Deepnote article on Data Science with DuckDB for an in-depth analysis of features and use cases.
With robust tools for data loading, managing, and analysis, DuckDB offers an attractive option compared to other database solutions for data science. I believe DuckDB will continue gaining users in the coming years as more data professionals discover its user-friendly nature.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in Technology Management and a bachelor’s degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.
By subscribing you accept KDnuggets Privacy Policy
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.
By subscribing you accept KDnuggets Privacy Policy
Subscribe To Our Newsletter
(Get The Complete Collection of Data Science Cheat Sheets & Great Big NLP Primer ebook)
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.
By subscribing you accept KDnuggets Privacy Policy
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.
By subscribing you accept KDnuggets Privacy Policy