How to remove *some* duplicates from a dataframe

A TIL about how to partially de-dupe


I have a dataframe that has contains duplicates. And although I wanted to get rid of many of the duplicates, I also wanted to keep some duplicates on certain conditions. In this post, I outline the motivating problem and the solution (Note: I’m using polars here but you could use pandas/etc for your dataframe).

The motivating problem

To illustrate, here’s a random dataframe of foods, their categories, and some “last_date” field.

foodfood_typelast_date
carrotvegetable2024-01-28
lettucevegetable2024-01-28
lettucevegetable2024-01-27
applefruit2024-01-28
bananafruit2024-01-28
bananafruit2024-01-27
salmonseafood2024-01-28
salmonseafood2024-01-27

What I want in this example here is to keep both lettuces but discard the oldest fruit. In other words, I want my final table to look like this:

foodfood_typelast_date
carrotvegetable2024-01-28
lettucevegetable2024-01-28
lettucevegetable2024-01-27
applefruit2024-01-28
bananafruit2024-01-28
salmonseafood2024-01-28

See how there are two fewer rows? The second banana and the second salmon entries drop off, but the second lettuce entry stays.

How can you do that?

The way to reason about this is that you’re going to (1) split the dataframe into two separate dataframes, (2) do your deduplication on one of them, and then (3) concatenate them into a new dataframe. Here’s some code:

# Sort the dataframe
df_sorted = df.sort(["last_date"], descending=True)

# Split the dataframe into two dataframes, one with vegetables and one without
df_vegetables = df_sorted.filter(df_sorted["food_type"] == "vegetable")
df_others = df_sorted.filter(df_sorted["food_type"] != "vegetable")

# Drop the duplicates from the dataframe that does not have vegetables in it
df_others_latest = df_others.unique(subset=["food"])

# Combine the dataframes again, now with the dupes dropped
final_df = pl.concat([df_vegetables, df_others_latest])

And voila, a conditionally deduped dataframe.

Is there a better way to do this? Perhaps! But it worked for me!

AI Disclosure

ChatGPT helped me think through a solution here when I got stuck. Hopefully this is the best way to do it, but if there’s a better way, please let me know and I’ll update the blog with a better solution.