<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>gen-ai Archives - Vijay Gokarn</title>
	<atom:link href="https://vijay-gokarn.com/tag/gen-ai/feed/" rel="self" type="application/rss+xml" />
	<link>https://vijay-gokarn.com/tag/gen-ai/</link>
	<description>&#34;Ignite Curiosity. Fuel the Future.&#34;</description>
	<lastBuildDate>Sun, 19 Apr 2026 03:33:59 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/vijay-gokarn.com/wp-content/uploads/2023/09/cropped-ideogram.jpeg?fit=32%2C32&#038;ssl=1</url>
	<title>gen-ai Archives - Vijay Gokarn</title>
	<link>https://vijay-gokarn.com/tag/gen-ai/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">230943525</site>	<item>
		<title>Pandas Remove Duplicates</title>
		<link>https://vijay-gokarn.com/pandas-remove-duplicates/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=pandas-remove-duplicates</link>
		
		<dc:creator><![CDATA[Vijay Gokarn]]></dc:creator>
		<pubDate>Tue, 09 Jul 2024 11:12:55 +0000</pubDate>
				<category><![CDATA[ai-agents]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[food]]></category>
		<category><![CDATA[generative-ai]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[data-analysis]]></category>
		<category><![CDATA[gen-ai]]></category>
		<category><![CDATA[pandas]]></category>
		<guid isPermaLink="false">https://vijay-gokarn.com/?p=119</guid>

					<description><![CDATA[<p>Data Engineering · Python · Pandas · Data Cleaning Handling Duplicate Rows in Pandas — Identify, Remove &#038; Export Clean Data Librarypandas Methodsduplicated() · drop_duplicates() · reset_index() OutputCleaned CSV Stack Python pandas df.duplicated() drop_duplicates() reset_index() to_csv() Duplicate rows are one of the most common data quality issues — and one of the most damaging to [&#8230;]</p>
<p>The post <a href="https://vijay-gokarn.com/pandas-remove-duplicates/">Pandas Remove Duplicates</a> appeared first on <a href="https://vijay-gokarn.com">Vijay Gokarn</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Cormorant+Garamond:ital,wght@0,300;0,400;0,600;1,300;1,400&#038;family=DM+Sans:wght@300;400;500&#038;family=DM+Mono:wght@400&#038;display=swap" rel="stylesheet">

<style>
.vg8 {
  --ink: #0e0e0e; --paper: #f7f4ef; --paper-dark: #ede9e1;
  --teal: #0f6e56; --teal-light: #1d9e75; --teal-muted: #e1f5ee;
  --amber: #ba7517; --amber-light: #fac775; --amber-muted: #faeeda;
  --charcoal: #2c2c2a; --muted: #888780;
  --border: rgba(14,14,14,0.12); --border-strong: rgba(14,14,14,0.25);
  --code-bg: #161b22; --code-header: #2d333b; --code-border: rgba(255,255,255,0.06);
  font-family: 'DM Sans', sans-serif; font-weight: 300;
  color: var(--ink); background: var(--paper); line-height: 1.75; font-size: 16px; overflow-x: hidden;
}
.vg8 *, .vg8 *::before, .vg8 *::after { box-sizing: border-box; margin: 0; padding: 0; }

/* HERO */
.vg8-hero { background: #0d1117; padding: 5rem 4rem 4rem; position: relative; overflow: hidden; }
.vg8-hero::before {
  content: '⊕'; font-family: 'Cormorant Garamond', serif; font-size: 22rem;
  font-weight: 300; color: rgba(255,255,255,0.025); position: absolute;
  right: 1rem; bottom: -5rem; line-height: 1; pointer-events: none;
}
.vg8-hero-inner { position: relative; z-index: 1; max-width: 900px; }
.vg8-eyebrow { font-size: 0.68rem; letter-spacing: 0.22em; text-transform: uppercase; color: var(--teal-light); font-weight: 500; margin-bottom: 1.25rem; display: flex; align-items: center; gap: 0.75rem; }
.vg8-eyebrow::before { content: ''; display: inline-block; width: 1.5rem; height: 1px; background: var(--teal-light); }
.vg8-hero h1 { font-family: 'Cormorant Garamond', serif; font-size: clamp(2.2rem, 5vw, 3.8rem); font-weight: 300; line-height: 1.1; color: var(--paper); letter-spacing: -0.02em; margin-bottom: 1.5rem; max-width: 28ch; }
.vg8-hero h1 em { font-style: italic; color: var(--amber-light); }
.vg8-meta-row { display: flex; gap: 2rem; flex-wrap: wrap; }
.vg8-meta { font-size: 0.7rem; letter-spacing: 0.1em; text-transform: uppercase; color: rgba(247,244,239,0.35); }
.vg8-meta span { color: rgba(247,244,239,0.7); margin-left: 0.4rem; }

/* STACK BAND */
.vg8-stack-band { background: var(--teal); padding: 1.1rem 4rem; display: flex; gap: 0.75rem; flex-wrap: wrap; align-items: center; }
.vg8-stack-label { font-size: 0.63rem; letter-spacing: 0.18em; text-transform: uppercase; color: rgba(255,255,255,0.6); font-weight: 400; margin-right: 0.4rem; }
.vg8-stack-pill { font-size: 0.7rem; letter-spacing: 0.05em; padding: 0.28rem 0.85rem; background: rgba(255,255,255,0.12); color: #fff; border: 0.5px solid rgba(255,255,255,0.2); }

/* INTRO */
.vg8-intro { background: var(--teal-muted); padding: 2.5rem 4rem; border-left: 4px solid var(--teal); }
.vg8-intro p { font-size: 1.05rem; line-height: 1.85; color: var(--charcoal); font-weight: 300; max-width: 80ch; }
.vg8-intro strong { color: var(--teal); font-weight: 500; }

/* BODY */
.vg8-body { max-width: 900px; margin: 0 auto; padding: 4rem; }
.vg8-step { margin-bottom: 3.5rem; }
.vg8-step-label { font-size: 0.63rem; letter-spacing: 0.22em; text-transform: uppercase; color: var(--teal); font-weight: 500; margin-bottom: 0.5rem; display: flex; align-items: center; gap: 0.6rem; }
.vg8-step-label::before { content: ''; display: inline-block; width: 1.25rem; height: 1px; background: var(--teal); }
.vg8-step h2 { font-family: 'Cormorant Garamond', serif; font-size: clamp(1.4rem, 3vw, 2rem); font-weight: 300; line-height: 1.2; color: var(--ink); margin-bottom: 1rem; }
.vg8-step h2 em { font-style: italic; color: var(--teal); }
.vg8-step p { font-size: 0.93rem; line-height: 1.9; color: var(--charcoal); font-weight: 300; margin-bottom: 1rem; }
.vg8-step p strong { color: var(--ink); font-weight: 500; }
.vg8-divider { border: none; border-top: 0.5px solid var(--border); margin: 3rem 0; }
.vg8-ic { font-family: 'DM Mono', monospace; font-size: 0.82rem; background: rgba(14,14,14,0.07); padding: 0.1rem 0.4rem; color: var(--ink); }

/* CALLOUT */
.vg8-callout { background: var(--paper-dark); border-left: 3px solid var(--amber); padding: 1.25rem 1.5rem; margin: 1.25rem 0; font-size: 0.87rem; line-height: 1.8; color: var(--charcoal); }
.vg8-callout strong { color: var(--amber); font-weight: 500; }
.vg8-callout.teal { border-color: var(--teal); }
.vg8-callout.teal strong { color: var(--teal); }

/* STRATEGY CARDS */
.vg8-strategy-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1.25rem; margin: 1.5rem 0; }
.vg8-strategy-card { background: var(--paper); border: 0.5px solid var(--border-strong); padding: 1.5rem; position: relative; }
.vg8-strategy-card::before { content: ''; position: absolute; top: 0; left: 0; width: 100%; height: 4px; }
.vg8-strategy-card:nth-child(1)::before { background: var(--muted); }
.vg8-strategy-card:nth-child(2)::before { background: var(--amber); }
.vg8-strategy-card:nth-child(3)::before { background: var(--teal); }
.vg8-strategy-card .vg8-strat-tag { font-family: 'DM Mono', monospace; font-size: 0.65rem; letter-spacing: 0.1em; text-transform: uppercase; color: var(--muted); margin-bottom: 0.5rem; display: block; }
.vg8-strategy-card:nth-child(2) .vg8-strat-tag { color: var(--amber); }
.vg8-strategy-card:nth-child(3) .vg8-strat-tag { color: var(--teal); }
.vg8-strategy-card h3 { font-family: 'Cormorant Garamond', serif; font-size: 1.15rem; font-weight: 400; color: var(--ink); margin-bottom: 0.4rem; }
.vg8-strategy-card p { font-size: 0.82rem; line-height: 1.7; color: var(--charcoal); font-weight: 300; }

/* PIPELINE */
.vg8-pipeline { display: flex; flex-direction: column; gap: 0; margin: 1.5rem 0; }
.vg8-pipeline-step { display: grid; grid-template-columns: 52px 1fr; gap: 1.5rem; padding: 1.25rem 0; border-top: 0.5px solid var(--border); align-items: start; }
.vg8-pipeline-step:last-child { border-bottom: 0.5px solid var(--border); }
.vg8-pipeline-num { width: 36px; height: 36px; background: var(--teal); display: flex; align-items: center; justify-content: center; font-family: 'Cormorant Garamond', serif; font-size: 1.1rem; font-weight: 300; color: var(--paper); flex-shrink: 0; }
.vg8-pipeline-body h4 { font-family: 'Cormorant Garamond', serif; font-size: 1.1rem; font-weight: 400; color: var(--ink); margin-bottom: 0.3rem; }
.vg8-pipeline-body p { font-size: 0.83rem; line-height: 1.7; color: var(--charcoal); font-weight: 300; }

/* CODE BLOCKS */
.vg8-code-wrap { margin: 1.25rem 0; border: 0.5px solid var(--code-border); overflow: hidden; }
.vg8-code-header { background: var(--code-header); padding: 0.6rem 1.25rem; display: flex; justify-content: space-between; align-items: center; border-bottom: 0.5px solid var(--code-border); }
.vg8-code-filename { font-family: 'DM Mono', monospace; font-size: 0.68rem; color: rgba(247,244,239,0.45); letter-spacing: 0.04em; }
.vg8-code-lang { font-size: 0.6rem; letter-spacing: 0.14em; text-transform: uppercase; color: var(--teal-light); font-weight: 500; }
.vg8-code-body { background: var(--code-bg); padding: 1.5rem; overflow-x: auto; }
.vg8-code-body pre { margin: 0; }
.vg8-code-body code { font-family: 'DM Mono', monospace; font-size: 0.82rem; line-height: 1.85; color: #e6edf3; white-space: pre; display: block; }
/* tokens */
.t8-k { color: #ff7b72; }
.t8-s { color: #a5d6ff; }
.t8-c { color: #8b949e; font-style: italic; }
.t8-f { color: #d2a8ff; }
.t8-n { color: #79c0ff; }
.t8-v { color: #ffa657; }
.t8-b { color: var(--amber-light); }

/* FULL SCRIPT SECTION */
.vg8-full-section { background: var(--paper-dark); padding: 4rem; }
.vg8-full-eyebrow { font-size: 0.65rem; letter-spacing: 0.22em; text-transform: uppercase; color: var(--teal); font-weight: 500; margin-bottom: 0.5rem; display: flex; align-items: center; gap: 0.6rem; }
.vg8-full-eyebrow::before { content: ''; display: inline-block; width: 1.25rem; height: 1px; background: var(--teal); }
.vg8-full-section > h2 { font-family: 'Cormorant Garamond', serif; font-size: clamp(1.6rem, 3vw, 2.4rem); font-weight: 300; color: var(--ink); margin-bottom: 0.75rem; }
.vg8-full-section > h2 em { font-style: italic; color: var(--teal); }
.vg8-full-section > p { font-size: 0.9rem; color: var(--charcoal); font-weight: 300; line-height: 1.8; margin-bottom: 2rem; max-width: 70ch; }

/* INTERVIEW */
.vg8-interview-section { background: var(--ink); padding: 4rem; }
.vg8-interview-eyebrow { font-size: 0.65rem; letter-spacing: 0.22em; text-transform: uppercase; color: var(--amber-light); font-weight: 500; margin-bottom: 0.5rem; display: flex; align-items: center; gap: 0.6rem; }
.vg8-interview-eyebrow::before { content: ''; display: inline-block; width: 1.25rem; height: 1px; background: var(--amber-light); }
.vg8-interview-section > h2 { font-family: 'Cormorant Garamond', serif; font-size: clamp(1.6rem, 3vw, 2.4rem); font-weight: 300; color: var(--paper); margin-bottom: 2.5rem; }
.vg8-interview-section > h2 em { font-style: italic; color: var(--amber-light); }
.vg8-qa-list { display: flex; flex-direction: column; }
.vg8-qa-item { display: grid; grid-template-columns: 1fr 1.4fr; gap: 2rem; padding: 1.5rem 0; border-top: 0.5px solid rgba(247,244,239,0.1); align-items: start; }
.vg8-qa-item:last-child { border-bottom: 0.5px solid rgba(247,244,239,0.1); }
.vg8-qa-q { font-family: 'Cormorant Garamond', serif; font-size: 1.05rem; font-weight: 400; color: var(--paper); line-height: 1.4; }
.vg8-q-badge { font-family: 'DM Mono', monospace; font-size: 0.58rem; letter-spacing: 0.1em; text-transform: uppercase; background: var(--teal); color: var(--paper); padding: 0.15rem 0.5rem; margin-bottom: 0.5rem; display: inline-block; }
.vg8-qa-a { font-size: 0.83rem; line-height: 1.8; color: rgba(247,244,239,0.65); font-weight: 300; }
.vg8-qa-a strong { color: var(--amber-light); font-weight: 400; }
.vg8-qa-a code { font-family: 'DM Mono', monospace; font-size: 0.77rem; background: rgba(247,244,239,0.08); padding: 0.1rem 0.35rem; color: var(--paper); }
.vg8-pills { display: flex; flex-wrap: wrap; gap: 0.5rem; margin-top: 0.75rem; }
.vg8-pill { font-size: 0.67rem; letter-spacing: 0.06em; padding: 0.25rem 0.75rem; border: 0.5px solid rgba(247,244,239,0.15); color: rgba(247,244,239,0.5); }
.vg8-pill.t { border-color: var(--teal-light); color: var(--teal-light); }
.vg8-pill.a { border-color: var(--amber-light); color: var(--amber-light); }

/* FOOTER */
.vg8-footer { background: #0d1117; padding: 3rem 4rem; display: flex; justify-content: space-between; align-items: center; flex-wrap: wrap; gap: 1.5rem; border-top: 0.5px solid rgba(247,244,239,0.06); }
.vg8-footer p { font-size: 0.82rem; color: rgba(247,244,239,0.35); font-weight: 300; }
.vg8-footer p strong { color: rgba(247,244,239,0.65); font-weight: 400; }
.vg8-footer-links { display: flex; gap: 1rem; }
.vg8-btn { display: inline-block; padding: 0.65rem 1.75rem; font-size: 0.7rem; letter-spacing: 0.12em; text-transform: uppercase; text-decoration: none; font-weight: 400; }
.vg8-btn.primary { background: var(--teal); color: var(--paper); }
.vg8-btn.ghost { background: transparent; color: rgba(247,244,239,0.55); border: 0.5px solid rgba(247,244,239,0.2); }

/* REVEAL */
.vg8-reveal { opacity: 0; transform: translateY(20px); transition: opacity 0.55s ease, transform 0.55s ease; }
.vg8-reveal.vg8-vis { opacity: 1; transform: translateY(0); }
.vg8-d1 { transition-delay: 0.1s; } .vg8-d2 { transition-delay: 0.2s; } .vg8-d3 { transition-delay: 0.3s; }
</style>

<div class="vg8">

<!-- HERO -->
<div class="vg8-hero">
  <div class="vg8-hero-inner">
    <p class="vg8-eyebrow">Data Engineering · Python · Pandas · Data Cleaning</p>
    <h1>Handling Duplicate Rows in Pandas — <em>Identify, Remove &#038; Export Clean Data</em></h1>
    <div class="vg8-meta-row">
      <p class="vg8-meta">Library<span>pandas</span></p>
      <p class="vg8-meta">Methods<span>duplicated() · drop_duplicates() · reset_index()</span></p>
      <p class="vg8-meta">Output<span>Cleaned CSV</span></p>
    </div>
  </div>
</div>

<!-- STACK BAND -->
<div class="vg8-stack-band">
  <span class="vg8-stack-label">Stack</span>
  <span class="vg8-stack-pill">Python</span>
  <span class="vg8-stack-pill">pandas</span>
  <span class="vg8-stack-pill">df.duplicated()</span>
  <span class="vg8-stack-pill">drop_duplicates()</span>
  <span class="vg8-stack-pill">reset_index()</span>
  <span class="vg8-stack-pill">to_csv()</span>
</div>

<!-- INTRO -->
<div class="vg8-intro">
  <p>Duplicate rows are one of the most common data quality issues — and one of the most damaging to model accuracy and analysis reliability. <strong>Pandas</strong> gives you precise tools to detect, inspect, and remove duplicates with a single line of code. This guide walks through the full pipeline: load, detect, choose a strategy, clean, and export.</p>
</div>

<!-- BODY -->
<div class="vg8-body">

  <!-- WHY IT MATTERS -->
  <div class="vg8-step vg8-reveal">
    <p class="vg8-step-label">Context</p>
    <h2>Why duplicates <em>matter</em></h2>
    <p>Duplicate rows skew aggregations, inflate record counts, bias ML model training, and produce misleading visualizations. A sales total that counts the same transaction twice, a classifier trained on repeated samples — both produce results that look correct but aren&#8217;t. <strong>Clean data is the foundation everything else is built on.</strong></p>
    <div class="vg8-strategy-grid">
      <div class="vg8-strategy-card vg8-reveal vg8-d1">
        <span class="vg8-strat-tag">keep=&#8217;first&#8217;</span>
        <h3>Keep First</h3>
        <p>Drop all duplicates <em>except</em> the first occurrence. The original record is preserved. Most common default choice.</p>
      </div>
      <div class="vg8-strategy-card vg8-reveal vg8-d2">
        <span class="vg8-strat-tag">keep=&#8217;last&#8217;</span>
        <h3>Keep Last</h3>
        <p>Drop all duplicates <em>except</em> the last occurrence. Useful when later records represent updated values.</p>
      </div>
      <div class="vg8-strategy-card vg8-reveal vg8-d3">
        <span class="vg8-strat-tag">keep=False</span>
        <h3>Drop All</h3>
        <p>Remove every instance of a duplicated row — including the first. Use when any duplicated record is invalid.</p>
      </div>
    </div>
  </div>

  <hr class="vg8-divider">

  <!-- PIPELINE OVERVIEW -->
  <div class="vg8-step vg8-reveal">
    <p class="vg8-step-label">Pipeline</p>
    <h2>The four-step <em>deduplication pipeline</em></h2>
    <div class="vg8-pipeline">
      <div class="vg8-pipeline-step vg8-reveal">
        <div class="vg8-pipeline-num">1</div>
        <div class="vg8-pipeline-body"><h4>Load</h4><p>Read the raw CSV into a DataFrame with <code class="vg8-ic">pd.read_csv()</code>.</p></div>
      </div>
      <div class="vg8-pipeline-step vg8-reveal vg8-d1">
        <div class="vg8-pipeline-num">2</div>
        <div class="vg8-pipeline-body"><h4>Detect</h4><p>Use <code class="vg8-ic">df.duplicated()</code> to identify and inspect all duplicate rows before touching the data.</p></div>
      </div>
      <div class="vg8-pipeline-step vg8-reveal vg8-d2">
        <div class="vg8-pipeline-num">3</div>
        <div class="vg8-pipeline-body"><h4>Remove</h4><p>Call <code class="vg8-ic">drop_duplicates(keep=...)</code> with your chosen strategy. Reset the index for a clean sequential result.</p></div>
      </div>
      <div class="vg8-pipeline-step vg8-reveal vg8-d3">
        <div class="vg8-pipeline-num">4</div>
        <div class="vg8-pipeline-body"><h4>Export</h4><p>Write the cleaned DataFrame back to CSV with <code class="vg8-ic">to_csv()</code> for downstream use.</p></div>
      </div>
    </div>
  </div>

  <hr class="vg8-divider">

  <!-- STEP 1 — LOAD -->
  <div class="vg8-step vg8-reveal">
    <p class="vg8-step-label">Step 1</p>
    <h2>Load <em>your dataset</em></h2>
    <p>Start by reading your data into a pandas DataFrame. <code class="vg8-ic">pd.read_csv()</code> is the standard entry point for flat files. From here, all deduplication operations work on the in-memory DataFrame — your source file is never modified.</p>
    <div class="vg8-code-wrap">
      <div class="vg8-code-header"><span class="vg8-code-filename">load_data.py</span><span class="vg8-code-lang">Python</span></div>
      <div class="vg8-code-body"><pre><code><span class="t8-k">import</span> pandas <span class="t8-k">as</span> pd

<span class="t8-c"># Read the raw dataset into a DataFrame</span>
df = pd.<span class="t8-f">read_csv</span>(<span class="t8-s">'your_data_file.csv'</span>)

<span class="t8-c"># Quick shape check before cleaning</span>
<span class="t8-f">print</span>(<span class="t8-f">f</span><span class="t8-s">"Rows: {df.shape[0]:,}  |  Columns: {df.shape[1]}"</span>)</code></pre></div>
    </div>
    <div class="vg8-callout teal">
      <strong>Other sources:</strong> The same deduplication logic applies regardless of how you load your data. Use <code class="vg8-ic">pd.read_excel()</code> for XLSX, <code class="vg8-ic">pd.read_parquet()</code> for Parquet, or query a database with <code class="vg8-ic">pd.read_sql()</code> — all return a DataFrame you can clean the same way.
    </div>
  </div>

  <hr class="vg8-divider">

  <!-- STEP 2 — DETECT -->
  <div class="vg8-step vg8-reveal">
    <p class="vg8-step-label">Step 2</p>
    <h2>Detect <em>&#038; inspect duplicates</em></h2>
    <p><code class="vg8-ic">df.duplicated()</code> returns a boolean Series — <code class="vg8-ic">True</code> for every row that is a duplicate of an earlier row. Always <strong>inspect before you remove</strong> — understanding what the duplicates look like helps you choose the right strategy.</p>
    <div class="vg8-code-wrap">
      <div class="vg8-code-header"><span class="vg8-code-filename">detect_duplicates.py</span><span class="vg8-code-lang">Python</span></div>
      <div class="vg8-code-body"><pre><code><span class="t8-c"># Boolean mask: True for every row that is a duplicate</span>
duplicate_mask = df.<span class="t8-f">duplicated</span>()

<span class="t8-c"># How many duplicates exist?</span>
<span class="t8-f">print</span>(<span class="t8-f">f</span><span class="t8-s">"Duplicate rows found: {duplicate_mask.sum():,}"</span>)

<span class="t8-c"># Inspect the duplicate rows themselves</span>
duplicates = df[df.<span class="t8-f">duplicated</span>()]
<span class="t8-f">print</span>(duplicates)

<span class="t8-c"># See ALL occurrences of duplicated rows (including originals)</span>
all_dupes = df[df.<span class="t8-f">duplicated</span>(keep=<span class="t8-b">False</span>)]
<span class="t8-f">print</span>(all_dupes.<span class="t8-f">sort_values</span>(by=df.columns.<span class="t8-f">tolist</span>()))</code></pre></div>
    </div>
    <div class="vg8-callout">
      <strong>Subset duplicates:</strong> By default <code class="vg8-ic">duplicated()</code> checks all columns. To flag rows that are duplicates only on specific columns (e.g. same customer_id): <code class="vg8-ic">df.duplicated(subset=['customer_id'])</code>. This is useful for finding logical duplicates even when other columns differ.
    </div>
  </div>

  <hr class="vg8-divider">

  <!-- STEP 3 — REMOVE -->
  <div class="vg8-step vg8-reveal">
    <p class="vg8-step-label">Step 3</p>
    <h2>Remove duplicates — <em>three strategies</em></h2>
    <p><code class="vg8-ic">drop_duplicates()</code> returns a new DataFrame by default — the original is untouched. The <code class="vg8-ic">keep</code> parameter controls which occurrence survives. After removing, <code class="vg8-ic">reset_index(drop=True)</code> gives you a clean sequential index starting from 0.</p>
    <div class="vg8-code-wrap">
      <div class="vg8-code-header"><span class="vg8-code-filename">remove_duplicates.py</span><span class="vg8-code-lang">Python</span></div>
      <div class="vg8-code-body"><pre><code><span class="t8-c"># ── Strategy 1: keep the FIRST occurrence (default) ──</span>
df_keep_first = df.<span class="t8-f">drop_duplicates</span>(keep=<span class="t8-s">'first'</span>)

<span class="t8-c"># ── Strategy 2: keep the LAST occurrence ──</span>
<span class="t8-c">#    useful when later rows represent updated/corrected records</span>
df_keep_last = df.<span class="t8-f">drop_duplicates</span>(keep=<span class="t8-s">'last'</span>)

<span class="t8-c"># ── Strategy 3: drop ALL occurrences of any duplicated row ──</span>
<span class="t8-c">#    use when any repeated row is invalid data</span>
df_drop_all = df.<span class="t8-f">drop_duplicates</span>(keep=<span class="t8-b">False</span>)

<span class="t8-c"># ── Subset: deduplicate only on specific columns ──</span>
df_subset = df.<span class="t8-f">drop_duplicates</span>(subset=[<span class="t8-s">'customer_id'</span>, <span class="t8-s">'order_date'</span>], keep=<span class="t8-s">'first'</span>)

<span class="t8-c"># ── Reset the index after removal (clean 0-based index) ──</span>
df_cleaned = df_keep_first.<span class="t8-f">reset_index</span>(drop=<span class="t8-b">True</span>, inplace=<span class="t8-b">False</span>)

<span class="t8-c"># Confirm rows removed</span>
<span class="t8-f">print</span>(<span class="t8-f">f</span><span class="t8-s">"Before: {len(df):,}  |  After: {len(df_cleaned):,}  |  Removed: {len(df) - len(df_cleaned):,}"</span>)</code></pre></div>
    </div>
    <div class="vg8-callout teal">
      <strong>inplace vs assignment:</strong> <code class="vg8-ic">drop_duplicates(inplace=True)</code> modifies the DataFrame in place and returns <code class="vg8-ic">None</code>. Prefer the assignment pattern (<code class="vg8-ic">df_cleaned = df.drop_duplicates()</code>) — it preserves the original for comparison and makes your code easier to debug.
    </div>
  </div>

  <hr class="vg8-divider">

  <!-- STEP 4 — EXPORT -->
  <div class="vg8-step vg8-reveal">
    <p class="vg8-step-label">Step 4</p>
    <h2>Export <em>the clean data</em></h2>
    <p>Write the deduplicated DataFrame back to a CSV. Setting <code class="vg8-ic">index=False</code> prevents pandas from writing the row index as an extra column — your downstream consumers will thank you.</p>
    <div class="vg8-code-wrap">
      <div class="vg8-code-header"><span class="vg8-code-filename">export.py</span><span class="vg8-code-lang">Python</span></div>
      <div class="vg8-code-body"><pre><code><span class="t8-c"># Export to CSV — index=False keeps the file clean</span>
df_cleaned.<span class="t8-f">to_csv</span>(<span class="t8-s">'cleaned_data.csv'</span>, index=<span class="t8-b">False</span>)

<span class="t8-f">print</span>(<span class="t8-s">"Cleaned data exported to cleaned_data.csv"</span>)

<span class="t8-c"># Optional: also export to Parquet for better performance at scale</span>
df_cleaned.<span class="t8-f">to_parquet</span>(<span class="t8-s">'cleaned_data.parquet'</span>, index=<span class="t8-b">False</span>)</code></pre></div>
    </div>
  </div>

</div><!-- /vg8-body -->

<!-- FULL SCRIPT -->
<div class="vg8-full-section">
  <p class="vg8-full-eyebrow">Complete Reference</p>
  <h2>Full deduplication <em>script</em></h2>
  <p>Everything in one place — load, detect, remove (keep first), reset index, and export.</p>
  <div class="vg8-code-wrap vg8-reveal">
    <div class="vg8-code-header"><span class="vg8-code-filename">deduplicate.py — full script</span><span class="vg8-code-lang">Python</span></div>
    <div class="vg8-code-body"><pre><code><span class="t8-k">import</span> pandas <span class="t8-k">as</span> pd

<span class="t8-c"># ── 1. Load ─────────────────────────────────────────────</span>
df = pd.<span class="t8-f">read_csv</span>(<span class="t8-s">'your_data_file.csv'</span>)
<span class="t8-f">print</span>(<span class="t8-f">f</span><span class="t8-s">"Loaded {len(df):,} rows"</span>)

<span class="t8-c"># ── 2. Detect ────────────────────────────────────────────</span>
duplicates = df[df.<span class="t8-f">duplicated</span>()]
<span class="t8-f">print</span>(<span class="t8-f">f</span><span class="t8-s">"Duplicate rows found: {len(duplicates):,}"</span>)
<span class="t8-f">print</span>(duplicates)

<span class="t8-c"># ── 3a. Keep last occurrence of each duplicate row ───────</span>
df_cleaned = df.<span class="t8-f">drop_duplicates</span>(keep=<span class="t8-s">'last'</span>)

<span class="t8-c"># ── 3b. Keep first occurrence (swap in if preferred) ─────</span>
<span class="t8-c"># df_cleaned = df.drop_duplicates(keep='first')</span>

<span class="t8-c"># ── 3c. Reset the index to a clean 0-based sequence ──────</span>
df_cleaned.<span class="t8-f">reset_index</span>(drop=<span class="t8-b">True</span>, inplace=<span class="t8-b">True</span>)

<span class="t8-f">print</span>(<span class="t8-f">f</span><span class="t8-s">"Rows after cleaning: {len(df_cleaned):,}"</span>)

<span class="t8-c"># ── 4. Export ─────────────────────────────────────────────</span>
df_cleaned.<span class="t8-f">to_csv</span>(<span class="t8-s">'cleaned_data.csv'</span>, index=<span class="t8-b">False</span>)
<span class="t8-f">print</span>(<span class="t8-s">"Exported to cleaned_data.csv"</span>)</code></pre></div>
  </div>
</div>

<!-- INTERVIEW CHEAT SHEET -->
<div class="vg8-interview-section">
  <p class="vg8-interview-eyebrow">Interview Prep</p>
  <h2>Cheat sheet — <em>quick definitions to remember</em></h2>
  <div class="vg8-qa-list">

    <div class="vg8-qa-item vg8-reveal">
      <div class="vg8-qa-q"><span class="vg8-q-badge">Define</span><br>What does <code>df.duplicated()</code> return?</div>
      <div class="vg8-qa-a">A <strong>boolean Series</strong> the same length as the DataFrame — <code>True</code> for every row that is a duplicate of a previously seen row, <code>False</code> otherwise. The first occurrence is marked <code>False</code> by default.
        <div class="vg8-pills"><span class="vg8-pill t">Boolean Series</span><span class="vg8-pill">True = duplicate</span><span class="vg8-pill a">First = False by default</span></div>
      </div>
    </div>

    <div class="vg8-qa-item vg8-reveal vg8-d1">
      <div class="vg8-qa-q"><span class="vg8-q-badge">Compare</span><br>keep=&#8217;first&#8217; vs keep=&#8217;last&#8217; vs keep=False</div>
      <div class="vg8-qa-a"><strong>first</strong> — keeps the first occurrence, drops all subsequent duplicates. <strong>last</strong> — keeps the final occurrence, useful for updated records. <strong>False</strong> — drops every occurrence of any duplicated row, leaving only rows that were unique to begin with.
        <div class="vg8-pills"><span class="vg8-pill t">first = keep original</span><span class="vg8-pill a">last = keep latest</span><span class="vg8-pill">False = drop all copies</span></div>
      </div>
    </div>

    <div class="vg8-qa-item vg8-reveal">
      <div class="vg8-qa-q"><span class="vg8-q-badge">Explain</span><br>What does the <code>subset</code> parameter do?</div>
      <div class="vg8-qa-a">By default, <code>duplicated()</code> and <code>drop_duplicates()</code> compare <strong>all columns</strong>. The <code>subset</code> parameter restricts the comparison to specific columns — for example <code>subset=['customer_id']</code> finds rows with the same customer ID even if other columns differ.
        <div class="vg8-pills"><span class="vg8-pill t">Default = all columns</span><span class="vg8-pill">subset = logical dedup</span></div>
      </div>
    </div>

    <div class="vg8-qa-item vg8-reveal vg8-d1">
      <div class="vg8-qa-q"><span class="vg8-q-badge">Gotcha</span><br>Why call <code>reset_index(drop=True)</code> after deduplication?</div>
      <div class="vg8-qa-a">After dropping rows, the DataFrame retains the <strong>original row indices</strong> — you&#8217;d have gaps like 0, 1, 4, 7 instead of 0, 1, 2, 3. <code>reset_index(drop=True)</code> renumbers from 0 continuously. <code>drop=True</code> prevents the old index from being added as a column.
        <div class="vg8-pills"><span class="vg8-pill a">Index gaps after drop</span><span class="vg8-pill t">reset_index fixes gaps</span><span class="vg8-pill">drop=True prevents extra col</span></div>
      </div>
    </div>

    <div class="vg8-qa-item vg8-reveal">
      <div class="vg8-qa-q"><span class="vg8-q-badge">Gotcha</span><br>inplace=True vs reassignment — which is preferred?</div>
      <div class="vg8-qa-a">Prefer <strong>reassignment</strong> (<code>df_cleaned = df.drop_duplicates()</code>) — it preserves the original DataFrame for comparison and makes pipelines easier to debug. <code>inplace=True</code> modifies the object and returns <code>None</code>, which can cause confusion when chaining operations. Many pandas best-practice guides now recommend avoiding inplace.
        <div class="vg8-pills"><span class="vg8-pill t">Reassignment = safer</span><span class="vg8-pill a">inplace returns None</span></div>
      </div>
    </div>

    <div class="vg8-qa-item vg8-reveal vg8-d1">
      <div class="vg8-qa-q"><span class="vg8-q-badge">Best Practice</span><br>How do you handle duplicates in a production data pipeline?</div>
      <div class="vg8-qa-a"><strong>Three layers:</strong> (1) <strong>Detect and log</strong> before removing — store duplicate counts as data quality metrics. (2) <strong>Deduplicate at ingestion</strong>, not at query time — clean once, use many times. (3) Add a <strong>unique constraint</strong> in your database or Delta Lake table to prevent duplicates from re-entering at source.
        <div class="vg8-pills"><span class="vg8-pill t">Log before removing</span><span class="vg8-pill t">Clean at ingestion</span><span class="vg8-pill a">DB unique constraints</span></div>
      </div>
    </div>

    <div class="vg8-qa-item vg8-reveal">
      <div class="vg8-qa-q"><span class="vg8-q-badge">Use Case</span><br>When should you NOT remove duplicates?</div>
      <div class="vg8-qa-a">When the repeated rows represent <strong>legitimate repeated events</strong> — a customer placing the same order twice on different days, a sensor reading the same value consecutively, or audit log entries. Always validate with domain knowledge before dropping. Use <code>subset</code> to deduplicate on business keys, not entire rows.
        <div class="vg8-pills"><span class="vg8-pill a">Repeated events = valid</span><span class="vg8-pill t">Use subset= for business keys</span></div>
      </div>
    </div>

  </div>
</div>

<!-- FOOTER -->
<div class="vg8-footer">
  <p><strong>GenAI Mastery Series</strong> — vijay-gokarn.com · Vijay Gokarn</p>
  <div class="vg8-footer-links">
    <a href="https://github.com/vijaygokarn130" class="vg8-btn ghost">GitHub ↗</a>
    <a href="https://vijay-gokarn.com" class="vg8-btn primary">Back to Blog ↗</a>
  </div>
</div>

</div><!-- /vg8 -->

<script>
(function(){
  var obs = new IntersectionObserver(function(e){
    e.forEach(function(x){ if(x.isIntersecting) x.target.classList.add('vg8-vis'); });
  }, {threshold: 0.08});
  document.querySelectorAll('.vg8-reveal').forEach(function(el){ obs.observe(el); });
})();
</script>
<p>The post <a href="https://vijay-gokarn.com/pandas-remove-duplicates/">Pandas Remove Duplicates</a> appeared first on <a href="https://vijay-gokarn.com">Vijay Gokarn</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">119</post-id>	</item>
	</channel>
</rss>
