Skip to content

Clean Data Before Joining

Goal: Stop fighting silent join failures. Use this checklist before every join.


The 4 reasons a join fails

flowchart LR
    F[Join failed?] --> T1[Field types<br/>don't match]
    F --> T2[Leading zeros<br/>missing]
    F --> T3[Whitespace /<br/>casing]
    F --> T4[Cardinality<br/>not 1:1]

    classDef root fill:#fef2f2,stroke:#ef4444,color:#991b1b
    class F root
    classDef cause fill:#fef3c7,stroke:#f59e0b,color:#92400e
    class T1,T2,T3,T4 cause
Cause Symptom
Field types don't match "0 of N records joined"
Leading zeros lost Some records joined, others didn't (FIPS, ZIPs)
Whitespace or casing "GA " ≠ "GA"; "atlanta" ≠ "Atlanta"
Cardinality > 1:1 Many table rows match one feature → join keeps just one

Pre-join checklist

1. Check field types

Right-click each layer/table → Fields View. Confirm the join columns are the same type:

  • Both Text
  • Both Long Integer
  • One Text, one Long → ❌ Convert one.

To convert:

  1. Add a new field of the right type.
  2. Calculate Field: str(!FIPS!) or int(!FIPS!).
  3. Use the new field as the join key.

2. Check leading zeros

Open the table. Look at FIPS / ZIP / phone-number fields:

  • 06037 ✅ (text, 5 chars)
  • 6037 ❌ (number, leading zero stripped)

Fix by reimporting the source as Text. For CSVs, before importing:

  • Open in a text editor first to confirm zeros are present.
  • In ArcGIS Pro, Add Data → CSV → click the field → set type = Text.

If the data is already imported wrong, you can reconstruct:

# Calculate Field, Python
"{:05d}".format(!FIPS_NUM!)

3. Trim and standardize text

Whitespace and casing kill joins silently.

# Strip whitespace
!STATE!.strip()

# Lowercase
!STATE!.lower()

# Title case
!CITY!.title()

# Remove all spaces
!ZIP!.replace(" ", "")

Always trim before joining text fields.

4. Check cardinality

Run Statistics on the join field of each side:

  • If the layer side has duplicates → fine for many-to-one.
  • If the table side has duplicates → ⚠️ ArcGIS Pro will silently keep only the first match.

To check duplicates: right-click table → Summary Statistics → Count by your join field. If any count > 1, you have duplicates.

Fix: summarize the table to one row per key first.

# Or use Summary Statistics tool to aggregate
arcpy.analysis.Statistics(
    in_table="raw_data",
    out_table="summarized",
    statistics_fields="POP SUM",
    case_field="FIPS"
)

5. Validate the join

After adding the join:

  1. Open the attribute table.
  2. Sort by your joined value (e.g., MEDIAN_INC).
  3. Look for <Null> rows — those are unmatched.
  4. Use Statistics on the joined field — does the count match expectations?

Or use the built-in Validate Join button in the Add Join dialog.

6. Permanent join

A join is virtual. Once validated:

  1. Right-click the joined layer → Data → Export Features.
  2. Save as a new feature class.

Now the joined fields are permanent and the export is safe to share.


Common gotchas

Census FIPS pitfalls

  • State FIPS = 2 chars (06)
  • County FIPS = 5 chars (state+county) (06037)
  • Tract GEOID = 11 chars
  • Block group = 12 chars
  • Block = 15 chars

Always confirm length before joining.

Excel auto-format

Excel will silently turn 06037 into 6037. Use a CSV parser that respects strings, or import directly to ArcGIS Pro from the original source.

Encoding

UTF-8 vs Latin-1 → "São Paulo" can become "S\u00e3o Paulo". Save CSVs as UTF-8.


Quick-fix script

Save this as a Python snippet in your project:

import arcpy

fc = "counties"
field_in = "FIPS"
field_clean = "FIPS_clean"

if field_clean not in [f.name for f in arcpy.ListFields(fc)]:
    arcpy.AddField_management(fc, field_clean, "TEXT", field_length=5)

with arcpy.da.UpdateCursor(fc, [field_in, field_clean]) as cur:
    for row in cur:
        raw = str(row[0]).strip()
        row[1] = raw.zfill(5)
        cur.updateRow(row)

print("FIPS_clean is ready for joining.")

Practice

Diagnose a broken join

  1. Get a counties shapefile and an income CSV.
  2. Try the join — count <Null> matches.
  3. Check: types match? Leading zeros intact? Whitespace? Duplicates?
  4. Fix each one and re-run.
  5. Validate with Statistics.