Clean Data Before Joining¶
Goal: Stop fighting silent join failures. Use this checklist before every join.
The 4 reasons a join fails¶
flowchart LR
F[Join failed?] --> T1[Field types<br/>don't match]
F --> T2[Leading zeros<br/>missing]
F --> T3[Whitespace /<br/>casing]
F --> T4[Cardinality<br/>not 1:1]
classDef root fill:#fef2f2,stroke:#ef4444,color:#991b1b
class F root
classDef cause fill:#fef3c7,stroke:#f59e0b,color:#92400e
class T1,T2,T3,T4 cause | Cause | Symptom |
|---|---|
| Field types don't match | "0 of N records joined" |
| Leading zeros lost | Some records joined, others didn't (FIPS, ZIPs) |
| Whitespace or casing | "GA " ≠ "GA"; "atlanta" ≠ "Atlanta" |
| Cardinality > 1:1 | Many table rows match one feature → join keeps just one |
Pre-join checklist¶
1. Check field types¶
Right-click each layer/table → Fields View. Confirm the join columns are the same type:
- Both Text ✅
- Both Long Integer ✅
- One Text, one Long → ❌ Convert one.
To convert:
- Add a new field of the right type.
- Calculate Field:
str(!FIPS!)orint(!FIPS!). - Use the new field as the join key.
2. Check leading zeros¶
Open the table. Look at FIPS / ZIP / phone-number fields:
06037✅ (text, 5 chars)6037❌ (number, leading zero stripped)
Fix by reimporting the source as Text. For CSVs, before importing:
- Open in a text editor first to confirm zeros are present.
- In ArcGIS Pro, Add Data → CSV → click the field → set type = Text.
If the data is already imported wrong, you can reconstruct:
3. Trim and standardize text¶
Whitespace and casing kill joins silently.
# Strip whitespace
!STATE!.strip()
# Lowercase
!STATE!.lower()
# Title case
!CITY!.title()
# Remove all spaces
!ZIP!.replace(" ", "")
Always trim before joining text fields.
4. Check cardinality¶
Run Statistics on the join field of each side:
- If the layer side has duplicates → fine for many-to-one.
- If the table side has duplicates → ⚠️ ArcGIS Pro will silently keep only the first match.
To check duplicates: right-click table → Summary Statistics → Count by your join field. If any count > 1, you have duplicates.
Fix: summarize the table to one row per key first.
# Or use Summary Statistics tool to aggregate
arcpy.analysis.Statistics(
in_table="raw_data",
out_table="summarized",
statistics_fields="POP SUM",
case_field="FIPS"
)
5. Validate the join¶
After adding the join:
- Open the attribute table.
- Sort by your joined value (e.g.,
MEDIAN_INC). - Look for
<Null>rows — those are unmatched. - Use Statistics on the joined field — does the count match expectations?
Or use the built-in Validate Join button in the Add Join dialog.
6. Permanent join¶
A join is virtual. Once validated:
- Right-click the joined layer → Data → Export Features.
- Save as a new feature class.
Now the joined fields are permanent and the export is safe to share.
Common gotchas¶
Census FIPS pitfalls
- State FIPS = 2 chars (
06) - County FIPS = 5 chars (state+county) (
06037) - Tract GEOID = 11 chars
- Block group = 12 chars
- Block = 15 chars
Always confirm length before joining.
Excel auto-format
Excel will silently turn 06037 into 6037. Use a CSV parser that respects strings, or import directly to ArcGIS Pro from the original source.
Encoding
UTF-8 vs Latin-1 → "São Paulo" can become "S\u00e3o Paulo". Save CSVs as UTF-8.
Quick-fix script¶
Save this as a Python snippet in your project:
import arcpy
fc = "counties"
field_in = "FIPS"
field_clean = "FIPS_clean"
if field_clean not in [f.name for f in arcpy.ListFields(fc)]:
arcpy.AddField_management(fc, field_clean, "TEXT", field_length=5)
with arcpy.da.UpdateCursor(fc, [field_in, field_clean]) as cur:
for row in cur:
raw = str(row[0]).strip()
row[1] = raw.zfill(5)
cur.updateRow(row)
print("FIPS_clean is ready for joining.")
Practice¶
Diagnose a broken join
- Get a counties shapefile and an income CSV.
- Try the join — count
<Null>matches. - Check: types match? Leading zeros intact? Whitespace? Duplicates?
- Fix each one and re-run.
- Validate with Statistics.