-
Notifications
You must be signed in to change notification settings - Fork 217
Description
Problem
We're repeatedly patching DataFrame type mismatches at GFQL call sites rather than fixing them at the source. This creates a whack-a-mole pattern where each new operation requires defensive engine coercion.
Recent Pattern: Type Mismatches Fixed by Adding Coercion
1. Issue #777 + PR #781 ✅ MERGED
Problem: concat()
in chain operations
Location: graphistry/compute/chain.py
Fix: Added resolve_engine()
+ df_to_engine()
before concat
Pattern: Detect → Convert → Concat
2. Issue #778 + PR #782 🔄 OPEN
Problem: merge()
in get_degrees operations
Location: graphistry/compute/ComputeMixin.py
(lines 268, 478)
Fix: Added resolve_engine()
+ df_to_engine()
before merge
Pattern: Detect → Convert → Merge
3. PR #783 🔄 OPEN (Our Current PR)
Problem: GFQL let()
and call()
return values
Location: chain_let.py:714-717
, call_executor.py:134-137
Fix: Added ensure_engine_match()
at return boundaries
Pattern: Detect → Convert → Return
Root Cause
Underlying operations assume homogeneous DataFrame types, but GFQL workflows mix pandas and cuDF:
- User requests
engine='pandas'
- UMAP runs in GPU mode → creates cuDF DataFrames
- Next operation calls pandas
concat()
/merge()
→ TypeError
Current Approach: Defensive Coercion at Call Sites
# Pattern repeated across codebase:
from graphistry.Engine import resolve_engine, df_to_engine
# Detect types
engine1 = resolve_engine(EngineAbstract.AUTO, df1)
engine2 = resolve_engine(EngineAbstract.AUTO, df2)
# Convert if mismatch
if engine1 != engine2:
df2 = df_to_engine(df2, engine1)
# Now safe to call pandas operation
result = pd.concat([df1, df2])
Problems with this approach:
- Repetitive: Same pattern copy-pasted to 3+ locations (and counting)
- Error-prone: Easy to miss new call sites
- Defensive: Patching symptoms rather than root cause
- Scattered: Logic spread across GFQL layer instead of compute primitives
Proposed Solution: Engine-Aware Primitives
Move type detection/coercion INTO the underlying operations:
# In graphistry/compute/primitives.py (new file)
def safe_concat(dfs, **kwargs):
"""Engine-aware concat that handles pandas/cuDF transparently."""
if not dfs:
return None
# Detect target engine from first non-None df
target_engine = resolve_engine(EngineAbstract.AUTO, dfs[0])
# Convert all dfs to target engine
coerced_dfs = [df_to_engine(df, target_engine) if df is not None else None
for df in dfs]
# Use appropriate concat
if target_engine == Engine.CUDF:
import cudf
return cudf.concat(coerced_dfs, **kwargs)
else:
return pd.concat(coerced_dfs, **kwargs)
def safe_merge(left, right, **kwargs):
"""Engine-aware merge that handles pandas/cuDF transparently."""
# Detect engines
left_engine = resolve_engine(EngineAbstract.AUTO, left)
right_engine = resolve_engine(EngineAbstract.AUTO, right)
# Convert right to match left if needed
if left_engine != right_engine:
right = df_to_engine(right, left_engine)
# Use appropriate merge
if left_engine == Engine.CUDF:
return left.merge(right, **kwargs) # cuDF merge
else:
return left.merge(right, **kwargs) # pandas merge
Then replace all call sites:
# Old (error-prone):
result = pd.concat([df1, df2])
# New (safe):
from graphistry.compute.primitives import safe_concat
result = safe_concat([df1, df2])
Benefits
- DRY: Single implementation, tested once
- Comprehensive: Fixes all existing and future call sites
- Maintainable: Changes in one place
- Correct-by-default: Impossible to forget type checking
- Clear separation: Compute layer handles types, GFQL handles query logic
Affected Operations
Audit needed for all DataFrame operations in graphistry/compute/
:
- ✅
concat()
- Fixed in 3 places, should be in primitive - ✅
merge()
- Fixed in 2 places, should be in primitive ⚠️ join()
- Likely needs fixing⚠️ append()
- Likely needs fixing⚠️ groupby().agg()
- May need special handling⚠️ Direct indexing operations - May create mixed types⚠️ DataFrame constructors - May need wrappers
Migration Strategy
Phase 1: Create primitives module
- Add
graphistry/compute/primitives.py
- Implement
safe_concat()
,safe_merge()
,safe_join()
- Add comprehensive tests (pandas/cuDF/mixed scenarios)
Phase 2: Replace GFQL layer
- Update
chain.py
,chain_let.py
,call_executor.py
- Remove defensive coercion code
- Use primitives instead
Phase 3: Replace compute methods
- Update
ComputeMixin.py
(get_degrees, get_topological_levels) - Update other compute modules
- Ensure consistent behavior
Phase 4: Remove redundant coercion
- Once primitives handle types, remove explicit coercion at GFQL boundaries
- Keep
ensure_engine_match()
for final result validation only
Success Metrics
- ✅ No more type mismatch errors in any GFQL workflow
- ✅ Reduced code duplication (3+ implementations → 1)
- ✅ Future operations automatically handle mixed types
- ✅ Clearer separation of concerns
Related Issues & PRs
- Issue cuDF/pandas concat error in chain operations after UMAP #777, PR fix(chain): Use engine-aware concat for cuDF/pandas compatibility (#777) #781: Chain concat fix
- Issue cuDF/pandas merge error in get_degrees after UMAP #778, PR fix(compute): Use engine-aware merge in get_degrees for cuDF/pandas compatibility #782: get_degrees merge fix
- PR fix(gfql): Extend engine coercion to let() and call() operations #783: let/call coercion fix
References
- Current pattern:
graphistry/Engine.py
-resolve_engine()
,df_to_engine()
- GFQL layer:
graphistry/compute/chain*.py
- Compute layer:
graphistry/compute/ComputeMixin.py
Priority
High - Each new GFQL operation risks type mismatch errors. Moving to primitives prevents future bugs and simplifies maintenance.
Note: This issue proposes a refactor to prevent future type mismatch bugs by centralizing DataFrame type handling in compute primitives rather than patching at GFQL call sites.