Skip to content

refactor(compute): Move DataFrame type coercion to underlying primitives #784

@lmeyerov

Description

@lmeyerov

Problem

We're repeatedly patching DataFrame type mismatches at GFQL call sites rather than fixing them at the source. This creates a whack-a-mole pattern where each new operation requires defensive engine coercion.

Recent Pattern: Type Mismatches Fixed by Adding Coercion

1. Issue #777 + PR #781 ✅ MERGED

Problem: concat() in chain operations
Location: graphistry/compute/chain.py
Fix: Added resolve_engine() + df_to_engine() before concat
Pattern: Detect → Convert → Concat

2. Issue #778 + PR #782 🔄 OPEN

Problem: merge() in get_degrees operations
Location: graphistry/compute/ComputeMixin.py (lines 268, 478)
Fix: Added resolve_engine() + df_to_engine() before merge
Pattern: Detect → Convert → Merge

3. PR #783 🔄 OPEN (Our Current PR)

Problem: GFQL let() and call() return values
Location: chain_let.py:714-717, call_executor.py:134-137
Fix: Added ensure_engine_match() at return boundaries
Pattern: Detect → Convert → Return

Root Cause

Underlying operations assume homogeneous DataFrame types, but GFQL workflows mix pandas and cuDF:

  • User requests engine='pandas'
  • UMAP runs in GPU mode → creates cuDF DataFrames
  • Next operation calls pandas concat()/merge()TypeError

Current Approach: Defensive Coercion at Call Sites

# Pattern repeated across codebase:
from graphistry.Engine import resolve_engine, df_to_engine

# Detect types
engine1 = resolve_engine(EngineAbstract.AUTO, df1)
engine2 = resolve_engine(EngineAbstract.AUTO, df2)

# Convert if mismatch
if engine1 != engine2:
    df2 = df_to_engine(df2, engine1)

# Now safe to call pandas operation
result = pd.concat([df1, df2])

Problems with this approach:

  1. Repetitive: Same pattern copy-pasted to 3+ locations (and counting)
  2. Error-prone: Easy to miss new call sites
  3. Defensive: Patching symptoms rather than root cause
  4. Scattered: Logic spread across GFQL layer instead of compute primitives

Proposed Solution: Engine-Aware Primitives

Move type detection/coercion INTO the underlying operations:

# In graphistry/compute/primitives.py (new file)

def safe_concat(dfs, **kwargs):
    """Engine-aware concat that handles pandas/cuDF transparently."""
    if not dfs:
        return None
    
    # Detect target engine from first non-None df
    target_engine = resolve_engine(EngineAbstract.AUTO, dfs[0])
    
    # Convert all dfs to target engine
    coerced_dfs = [df_to_engine(df, target_engine) if df is not None else None 
                   for df in dfs]
    
    # Use appropriate concat
    if target_engine == Engine.CUDF:
        import cudf
        return cudf.concat(coerced_dfs, **kwargs)
    else:
        return pd.concat(coerced_dfs, **kwargs)


def safe_merge(left, right, **kwargs):
    """Engine-aware merge that handles pandas/cuDF transparently."""
    # Detect engines
    left_engine = resolve_engine(EngineAbstract.AUTO, left)
    right_engine = resolve_engine(EngineAbstract.AUTO, right)
    
    # Convert right to match left if needed
    if left_engine != right_engine:
        right = df_to_engine(right, left_engine)
    
    # Use appropriate merge
    if left_engine == Engine.CUDF:
        return left.merge(right, **kwargs)  # cuDF merge
    else:
        return left.merge(right, **kwargs)  # pandas merge

Then replace all call sites:

# Old (error-prone):
result = pd.concat([df1, df2])

# New (safe):
from graphistry.compute.primitives import safe_concat
result = safe_concat([df1, df2])

Benefits

  1. DRY: Single implementation, tested once
  2. Comprehensive: Fixes all existing and future call sites
  3. Maintainable: Changes in one place
  4. Correct-by-default: Impossible to forget type checking
  5. Clear separation: Compute layer handles types, GFQL handles query logic

Affected Operations

Audit needed for all DataFrame operations in graphistry/compute/:

  • concat() - Fixed in 3 places, should be in primitive
  • merge() - Fixed in 2 places, should be in primitive
  • ⚠️ join() - Likely needs fixing
  • ⚠️ append() - Likely needs fixing
  • ⚠️ groupby().agg() - May need special handling
  • ⚠️ Direct indexing operations - May create mixed types
  • ⚠️ DataFrame constructors - May need wrappers

Migration Strategy

Phase 1: Create primitives module

  • Add graphistry/compute/primitives.py
  • Implement safe_concat(), safe_merge(), safe_join()
  • Add comprehensive tests (pandas/cuDF/mixed scenarios)

Phase 2: Replace GFQL layer

  • Update chain.py, chain_let.py, call_executor.py
  • Remove defensive coercion code
  • Use primitives instead

Phase 3: Replace compute methods

  • Update ComputeMixin.py (get_degrees, get_topological_levels)
  • Update other compute modules
  • Ensure consistent behavior

Phase 4: Remove redundant coercion

  • Once primitives handle types, remove explicit coercion at GFQL boundaries
  • Keep ensure_engine_match() for final result validation only

Success Metrics

  • ✅ No more type mismatch errors in any GFQL workflow
  • ✅ Reduced code duplication (3+ implementations → 1)
  • ✅ Future operations automatically handle mixed types
  • ✅ Clearer separation of concerns

Related Issues & PRs

References

  • Current pattern: graphistry/Engine.py - resolve_engine(), df_to_engine()
  • GFQL layer: graphistry/compute/chain*.py
  • Compute layer: graphistry/compute/ComputeMixin.py

Priority

High - Each new GFQL operation risks type mismatch errors. Moving to primitives prevents future bugs and simplifies maintenance.


Note: This issue proposes a refactor to prevent future type mismatch bugs by centralizing DataFrame type handling in compute primitives rather than patching at GFQL call sites.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions