Graph-Based Market Basket Analysis

Two systems built for a programming module at Warwick: a public health data dashboard for COVID-19 vaccination analytics, and a graph-based market basket analysis engine for supermarket transaction data. Both use Python and Streamlit, both follow TDD throughout. The interesting parts are the data structure decisions and the database design.

Project 1: Public Health Data Insights Dashboard

The dashboard ingests COVID-19 vaccination data from Our World in Data (Mathieu et al. 2020), via local CSV or their REST API, and provides filtering, statistical aggregation, CRUD operations, and export through both a CLI and a Streamlit web interface.

Database Schema

The storage layer uses SQLite with a five-table schema normalized to Third Normal Form. Tables: COUNTRIES, CONTINENTS, VACCINATIONS, DATA_SOURCES, and IMPORT_LOGS. Countries hold static metadata (ISO code, population, continent foreign key); vaccinations store time-series data with a composite UNIQUE constraint on (country_id, date).

That composite constraint does double duty. It prevents duplicate entries for the same country on the same date, and it enables INSERT OR REPLACE for data refresh operations. When reloading from the API, this avoids separate existence checks per record; the database handles upsert semantics natively.

Foreign key constraints use CASCADE DELETE to maintain referential integrity. Deleting a country removes its vaccination records automatically. This was validated with dedicated test cases:

def test_cascade_delete_vaccinations(self):
    """Test that deleting a country cascades to vaccinations."""
    records = [
        {'location': 'TestCountry', 'date': '2021-01-01',
         'total_vaccinations': 100},
        {'location': 'TestCountry', 'date': '2021-01-02',
         'total_vaccinations': 200},
    ]
    self.db.bulk_insert_with_transaction(records)
    before = self.db.get_records_by_location('TestCountry')
    self.assertEqual(len(before), 2)

    result = self.db.delete_country_cascade('TestCountry')
    self.assertTrue(result)

    after = self.db.get_records_by_location('TestCountry')
    self.assertEqual(len(after), 0)

Indexing Strategy

B-tree indices on country_id and date in the vaccinations table, plus location in the countries table. These columns appear most frequently in WHERE clauses and JOIN conditions. The composite UNIQUE constraint on (country_id, date) implicitly creates an additional index that speeds up time-series queries.

CREATE INDEX idx_vaccinations_country ON vaccinations(country_id);
CREATE INDEX idx_vaccinations_date ON vaccinations(date);
CREATE INDEX idx_countries_location ON countries(location);

This reduces filtered query complexity from O(n) full-table scans to O(log n) index lookups. INSERTs get slightly slower, but for a heavily read-biased analytics workload, that’s a fine trade.

Architecture

Eight classes, each with a single responsibility: DataLoader for ingestion, DataCleaner for type conversion and missing values, DatabaseManager for SQLite operations, DataFilter for in-memory queries, StatisticsCalculator for aggregation, DataExporter for CSV/JSON output, APIHandler for external communication, and PublicHealthDashboard as the facade.

The facade pattern (Martin 2003) is the key architectural decision. PublicHealthDashboard composes all other components at initialization:

class PublicHealthDashboard:
    def __init__(self, db_path: str = "health_data.db"):
        self.db = DatabaseManager(db_path)
        self.loader = DataLoader()
        self.cleaner = DataCleaner()
        self.filter = DataFilter()
        self.stats = StatisticsCalculator()
        self.exporter = DataExporter()
        self.api = APIHandler()
        self.logger = ActivityLogger()

This keeps the Streamlit frontend thin: it calls methods on the facade rather than coordinating between components directly. It also makes testing straightforward, since each component can be instantiated and tested in isolation. The full test suite has 70 cases across two modules covering unit, integration, and edge-case scenarios (empty CSVs, invalid dates, NULL propagation, transaction rollback).

Library Choices

Streamlit over Flask or Django for the web interface. Flask provides more routing control but requires significant boilerplate for reactive data visualization. Streamlit’s reactive execution model fits the analytics use case: filter changes re-render charts automatically without manual state management. Plotly over Matplotlib for interactive hover tooltips and native Streamlit integration.

SQLite over PostgreSQL. This is a single-user analytical tool. PostgreSQL adds server configuration and credential management overhead that wasn’t justified here.

Project 2: Market Basket Analysis System

The second project analyzes supermarket purchasing patterns across 14,963 transactions containing 167 unique items. The core output is a recommendation engine: given items in a customer’s basket, suggest what else they’re likely to buy, ranked by co-purchase frequency.

Data Structure Selection

The dataset produces 6,260 unique co-purchase pairs. The central question is how to store and query these relationships.

I evaluated five alternatives before settling on a weighted undirected graph implemented as an adjacency list using nested Python dictionaries:

Adjacency matrix: O(1) edge lookup, but with 167 items it allocates 27,889 entries regardless of actual connectivity. The graph has 45% density (6,260 out of 13,861 possible edges), not sparse enough to make this catastrophic, but the adjacency list stores only existing edges with O(V + E) space.

Edge list: Space-efficient at O(E), but edge lookup requires O(E) linear search. Querying “what items are bought with milk?” becomes expensive compared to O(k) with an adjacency list, where k is the number of neighbors (Cormen et al. 2009).

BST/AVL trees: O(log V) lookup, but trees model hierarchical relationships. Many-to-many co-purchase associations require nested trees or auxiliary structures without performance gain.

Relational database: SQL can answer co-purchase queries, but computing recommendations requires multiple self-joins with aggregation. In-memory graph traversal accumulates edge weights during a single walk. For real-time queries, the graph wins (Robinson, Webber and Eifrem 2015).

Hash table of sets: Loses frequency information. Without weights, recommendations can’t be ranked.

The adjacency list was chosen because nested dictionaries provide O(1) average-case lookups, insertions, and weight updates via hash tables, while O(V + E) space scales with actual connectivity.

Data StructureSpaceEdge LookupGet NeighboursWeights
Adjacency List (chosen)O(V+E)O(1) avgO(1)Yes
Adjacency MatrixO(V²)O(1)O(V)Yes
Edge ListO(E)O(E)O(E)Yes
BST/AVL TreeO(V)O(log V)ComplexNo
Hash Table of SetsO(V+E)O(1) avgO(1)No

Algorithm Choices

Sorting: Merge sort for ranking frequent items and co-purchase pairs. It guarantees O(n log n) in all cases with stable ordering, unlike quicksort which degrades to O(n²) on sorted inputs (Cormen et al. 2009). For a recommendation engine where sorted results are always needed, worst-case guarantees matter more than average-case constants.

Graph traversal: Both BFS and DFS are implemented (both O(V + E)), but BFS is preferred for find_related_items(). BFS explores level-by-level, which naturally respects depth constraints when finding items within k hops. DFS explores deeply before backtracking, better suited for connected component detection than proximity-based recommendations.

Recommendation algorithm: For a given basket, the system iterates each basket item’s neighbors, accumulates co-purchase weights, excludes items already in the basket, then sorts by cumulative score. Complexity is O(B × K + R log R) where B is basket size, K is average neighbors per item, and R is the number of candidate recommendations.

Benchmark Results

Run on the full dataset (14,963 transactions, 167 items, 6,260 edges), averaging 100 iterations per operation on an Intel Core i7-8565U:

OperationTime
Graph Construction106.22 ms
Edge Lookup0.0014 ms
Get Neighbours0.0027 ms
Recommendations0.6835 ms
BFS Traversal1.6000 ms
Top 10 Pairs (merge sort)41.86 ms

Sub-microsecond edge lookups confirm O(1) hash table behavior. Recommendations in under 1 ms means the system can serve interactive e-commerce queries without perceptible latency. The Top 10 Pairs operation is the slowest at 41.86 ms, bottlenecked by O(E log E) sorting across all 6,260 edges: acceptable for a batch query but worth addressing if called frequently.

Graph construction at 106 ms for ~15,000 transactions is a one-time startup cost. The O(T × I²) complexity here comes from generating all item pairs within each transaction. For significantly larger datasets (millions of transactions), this would be the first bottleneck to address, likely through batch processing or parallel edge weight aggregation.

Findings

Whole milk is consistently the top recommendation regardless of basket contents, because of its high co-occurrence with most product categories. This is consistent with what you’d expect from supermarket data: staple items dominate co-purchase graphs because they appear in a large proportion of transactions. A production system would need to account for this baseline popularity to surface useful recommendations (by normalizing edge weights against item frequency, similar to TF-IDF in information retrieval).

Testing

53 test cases across 10 test classes: data loading, graph creation, graph-from-transactions, BFS/DFS traversal, frequent pair mining, recommendations, merge sort correctness, edge cases (special characters, spaces, large transactions), item frequency, and graph statistics. Tests were written alongside implementation following TDD; the test structure mirrors the class architecture.

Cross-Cutting Observations

Both projects use the same stack (Python, Streamlit, Plotly, unittest). Consistency reduces context-switching and allows shared patterns across codebases. The Streamlit execution model works for analytical dashboards where the interaction is “adjust filters, see updated results.” It’s a poor fit for applications requiring fine-grained state management or custom routing.

SQLite’s zero-configuration deployment was appropriate at this scale. The vaccination dashboard handles country-level aggregated data (hundreds of countries, daily records over a few years), and the market basket system operates in-memory after initial CSV parsing. Neither pushes SQLite’s concurrency limits.

SOLID principles (Martin 2003) were applied throughout, with one caveat: the Liskov Substitution Principle is largely irrelevant in codebases without deep inheritance hierarchies. Both projects use composition over inheritance, which sidesteps LSP entirely.

Limitations

The vaccination dashboard loads entire datasets into memory. Asynchronous API calls would improve UI responsiveness during data fetching, but Streamlit’s execution model complicates asyncio integration.

The market basket system has no persistence layer. Restarting the application rebuilds the graph from CSV. There’s also no temporal analysis: all transactions are treated as a single static snapshot, so seasonal trends or trending products are invisible.

Both projects are single-machine, single-user applications. Scaling to concurrent users would require a proper web framework (Flask/FastAPI), connection pooling, and potentially a message queue for async ingestion.

References

  • Cormen, T.H., Leiserson, C.E., Rivest, R.L. and Stein, C. (2009) Introduction to Algorithms. 3rd edn. Cambridge, MA: MIT Press.
  • Martin, R.C. (2003) Agile Software Development: Principles, Patterns, and Practices. Upper Saddle River, NJ: Pearson Education.
  • Mathieu, E. et al. (2020) ‘Coronavirus Pandemic (COVID-19)’, Our World in Data. Available at: https://ourworldindata.org/coronavirus.
  • Robinson, I., Webber, J. and Eifrem, E. (2015) Graph Databases. 2nd edn. Sebastopol, CA: O’Reilly Media.
← back to writing