Mar 20, 2025

Building a data pipeline with DuckDB

In this blog post I’m going to explore how as a data engineer in the field today I might go about putting together a rudimentary data pipeline. I’ll take some operational data, and wrangle it into a form that makes it easily pliable for analytics work.

After a somewhat fevered and nightmarish period during which people walked around declaring "Schema on Read" was the future, that "Data is the new oil", and "Look at the size of my big data", the path that is history in IT is somewhat coming back on itself to a more sensible approach to things.

As they say:

What’s old is new

This is good news for me, because I am old and what I knew then is 'new' now ;)

Mar 6, 2018

Why Do We Need Streaming ETL?

(This is an expanded version of the intro to an article I posted over on the Confluent blog. Here I get to be as verbose as I like ;)) My first job from university was building a datawarehouse for a retailer in the UK. Back then, it was writing COBOL jobs to load tables in DB2. We waited for all the shops to close and do their end of day system processing, and send their data back to the central mainframe.

Jun 28, 2011

Oracle 11g - How to force a sql_id to use a plan_hash_value using SQL Baselines

Here’s a scenario that’ll be depressingly familiar to most reading this: after ages of running fine, and no changes to the code, a query suddenly starts running for magnitudes longer than it used to. In this instance it was an ETL step which used to take c.1 hour, and was now at 5 hours and counting. Since it still hadn’t finished, and the gods had conspired to bring down Grid too (unrelated), I generated a SQL Monitor report to see what was happening: [sourcecode language=“sql”] select DBMS_SQLTUNE.

Dec 30, 2010

Data Warehousing and Statistics in Oracle 11g - incremental global statistics

This is a series of posts where I hope to humbly plug some gaps in the information available (or which has escaped my google-fu) regarding statistics management in Oracle 11g specific to Data Warehousing. Incremental Global Statistics is new functionality in Oracle 11g (and 10.2.0.4?) and is explained in depth in several places including: Oracle® Database Performance Tuning Guide - Statistics on Partitioned Objects Greg Rahn - Oracle 11g: Incremental Global Statistics On Partitioned Tables Inside the Oracle Optimiser - Maintaining statistics on large partitioned tables Amit Poddar - One Pass Distinct Sampling (ppt - slides 52 onwards are most relevant) In essence, Oracle maintains information about each partition when statistics is gathered on the partition, and it uses this to work out the global statistics - without having to scan the whole table.

Nov 3, 2010

Analysing ODI batch performance

I’ve been involved with some performance work around an ODI DWH load batch. The batch comprises well over 1000 tasks in ODI, and whilst the Operator console is not a bad interface, it’s not very easy to spot the areas consuming the most runtime. Here’s a set of SQL statements to run against the ODI work repository tables to help you methodically find the steps of most interest for tuning efforts.

Jul 23, 2009

Mark Rittman’s OBIEE repository for DAC

Mark Rittman has an excellent article about querying the DAC repository database tables, including a downloadable RPD file. Being new to working with RPDs I thought it would be good practise to explore this as well as hopefully get some useful information about our current ETL deployment. I downloaded the RPD to c:\OracleBI\server\Repository and opened it up in the Admin tool (Administrator/Administrator). First off I changed the connection pool to point to my DAC repository database, having setup a TNS entry for it first.