streams

Streams

Most of the time it’s fine to retrieve data from the database using SELECT queries, and store data using INSERT. But for those cases where efficiency matters, there are two data streaming mechanisms to help you do this more efficiently: “streaming queries,” for reading query results from the database; and the @ref pqxx::stream_to class, for writing data from the client into a table.

These are less flexible than SQL queries. Also, depending on your needs, it may be a problem to lose your connection while you’re in mid-stream, not knowing that the query may not complete. But, you get some scalability and memory efficiencies in return.

Just like regular querying, these streaming mechanisms do data conversion for you. You deal with the C++ data types, and the database deals with the SQL data types.

Interlude: null values

So how do you deal with nulls? It depends on the C++ type you’re using. Some types may have a built-in null value. For instance, if you have a char const * value and you convert it to an SQL string, then converting a nullptr will produce a NULL SQL value.

But what do you do about C++ types which don’t have a built-in null value, such as int? The trick is to wrap it in std::optional. The difference between int and std::optional<int> is that the former always has an int value, and the latter doesn’t have to.

Actually it’s not just std::optional. You can do the same thing with std::unique_ptr or std::shared_ptr. A smart pointer is less efficient than std::optional in most situations because they allocate their value on the heap, but sometimes that’s what you want in order to save moving or copying large values around.

This part is not generic though. It won’t work with just any smart-pointer type, just the ones which are explicitly supported: shared_ptr and unique_ptr. If you really need to, you can build support for additional wrappers and smart pointers by copying the implementation patterns from the existing smart-pointer support.

Streaming data from a query

Use @ref transaction_base::stream to read large amounts of data directly from the database. In terms of API it works just like @ref transaction_base::query, but it’s faster than the exec and query functions For larger data sets. Also, you won’t need to keep your full result set in memory. That can really matter with larger data sets.

Another performance advantage is that with a streaming query, you can start processing your data right after the first row of data comes in from the server. With exec() or query() you need to wait to receive all data, and only then can you begin processing. With streaming queries you can be processing data on the client side while the server is still sending you the rest.

Not all kinds of queries will work in a stream. Internally the streams make use of PostgreSQL’s COPY command, so see the PostgreSQL documentation for COPY for the exact limitations. Basic SELECT and UPDATE ... RETURNING queries will just work, but fancier constructs may not.

As you read a row, the stream converts its fields to a tuple type containing the value types you ask for:

    for (auto [name, score] :
        tx.stream<std::string_view, int>("SELECT name, points FROM score")
    )
        process(name, score);

On each iteration, the stream gives you a std::tuple of the column types you specify. It converts the row’s fields (which internally arrive at the client in text format) to your chosen types.

The auto [name, score] in the example is a structured binding which unpacks the tuple’s fields into separate variables. If you prefer, you can choose to receive the tuple instead: for (std::tuple<int, std::string_view> :.

Is streaming right for my query?

Here are the things you need to be aware of when deciding whether to stream a query, or just execute it normally.

First, when you stream a query, there is no metadata describing how many rows it returned, what the columns are called, and so on. With a regular query you get a @ref result object which contains this metadata as well as the data itself. If you absolutely need this metadata for a particular query, then that means you can’t stream the query.

Second, under the bonnet, streaming from a query uses a PostgreSQL-specific SQL command COPY (...) TO STDOUT. There are some limitations on what kinds of queries this command can handle. These limitations may change over time, so I won’t describe them here. Instead, see PostgreSQL’s COPY documentation for the details. (Look for the TO variant, with a query as the data source.)

Third: when you stream a query, you start receiving and processing data before you even know whether you will receive all of the data. If you lose your connection to the database halfway through, you will have processed half your data, unaware that the query may never execute to completion. If this is a problem for your application, don’t stream that query!

The fourth and final factor is performance. If you’re interested in streaming, obviously you care about this one.

I can’t tell you a priori whether streaming will make your query faster. It depends on how many rows you’re retrieving, how much data there is in those rows, the speed of your network connection to the database, your client encoding, how much processing you do per row, and the details of the client-side system: hardware speed, CPU load, and available memory.

Ultimately, no amount of theory beats real-world measurement for your specific situation so… if it really matters, measure. (And as per Knuth’s Law: if it doesn’t really matter, don’t optimise.)

That said, here are a few data points from some toy benchmarks:

If your query returns e.g. a hundred small rows, it’s not likely to make up a significant portion of your application’s run time. Streaming is likely to be slower than regular querying, but most likely the difference just won’t amtter.

If your query returns a thousand small rows, streaming is probably still going to be a bit slower than regular querying, though “your mileage may vary.”

If you’re querying ten thousand small rows, however, it becomes more likely that streaming will speed it up. The advantage increases as the number of rows increases.

That’s for small rows, based on a test where each row consisted of just one integer number. If your query returns larger rows, with more columns, I find that streaming seems to become more attractive. In a simple test with 4 columns (two integers and two strings), streaming even just a thousand rows was considerably faster than a regular query.

If your network connection to the database is slow, however, that may make streaming a bit less effcient. There is a bit more communication back and forth between the client and the database to set up a stream. This overhead takes a more or less constant amount of time, so for larger data sets it will tend to become insignificant compared to the other performance costs.

Streaming data into a table

Use stream_to to write data directly to a database table. This saves you having to perform an INSERT for every row, and so it can be significantly faster if you want to insert more than just one or two rows at a time.

As with stream_from, you can specify the table and the columns, and not much else. You insert tuple-like objects of your choice:

    pqxx::stream_to stream{
        tx,
        "score",
        std::vector<std::string>{"name", "points"}};
    for (auto const &entry: scores)
        stream << entry;
    stream.complete();

Each row is processed as you provide it, and not retained in memory after that.

The call to complete() is more important here than it is for stream_from. It’s a lot like a “commit” or “abort” at the end of a transaction. If you omit it, it will be done automatically during the stream’s destructor. But since destructors can’t throw exceptions, any failures at that stage won’t be visible in your code. So, always call complete() on a stream_to to close it off properly!