DataStream.batch() streaming data from BigQuery

Question

DataStream.batch() streaming data from BigQuery

chapmanjacobd opened this issue 4 years ago · comments

If I remove the .batch() or .timeBatch() line then it works fine.

With it I get this error:

Cannot read property 'value' of undefined

      at node_modules/.pnpm/scramjet-core@4.28.2/node_modules/scramjet-core/lib/util/mk-transform.js:59:44
      at processTicksAndRejections (internal/process/task_queues.js:97:5)
        caused by:
      at DataStream.<anonymous> (src/bq-to-mssql.ts:99:55)
        --- raised in DataStream(15) constructed ---
      at new PromiseTransformStream (node_modules/.pnpm/scramjet-core@4.28.2/node_modules/scramjet-core/lib/util/promise-transform-stream.js:65:27)
      at new DataStream (node_modules/.pnpm/scramjet-core@4.28.2/node_modules/scramjet-core/lib/data-stream.js:43:9)    
      at DataStream.map (node_modules/.pnpm/scramjet-core@4.28.2/node_modules/scramjet-core/lib/data-stream.js:186:26)

To Reproduce

import { BigQuery } from '@google-cloud/bigquery';

const bq = new BigQuery();

async function bqStreamToMSSQL(
  trx: Knex.Transaction<any, any>,
  table: string,
  query: string
) {

// BigQuery.createQueryStream: (options?: Query) => ResourceStream<any>
  return await bq
    .createQueryStream(query)
    .pipe(new DataStream({ maxParallel: 1 }))
    .timeBatch(7000, 10000)
    .map((row) => ({ ...row, at_ingest: row.at_ingest.value }))
    .do(async (row) => {
      await trx
        .delete()
        .from(table)
        .where('pk', '=', row.pk)
        .andWhere('at_ingest', '<=', row.at_ingest);

      await trx.insert(row).into(table);
    })
    .run();
}

maybe it is a limitation of the BigQuery method or maybe I'm doing this wrong

Budleigh Salterton · Answer 1 · Sat Sep 05 2020 20:47:20 GMT+0800 (China Standard Time)

Ok, batch and timebatch actually turn these rows into arrays of rows so I'm guessing that's what causes the errors.

You'd need to put the timeBatch after the map there and in do use either Promise.all or better yet convert this delete a number of records with IN.

Reach out to me on slack directly I can take a look at this. :)

Budleigh Salterton · Answer 2 · Sun Sep 06 2020 20:23:23 GMT+0800 (China Standard Time)

BTW, you may want to look at the nagle method - this may work more as you expected, but it wouldn't be as efficient as you could achieve using the above timeBatch with actual batched deletes and inserts.

Jacob Chapman · Answer 3 · Tue Sep 08 2020 01:15:00 GMT+0800 (China Standard Time)

Yeah I decided to change my approach. I'm going with batched inserts to a temporary table outside of a transaction and then have a separate transaction to run some fast queries to delete and insert from the temp table into the destination table

Budleigh Salterton · Answer 4 · Tue Sep 08 2020 02:09:44 GMT+0800 (China Standard Time)

That's cool. So I guess DELETE FROM x WHERE id IN (SELECT id FROM temp_x)?

For others, would you be so kind to post the code here when you're done?

Jacob Chapman · Answer 5 · Tue Sep 08 2020 23:45:56 GMT+0800 (China Standard Time)

yeah I just do something like this:

  await Promise.all(
    sourceTables.map((table) =>
      insertBQToMSSQLTempTable({
        query: genSinceSQL(table),
        tempTableName: `temp_${table}`,
      })
    )
  );

await mssql.transaction(async function (trx) {
    for await (const table of sourceTables) {
       await atomicMoveIntoTable(trx, sourceColumns, `temp_${table}`, table);
    }
});

async function insertBQToMSSQLTempTable({
  query,
  tempTableName,
}) {
  return await bq
    .createQueryStream(query)
    .pipe(new DataStream({ maxParallel: 16 }))
    .map((row) => {
      delete row.id;

      return { ...row, ['at_bq']: row['at_bq'].value };
    })
    .timeBatch(2000, 5000)
    .do(async (rows: any[]) => {
      await mssql.raw('SET NOCOUNT ON');
      await mssql.batchInsert(tempTableName, rows, 300).catch((err) => {
        console.log(err);
        console.log(err[0]);
        console.log(err.originalError);
        console.log(err.message);
        process.exit(2);
      });
    })
    .run();
}

async function atomicMoveIntoTable(
  trx: Knex.Transaction<any, any>,
  sourceColumns: SourceColumn[],
  tempTable: string,
  destinationTable: string
) {
  await trx.raw(
    `DELETE FROM ${destinationTable} WHERE pk IN
    (select pk FROM ${tempTable})`
  );

  // specify all columns if you have any IDENTITY columns, if not you could just use `select *`
  await trx.raw(`INSERT into ${destinationTable}(${sourceColumns}) select ${sourceColumns} from ${tempTable}`);

  await trx.schema.dropTable(tempTable);
}

how do I access the originalError from within the do method?

Budleigh Salterton · Answer 6 · Wed Sep 09 2020 00:14:07 GMT+0800 (China Standard Time)

You caught the error as err so your originalError is just err.

If you don't catch it (just leave it as await) then a scramjet wrapped error can be catught using a catch after the whole do, or simply by catching the error after run which returns the promise.

It's a fantastic example BTW. :)

Budleigh Salterton · Answer 7 · Wed Sep 09 2020 00:16:16 GMT+0800 (China Standard Time)

    .do(async (rows: any[]) => {
      await mssql.raw('SET NOCOUNT ON');
      await mssql.batchInsert(tempTableName, rows, 300).catch((err) => {
        console.log(err);
        console.log(err[0]);
        console.log(err.originalError);
        console.log(err.message);
        process.exit(2);
      });
    })
    // here .catch(err => err.cause) // then you can still run the queries, but seems you want to fail fast
    .run()
    .then(err => err.cause /* maybe rollback the tx rather than pc exit? */)
;

Jacob Chapman · Answer 8 · Wed Sep 09 2020 01:02:38 GMT+0800 (China Standard Time)

If you don't catch it (just leave it as await) then a scramjet wrapped error can be catught

ohh I see... so that's how it works

maybe rollback

ahh yes. I am worried that if I do too many inserts into SQL Server then the transaction log will run out of space. I'm not sure how it works internally so I'm being extra careful.

awaited knex.batchInsert() will actually do an implicit transaction so it will never insert only some "chunks" but not others. It has an implicit commit/rollback.

awaited mssql.transaction(async function (trx) {}); also is implicit commit/rollback.

the only leaking part here is that if you use this code you should:

make sure only one copy is running at a time and
truncate the holding temp tables first thing in case there was an error previously.