cassandra: Split author/committer/date/committer_date into individual columns

requested review from @olasd

Jenkins job DSTO/gitlab-builds #748 failed in 10 min.
See Console Output, Blue Ocean and Coverage Report for more details.

Thanks, this looks like a sound plan.

I think we should consider only storing the fullname, which is what is canonically used in the SWHID (and regenerating the name and email when the information is read, using Person.from_fullname()), I'm not sure storing the parsed values adds much (at least I don't think it would substantially change the workload for the queries we were looking to do, and it would make the schema simpler and the data marginally smaller).

For schema changes, would it make sense to start adding a set of migration CQL scripts and a table to record which scripts have run? This would make the "easy" schema migrations such as this one more transparent, as we could include running them as a flag in the init-keyspace command.

I'm assuming there's no adverse behavior when running the old code on the new schema?

I'm not sure storing the parsed values adds much (at least I don't think it would substantially change the workload for the queries we were looking to do, and it would make the schema simpler and the data marginally smaller).

Although queries on blobs will only allow filtering on an exact match, so that's not entirely correct. Meh.

I think we should consider only storing the fullname

Then you won't be able to filter on emails, which is the whole point of this MR.

I'm assuming there's no adverse behavior when running the old code on the new schema?

Not tested, but it should work (by inserting null values in the new columns).

While we're doing this: are there other UDTs that would benefit from the same treatment? For instance, microtimestamp_with_timezone fields are similarly unusable for filtering.

I don't think so, but I also didn't think you would need Person flattened, so

You're saying "you" like this is a whim rather than an operational need, which is getting a bit frustrating.

I'm happy to consider any and all alternative proposals that would let us do ad-hoc filtering of the live data of the archive on one of the fields of the data model that aren't being indexed/optimized for queries, as long as they don't involve days of wondering whether all the data was really queried. I believe making sure that most fields in our cassandra keyspace are indeed queryable gets us there.

By "you" I meant the infra team. I trust you that it's a need, but I can't predict what future needs will be.

I believe making sure that most fields in our cassandra keyspace are indeed queryable gets us there.

Okay, I can do that.

Thanks!

added 1 commit

651c86d6 - Fix docstring

Compare with previous version

Jenkins job DSTO/gitlab-builds #749 failed in 7 min 47 sec.
See Console Output, Blue Ocean and Coverage Report for more details.

added 1 commit

3a2f2853 - Flatten dates too + add tests for nulls

Compare with previous version

Jenkins job DSTO/gitlab-builds #753 failed in 7 min 59 sec.
See Console Output, Blue Ocean and Coverage Report for more details.

changed title from cassandra: Split author/committer into individual columns to cassandra: Split author/committer/date/committer_date into individual columns