cassandra: Split author/committer/date/committer_date into individual columns
Cassandra does not support filtering on individual fields of UDTs, as it considers structures as a single whole value.
However, the infra team needs to filter on author.email and committer.email, hence the need for separate columns.
This commit reads and writes the new split columns, but keeps reading the UDT as a fallback. This will be removed after we are done migrating all rows.
Migration plan:
-
ALTER TABLE revision ADD ( author_fullname blob, author_name blob, author_email blob, committer_fullname blob, committer_name blob, committer_email blob, date_seconds bigint, date_microseconds int, date_offset_bytes blob, committer_date_seconds bigint, committer_date_microseconds int, committer_date_offset_bytes blob ); ALTER TABLE release ADD ( author_fullname blob, author_name blob, author_email blob, date_seconds bigint, date_microseconds int, date_offset_bytes blob );
-
update Python code and restart
-
run a replayer on
revision
andrelease
objects without a filtering proxy, in order to write the new columns
Merge request reports
Activity
requested review from @olasd
Jenkins job DSTO/gitlab-builds #748 failed in 10 min.
See Console Output, Blue Ocean and Coverage Report for more details.Thanks, this looks like a sound plan.
I think we should consider only storing the fullname, which is what is canonically used in the SWHID (and regenerating the name and email when the information is read, using
Person.from_fullname()
), I'm not sure storing the parsed values adds much (at least I don't think it would substantially change the workload for the queries we were looking to do, and it would make the schema simpler and the data marginally smaller).For schema changes, would it make sense to start adding a set of migration CQL scripts and a table to record which scripts have run? This would make the "easy" schema migrations such as this one more transparent, as we could include running them as a flag in the init-keyspace command.
I'm assuming there's no adverse behavior when running the old code on the new schema?
I'm not sure storing the parsed values adds much (at least I don't think it would substantially change the workload for the queries we were looking to do, and it would make the schema simpler and the data marginally smaller).
Although queries on blobs will only allow filtering on an exact match, so that's not entirely correct. Meh.
I think we should consider only storing the fullname
Then you won't be able to filter on emails, which is the whole point of this MR.
I'm assuming there's no adverse behavior when running the old code on the new schema?
Not tested, but it should work (by inserting null values in the new columns).
You're saying "you" like this is a whim rather than an operational need, which is getting a bit frustrating.
I'm happy to consider any and all alternative proposals that would let us do ad-hoc filtering of the live data of the archive on one of the fields of the data model that aren't being indexed/optimized for queries, as long as they don't involve days of wondering whether all the data was really queried. I believe making sure that most fields in our cassandra keyspace are indeed queryable gets us there.
Jenkins job DSTO/gitlab-builds #749 failed in 7 min 47 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DSTO/gitlab-builds #753 failed in 7 min 59 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 6 commits
-
3a2f2853...e62b6555 - 3 commits from branch
master
- f61d649c - cassandra: Split author/committer into individual columns
- f5aaa1f5 - Fix docstring
- 49e06138 - Flatten dates too + add tests for nulls
Toggle commit list-
3a2f2853...e62b6555 - 3 commits from branch
Jenkins job DSTO/gitlab-builds #771 failed in 3 min 52 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 2 commits
@jenkins retry build
Jenkins job DSTO/gitlab-builds #772 failed in 12 min.
See Console Output, Blue Ocean and Coverage Report for more details.