Census Data from Statistics Canada

Statistics Canada carries census every 5 years, with 2016 being the last run. The census data provides a wealth of insights but are published in raw format. Post-processing work is needed to extrapolate information, such as median income of a neighbourhood, age distribution of a city, etc. For someone like myself without any background in … Read moreCensus Data from Statistics Canada

How imaging devices talk to each other (in DICOM)

Overview In the previous post I briefly touched on DICOM as the crucial standard in medical imaging for both data exchanging and data storage. It is important to understand that DICOM is such a massive standard that, beyond data exchanging and storage, has expanded into many different areas around imaging, that no device (or information … Read moreHow imaging devices talk to each other (in DICOM)

Spark, Cassandra and Python

In this post we touch briefly on Apache Spark as a cluster computing framework that supports a number of drivers to pipe data in, and that its stunning performance thanks much to resilient distributed dataset (RDD) as its architectural foundation. In this hands-on guide, we expand on how to configure Spark, and use Python to … Read moreSpark, Cassandra and Python

DataStax Python Driver

For someone with relational database background, analyzing data in Cassandra isn’t intuitive. There are two reasons. First, Cassandra data table is hardly updated or deleted in avoidance of tombstones. Insertion is the only action on the table resulting in multiple versions of each record all stored in the same table, thus a much longer table … Read moreDataStax Python Driver

Cassandra data model (as opposed to relational model)

Bad data model design with Cassandra causes chronic pains as application scales. I had to re-read about data model design in “Cassandra – the Definitive Guide” and keep my notes and thoughts in this post. The data modelling in the relational world is indoctrinated to every students out of university. It embraces several things: Entity-Relation: … Read moreCassandra data model (as opposed to relational model)

Storage Nitty-Gritty 5 of 5 – Replication

Replication Terms PIT (point in time) replica – snapshot of the source at some specific timestamp;Continuous Replica – always in-sync with the production data;Recoverability – enables restoration of data from the replica to the source if data loss or corruption occurs;Restartability – enables restarting business operations using the replicas; Local Replication Use Case: Alternative source … Read moreStorage Nitty-Gritty 5 of 5 – Replication