If you haven’t heard the words in the title before it will feel too ‘technical’. No fear, I’ll keep it simple.
If you’ve read the Chapter on Data Management in the Comprehensive Rigging Guide you’ll know that a Data Management Technician (DMT) or Digital Imaging Technician (DIT) – choose your title – is responsible for some or all of these tasks:
- Inspecting footage
- Create backups
- Inspect, change and/or input metadata
- Create dailies, transcode
- On-set color grading
- Asset management
- Assuring data integrity
I will leave out color grading because that requires different tools, and does not strictly fall under the ambit of data wrangling. The rest is a part of managing and handling data.
We’ve covered some of the above functions earlier in Data Management. The ones we haven’t covered are:
- Assuring data integrity
- Handling metadata
- Making backups
Let’s start with data integrity.
What is Checksum or Hash?
Imagine this: You are given a four digit number. E.g. 3892. Your task is to send it across the earth to your counterpart. When he or she receives it, it must still remain 3892. How do you ensure it?
One way to solve this problem is by sending the number twice or more. But that creates two new problems:
- It costs more to send the same thing again, and it consumes twice (or a multiple of) the original size.
- If the system is incapable of sending the right number because of an inherent flaw, you’ll get the same mistake every time.
- You don’t know you need to send it again unless you know the first one failed. Otherwise you’ll always be sending multiple copies all the time, just to be safe.
In computer-speak, all this increases bandwidth or storage requirements, and takes more time. All this, is money.
What else can you do? You can create a checksum. What’s that?
A checksum = check + sum = a sum that is used to check the original number. What do we mean by ‘sum’? Simple, you add the numbers like this:
3+8+9+2 = 22 => 2+2 ==> 4. This ‘4’ is the checksum. Now, instead of sending 3892, you send 38924. It adds only one digit to the message, instead of four (if you’re sending it twice). When your counterpart receives it, he or she will perform the same ‘sum’, and get ‘4’.
The two of you have already agreed beforehand that the last digit in the message will be the checksum. When your counterpart adds your message to get 4, and sees the checksum is also 4, he or she knows the message is correct.
In binary terms, a word (binary word) is 0s and 1s, and the sum of it is either a 0 or a 1. Let’s say we have a word of length 8-bits (8 characters). E.g. 10010100. The sum is 11, so you add this to the end. The entire message is 1001010011. The person at the end will look at the checksum, which is 11, and then add the rest of the numbers to see if it arrives at the checksum. If they match, he or she knows the message is correct.
This checksum character or set of characters is also called a hash. As you can see, every change produces a different hash:
A hash can be a binary value (if it is small and simple), or a hexadecimal value. Your hash, your rules. The algorithm that converts a message or data fragment into a hash is called a checksum algorithm.
The algorithm or software that checks the hash against the message or data fragment is a checksum verification algorithm or software.
Fundamental problems with Checksum Algorithms
The cool thing about the checksum idea is that it allows us to check for errors without duplicating the message or data. You’ll still have some overhead, but not much. That’s a big deal when you look at terabytes of data.
But, there are some major problems with checksums:
- A simple one-digit checksum can only be a 0 or 1, and if one digit changes the error is caught. If two fail, the error will not be caught.
- There is more than one sequence of numbers that give the same sum. In our decimal example, there are many 4-digit combinations that add up to 4.
- Somebody who knows the checksum method (sum, addition, multiplication, whatever) will be able to easily beat the system.
The first problem is easy to correct. The other two are impossible.
So, how does the checksum method circumvent the above ‘problems’? It’s got to be more useful, right? Here are a few tricks:
- You create a bigger checksum or hash. Instead of one or two characters, your hash is larger in size; but still proportionately smaller than the original message.
- You use more sophisticated algorithms instead of simple additions and subtractions.
Even these methods aren’t totally fool-proof, but they reduce the probability of an error to within acceptable limits. It’s easy to get a 4, but to get a 234fsd9f9323k3kj34344p isn’t likely within the same file set or day (or even year). What are the odds of that?
What price do you pay for a better checksum algorithm?
- You need more processing power, since the computer will need to calculate the checksum and compare.
- You need more time because the computer takes more time.
So, the art of choosing the right checksum software is finding the right balance between security and acceptable probability. Obviously, you don’t want to compromise your data, but you don’t want to go overboard either. Some systems, like government records, bank records, etc., need greater security. Therefore, they tend to sacrifice speed for maximum security.
On the other hand, on-set duplication and backup doesn’t need this level of security. You just want to know if your data is error-free or not. The two most widely used algorithms in checksum software are:
Both these algorithms aren’t secure enough for high-security data, but great for error checking and correction. They are also freely available and widely supported on all platforms – Macs, Windows and Linux.
So, how does this work with video? Every time the data is copied to another drive or system, you use the verification software to ensure the data is the exact same. You can also use this to check if any metadata has been changed, and so on. This ensures that every backup is the exact replica of the original camera data.
That’s all you want it to do.
When would you need a video backup and checksum software?
Whenever you are dealing with a project that involves:
- Many days of shooting
- Many backups
- Creating dailies
- Dealing with metadata
- Liaising with Post facilities
You’ll need a backup and checksum software. You could go about this in two ways:
- Get each software separately and learn to deal with the workflow
- Find a one-stop-shop software solution that is designed for digital data management
It’s up to you to figure out the right way to do it. Some technicians aren’t very tech savvy, so a single application gives them the peace of mind to carry out all these tasks. Others like to have total control over the entire workflow, and want to work with the best software only.
Understand this: There isn’t one software or suite of tools that does everything perfectly. It is extremely likely that you’ll be using more than one software because another frustrates your kind of workflow.
In Part Two we’ll look at the typical data workflow on a set and list some important software that does video backup, checksum and data management well.