I’m making a cat vs dog image classification using PyTorch. Should I have only one output variable, or should I have two (aka one hot)?
Binary vs one hot
The choice between a single output variable and two output variables using one-hot encoding ultimately depends on your specific requirements and the complexity of your task. If you're only interested in classifying between cats and dogs and do not plan to expand the classification task in the future, using a single output variable can be a simpler and more efficient option. However, if you anticipate the need to extend your model to handle more classes in the future or if you want more explicit probabilities for each class, using two output variables with one-hot encoding can be a better choice.
🙂🙂🙂🙂🙂<←The tone posse, ready for action.
I don't think it makes a difference, but using the one-hot encoding makes it a bit simpler to write the loss function in a way that can be generalized to more possible outputs.
Agreed with Fleabay - in this situation, the single output should be more efficient here.
A one-hot output requires extra ‘effort’ because the network not only has to positively reinforce the ‘right’ output but also to negatively reinforce the ‘wrong’ outputs. With a single variable, these are exactly the same operation.
Both comments above are correct in that one-hot starts to get better once you move towards 3 or more outputs, not just because it's easier to understand as a human, but because now each bit of the output potentially represents 2 different outputs and it is dependent on the other bit. For example, if you had 00=cat, 01=dog, 10=duck, 11=fish, then your ‘dog’ training also needs to include ‘anti-fish’ training, as there's no longer just a single output you can increase to denote a more ‘dog-like’ input.
All that said, this is a trivial problem for modern systems and either would work.
I'm going to disagree with Kylotan (which I don't do often! ?). I have to go into details, so this might not be for everyone, but there is a nugget of knowledge in the fourth paragraph that you might not have thought of before.
If you use a single output and you want to interpret it as the probability of the image being a dog, you should probably have a final sigmoid layer, and the loss function you should use is log-likelihood, which for probability p and true label y is -y*log(p)-(1-y)*log(1-p).
If you have a one-hot encoding instead, you are probably going to have a final SoftMax layer, and the loss function you should use is cross-entropy, which amounts to -is_actually_a_dog*log(probability_of_dog)-is_actually_a_cat*log(probability_of_cat).
Notice how the function sigmoid(x) = 1/(1+exp(-x) can be written as sigmoid(x) = exp(x)/(exp(x)+exp(0)), which is the SoftMax with logits x and 0. Then 1-sigmoid(x) is the other value of the SoftMax, exp(0)/(exp(x)+exp(0)). So the one-output version is mathematically the same as the two-output version, but where you have hardcoded one of the classes to have logit 0.
These two options are identical for most purposes. Just before the final layer, you'll have some linear combinations. In the single-output case you'll have one linear combination. In the two-outputs case you'll have two linear combinations, where the loss function only cares about the difference. The gradients propagated to the layers before will be [I believe] identical.
In practice, you will not be able to measure the difference in performance between the two options (if anyone disagrees, I would like you to try to measure it). But the option with two outputs will be easier to get right (you will be less tempted to use an L2 loss, for instance), easier to think about and easier to extend beyond 2 classes. So that's the one I continue to recommend.
I thank you all for your time and expertise.
I get about 60% test data classified correctly, and I'm not sure if it's because I'm using a traditional neural network? I'd like to see results like 90% correct.
The code is at:
https://github.com/sjhalayka/pytorch_cats_vs_dogs/tree/4b55edb2c3038dcc75ca0ec0a8fa637145e8f704
taby said:
I get about 60% test data classified correctly
That's pretty good, for chihuahua or muffin…
🙂🙂🙂🙂🙂<←The tone posse, ready for action.
60% is pretty bad when you consider that the baseline (pure guessing) is 50%. 60% is the score you would expect to get on a true-false quiz when you only know 20% of the answers.
32x32 is a pretty crappy resolution to get anything done. What does your data really look like?
A multi-layer perceptron with tanh non-linearities is not a great architecture for this task. If I were given this problem and enough pictures of dogs and cats to train a net, I would probably try a CNN (probably something like ResNet).
But these days you should be able to solve this classification problem with no training at all, using CLIP. In brief, OpenAI already trained two neural networks for you: one that maps images to a fixed-length vector representation, and another one that maps text to the same fixed-length vector representation, in such a way that images and their descriptions have similar vector representations. You can compute the representations of “cat” and “dog”, and when you want to classify an image you just compute its vector representation and see if it's closer to “cat” or “dog” in this space.