This morning I went for a long drive. I started thinking about image recognition. The question is, "How can you train a computer to recognize a common image pattern, even if many parts of the image change?" In other words, can we teach a computer to recognize and identify memes? The thing with meme images is that they usually reuse the same image and people post different text overlays on it. Here's an example:
Here's another example:
So, if you took both of these images and just hashed them with MD5 and stored the result in a database somewhere, and then only looked at the hash value, you would have no way of knowing that these are practically the same image but with minor variations. How could you devise a better way to know what you're looking at and whether its similar to other things? And why would you want to?
This is probably a solved problem and I'm probably reinventing the wheel -- and much more inefficiently, as first attempts tend to go. I imagine that many much smarter engineers working at facebook and google have already solved this problem and written extensive white papers on the subject, of which I am ignorant. But, I think the journey towards the discovery of a solution is half the fun and also a good exercise. It's like trying to solve a math problem in a textbook before checking the answer in the back of the book, right? The point isn't to get the answer, but to develop the method... So, onwards to what I came up with!
My initial thought was that you'd just feed an image into a machine learning neural network. If the network is big enough and the magic black box is complicated enough, and we give it enough repetitions in supervised learning, maybe it can learn to recognize images and similarities between images? But that sounds really complicated and I don't like things that "just work" and by unexplainable "magic". It's great when it works, but what do you do when it doesn't work and you don't know why? What hope would you ever have of diagnosing the problem and fixing it? Maybe neural networks are enough and I'm unnecessarily overthinking things, but that's the fun of it. What if... instead of taking a hash value of the complete image, we created several hash values of subdivisions of the image? We could create a quad tree out of an image, and then generate a hash value for each quadrant, and then keep repeating the process until we get to some minimum sized subdivision. Then, we can store all of these hash values in a database as an index to the image data. To test if an image is similar to another, we just look at the list of subdivided hash values and see if we find any matches. The more matches you find, the more statistically likely you are to have a similar image match.
Of course, this all assumes that the images are going to be exactly the same (at the binary level) with the exception of text. If even one bit is off, then the generated hash value will be completely different. This could be particularly problematic with lossy image formats such as JPEG, so doing hashes of image quadrants may not necessarily be the best approach. But, what if we looped back to our machine learning idea and had a neural network be specialized at recognizing subsections of an image and recognizing it, despite any lossy or incomplete data? The tolerance for lossiness would make it easier to correctly identify an image subquadrant and match it to a set of high and low resolution image quadrants.
But now, this is where it gets interesting.
What if... when someone uploads an image online, rather than storing the whole image, we subdivide the image into quadrants and compare each image quadrant against a set of existing image quadrants? The only data we would upload would be any image quadrants which had no matches in the database. The rest of the image data would just be stored as references to the existing image quadrants. Theoretically, you could have some pretty intense data compression. That 1024x1024 32 bit image (~4.1Mb), could be 90% represented by pre-existing hash values representing image blocks in the database, and you'd be looking at a massive reduction in disk space usage. Rather than storing the full image itself, you'd be storing the deltas on a quadrant by quadrant basis, and as people ask for any image, it gets reconstructed on the fly based on the set of hash values which describe the image. And, if you have this way of laying out your data, you can create a heat map of which data quadrants are more popular than others, and then make those data quadrants far more "hot" and accessible. In other words, if a meme or video goes viral and needs to serve out a billion views, you would want a distributed file system which makes thousands of copies of that data available on hundreds of mirrors, and then let a network load balancer find the fastest data block to send to a user. The more frequently a block of data is requested, the more distributed and readily accessible it gets (with an obvious time decay of course, to handle dead memes). And in the best case scenario, if someone uploads an exact copy of an image which already has a hash value match to something else someone uploaded, we just create a duplicate reference to that data instead of a duplication of the data itself.
Then, the question becomes: How and when do we delete data? If there's a dead meme, that's one thing. But what if someone uploads naked pictures of me to the internet and I want them deleted (just to personalize the stakes)? Both are use cases that need to be handled elegantly by the underlying design. In the case of dead memes, we would enforce a gradual decay timer based on frequency of access to each data block. If a block of data hasn't had a read request for months and we have 4 copies, maybe we can automatically delete three of those copies to free up space without completely deleting the data? If a user uploads an image, they don't "own" the image, but rather they're given a hash value which represents the composition of that image, which may have data blocks which are shared by other users who uploaded similar images. A "delete" action by this user would only delete their hash link rather than the underlying image itself. Maybe we also maintain a reference counter, such that when it reaches zero, it gets marked for garbage collection in a month or so?
In the case of someone uploading naked pictures of me, where removing the image is more important than just deleting references to it, there would need to be some sort of purge command which not only deletes all references to all data blocks, but also deletes the data blocks themselves, so any future read accesses will fail. This would be a dangerous section of code and would need to be done very carefully, with security and abuse in mind. But, that raises another question: who gets to decide who gets 'delete' access and who gets 'purge' access? Do you own every quadrant of an image you upload, even if its shared among other related images? Let's say a vicious person took a naked picture of me and uploaded it to this cloud, and then I go through the process and get this image purged. What's going to stop this vicious person from uploading the image again, and causing me new headache and harm? Ideally, we'd like to block the vicious person from being able to upload the image ever again. But, in order to do that, we'd have to know that the data someone is trying to upload has been deemed "forbidden" by the system before we can block it. And how are we to know a block of data is forbidden without comparing it against an existing dataset? To be much more specific towards a real world problem, let's pretend that someone uploads child porn to our image cloud. They use the network to distribute the CP to other pedophiles as quickly as they can before it gets found and shut down. The server owners find and immediately purge the content as quickly as they can and want to block any further attempts to upload the same content/material, but they run a bit of a challenge: In order to know what content to block, you must store copies of the content to recognize it, and that ultimately would mean that you're still storing child porn on your servers, which would then get you into legal trouble. I think my quadrant based hash value idea would still be an elegant way to resolve this. Instead of storing the data itself, you store a list of banned hash values. If, during the image decomposition, one or more of the hashed image quadrants match a list of the banned image values, then you reject the complete image. If your hashing function has a one in a trillion collision rate, you don't really need to worry about false positives (provided your quadrants don't get down to the granular level of individual pixels).
The other danger is that someone is a bit too liberal with the "purge & ban" button. Imagine that an artificial neural network isn't used just for pattern analysis to identify images and image quadrants, but also to identify paragraphs, sentences, and words in blogs, forums, and message boards. If someone posts copy/pasta which matches a forbidden topic, such as "Falun Gong" or the "Tianamen Square Massacre" in China, this sort of system could potentially be used by authoritarians to squash free speech and ideas which they don't like. That could ultimately be more dangerous & damaging in the long run for the flourishing of mankind, than a permissive policy? Imagine it gets really crazy, where the home owners association for your neighborhood has legal authority to silence anyone, and someone petty with a spiteful bone to pick decides to get happy with the "purge & ban" button with a neighbor they don't like? Obviously, the room for abuse on both the admin and user sides will need to be very carefully planned and designed. For that, I don't have easy technical solutions to people centered problems... Maybe in this case, a diversity of platforms and owners would be better than one large shared cloud? But maybe that's just trading one problem for another.
Anyways, this is the kind of stuff my mind tends to wander towards when I'm stuck in traffic.
Comments
QuoteOf course, this all assumes that the images are going to be exactly the same (at the binary level) with the exception of text. If even one bit is off, then the generated hash value will be completely different. This could be particularly problematic with lossy image formats such as JPEG, so doing hashes of image quadrants may not necessarily be the best approach.
Correct. MD5 or any other cryptographic or simple hashes are effectively useless here. Re-encoding an image using a different encoder or the same encoder with different settings would produce a vastly different hash in any algorithm, except perceptual hashes.
Perceptual hashes encode and compress the characteristics an image in such a way that the hashes of similar images will have a small Hamming distance, despite distortions, artifacts, and watermarks. Check out the very excellent pHash library. The image macros in your example were rated as similar using pHashML, though with a large Hamming distance.
On to compression, the problem with hashes is they are generally one-way and have collisions. And in addition to collisions, the image will need to be reconstructed somehow. Even if the image is composed of deltas off of existing hashes, the data that makes the uniqueness of that image must be encoded and stored somewhere. Requesting an image or retrieving an image from storage will require a vast database of hashes and their data to reconstruct all possible images, which would be infeasible to store or expensive to construct.
On 1/21/2019 at 8:39 PM, fastcall22 said:Correct. MD5 or any other cryptographic or simple hashes are effectively useless here. Re-encoding an image using a different encoder or the same encoder with different settings would produce a vastly different hash in any algorithm, except perceptual hashes.
Perceptual hashes encode and compress the characteristics an image in such a way that the hashes of similar images will have a small Hamming distance, despite distortions, artifacts, and watermarks. Check out the very excellent pHash library. The image macros in your example were rated as similar using pHashML, though with a large Hamming distance.
On to compression, the problem with hashes is they are generally one-way and have collisions. And in addition to collisions, the image will need to be reconstructed somehow. Even if the image is composed of deltas off of existing hashes, the data that makes the uniqueness of that image must be encoded and stored somewhere. Requesting an image or retrieving an image from storage will require a vast database of hashes and their data to reconstruct all possible images, which would be infeasible to store or expensive to construct.
Yeah, while hashes are one way, you can just create an indexed dictionary of key value pairs, where the key is the hash value and the value is the image or the quadrant of an image. If you store 1,000,000 memes on disk, you can save disk space by not storing the parts which are common across all 1,000,000 memes, and instead just have a reference link via the hash value (the key lookup in a dictionary).
The challenge comes with distortions in the image and recognizing that it's an anomaly that can be ignored. So, traditional hashing techniques, such as MD5 are not suited for this because they're looking at binary comparisons. Using MD5 would just get us back to where we started, which is storing 1,000,000 very similar images on disk with common information which may vary by a single bit. So, we'd want to have a neural network which looks at an image and is a little bit more fuzzy with precision. Changing a single bit between to identify images would not change its 99.9% confidence level at identification. So, if the neural network can tolerate a lot of distortion and variation between binary values but still have a 99.9% confidence level at correctly identifying the image, we can use that identified image's hash value to identify it.
The thing I haven't really considered deeply until now, is that the neural network will necessarily have some sort of threshold value for anomaly tolerance. What if... the tolerance is too high and we resolve two unique images to the same hash value? This would be a collision at the neural net level rather than at the hashing function level. Even with careful tweaking of tolerance values, you'd still run a risk of collisions with the neural network. But, maybe we just bite the bullet and say collisions don't matter? Can humans look at two distinct pictures and think they're the same? If humans can't tell the difference, then maybe we can excuse computers if they can't either? But this assumes that computer vision is on par with human vision as well, and I'm not sure we're there yet.
After I wrote my blog post, the following day I saw an article which claimed that something similar to what I described above could change the way databases catalog data:
https://blog.bradfieldcs.com/an-introduction-to-hashing-in-the-era-of-machine-learning-6039394549b0
It's interesting because the approaches to indexing data are similar... and makes me wonder if there is some really good targeted advertising towards me with another machine learning system? A bit ironic in a way. Anyways, read the article if you have time. It's a good one.
On 1/21/2019 at 7:30 PM, Stragen said:This is a bigger problem. Once something is out on the internet, how do you create any assurance that when you request a delete that anyone is going to respect that request? Until an authoritarian system exists for all content on the internet - and connected devices, which is incredibly unlikely to occur due to privacy, data ownership rights, and patent law (to name a few), there will be no true way to ensure that any form of delete request will be adhered to.
Yeah, I think this problem is gradually changing though. The internet is moving into the era of platforms, and you only need to enforce the ban at the platform level. If we treat facebook as a platform, and facebook decides they want to ban a naked picture of me at my request, then they could prevent all facebook users from seeing my naked pictures or re-uploading them. It won't prevent people from keeping the saved images on their own computers. Facebook is one platform of many, so if we hit all of the major content platforms and request that they ban a naked picture of me, then it makes it much harder for people to continue circulating the image. We could even do it at the cloud level -- such as AWS and Azure -- so that even if some third party creates a site which stores data in the cloud, the cloud itself could enforce blocking policies. If a majority of the content on the internet is contained on a handful of platforms, then maybe it's good enough if those platforms follow take down notices on certain kinds of content (such as child porn). If the spread and proliferation is contained and very difficult, then maybe that's sufficient for minimizing damage.
1 minute ago, slayemin said:We could even do it at the cloud level -- such as AWS and Azure -- so that even if some third party creates a site which stores data in the cloud, the cloud itself could enforce blocking policies.
The legal mechanisms to do this at a cloud level still doesn't exist, and this is what i mean by going down that authoritative path... you stumble into a minefield of all sorts of proportions.
Youtube is a really good example of where this policy is implemented by a provider, and has challenges, it's attempting to enforce DMCA and Copyright conditions on all its contents through the takedown system. Not all countries have the same level of copyright adherence, but in order to post content on the system you have to agree to be subject to those rules. Its not uncommon to see false positive violations, resulting in video or audio being removed from the medium... which is where you're looking to. This is all nicely wrapped up in the TOS, and so forth...
However, when you try to extend that to CSPs (Cloud Service Providers) and unstructured data within those zones, you're broaching a topic of much more immense levels of concern. Do we only care about the image files that are stored in unstructured data buckets such as S3? or do we care about whats stored as data blobs within databases, what if its encrypted? Do we then go down the path of looking at transitive data in memory? S3 alone is likely to be over an exabyte of data, and potentially growing at a PB a day, likely more. Additionally to this, who's jurisdiction applies? US because its a US based company? Australian, because the customer is in Australia? European, because the region the server is hosted in is European? What happens if a false positive trigger then wipes an entire PB of data because there was a 80% confidence match on a couple of files, and conversely, what happens when they fail to remove a file that was deemed to be removed - what is the cost of these events commercially? Does it delete from the whole cloud, if so, what about those services that are 'quarantined' (cause this is a thing)? What if i want to maliciously remove someone else's presence on the internet?
I think its a interesting idea, but in our current climate of 'privacy' and 'data sovereignty' its actually a two edged sword to offer a service that scans the content of a cloud and removes the data... because you have to, some how, inspect information and files that are considered protected to ensure that the content of interest is not present. This leads us down to the whole dystopian authority state condition that we get warned about, and such a powerful interface should not be released to the machines to control completely, and any people reviewing such triggers would have to be trusted implicitly and uncorruptable.
14 minutes ago, Stragen said:The legal mechanisms to do this at a cloud level still doesn't exist, and this is what i mean by going down that authoritative path... you stumble into a minefield of all sorts of proportions.
Youtube is a really good example of where this policy is implemented by a provider, and has challenges, it's attempting to enforce DMCA and Copyright conditions on all its contents through the takedown system. Not all countries have the same level of copyright adherence, but in order to post content on the system you have to agree to be subject to those rules. Its not uncommon to see false positive violations, resulting in video or audio being removed from the medium... which is where you're looking to. This is all nicely wrapped up in the TOS, and so forth...
However, when you try to extend that to CSPs (Cloud Service Providers) and unstructured data within those zones, you're broaching a topic of much more immense levels of concern. Do we only care about the image files that are stored in unstructured data buckets such as S3? or do we care about whats stored as data blobs within databases, what if its encrypted? Do we then go down the path of looking at transitive data in memory? S3 alone is likely to be over an exabyte of data, and potentially growing at a PB a day, likely more. Additionally to this, who's jurisdiction applies? US because its a US based company? Australian, because the customer is in Australia? European, because the region the server is hosted in is European? What happens if a false positive trigger then wipes an entire PB of data because there was a 80% confidence match on a couple of files, and conversely, what happens when they fail to remove a file that was deemed to be removed - what is the cost of these events commercially? Does it delete from the whole cloud, if so, what about those services that are 'quarantined' (cause this is a thing)? What if i want to maliciously remove someone else's presence on the internet?
I think its a interesting idea, but in our current climate of 'privacy' and 'data sovereignty' its actually a two edged sword to offer a service that scans the content of a cloud and removes the data... because you have to, some how, inspect information and files that are considered protected to ensure that the content of interest is not present. This leads us down to the whole dystopian authority state condition that we get warned about, and such a powerful interface should not be released to the machines to control completely, and any people reviewing such triggers would have to be trusted implicitly and uncorruptable.
Tough questions to answer. My guiding idea is that tech moves fast but policy and government should move very slowly. If tech is going to be implementing policy before any legal mandates are pushed on them, then that too must be implemented very slowly and very carefully, with full consideration for vectors of abuse. Ideally, legal mandate would be a last resort and not necessarily even effective in a globalized digital economy. The US could push a law into effect on content, and if a company doesn't want to follow that law? Just relocate your base of operations to an ISP in Sweden or Russia. A more effective strategy would be to make a moral appeal to the content platforms to opt in on banning certain types of content. If they share ideological values, then their consent would be easy to gain without legal force and it could be applied globally instead of regionally (maybe it's wishful thinking on my part, cynicism hasn't taken over completely yet). I think, at the end of the day, companies are run by people, who are human beings with normative moral values, and everyone generally desires to do what they think is the right thing.
I also think that if & when a tech company creates policy which is enforced by algorithms, it must be done very slowly and carefully. There will be bugs, there will be false positives, there will be mistakes. You gotta create the risk assessment matrix and create a deployment plan and roll back plan accordingly, do a very small phased deployment, have human monitoring to verify accuracy and correctness, all before you open the floodgates on full deployment and integration. This is the only way you can really mitigate the risk of deleting a PB of data on an 80% confidence check.
If someone encrypts data before storing it on a database, then an algorithm trying to cross link common data blocks becomes useless and you would not have much of a chance of censoring the data unless you knew with certainty (from external sources) that it was bad. Then you delete & ban, and the content creator would then only need to apply a new crypto key to the same data and upload it again. You'd just be forced into playing a game of whack-a-mole at that point.
14 minutes ago, slayemin said:A more effective strategy would be to make a moral appeal to the content platforms to opt in on banning certain types of content. If they share ideological values, then their consent would be easy to gain without legal force and it could be applied globally instead of regionally (maybe it's wishful thinking on my part, cynicism hasn't taken over completely yet).
A commercial entity's primary driver is making money rather than being morally or ideologically correct...
16 minutes ago, slayemin said:I think, at the end of the day, companies are run by people, who are human beings with normative moral values, and everyone generally desires to do what they think is the right thing
While it is true to say companies are run by people, i think it is a little of a long bow to draw to say that companies of the size that would be required to the things we're discussing are driven by the 'norm' of what we accept as moral values. Remember, morals are based on your culture, and cultures differ across the world. Ideologically what is acceptable in the US may not be acceptable in the Middle East, and certainly vice versa.
20 minutes ago, slayemin said:Then you delete & ban, and the content creator would then only need to apply a new crypto key to the same data and upload it again
And now you're in an authoritarian regime, where true encryption is outlawed (cause you need assurance you can reverse it to review the content), privacy is an illusion, and the government can see everything you're thinking about.
48 minutes ago, Stragen said:A commercial entity's primary driver is making money rather than being morally or ideologically correct...
While it is true to say companies are run by people, i think it is a little of a long bow to draw to say that companies of the size that would be required to the things we're discussing are driven by the 'norm' of what we accept as moral values. Remember, morals are based on your culture, and cultures differ across the world. Ideologically what is acceptable in the US may not be acceptable in the Middle East, and certainly vice versa.
And now you're in an authoritarian regime, where true encryption is outlawed (cause you need assurance you can reverse it to review the content), privacy is an illusion, and the government can see everything you're thinking about.
I may be wrong here, but I think if you create a general consensus among companies on acceptable normative moral behavior, those companies who are out of alignment with the norm would face backlash and be pressured into falling in with the pack. I mean, who would want to say, "We make money because we allow child porn on our platform while nobody else does! We're not even ashamed, Yay, us! Yay money!"
You can see some of this normative behavior already in action among most major platforms by looking at their terms of service and finding a lot of common themes on what is and is not allowed on their platforms. Many of those terms are not legally mandated, they're voluntary. I admit, it's a bit of a stretch to imagine that every company will be a good actor with good intentions. It's one of the problems with idealism. But, despite that, I think this appeal to morality and human flourishing may ultimately be more effective than the force of regional laws. It at least allows room for disagreements on what is moral and what constitutes enriching the human experience.
To get philosophical for a moment, I don't buy into moral cultural relativism. I think there's an objective moral norm and it can be found by aggregating the consent and values of all cultures (which is why I think group diversity is valuable). Anything outside of the moral norm is increasingly morally questionable, especially if its far outside the norm. Someone can ask how moral progress is possible if there's an objective moral norm, and that's a pretty legit criticism. I don't have a good well thought out short answer for that and it'd be a bit off track
9 minutes ago, slayemin said:I may be wrong here, but I think if you create a general consensus among companies on acceptable normative moral behavior, those companies who are out of alignment with the norm would face backlash and be pressured into falling in with the pack.
You're correct in some respects, and i'm playing a little devils advocate to tease out your thinking on it to see where you end up and build the discussion.
But to deviate very slightly for a moment, the problem i have particularly with moralistic thinking is that its so subjective. Its a key, absolutely critical, aspect of thought when we start working on machine learning algorithms that have the ability to 'judge', which drives my comment that you need humans to make the final call - introducing the motivation element into the mix. A machine will consider only the historical evidence it has been used to build its models of reality, and if that 'evidence' is biased in any way you may as well not use it, this bias has been seen in insurance calculations, information heatmapping, and forecasting crime and incidents. Hell it took Microsoft's chat bot only 24 hours of interaction with the human race to become a racist xenophobe.
16 minutes ago, slayemin said:I mean, who would want to say, "We make money because we allow child porn on our platform while nobody else does! We're not even ashamed, Yay, us! Yay money!"
To go back to your point from before, the problem with the current corporate and global environments, is not so much bragging about doing something like that, but more "We ask no questions." In an environment where you can choose between an organisation that will "snoop on your data", vs, an organisation that wont... the majority of people currently would go with the latter over the former, because personal and organisational privacy is more important than the 'unknowns'.
10 minutes ago, Stragen said:You're correct in some respects, and i'm playing a little devils advocate to tease out your thinking on it to see where you end up and build the discussion.
But to deviate very slightly for a moment, the problem i have particularly with moralistic thinking is that its so subjective. Its a key, absolutely critical, aspect of thought when we start working on machine learning algorithms that have the ability to 'judge', which drives my comment that you need humans to make the final call - introducing the motivation element into the mix. A machine will consider only the historical evidence it has been used to build its models of reality, and if that 'evidence' is biased in any way you may as well not use it, this bias has been seen in insurance calculations, information heatmapping, and forecasting crime and incidents. Hell it took Microsoft's chat bot only 24 hours of interaction with the human race to become a racist xenophobe.
To go back to your point from before, the problem with the current corporate and global environments, is not so much bragging about doing something like that, but more "We ask no questions." In an environment where you can choose between an organisation that will "snoop on your data", vs, an organisation that wont... the majority of people currently would go with the latter over the former, because personal and organisational privacy is more important than the 'unknowns'.
Yeah, morality is really hard to nail down. The philosophical branch of moral theory tries to do it, but there's been a lot of historical disagreement on which moral theory is right or best. That being said, several moral theories have generally been discarded in favor of better ones due to underlying problems with the theory, with cultural relativism being one of them. I'm a bit of an anti-theorist and moral pluralist myself. I think you can find fundamental problems and dilemmas with any moral theory, but if you take a little bit from all of them, you can get a pretty good approximation. The issue with morality is that it's always super contextual and relies a lot on informed consent, doing no harm, and doing things which promote human flourishing.
The problem with morality and machine learning is that the current state of "machine learning" is really just a fancy term for statistical approximation algorithms. There isn't anything intelligent going on behind the scenes, so if we try to project our moral frameworks onto a machine learning algorithm, it's about as silly as trying to get the quadratic equation to behave morally. At least, that's just with the current state of machine learning today. The underlying problem is that the current machine learning algorithms are not really models of intelligence applied to machines. A lot of ML researchers are starting to run into a wall where their algorithms have trouble understanding context to make accurate decisions. For example, facebook uses machine learning algorithms to sort through the content people post to look for content which violates their ToS, such as promoting violence, racism, nazi glorification, porn, etc. They get billions of posts a day, so if their machine learning algorithms have a 99% success rate on a billion items, that still leaves ten million items which passed through the filters unnoticed. The ones which pass almost always require understanding a nuanced context of seemingly innocuous content.
Aside from the bias and racism with ML algorithms (which is just a case of garbage in, garbage out), is that eventually people and institutions are going to rely almost exclusively on algorithimic outputs to make decisions for them. I can walk into a bank and ask for a loan. The loan officer will pull up a form on a computer, ask me a few questions, enter the answers into the form, they'll submit it, and some algorithm somewhere will spit out a result on whether he should give me a loan or not. There's nothing I could say or do which would change the result, and even if I had great charisma, need or a compelling pitch, which appeals to the human behind the desk, it doesn't matter because he's handed his decision making authority over to an algorithm which does not weigh my other inputs as factors into its decision making process. This stuff can also be carried over to jail sentencing and terms, over to a war zone allowing machines to decide who lives and who dies, etc. We may be putting too much blind faith into our technology, not understanding that it's being developed by faulty humans.
I was working on my own AI system a few months back which used a moral component as a factor in decision making. It was a relatively rudimentary implementation of goal oriented action planning combined with reinforcement learning, but I haven't had enough time recently to keep sinking effort into it. The approach I was using for measuring the moral value of a decision was based on the innate characteristics of a creature and their tolerance for moral violations. Zombies have no problem with cannibalism, so it has no effect on their decision making process. Humans deciding whether to eat a loaf of bread or not may be influenced by how hungry they are and whether that loaf belongs to them. The moral imperative initially blocks them from eating food that doesn't belong to them, but as you increase their hunger levels, the hunger motivation eventually overrides the moral imperative. The question is, is there an algorithmic way to assess the moral consequences of interactions based on the impact it has on other characters? A utilitarian / consequentialist approach may be the easiest way to implement this, but I'd like to avoid going through and manually specifying the moral weights for each action per creature type. It's hard to come up with something like this without a strong model for intelligence as an underlying foundation.
As a quick hack, i would try something like separating the image into many different quads, calculating the average, minimum and maximum colour values of each quad, and storing those as my image thumbprint.
Comparing a second image, you could determine if they were somewhat alike if the averages, minimums and maximums of the second image fell in the range of those of the first, for over a set percentage of the quads, e.g. 75% of the quads had an average within acceptable "distance" of the first image's thumbprint.
Kind of like edit distance, but for images.
This is definitely a solved problem as Google and TinEye are doing it already, but as to how they do it? That's beyond my knowledge and I tend to go for quick hacks like this that work rather than dive into deep complex maths to solve a problem.
Where I work i'm investigating video analytics, facial recognition, and algorithm assisted image recognition solutions that are available in the market. There are some really rudimentary approaches (pixel matching), more interesting approaches (key item extraction and comparison) and even more complicated object identification and extraction coupled with model development for future comparisions.
The approach you talk about above, with the MD5 hash will allow you to compare image files to one another, however MD5 hash fingerprints will only work so far, and this goes for Pixel Matching... it relies on files being exactly the same, scale, rotation, etc. MD5's should be identical between files made at the same time, however if internal image metadata varies - not the image itself - the file may not be identical, thus fail a MD5 test while being visibly identical. Pixel matching falls apart when the image has been shrunk, and such challenges need to be captured. The other approaches extract aspects of similarity out of the image, and use those extracted elements (for example a face, face structure, etc) to compare elements in images or videos for similarity and then determine a confidence level.
Its a big field and there are a lot of data scientists, AI developers, and 'big data' analysts out there building these capabilities. If you're looking for 'real world' solutions that are out there, i'd recommend looking up OpenCV, TensorFlow, CudaNN from an enabling perspective, and then products such as xJera, and Qognify.
This is a bigger problem. Once something is out on the internet, how do you create any assurance that when you request a delete that anyone is going to respect that request? Until an authoritarian system exists for all content on the internet - and connected devices, which is incredibly unlikely to occur due to privacy, data ownership rights, and patent law (to name a few), there will be no true way to ensure that any form of delete request will be adhered to.