There are several methods we can use to transfer the classification to our customers. In order to provide the highest level of protection, we chose these three methods.
||How many users
||Delay in classification
||How many file versions
Avast checks every executable before it’s executed in a customer’s machine. When no signature from the current threat database matches the file, the FileRep service is queried. If the returned user count (prevalence) is anomalously low, the executable ends up in the Avast Sandbox. If the executable trace log does not match any known threat, the real-time classifier is invoked. Avast extracts the feature vector, submits it to a cloud-based service, and waits for the response. Most of the low prevalence files are benign. Out of approximately 250,000 requests daily, about 4,000 are classified as malicious.
Once a file is classified as malicious and our internal systems check that it is safe to detect this particular file worldwide, a simple flag is set in the FileRep service. Every Avast client that encounters that particular file instantly blocks it and reports it as FileRepMalware.
Old string-based signatures work well when properly executed, and are especially good at generalizing many variants of a threat. But string-based signatures require an analyst and time. In today’s threat landscape, with all its variants and interconnectivity, there just aren’t enough people or enough time to keep pace. A new approach was required that generalizes like string signatures do, but doesn’t rely on human intervention or take as much time.
Enter Evo-gen. Evo-gen leverages the distance function to create a set of similar feature vectors which allows us to build a rule set from those features. Once we have a set of very similar feature vectors from the distance function, we can start to pick features that make them similar and build a rule set from those features. It is somewhat similar to rule-set generation in decision trees, but the objectives are different. To boost the generalization, we can pick as few rules as possible, while keeping hits in the clean set at zero. But there are many ways to pick 20 rules from 100 possible ones - 5.36x1020, or 536 billion, numerically speaking. We’re currently taming the combinatorial explosion with a stochastic approach, which provides better results than Scavenger approaches. This is where the speed of the GPUs is very important again.
While trying to understand how the Evo-gen rule sets (blue) affect the signature “ecosystem,” we produced the following visualization. Each blob represents a different rule set or signature, and the size of the blob is proportional to the number of detected variants.