National Coding Week – Golang, TiDB, and AI, Oh my!

This week is National Coding Week. The theme for 2023 (at least according to codingweek.org) is Artificial Intelligence. I heard about this on a LinkedIn post from a former colleague who was taking this as an opportunity to learn Go.

I just learned that this week is National Coding Week. Now I have a flimsy excuse to start taking baby steps learning Go! #nationalcodingweek

As it happens, Go is used quite a lot at my new employer, PingCAP, in their development of TiDB (a modern distributed SQL database) so I thought that making a few “experimental” changes to TiDB would be a good way to learn a new language and scratch the AI itch at the same time.

Deciding on a Project

For my project I wanted to accomplish the following:

  1. Learn some Go programming
  2. Learn a bit more about TiDB
  3. Do something AI related

After seeing a recent blog post by Daniël van Eeden about extending TiDB with custom functions, I thought it would be interesting to follow that example and add some custom functions to TiDB. I have been looking at some of the interesting capabilities that are made available in vector databases, and thought it may be interesting to add some experimental functions that would allow users to calculate the distance between vectors.

These kinds of functions are useful for users who are storing vector embeddings and want to be able to calculate the distance between them. Measuring the distance between vectors can be used to calculate similarity between two vector embeddings persisted in a database, or it could also compare persisted vector embeddings with a vector embedding generated by a search query.

While there are definitely some more state of the art approaches to optimize performance and scaling for these kinds of vector similarity searches, for this experimental effort I have stuck with the more straightforward cosine similarity and dot product functions.

Getting Started

To get started with the project I relied on the detailed write up from Daniël van Eeden, as well as TiDB’s excellent developer guide.

I followed the getting started guide to set up my IDE (Visual Studio Code) on my Mac and get started.

It was remarkably easy to get the TiDB database up and running following the guide so I won’t replicate that documentation here.

Learning by Doing

My previous programming experience was primarily in Java so adjusting to some of the syntactic differences between the two was interesting. Taking a look at some of the existing examples in the TiDB code base was very helpful in figuring it out, as well as being able to dive into some of the Go documentation as needed.

The first thing I did was follow the example from the TiDB developer documentation as these custom functions will be compiled into TiDB. The initial changes I made were to update functions.go so that TIDB would recognize the function names in the SQL statements. I also updated builtin.go at this point to point to the function implementations I was going to write (as new builtin functions, compiled into TiDB).

Next, I set about defining the functions themselves. This helped me to learn a bit about how functions and methods are defined in Go. I particularly liked the ability to define multiple return values (somewhat reminiscent of returning multiple values as tuples in Haskell) to help with handling error cases. This was an interesting change of pace from Java and its use of exceptions to handle some of these cases.

I decided to extract the calculation of the Cosine Similarity to operate directly on float64 arrays. I originally did this as I was thinking of importing the functions using some library (that would have likely operated on float64’s versus the internal TiDB types), but after an initial investigation it seemed that these functions were easy enough to implement directly (for the purposes of this experimental project) so just went ahead and did that:

Go
func Cosine(a []float64, b []float64) (cosine float64, err error) {
	if len(a) != len(b) {
		return 0.0, errors.New("Invalid vectors: two arrays of the same length were expected")
	}
	if len(a) == 0 {
		return 0.0, errors.New("Invalid vectors: two non-zero length arrays were expected")
	}

	sum := 0.0
	s1 := 0.0
	s2 := 0.0

	for i := range a {
		sum += a[i] * b[i]
		s1 += a[i] * a[i]
		s2 += b[i] * b[i]
	}
	return sum / (math.Sqrt(s1) * math.Sqrt(s2)), nil
}

I also needed to decide how to store the vector embeddings in the database. To make life easy on myself I decided that I would forgo adding a custom type and instead decided to use the JSON data type that is available in TiDB. The functions would operate on JSON arrays of numbers. To do this I used some of the useful capabilities exposed in TiDB types to convert from the JSON type to an array of float64 in Go:

Go
func AsFloat64Array(binJson types.BinaryJSON) (values []float64, err error) {
	if binJson.TypeCode != types.JSONTypeCodeArray {
		err = errors.New("Invalid JSON Array: an array of numbers were expected")
		return nil, err
	}

	var arrCount int = binJson.GetElemCount()
	values = make([]float64, arrCount)
	for i := 0; i < arrCount && err == nil; i++ {
		var elem = binJson.ArrayGetElem(i)
		values[i], err = types.ConvertJSONToFloat(fakeSctx, elem)
	}
	return values, err
}

All the changes I made (including some basic test cases) are available in Github (see initial commit).

Phase 1 Complete

All of the changes I have made now enable me to easily calculate the cosine similarity and dot product of two JSON arrays in TiDB using SQL. Using the MySQL command line client (TiDB is wire compatible with MySQL) I can run SQL like the following:

SQL
mysql> SELECT 1 AS r, x_cosine_sim('[1.0, 2.0, 3.0, 4.0 ,5.0]','[1.0, 2.0, 3.0, 4.0, 5.0]') AS result   
    -> UNION
    -> SELECT 2 AS r, x_cosine_sim('[1.0, 2.0, 3.0, 4.0 ,5.0]','[-1.0, -2.0, -3.0, -4.0, -5.0]') AS result
    -> UNION
    -> SELECT 3 as r, x_cosine_sim('[1.0, 2.0, 3.0, 4.0 ,5.0]','[-1.0, 2.0, 3.0, 4.0, -5.0]') AS result
    -> ORDER BY r ASC;
+------+---------------------+
| r    | result              |
+------+---------------------+
|    1 |                   1 |
|    2 |                  -1 |
|    3 | 0.05454545454545454 |
+------+---------------------+
3 rows in set (0.00 sec)

Reflections…

For some people (like myself), having a small project to work on is a useful tool for learning a new programming language. While the code I have shared is not professional quality (and could definitely be improved and be more idiomatic) it was helpful in its goal of helping me get more familiar with the Go language and its available tooling. Thanks to Chris and Daniël for the indirect inspiration for this project!

Scheduled Tasks – YARA Post #8

Today I’ll share just a short snippet that I used to look for some specific scheduled tasks on a Windows system. Luckily windows creates XML files that are located somewhere like the C:\Windows\System32\Tasks folder. These files contain an XML representation of the scheduled tasks, and it is this that I am scanning with YARA.

Here’s a quick example of the rule:

// Detects the scheduled task on a Windows machine that runs Ransim
rule RansimTaskDetect : RansomwareScheduledTask {
    meta:
        author = "Ben Meadowcroft (@BenMeadowcroft)"
    strings:
        // Microsoft XML Task files are UTF-16 encoded so using wide strings
        $ransim = "ransim.ps1 -mode encrypt</Arguments>" ascii wide nocase
        $task1  = "<Task" ascii wide
        $task2  = "xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">" ascii wide

    condition:
        all of them
}

The scheduled tasks runs a short PowerShell script that simulates some basic ransomware behavior, and this rule just matches the XML file for that task. This file is encoded in UTF-16, so the $task1 and $task2 strings simply reference some strings with the wide that are a part of the common strings found within the XML file (the start of the <Task element, and the XML namespace used to define the schema), the ascii wide modifiers searches for the string in both ascii and wide (double byte) form. The remaining string just looks for the invocation of the script as an argument to the task, and ignores the case used.

If I was looking for the presence of a task on live systems then I of course have other tools I could use, such as schtasks query. However, as I am often operating on the backups of a system being able to use this file based approach can be very helpful as it doesn’t rely on the availability of the primary system when I want to identify whether a scheduled task was present at some historical point in time.

Analyzing ZIP (OOXML) Files with YARA (part 3) – YARA Post #7

My prior posts about examining ZIP archives have covered matching the file names within a ZIP archive, as well as matching the pre-compression CRC values of the files within the archive. In this blog I am going to reference an interesting example of parsing the OOXML format used by modern Microsoft Office products. This office format essentially is a ZIP archive that contains certain files within it (describing the office document).

Aaron Stephens at Mandiant wrote a blog called “Detecting Embedded Content in OOXML Documents“. In that blog Aaron shared a few different techniques used to detect and cluster Microsoft Office documents. One of these examples was detecting a specific PNG file embedded within documents, the image was using to guide the user towards enabling macros. The presence of the image in this phishing doc could be used to indicate a clustering of these attacks.

Given the image files CRC, size, and that it was a png file, the author was able to create a YARA rule that would match if this image file was located within the OOXML document (essentially a ZIP archive). This rule approached the ZIP file a little differently than we have in my prior couple of blogs. The author skips looking for the ZIP file entry and references the CRC ($crc) and uncompressed file size ($ufs) hex strings directly to narrow down the match. They also checked if the file name field entry ended with the ".png" extension.

rule png_397ba1d0601558dfe34cd5aafaedd18e {
    meta:
        author = "Aaron Stephens <[email protected]>"
        description = "PNG in OOXML document."

    strings:
        $crc = {f8158b40}
        $ext = ".png"
        $ufs = {b42c0000}

    condition:
        $ufs at @crc[1] + 8 and $ext at @crc[1] + uint16(@crc[1] + 12) + 16 - 4
}

In this example the condition is using the @crc[1] as the base from which the offsets are calculated, unlike our prior examples where the offsets were based from the start of the local file header. The use of the at operator tests for the presence of the other strings at a specific offset (to the CRC value in this case).

An alternative approach to consider is using the wildcard character ? in the hex string, this allows us to match on the CRC and uncompressed file size fields together while skipping over the 4 bytes used to store the compressed file size field. Then validating that the four letter .png extension is at the end of the file name field.

rule png_alt {
    strings:
        $crc_ufs = {f8158b40 ???????? b42c0000}
        $ext = ".png"

    condition:
        $ext at @crc_ufs[1] + uint16(@crc_ufs[1] + 12) + 16 - 4
}

Analyzing ZIP Files with YARA (part 2) – YARA Post #6

In my first exploration of analyzing ZIP files with YARA I covered how we could create a YARA rule that matched on specific file names within the ZIP archive, in this post we’ll cover a few other fields that may be of interest.

One interesting example is looking for encrypted ZIP archives, here’s the Twitter post from Tyler McLellan that I read that showed how to do this with YARA:

Check if ZIP Archive is Encrypted with YARA – from @tylabs

This snippet is first checks if the file starts with a local file header record, uint16be(0) == 0x504B, it then tests whether the first bit in the “general purpose bit flag” is set by performing a bitwise and against the flag value (to get the value of the first bit) and seeing if that is set uint16(6) & 0x1 == 0x1. This first bit indicates whether the file is encrypted or not:

4.4.4 general purpose bit flag: (2 bytes)

Bit 0: If set, indicates that the file is encrypted.

section 4.4.4 of the .ZIP File Format Specification

Another interesting field in the ZIP Archive’s header is the CRC32 of the file. This is stored at offset 14 and is 4 bytes long. If you are looking for specific files then this can help narrow it down. It should be noted that the CRC is not a cryptographic calculation, but as it is stored in the header it is very simple to check for.

rule matchFileCRC {
   strings:
        $zip_header = {50 4B 03 04}
        $crc1 = {D8 39 6B 82}
        $crc2 = {B1 81 05 76}
        $crc3 = {C2 4A 02 94}
   condition:
        // iterate over local file zip headers
        for any i in (1..#zip_header):
        (
            // match any of the CRC32 values
            for any of ($crc*):
            (
                $ at @zip_header[i]+14
            )
        )
}

The line for any of ($crc*) checks all of the defined CRC values and the line $ at @zip_header[i]+14 causes the rule to be matched if any of the CRCs we are looking for are included in the ZIP header starting at offset 14.

Learning from Others – YARA Post #5

Taking a quick break from my Zip Archive adventures, one thing I’d be remiss not to mention is the community sharing that happens around YARA. As well as the specific YARA rules that people share, there are also a lot of insights into how to use YARA, how to craft or generate rules, and lots of other creative uses of the tool.

One example of this is the activity around #100DaysofYARA on Twitter last year that was kicked off by Greg Lesnewich. Looking through many of the tweets mentioning this Hashtag will certainly show some interesting possibilities in using YARA. I’d recommend following that hashtag on Twitter and Mastodon, seeing what comes up on January 1st 2023, and sharing your own experiments!

Reading ZIP (JAR) Files with YARA (part 1) – YARA Post #4

ZIP files have a well defined structure, this makes it possible to use YARA to match certain characteristics of files stored within the ZIP file. For example file name information is stored in both local file headers and in the central directory file headers within the Archive. Wikipedia has a decent write-up on the ZIP file format structure.

I first started experimenting with this to examine the contents of Java archives (JAR files, that are essentially ZIP archives) for specific files (in this case bundled log4j jars). To do this I first defined the marker for the local file header $zip_header = {50 4B 03 04} in the strings section, followed by the files I was interested in locating in the zip. See rule below:

rule vuln_log4j_jar_name_injar : log4j_vulnerable {
    strings:
        $zip_header = {50 4B 03 04}
        $a00 = "log4j-core-2.0-alpha1.jar"
        $a01 = "log4j-core-2.0-alpha2.jar"
        // …
        $a41 = "log4j-core-2.14.0.jar"
        $a42 = "log4j-core-2.14.1.jar"
            
    condition:
        // iterate over local file zip headers
        for any i in (1..#zip_header):
        (
            // match any of the file names
            for any of ($a*):
            (
                $ in (@zip_header[i]+30..@zip_header[i]+30+uint16(@zip_header[i]+26))
            )
        )
}

In the condition section we introduced some new capabilities. We iterate over the string matches for the $zip_header string. The variable #zip_header (note the #) gives us the count of the matches, for any i in (1..#zip_header):(…) iterates over the matches (populating i for each match), while the @zip_header[i] syntax (note the @) lets us reference the offset in the file for each match.

From the ZIP format specification we know that the file name is at offset 30 from the start of the local file header. The length of the file name is stored in the two bytes at offset 26. Given this information we can read the length of the file name using uint16(@zip_header[i]+26), and then read the file name beginning at offset 30 through to the offset specified in the file name length field and compare this to the file names we are looking for referenced by $a*.

In part 2 I’ll dig into some other interesting things we can look for in the zip headers.

XProtect & YARA – YARA Post #3

In yesterday’s post I covered an example of a YARA rule from AT&T’s Alien Labs used to detect payloads used by BlackCat Ransomware. Today I’ll take a quick look at one of the YARA rules used by Apple in XProtect to help protect macOS devices.

Apple’s XProtect uses YARA rules to deliver “signature-based detection and removal of malware” as described in their security guide Protecting against malware in macOS.

On my version of macOS these signatures are located in /Library/Apple/System/Library/CoreServices/XProtect.bundle/Contents/Resources/XProtect.yara . In this file Apple uses a technique called private rules to define reusable YARA rules that can be references in the conditions of other rules.

For example, one of these rules identifies Portable Executables:

private rule PE
{
    meta:
        description = "private rule to match PE binaries"

    condition:
        uint16(0) == 0x5a4d and uint32(uint32(0x3C)) == 0x4550
}

This rule is a little different from the condition uses in yesterday’s BlackCat rule that just used uint16(0) == 0x5A4D to identify the executable. Here the rule Apple are using does a couple of lookups to identify that this is a PE file using specific offsets defined in the PE file format specification:

After the MS-DOS stub, at the file offset specified at offset 0x3c, is a 4-byte signature that identifies the file as a PE format image file. This signature is “PE\0\0” (the letters “P” and “E” followed by two null bytes).

PE Format

If you are interested in seeing how Apple identifies MACH-0 files (executable format used by macOS and iOS) they also have a private rule to do that as well:

private rule Macho
{
    meta:
        description = "private rule to match Mach-O binaries"
    condition:
        uint32(0) == 0xfeedface or uint32(0) == 0xcefaedfe or uint32(0) == 0xfeedfacf or uint32(0) == 0xcffaedfe or uint32(0) == 0xcafebabe or uint32(0) == 0xbebafeca

}

These private rules can be used within the condition sections of the public rules that XProtect also uses. For example in the XProtect_MACOS_51f7dde rule we can see that the rules condition first references the Macho private rule, an upper bound on file size, and then the presence of multiple strings (that were defined using the Hexadecimal format in the strings section above):

rule XProtect_MACOS_51f7dde
{
    meta:
        description = "MACOS.51f7dde"
    strings:

        $a = { 63 6F 6D 2E 72 65 66 6F 67 2E 76 69 65 77 65 72 }
        $b = { 53 6D 6F 6B 65 43 6F 6E 74 72 6F 6C 6C 65 72 }
        $c1 = { 75 70 64 61 74 65 53 6D 6F 6B 65 53 74 61 74 75 73 }
        $c2 = { 70 61 75 73 65 53 6D 6F 6B 65 3A }
        $c3 = { 72 65 73 75 6D 65 53 6D 6F 6B 65 3A }
        $c4 = { 73 74 6F 70 53 6D 6F 6B 65 3A }
    condition:
        Macho and filesize < 2MB and all of them
}

This shows how commonly used elements of a rule can be defined separately and then composed with other conditions to simplify the resulting rules.

YARA BlackCat Payload Example – YARA Post #2

In my first post in this series, Day 1 of the 12 Days of YARA, I shared some resources that can help you to get started with YARA. YARA is a rules based language that allows for pattern matching against a file (or a processes memory). This is a useful tool that can describe a malware sample in a way can match on both a specific sample and (if properly generalized) on similar samples. A generalized rule may continue to match on related samples even if they change slightly from version to version, this is in contrast to cryptographic hash based approaches (SHA1, SHA256, MD5) to detection where minor changes in the malware create an entirely different hash value.

YARA rules are quite simple in terms of structure. The rule typically consists of the strings that will match file contents, and boolean conditions that determine whether the rule is matched or not. There’s a good overview of how to write a YARA rule in the documentation that it is worth reading if you are unfamiliar with how the rules are structured.

When I learn something I like to learn by reading the documentation as well as examples published by others. This helps me get a better sense of the idiomatic use of a language. Luckily there are lots of examples published that you can use to see different approaches people have taken.

In this post I’ll reference one of the YARA rules provided by AT&T Alien Labs in their post about BlackCat ransomware earlier this year:

rule BlackCat : WindowsMalware {
   meta:
      author = "AlienLabs"
      description = "Detects BlackCat payloads."
      SHA256 = "6660d0e87a142ab1bde4521d9c6f5e148490b05a57c71122e28280b35452e896"
    strings:
        $rust = "/rust/" ascii wide
        $a0 = "vssadmin.exe Delete Shadows /all /quietshadow" ascii
        $a1 = "bcdedit /set {default}bcdedit /set {default} recoveryenabled No" ascii wide
        $a2 = "Services\\LanmanServer\\Parameters /v MaxMpxCt /d 65535" ascii wide
        $a3 = ".onion/?access-key=${ACCESS_KEY}" ascii wide
        $b0 = "config_id" ascii
        $b1 = "public_key" ascii
        $b2 = "extension" ascii
        $b3 = "note_file_name" ascii
        $b4 = "enable_esxi_vm_kill" ascii
        $b5 = "enable_esxi_vm_snapshot_kill" ascii
    condition:
        uint16(0) == 0x5A4D and filesize < 5MB and $rust and 2 of ($a*) and 3 of ($b*)

}

Taking a few moments to examine this a few things begin to pop out.

First the meta section provides information about the origin of the rule, in this case information about the authorship, description, and the hash of the Windows BlackCat payload. While the hash in the metadata is not used in the rule itself, it is useful to be able to look up more information about the file on other sources (like VirusTotal).

Second, in the strings section the author has grouped related sets of strings together with the same prefix ($a,$b), when we look in the condition section we can see that YARA allows the use of wildcards to identify these different sets of strings and apply different conditional logic to them.

Third, in the condition section we can see how the author uses both the strings defined earlier, as well as some other conditional statements:

  • uint16(0) == 0x5A4D. The uint16(0) reads the unsigned 16 bit integer at the start of the file (offset 0) and compares it with 0x5A4D which is the magic number indicating this is an executable file on Windows. YARA does include a PE module that allows for fine-grained inspection of attributes of a portable executable files, but looking for specific markers at a known offset (as in this case) may be more efficient from an execution perspective if the rule doesn’t need that level of granularity.
  • filesize < 5MB. Here the authors include a filesize limit, this will help optimise the processing of very large files (above the 5MB specified) so that the scan can concentrate on the right set of files.
  • $rust and 2 of ($a*) and 3 of ($b*). Here the author uses the strings defined earlier. These sets of strings relate to characteristics that are less likely to change over time. For example, commands to delete shadow copies are a common characteristics of ransomware, so detecting strings related to that operation are less likely to change than strings related to non-core aspects of the malware.

Stay tuned for Day 3 tomorrow!