Data Analysis for Network Security using Julia — Part 2

Pankaj Telang
8 min readFeb 5, 2021
Source: Pixabay

This is the second part of the blog that demonstrates elementary data analysis for network security on a synthetic dataset from Wildcard 400–2019 Trendmicro CTF. You can find the first part here.

Question 4: Private C&C channel

We’re always running a low-grade infection; some internal machines will always have some sort of malware. Some of these infected hosts phone home to C&C on a private channel. What unique port is used by external malware C&C to marshal its bots?

Answer 4:

We need to find a port that is used by an external C&C server to communicate with several internal machines (bots). Such a port will not be well-known and will be used by only (one or a) few external C&C servers.

We compute the number of external IPs that are communicating with internal host using each port.

src_count_by_port = combine(groupby(df_ext_int, :port), 
:src => (x -> size(unique(x))[1])
=> :src_count)
sort!(src_count_by_port, :src_count)

The code employs groupby for grouping the flows in df_ext_int by port, and combine to compute the count of distinct external source IP address for each group. The expression: :src => (x -> size(unique(x))[1]) => :src_count
computes the count of distinct src. The result is stored in column src_count.

The next line sorts src_count_by_port by src_count. Fig. 1 shows a few records from src_count_by_port.

Fig. 1: Sample records from src_count_by_port

Port 113 is unlike any other port. It is used by exactly one external IP address to communicate with internal machines. This is most likely the port used by external malware C&C to marshal its bots. Fig. 2 shows a few records from df_ext_int, where the port is 113.

df_ext_int[df_ext_int.port .== 113, :]
Fig. 2: Sample records from df_ext_int with port 113

The external C&C IP address is 15.104.76.58.

Question 5: Internal P2P

Sometimes our low-grade infection is visible in other ways. One particular virus has spread through a number of machines, which now are used to relay commands to each other. The malware has created an internal P2P network. What unique port is used by the largest internal clique, of all hosts talking to each other?

Answer 5:

To answer this question, we need to compute cliques from the network using Julia Graphs package.

using Graphsfunction max_clique_size(src, dst)
x = union(Set(src), Set(dst))
n = length(x)
g = simple_graph(n, is_directed=false)
broadcast((x, y) -> add_edge!(g, x, y), src, dst)
result = 0
for x in maximal_cliques(g)
if length(x) > result
result = length(x)
end
end
return result
end

The max_clique_size function constructs a graph for the given source and destination arrays as input. The function expects the source and destinations to be arrays of integers, that is, each source and destination is identified using an integer. The function employs maximal_cliques function from Graphs package to compute the cliques. Finally, it computes the size of the largest clique found in the graph.

Next, we encode the source and destination IPs in dst_int_int as integers.

unique_ips = union(Set(df_int_int.src), Set(df_int_int.dst))lookup = Dict()
for (i, ip) in enumerate(unique_ips)
push!(lookup, (ip => i))
end

The unique_ips is a set of source and destination hosts and lookup is a dictionary with key as host IP address and value as the encoded integer. We use the lookup dictionary to encode src and dst in df_int_int, and the unique function to dedupe the result.

df_int_int_encoded = transform(df_int_int, 
[:src, :dst] .=> ByRow(x -> lookup[x]) .=>
[:src, :dst])[!, [:src,:dst,:port]]
unique!(df_int_int_encoded)

Since we need to find the largest clique by port, we group df_int_int_encoded by port, and use the max_clique_size function to compute the maximum clique size for each port.

df_int_int_encoded_by_port = groupby(df_int_int_encoded, :port)clique_sizes_df = combine(df_int_int_encoded_by_port, 
[:src, :dst] => ((src, dst) -> max_clique_size(src, dst))
=> :max_clique_size)

The largest clique in clique_sizes_df is for port 83, which is the answer to this question.

Fig. 2.5 shows a few records from df_int_int that use port 83.

Fig. 2.5: Sample records from df_int_int with port 83

Question 6: Malware Controller

We were just blacklisted by an IP reputation service, because some host in our network is behaving badly. One host is a bot herder receiving C&C callbacks from its botnet, which has little other reason to communicate with hosts in the enterprise. What is its IP?

Answer 6:

We need to find one internal host that is receiving C&C callbacks from a botnet, that is several external IPs.

We first compute the number of distinct internal hosts accessed by each external IP. The result is sorted by the number of distinct internal hosts.

dst_count_by_src = combine(groupby(df_ext_int, :src), 
:dst => (x -> size(unique(x))[1]) => :dst_count)
sort!(dst_count_by_src, [order(:dst_count, rev=false)])

Fig. 3 shows a few records from dst_count_by_src.

Fig. 3: Sample records from dst_count_by_src

The dataframe dst_count_by_src contains several external IPs that communicate with exactly one internal host. Next, we need to find the internal host(s) that these external IPs communicate with.

First, we extract the list of external IPs from dst_count_by_src dataframe where the number of distinct internal hosts is 1.

dst_ips_list = 
dst_count_by_src[dst_count_by_src.dst_count .== 1, :].src

Next, we query the records in df_ext_int with src in the list dst_ips_list.

df_ext_int_botnet = 
filter(row -> row.src ∈ dst_ips_list, df_ext_int)

Notice the symbol ∈ in the above code. Julia allows unicode characters as operators. Specifically, ∈ is the in operator for a set, which returns true if an element is in the set.

Fig. 4 shows a few records from df_ext_int_botnet.

Fig. 4: Sample records from df_ext_int_botnet

All external IPs are communicating with the internal host 14.45.67.46. We confirm if 14.45.67.46 is the only internal host in df_ext_int_botnet.

unique(df_ext_int_botnet.dst)

The result of the above statement is: 14.45.67.46.

Question 7: Infected Host

One host is part of the botnet from Question 6, what is its IP?

Answer 7:

We need to find internal hosts that communicate with the host 14.45.67.46 and port 27 that we found in Answer 6.

df_int_int_botnet = df_int_int[(df_int_int.dst .== "14.45.67.46") .& 
(df_int_int.port .== 27), :]

Fig. 5 shows the all of the records from df_int_int_botnet.

Fig. 5: All records from df_int_int_botnet

There are two hosts that are part of the botnet: 14.51.84.50 and 13.42.70.40.

Question 8: Botnet Inside

There is a stealthier botnet in the network, using low frequency periodic callbacks to external C&C, with embedded higher frequency calls. What port does it use?

Answer 8:

To answer this question, we will assume that the bytes sent by the botnet hosts to the external C&C servers will have low variability.

We compute the standard deviation of bytes sent to external IPs, and sort the result by the standard deviation.

df_int_ext_stdbytes_by_dst = combine(groupby(df_int_ext, :dst), 
:bytes => std => :bytes_std)
sort!(df_int_ext_stdbytes_by_dst, [order(:bytes_std)])

Fig. 6 shows a few records from df_int_ext_stdbytes_by_dst.

Fig. 6: Sample records from df_int_ext_stdbytes_by_dst

The first 6 external IPs have standard deviation of 0. They are most likely the external IPs that are part of the botnet.

Next, we query internal hosts that communicate with the 6 external IPs.

external_ips_list = bydst_bytes[bydst_bytes.bytes_std .== 0, :dst]
df_int_ext_botnet = filter(row -> row.dst ∈ r, df_int_ext)

Fig. 7 shows a few records from df_int_ext_botnet.

Fig. 7: Sample records from df_int_ext_botnet

All of the above records seem to use port 51. We confirm it by querying unique ports from df_int_ext_botnet.

unique(df_int_ext_botnet.port)

Question 9: Lateral Brute

Once a machine is popped, it’s often used to explore what else can be reached. One host is being used to loudly probe the entire enterprise, trying to find ways onto every other host in the enterprise. What is its IP?

Answer 9:

We count the number of destination hosts accessed by each of the source hosts, and sort the result by the destination host count in the reverse order.

df_int_int_bysrc_dstcount = 
combine(groupby(df_int_int, :src),
:dst => (x -> size(unique(x))[1]) => :dst_count)
sort!(df_int_int_bysrc_dstcount, order(:dst_count, rev=true))

Fig. 8 shows a few records from df_int_int_bysrc_dstcount.

Fig. 8: Sample records from df_int_int_bysrc_dstcount

It is clear that the host 13.42.70.40 is scanning a significantly large number of internal hosts.

Question 10: Lateral Spy

One host is trying to find a way onto every other host more quietly. What is its IP?

Answer 10:

A host that is trying to find a way onto a target will try to connect to the target on several ports. Some of these ports will be uncommon and not used by other sources to connect to the target. To answer this question, we look for such uncommon (port, target), and then we will query the source hosts that are trying to connect to these uncommon (port, target) pairs.

We first filter out the hosts from previous answers from df_int_int since we know that the answer to this question is different.

prev_answers = ["13.37.84.125", "12.55.77.96", "12.30.96.87",
"13.42.70.40", "14.51.84.50"]
filtered_df_int_int = filter(row -> row.src ∉ prev_answers,
df_int_int)

Next, we first count the number of distinct source hosts that connect to a (port, target) pair.

df_int_int_bydstport_srcount = 
combine(groupby(filtered_df_int_int, [:port, :dst]),
:src => (x -> size(unique(x))[1]) => :src_count)

Then we look for (port, target) pairs where the src_count is 1, that is, where a single source connected to the (port, target) pair.

df_int_int_bydstport_srcount_1 = 
df_int_int_bydstport_srcount[
df_int_int_bydstport_srcount.src_count .== 1,
[:dst, :port]]

Fig. 9 shows a few records from df_int_int_bydstport_srcount_1. This shows that a single source connected to 14.36.98.25 on several different ports.

Fig. 9: Sample records from df_int_int_bydstport_srcount_1

Next, we join filtered_df_int_in and tdf_int_int_bydstport_srcount_1 to find the source.

result = innerjoin(filtered_df_int_in, 
tdf_int_int_bydstport_srcount_1,
on = [:port => :port, :dst => :dst])

Fig. 10 shows the result dataframe.

Fig. 10: Sample records from results dataframe

Finally, we find the unique source host.

unique(result.src)

The lateral spy host is: 12.49.123.62.

In conclusion, this blog demonstrated how network activities data can be analyzed to uncover cybersecurity threats. It also showed the use of Julia language for implementing the analysis.

--

--