Comprehensive, Multi-Source Cyber-Security Events

This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network.

The data sources include Windows-based authentication events from both individual computers and centralized Active Directory domain controller servers; process start and stop events from individual Windows computers; Domain Name Service (DNS) lookups as collected on internal DNS servers; network flow data as collected on at several key router locations; and a set of well-defined red teaming events that present bad behavior within the 58 days. In total, the data set is approximately 12 gigabytes compressed across the five data elements and presents 1,648,275,307 events in total for 12,425 users, 17,684 computers, and 62,974 processes.

Specific users that are well known system related (SYSTEM, Local Service) were not de-identified though any well-known administrators account were still de-identified. In the network flow data, well-known ports (e.g. 80, 443, etc) were not de-identified. All other users, computers, process, ports, times, and other details were de-identified as a unified set across all the data elements (e.g. U1 is the same U1 in all of the data). The specific timeframe used is not disclosed for security purposes. In addition, no data that allows association outside of LANL’s network is included. All data starts with a time epoch of 1 using a time resolution of 1 second. In the authentication data, failed authentication events are only included for users that had a successful authentication event somewhere within the data set.

Individual File Descriptions

auth.txt.gz

This data represents authentication events collected from individual Windows-based desktop computers, servers, and Active Directory servers. Each event is on a separate line in the form of "time,source user@domain,destination user@domain,source computer,destination computer,authentication type,logon type,authentication orientation,success/failure" and represents an authentication event at the given time. The values are comma delimited and any fields that do not have a valid value are represented as a question mark ('?').

Here are three lines from the data as an example:

1,C625$@DOM1,U147@DOM1,C625,C625,Negotiate,Batch,LogOn,Success
1,C653$@DOM1,SYSTEM@C653,C653,C653,Negotiate,Service,LogOn,Success
1,C660$@DOM1,SYSTEM@C660,C660,C660,Negotiate,Service,LogOn,Success

proc.txt.gz

This data represents process start and stop events collected from individual Windows-based desktop computers and servers. Each event is on a separate line in the form of "time,user@domain,computer,process name,start/end" and represents a process event at the given time. The values are comma delimited and any fields that do not have a valid value are presented as a question mark ('?').

Here are three lines from the data as an example:

1,C553$@DOM1,C553,P16,Start
1,C553$@DOM1,C553,P25,End
1,C553$@DOM1,C553,P25,Start

flows.txt.gz

This data presents network flow events collected from central routers within the network. Each event is on a separate line in the form of "time,duration,source computer,source port,destination computer,destination port,protocol,packet count,byte count" and presents a network flow event at the given time and the given duration in seconds. The values are comma delimited and any fields that do not have a valid value are presented as a question mark ('?').

Here are three lines from the data as an example:

1,9,C3090,N10471,C3420,N46,6,3,144
1,9,C3538,N2600,C3371,N46,6,3,144
2,0,C4316,N10199,C5030,443,6,2,92

dns.txt.gz

This data presents Domain Name Service (DNS) lookup events collected from the central DNS servers within the network. Each event is on a separate line in the form of "time,source computer,computer resolved" and presents a DNS lookup at the given time by the source computer for the resolved computer and represents a likely network connection originating from the source computer to the resolved computer. The values are comma delimited and any fields that do not have a valid value are presented as a question mark ('?').

Here are three lines from the data as an example:

31,C161,C2109
35,C5642,C528
38,C3380,C22841

redteam.txt.gz

This data presents specific events taken from the authentication data that present known redteam compromise events. These may be used as ground truth of bad behavior that is different from normal user and computer activity. Each event is on a separate line in the form of "time,user@domain,source computer,destination computer" and presents a compromise event at the given time. The values are comma delimited.

Here are three lines from the data as an example:

151648,U748@DOM1,C17693,C728
151993,U6115@DOM1,C17693,C1173
153792,U636@DOM1,C17693,C294

Data

The data is currently available as a single file for each data source.

Files:

Citing

If you use this data in a publication please cite the following paper:

A. D. Kent, "Cybersecurity Data Sources for Dynamic Network Research," in Dynamic Networks in Cybersecurity, 2015. 

@InProceedings{akent-2015-enterprise-data,
   author = {Alexander D. Kent},
   title = {{Cybersecurity Data Sources for Dynamic Network Research}},
   year = 2015,
   booktitle = {Dynamic Networks in Cybersecurity},
   month =        jun,
   publisher = {Imperial College Press}
}

The data can be cited with the following:

A. D. Kent, “Comprehensive, Multi-Source Cybersecurity Events,” Los Alamos National Laboratory, http://dx.doi.org/10.17021/1179829, 2015.

@Misc{kent-2015-cyberdata1,
  author =     {Alexander D. Kent},
  title =      {{Comprehensive, Multi-Source Cyber-Security Events}},
  year =       {2015},
  howpublished = {Los Alamos National Laboratory},
  doi = {10.17021/1179829}
}

License

CC0
To the extent possible under law, Los Alamos National Laboratory has waived all copyright and related or neighboring rights to Comprehensive, Multi-Source Cyber-Security Events. This work is published from: United States.

Notes

This data set and associated research have been approved by the LANL Human Subject Research Review Board under approval LANL 14-07 X and has been approved for public release under approval LA-UR-15-23810.

Contact

For questions, or other feedback, please contact cyberdata@lanl.gov.